If you wish to have a full look at this project's R code please check my Github page for more details.


A bit about the dataset


The dataset comes from crowdsourced transcription of digitized images from the National Archives and Records Administration. The NARA images cover over 100 American cities and neighborhoods. This project makes use of a subset of the entire database consisting of 172 observations and 39 variables. The University of Maryland Digital Curation Innovation Center (DCIC) is in the process of crowdsourcing the transcription of digitized survey images and the project focuses on cities for which corrected data is available.
Initial Thoughts At project start, we developed several goals. The goals included: ● Identifying potential variables and stronger features;
● Normalizing and recoding selected variables as needed;
● Correlation testing of chosen variables for linear regression;
● Developing and fitting models (random forests, decision trees and support vector machines (SVM)) that accurately predicts the grading of a neighborhood based on economic, geographic and racial data.
● Determining the priority columns for transcription correction for the project.

The first step involves knowing about the columns in this dataset

data_final <- read.csv("20170428_NormalizedData_MappingProject.csv")
names(data_final)
 [1] "ID"                         "State"                      "City"                      
 [4] "Security_Grade"             "Terrain_Description"        "Terrain_high"              
 [7] "Terrain_hilly"              "Terrain_flat"               "Favorable_Influences"      
[10] "Detrimental_Influences"     "detriment_far_city"         "detriment_far_stores"      
[13] "detriment_far_schools"      "detriment_far_bus"          "detriment_far_church"      
[16] "detriment_lack_police_fire" "detriment_close_rr"         "INHABITANTS_Type"          
[19] "InhType_business"           "InhType_clerical"           "InhType_professional"      
[22] "InhType_mechanics"          "InhType_retired"            "InhType_gov"               
[25] "InhType_white_collar"       "InhType_labor"              "InhType_railroad"          
[28] "InhType_mixture"            "INHABITANTS_Annual_Income"  "Median_Income"             
[31] "INHABITANTS_Foreignborn"    "INHABITANTS_F"              "INHABITANTS_Negro"         
[34] "INHABITANTS_N"              "INHABITANTS_Infiltration"   "Infiltration_Normalized"   
[37] "Ten_Fifteen_Desirability"   "Desirability_Normalized"    "Remarks"                   

Exploratory Analysis on the dataset

Since the analysis is on the loans handed out, it makes sense to know about the distribution of the income as recorded in the survey

summary(data_final$Median_Income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    300    1700    3000   35710    5000 5000000      14 

On looking at the dataset we find that the median is $3000 whereas the mean is $35,710 with the maximum 'Median Income' being $5,000,000 - suggesting that this must be an outlier.


A general assumption that the surveyors made during the survey was that areas having higher concentration of Black Families would be necessarily poor and, as such, should be awarded a lower grade. Below is a plot to see the extent to which this assumption held true.

##plot of the mean number of Black Families with the security Grade
ggplot(data = data_final2, aes(Security_Grade, INHABITANTS_N)) + geom_bar(stat = "summary") + theme_light() + ylab("Average percent of Black Families") + xlab("Security Grade")


Another assumption is that more the income more should be the security grade. Below is a plot to test the same

## plot of mean income with security grade
ggplot(data= data_final2, aes(Security_Grade, Median_Income)) + geom_bar(stat = "summary")

##plot of count of Security Grade
ggplot(data_final, aes(Security_Grade)) + geom_bar() + theme_light()

Looking at the above two plots it is clear that the areas having a considerable percent of Black families were blanketed with a security grade of 'D'.


It makes sense to see whether the assumption that an increase in the number of Black families did lead to the area becoming worse off?

For this I constructed a linear regression model to asses the degree to which the Median Income and the mean percentage of Black families were correlated.

Calculating the baseline accuracy

###baseline model
base <- length(data_final$Security_Grade[data_final$Security_Grade == "B"])/length(data_final$Security_Grade)

The Baseline model has an accuracy of 0.395


Firstly I removed the outliers in the Median Income. It was just one area having an income of $5,000,000. I suspect that there must have been a transcribing error in this data as no other area has a median income even close to this.


Below is a scaled plot of the Median income and the percent of Black families

##Removing the outliers
plot(scale(data_final2$INHABITANTS_N), scale(data_final2$Median_Income))

Observation number 2 has a median income of $5,000,000

A linear moodel to know whether there was a relation between them at all or not

Relation between Median Income and Inhabitants

###unaltered plots
lm7 <- lm(log(Median_Income) ~ INHABITANTS_N, data= data_final)
summary(lm7)

Call:
lm(formula = log(Median_Income) ~ INHABITANTS_N, data = data_final)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4396 -0.5857 -0.1371  0.3738  7.2815 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    8.143421   0.079020 103.055  < 2e-16 ***
INHABITANTS_N -0.023545   0.005224  -4.507 1.31e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9387 on 150 degrees of freedom
  (20 observations deleted due to missingness)
Multiple R-squared:  0.1193,    Adjusted R-squared:  0.1134 
F-statistic: 20.32 on 1 and 150 DF,  p-value: 1.314e-05

A preliminary analysis shows that the median income and the percent of black families in an area are, in fact, inversely related. However, if you look at the plot you would see that most of the data points are cluttered at the left side - areas having no black population. And, there are a few areas, towards the right of the plot, having a very large black population and a pretty low median income. These data-points seemingly have a high leverage and further analysis needs to be done on them.


plot(data_final$INHABITANTS_N, log(data_final$Median_Income), xlab = "Percent of Black Families", ylab = "Log of the Median Income")
abline(lm7)

plot(lm7)

###After removing the Outliers
linearPlot6 <- subset(data_final, data_final$INHABITANTS_N > 0 & data_final$INHABITANTS_N < 75)
lm6 <- lm(log(Median_Income) ~ INHABITANTS_N, data = linearPlot6)
summary(lm6)

Call:
lm(formula = log(Median_Income) ~ INHABITANTS_N, data = linearPlot6)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.71765 -0.25812 -0.04348  0.18599  0.98666 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    7.146204   0.141957  50.341   <2e-16 ***
INHABITANTS_N -0.006325   0.008556  -0.739    0.469    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3985 on 19 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.02796,   Adjusted R-squared:  -0.0232 
F-statistic: 0.5465 on 1 and 19 DF,  p-value: 0.4688
plot(linearPlot6$INHABITANTS_N, log(linearPlot6$Median_Income), xlab = "Percent of Black Families", ylab = "Log of the Median Income")
abline(lm6)

plot(lm6)

Segregating the dataset into Test and Train set

We have used 80-20 split

index <- 1:nrow(data_final)
testindex <- sample(index, trunc(length(index)/5))
testset <- data_final[testindex,]
trainset <- data_final[-testindex,]

Decision Tree model

For our first analysis we have used Decision Trees. We have taken all the Normalized variables and have tried to predict the Security grades based on it.

attach(data_final)
mod1 <- rpart(Security_Grade ~ State + Terrain_high + Terrain_flat + Terrain_hilly + detriment_close_rr + detriment_far_bus + detriment_far_church + detriment_far_city + detriment_far_schools + detriment_far_stores + detriment_lack_police_fire + InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainset, method = "class")
##accuracy of the model
fancyRpartPlot(mod1)

mod1Predict <- predict(mod1, newdata = testset, type = "class")
tab1 <- table(testset$Security_Grade, mod1Predict)
accMod1 <- sum(diag(tab1))/sum(tab1)
###An accuracy of 0.676
####
TreeMod2 <- rpart(Security_Grade ~  Terrain_high + Terrain_flat + Terrain_hilly + detriment_close_rr + detriment_far_bus + detriment_far_church + detriment_far_city + detriment_far_schools + detriment_far_stores + detriment_lack_police_fire + InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainset, method = "class")
###Accuracy of model - 0.705
fancyRpartPlot(TreeMod2)

Treemod2Predict <- predict(TreeMod2, newdata = testset, type = "class")
tab1 <- table(testset$Security_Grade, Treemod2Predict)
accMod2 <- sum(diag(tab1))/sum(tab1)
### Accuracy of the Model - 0.705
TreeMod3 <- rpart(Security_Grade ~  InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainset, method = "class")
###accuracy of the model
fancyRpartPlot(TreeMod3)

Treemod3Predict <- predict(TreeMod3, newdata = testset, type = "class")
tab1 <- table(testset$Security_Grade, Treemod3Predict)
accMod3 <- sum(diag(tab1))/sum(tab1)
###Accuracy of the Model - 0.705

The best accuracy that we obtained from Decision trees was 70.58%

Svm models

SvmTest_set <- na.omit(testset)
SvmMod1 <- svm(Security_Grade ~ State + Terrain_high + Terrain_flat + Terrain_hilly + detriment_close_rr + detriment_far_bus + detriment_far_church + detriment_far_city + detriment_far_schools + detriment_far_stores + detriment_lack_police_fire + InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainset, method = "class", kernel = "radial")
##calculating Accuracy
SvmMod1Pred <- predict(SvmMod1, newdata = SvmTest_set, type = "class")
SvmTab1 <- table(SvmTest_set$Security_Grade, SvmMod1Pred)
accSVm1 <- sum(diag(SvmTab1))/sum(SvmTab1)
###
SvmModel2 <- svm(Security_Grade ~ InhType_labor + InhType_professional + Median_Income + INHABITANTS_N + Infiltration_Normalized , data= trainset, kernel = "linear")
SvmModelPred2 <- predict(SvmModel2, newdata = SvmTest_set, type = "class")
SvmTab2 <- table(SvmTest_set$Security_Grade, SvmModelPred2)
accSVM2 <- sum(diag(SvmTab2))/sum(SvmTab2)
summary(SvmModel2)

Call:
svm(formula = Security_Grade ~ InhType_labor + InhType_professional + Median_Income + 
    INHABITANTS_N + Infiltration_Normalized, data = trainset, kernel = "linear")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  1 
      gamma:  0.09090909 

Number of Support Vectors:  99

 ( 27 41 23 8 )


Number of Classes:  4 

Levels: 
 A B C D
summary(SvmModelPred2)
 A  B  C  D 
10  8  4  1 

With SVM the highest accuracy that we obtained was 74.07%

Random Forests Model with 2000 Decision trees

trainCom <- na.omit(trainset)
testCom <- na.omit(trainset)
rfMod1 <- randomForest(Security_Grade ~ Terrain_high + Terrain_flat + Terrain_hilly + detriment_close_rr + detriment_far_bus + detriment_far_church + detriment_far_city + detriment_far_schools + detriment_far_stores + detriment_lack_police_fire + InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainCom, importance = TRUE, ntree = 2000, proximity = TRUE)
##Calculation of accuracy
rfpredicted <- predict(rfMod1, newdata = testCom)
rfTab <- table(testCom$Security_Grade, rfpredicted)
rfAccuracy <- sum(diag(rfTab))/sum(rfTab)
varImpPlot(rfMod1)