If you wish to have a full look at this project's R code please check my Github page for more details.

The dataset comes from crowdsourced transcription of digitized images from the National
Archives and Records Administration. The NARA images cover over 100 American cities and
neighborhoods. This project makes use of a subset of the entire database consisting of 172
observations and 39 variables. The University of Maryland Digital Curation Innovation Center
(DCIC) is in the process of crowdsourcing the transcription of digitized survey images and the
project focuses on cities for which corrected data is available.

Initial Thoughts
At project start, we developed several goals. The goals included:
● Identifying potential variables and stronger features;

● Normalizing and recoding selected variables as needed;

● Correlation testing of chosen variables for linear regression;

● Developing and fitting models (random forests, decision trees and support vector
machines (SVM)) that accurately predicts the grading of a neighborhood based on
economic, geographic and racial data.

● Determining the priority columns for transcription correction for the project.

The first step involves knowing about the columns in this dataset

```
data_final <- read.csv("20170428_NormalizedData_MappingProject.csv")
names(data_final)
```

```
[1] "ID" "State" "City"
[4] "Security_Grade" "Terrain_Description" "Terrain_high"
[7] "Terrain_hilly" "Terrain_flat" "Favorable_Influences"
[10] "Detrimental_Influences" "detriment_far_city" "detriment_far_stores"
[13] "detriment_far_schools" "detriment_far_bus" "detriment_far_church"
[16] "detriment_lack_police_fire" "detriment_close_rr" "INHABITANTS_Type"
[19] "InhType_business" "InhType_clerical" "InhType_professional"
[22] "InhType_mechanics" "InhType_retired" "InhType_gov"
[25] "InhType_white_collar" "InhType_labor" "InhType_railroad"
[28] "InhType_mixture" "INHABITANTS_Annual_Income" "Median_Income"
[31] "INHABITANTS_Foreignborn" "INHABITANTS_F" "INHABITANTS_Negro"
[34] "INHABITANTS_N" "INHABITANTS_Infiltration" "Infiltration_Normalized"
[37] "Ten_Fifteen_Desirability" "Desirability_Normalized" "Remarks"
```

Since the analysis is on the loans handed out, it makes sense to know about the distribution of the income as recorded in the survey

`summary(data_final$Median_Income)`

```
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
300 1700 3000 35710 5000 5000000 14
```

On looking at the dataset we find that the median is $3000 whereas the mean is $35,710 with the maximum 'Median Income' being $5,000,000 - suggesting that this must be an outlier.

A general assumption that the surveyors made during the survey was that areas having higher concentration of Black Families would be necessarily poor and, as such, should be awarded a lower grade. Below is a plot to see the extent to which this assumption held true.

```
##plot of the mean number of Black Families with the security Grade
ggplot(data = data_final2, aes(Security_Grade, INHABITANTS_N)) + geom_bar(stat = "summary") + theme_light() + ylab("Average percent of Black Families") + xlab("Security Grade")
```

Another assumption is that more the income more should be the security grade. Below is a plot to test the same

```
## plot of mean income with security grade
ggplot(data= data_final2, aes(Security_Grade, Median_Income)) + geom_bar(stat = "summary")
```

```
##plot of count of Security Grade
ggplot(data_final, aes(Security_Grade)) + geom_bar() + theme_light()
```

Looking at the above two plots it is clear that the areas having a considerable percent of Black families were blanketed with a security grade of 'D'.

It makes sense to see whether the assumption that an increase in the number of Black families did lead to the area becoming worse off?

For this I constructed a linear regression model to asses the degree to which the Median Income and the mean percentage of Black families were correlated.

```
###baseline model
base <- length(data_final$Security_Grade[data_final$Security_Grade == "B"])/length(data_final$Security_Grade)
```

The Baseline model has an accuracy of 0.395

Firstly I removed the outliers in the Median Income. It was just one area having an income of $5,000,000. I suspect that there must have been a transcribing error in this data as no other area has a median income even close to this.

Below is a scaled plot of the Median income and the percent of Black families

```
##Removing the outliers
plot(scale(data_final2$INHABITANTS_N), scale(data_final2$Median_Income))
```

Observation number 2 has a median income of $5,000,000

A linear moodel to know whether there was a relation between them at all or not

```
###unaltered plots
lm7 <- lm(log(Median_Income) ~ INHABITANTS_N, data= data_final)
summary(lm7)
```

```
Call:
lm(formula = log(Median_Income) ~ INHABITANTS_N, data = data_final)
Residuals:
Min 1Q Median 3Q Max
-2.4396 -0.5857 -0.1371 0.3738 7.2815
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.143421 0.079020 103.055 < 2e-16 ***
INHABITANTS_N -0.023545 0.005224 -4.507 1.31e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.9387 on 150 degrees of freedom
(20 observations deleted due to missingness)
Multiple R-squared: 0.1193, Adjusted R-squared: 0.1134
F-statistic: 20.32 on 1 and 150 DF, p-value: 1.314e-05
```

A preliminary analysis shows that the median income and the percent of black families in an area are, in fact, inversely related. However, if you look at the plot you would see that most of the data points are cluttered at the left side - areas having no black population. And, there are a few areas, towards the right of the plot, having a very large black population and a pretty low median income. These data-points seemingly have a high leverage and further analysis needs to be done on them.

```
plot(data_final$INHABITANTS_N, log(data_final$Median_Income), xlab = "Percent of Black Families", ylab = "Log of the Median Income")
abline(lm7)
```

`plot(lm7)`

```
###After removing the Outliers
linearPlot6 <- subset(data_final, data_final$INHABITANTS_N > 0 & data_final$INHABITANTS_N < 75)
lm6 <- lm(log(Median_Income) ~ INHABITANTS_N, data = linearPlot6)
summary(lm6)
```

```
Call:
lm(formula = log(Median_Income) ~ INHABITANTS_N, data = linearPlot6)
Residuals:
Min 1Q Median 3Q Max
-0.71765 -0.25812 -0.04348 0.18599 0.98666
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.146204 0.141957 50.341 <2e-16 ***
INHABITANTS_N -0.006325 0.008556 -0.739 0.469
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3985 on 19 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.02796, Adjusted R-squared: -0.0232
F-statistic: 0.5465 on 1 and 19 DF, p-value: 0.4688
```

```
plot(linearPlot6$INHABITANTS_N, log(linearPlot6$Median_Income), xlab = "Percent of Black Families", ylab = "Log of the Median Income")
abline(lm6)
```

`plot(lm6)`

We have used 80-20 split

```
index <- 1:nrow(data_final)
testindex <- sample(index, trunc(length(index)/5))
testset <- data_final[testindex,]
trainset <- data_final[-testindex,]
```

For our first analysis we have used Decision Trees. We have taken all the Normalized variables and have tried to predict the Security grades based on it.

```
attach(data_final)
mod1 <- rpart(Security_Grade ~ State + Terrain_high + Terrain_flat + Terrain_hilly + detriment_close_rr + detriment_far_bus + detriment_far_church + detriment_far_city + detriment_far_schools + detriment_far_stores + detriment_lack_police_fire + InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainset, method = "class")
##accuracy of the model
fancyRpartPlot(mod1)
```

```
mod1Predict <- predict(mod1, newdata = testset, type = "class")
tab1 <- table(testset$Security_Grade, mod1Predict)
accMod1 <- sum(diag(tab1))/sum(tab1)
###An accuracy of 0.676
####
TreeMod2 <- rpart(Security_Grade ~ Terrain_high + Terrain_flat + Terrain_hilly + detriment_close_rr + detriment_far_bus + detriment_far_church + detriment_far_city + detriment_far_schools + detriment_far_stores + detriment_lack_police_fire + InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainset, method = "class")
###Accuracy of model - 0.705
fancyRpartPlot(TreeMod2)
```

```
Treemod2Predict <- predict(TreeMod2, newdata = testset, type = "class")
tab1 <- table(testset$Security_Grade, Treemod2Predict)
accMod2 <- sum(diag(tab1))/sum(tab1)
### Accuracy of the Model - 0.705
TreeMod3 <- rpart(Security_Grade ~ InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainset, method = "class")
###accuracy of the model
fancyRpartPlot(TreeMod3)
```

```
Treemod3Predict <- predict(TreeMod3, newdata = testset, type = "class")
tab1 <- table(testset$Security_Grade, Treemod3Predict)
accMod3 <- sum(diag(tab1))/sum(tab1)
###Accuracy of the Model - 0.705
```

The best accuracy that we obtained from Decision trees was 70.58%

```
SvmTest_set <- na.omit(testset)
SvmMod1 <- svm(Security_Grade ~ State + Terrain_high + Terrain_flat + Terrain_hilly + detriment_close_rr + detriment_far_bus + detriment_far_church + detriment_far_city + detriment_far_schools + detriment_far_stores + detriment_lack_police_fire + InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainset, method = "class", kernel = "radial")
##calculating Accuracy
SvmMod1Pred <- predict(SvmMod1, newdata = SvmTest_set, type = "class")
SvmTab1 <- table(SvmTest_set$Security_Grade, SvmMod1Pred)
accSVm1 <- sum(diag(SvmTab1))/sum(SvmTab1)
###
SvmModel2 <- svm(Security_Grade ~ InhType_labor + InhType_professional + Median_Income + INHABITANTS_N + Infiltration_Normalized , data= trainset, kernel = "linear")
SvmModelPred2 <- predict(SvmModel2, newdata = SvmTest_set, type = "class")
SvmTab2 <- table(SvmTest_set$Security_Grade, SvmModelPred2)
accSVM2 <- sum(diag(SvmTab2))/sum(SvmTab2)
summary(SvmModel2)
```

```
Call:
svm(formula = Security_Grade ~ InhType_labor + InhType_professional + Median_Income +
INHABITANTS_N + Infiltration_Normalized, data = trainset, kernel = "linear")
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 1
gamma: 0.09090909
Number of Support Vectors: 99
( 27 41 23 8 )
Number of Classes: 4
Levels:
A B C D
```

`summary(SvmModelPred2)`

```
A B C D
10 8 4 1
```

With SVM the highest accuracy that we obtained was 74.07%

```
trainCom <- na.omit(trainset)
testCom <- na.omit(trainset)
rfMod1 <- randomForest(Security_Grade ~ Terrain_high + Terrain_flat + Terrain_hilly + detriment_close_rr + detriment_far_bus + detriment_far_church + detriment_far_city + detriment_far_schools + detriment_far_stores + detriment_lack_police_fire + InhType_business + InhType_labor+ InhType_mixture + InhType_white_collar + InhType_retired + InhType_clerical + InhType_railroad +InhType_mechanics + InhType_gov + InhType_professional + Median_Income + INHABITANTS_F + INHABITANTS_N + Infiltration_Normalized + Desirability_Normalized, data= trainCom, importance = TRUE, ntree = 2000, proximity = TRUE)
##Calculation of accuracy
rfpredicted <- predict(rfMod1, newdata = testCom)
rfTab <- table(testCom$Security_Grade, rfpredicted)
rfAccuracy <- sum(diag(rfTab))/sum(rfTab)
varImpPlot(rfMod1)
```