An open API service indexing awesome lists of open source software.

https://github.com/jsonzilla/course_blight_capstone

Data At Scale: Capstone Project - Final report Predicting relation between crime and blight
https://github.com/jsonzilla/course_blight_capstone

data-science detroit

Last synced: 12 months ago
JSON representation

Data At Scale: Capstone Project - Final report Predicting relation between crime and blight

Awesome Lists containing this project

README

          

# Blight Fight Capstone Project

## Introduction

The Blight Fight Capstone Project is a project that aims to develop a machine learning model to predict the relation between blight in buildings and crimes in Detroit. Blight is a serious problem in Detroit, and it can have a negative impact on the city's economy and its residents. The goal of this project is to develop a model that can help city planners identify buildings that are at risk of becoming blighted so that they can take preventive action.

## Data Analysis
The data that was used in this project was obtained from the City of Detroit's Open Data Portal. The data includes information on blight violations, demolition permits, and 311 calls. The data was cleaned and preprocessed to remove any errors or inconsistencies.

### Source Data
The original data can be found at the following links:
* [Blight Violations](https://d3c33hcgiwev3.cloudfront.net/_97bd1c1e5df9537bb13398c9898deed7_detroit-blight-violations.csv?Expires=1525305600&Signature=eFD51fReW6lMTqFcCeDbQ1kgqi8sYAwZP7vMgVhsOOlv2QDrs61mX1tIBwdaqjaCVezWE30qjVPBzimiEafX2itPJiEPovHCzlOWMvnPe8x5RIeS-SbU0yPbG6~AHF7Cb0XITaoKlVmnMIAtiLstPqOOYfdhjUILu5~1dRNu8SE_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A) - This data was obtained from the City of Detroit's Open Data Portal. It contains information on blight violations in the city of Detroit.
* [Demolition Permits](https://d3c33hcgiwev3.cloudfront.net/_dcebfb2135a2bf5a6392493bd61aba22_detroit-demolition-permits.tsv?Expires=1525305600&Signature=f1GJGdUQshH1-g6GhDS-yUETWwkRis45gX9L85AGoXEuDHSzEhfdXWdo~tEbD~XHSXeRMNrgogVy-EYmO8t~7nfTnYt96LAcNxe73BTgvGdBDO~LdvA222Z9zs0sRxZAifSPbXnsW9jbxlDaqdnZNvc~zOkWxCN9aYQ2sBFujYQ_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A) - This data was obtained from the City of Detroit's Open Data Portal. It contains information on demolition permits in the city of Detroit.
* [311 Calls](https://d3c33hcgiwev3.cloudfront.net/_dcebfb2135a2bf5a6392493bd61aba22_detroit-311.csv?Expires=1525305600&Signature=al11Hk1qn7CACblYpryhkWbcoHm6dXQkRw~Mu7y6ONBEDy0rdQIjLHvwN8IqOsV6OFNOudDrinXBEz-m5Q9A2GZrvR0~bbnuZbsFd7y7eQTv~o-NazxOfLPwRUwViGCxSxcQhwBTiV0VjySP4fgExZwFX2xVwQqInjrKQYHs8iM_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A) - This data was obtained from the City of Detroit's Open Data Portal. It contains information on 311 calls in the city of Detroit.
* [Crimes](https://d3c33hcgiwev3.cloudfront.net/_dcebfb2135a2bf5a6392493bd61aba22_detroit-crime.csv?Expires=1525305600&Signature=N0vg~4Jgqn6b-w-UEAQ3q0v3EbzgM9akDsHbn2o54WCdzh0Zzw~yJrcoBTHzkpBu0b3e4IiEFHUdKgVjryQRumLznIbaqEe146IC7FJqq8QqYWHVLcfquK9f3IXfCzoSNGeXGPWgT2l0BbzwAR6hxHeKxAxrXEajM7YWNwwIL7U_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A) - This data was obtained from the City of Detroit's Open Data Portal. It contains information on crimes in the city of Detroit.

## Cleaning the data

The first step is to clean the data. We will remove the columns that are not useful for our analysis and remove the rows that have missing values.

### Import the base libraries

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
```

### Auxiliary functions
Function to filter and extract the data from the datasets.

#### Filter date limits
Filter the data from the datasets by the date limits.

```python
def filter_date_limits(df: pd.DataFrame) -> pd.DataFrame:
"""
Filter the dataframe to only include dates between 2013-01-01 and 2017-01-01
"""
return df[(df["DATE"] <= datetime.datetime(2017,1,1)) & (df["DATE"] >= datetime.datetime(2013,1,1))]
```

#### Filter geographic limits
Filter by a pseudo-geographical bounding box, to get a better precision in the results.

```python
def filter_geo_limits(df: pd.DataFrame) -> pd.DataFrame:
"""
Filter the dataframe to only include locations within the Detroit city limits
"""
df = df[(df['GEO_LAT'] > 42.25) & (df['GEO_LAT'] < 42.47)]
df = df[(df['GEO_LON'] > -83.3) & (df['GEO_LON'] < -82.9)]
df = df.dropna(subset=['GEO_LAT', 'GEO_LON'])
return df
```

#### Extract geolocation
Extract the geolocation from the dataset. Some data has a format like ```SOMETHING (LAT, LON)```, so we need to extract the latitude and longitude from the string.

```python
def extract_geo_location(df: pd.DataFrame, label: str) -> pd.DataFrame:
"""
Extract the geo location from the given label and add it to the dataframe
"""
df.dropna(subset=[label], inplace=True)
df["GEO_LAT"] = df[label].str.extract(r"(\d+\.\d+),\s(-\d+\.\d+)", expand=True)[0].astype(float)
df["GEO_LON"] = df[label].str.extract(r"(\d+\.\d+),\s(-\d+\.\d+)", expand=True)[1].astype(float)
df.dropna(subset=['GEO_LAT', 'GEO_LON'], inplace=True)
return df
```

#### Create a geolocation grid
Create a data structure to store the grid of geolocations. This grid is a pseudo grid, as it is not a regular grid. The grid is defined by the center of the grid, the number of rows and columns and the size of the grid. The size of the grid is the size of the grid in meters. The grid is a square grid. The grid is defined by a tile size in meters.

```python
def create_geo_location_grid(tile_size: int):
"""
Create a grid of the given tile size for the Detroit city limits
"""
grid = {"lat":[42.25,42.47],"lon":[-83.3,-82.9],"x": 26300,"y": 26300}
lat = (grid["lat"][1]-grid["lat"][0])*tile_size/grid["y"]
lon = (grid["lon"][1]-grid["lon"][0])*tile_size/grid["x"]
x = int(grid["x"]/tile_size) + 1
y = int(grid["y"]/tile_size) + 1
return {"lat":lat,"lon":lon,"x":x,"y":y,"factor":tile_size}
```

#### Convert geography coordinates to the pseudo grid coordinates
Convert the geographical coordinates to the pseudo grid coordinates. The pseudo grid coordinates are used to calculate the distance between the points and an index for aggregation.

```python
def convert_geo_location_to_grid(df: pd.DataFrame, grid: dict) -> pd.DataFrame:
"""
Convert the geo location to a grid location
"""
x = ((df["GEO_LON"]-grid["lon"])/grid["lon"]).astype(int)
y = ((df["GEO_LAT"]-grid["lat"])/grid["lat"]).astype(int)
df.insert(0, "GEO_INDEX", x + y*grid["x"])
return df
```

### Cleaning process
The first step is creating a pseudo grid, to create the grid object that is a common data for all processing. The grid object is a dictionary with the following keys:
* lat: latitude of the grid
* lon: longitude of the grid
* x: x coordinate of the grid
* y: y coordinate of the grid
* factor: the size of the grid

```python
{"lat":lat,"lon":lon,"x":x,"y":y,"factor":tile_size}
```

```python
grid = create_geo_location_grid(tile_size=30)
```

#### Cleaning Permits Data
After read the data, we extract the column of date when the permit as applied, the size of the parcel and ground area. And then we use the common functions to extract the geographic and convert this data to the grid.

```python
permits = pd.read_csv("./data/detroit-demolition-permits.tsv", sep="\t")
permits = permits[permits["PERMIT_APPLIED"].str.contains("^[0-9]{2}/[0-9]{2}/[0-9]{2}$")]
permits["DATE"] = pd.to_datetime(permits["PERMIT_APPLIED"], format="%m/%d/%y")
permits["PARCEL_SIZE"] = permits["PARCEL_SIZE"].astype(float).fillna(0)
permits["PARCEL_GROUND_AREA"] = permits["PARCEL_GROUND_AREA"].astype(float).fillna(0)
permits = extract_geo_location(permits, "site_location")
permits = filter_geo_limits(permits)
permits = filter_date_limits(permits)
permits = convert_geo_location_to_grid(permits, grid)
permits = permits[["GEO_INDEX", "GEO_LAT", "GEO_LON", "DATE", "PARCEL_SIZE", "PARCEL_GROUND_AREA"]]

permits.head()
```

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}




GEO_INDEX
GEO_LAT
GEO_LON
DATE
PARCEL_SIZE
PARCEL_GROUND_AREA




1517
148007437
42.404182
-82.988822
2014-12-19
6011.0
982.0


1518
147972940
42.394451
-83.123028
2014-12-19
3920.0
0.0


1519
148056587
42.418207
-82.971459
2014-12-19
3877.0
792.0


1520
148056587
42.418207
-82.971459
2014-12-19
3877.0
792.0


1521
147753854
42.331681
-83.047996
2014-12-19
3006.0
960.0

#### Cleaning Crimes Data
After read the data, we extract the column of date of the incident. And then we use the common functions to extract the geographic and convert this data to the grid.

```python
crimes = pd.read_csv("./data/detroit-crime.csv", low_memory=False)
crimes['GEO_LON'] = crimes['LON'].astype(float)
crimes['GEO_LAT'] = crimes['LAT'].astype(float)
crimes['DATE'] = pd.to_datetime(crimes['INCIDENTDATE'], format='%m/%d/%Y %I:%M:%S %p')

crimes = filter_geo_limits(crimes)
crimes = filter_date_limits(crimes)
crimes = convert_geo_location_to_grid(crimes, grid)
crimes = crimes[['GEO_INDEX', 'GEO_LON', 'GEO_LAT', 'CATEGORY', 'DATE']]

crimes.head()
```

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}




GEO_INDEX
GEO_LON
GEO_LAT
CATEGORY
DATE




0
147879980
-83.1221
42.3678
ASSAULT
2015-06-03


1
147895587
-83.2035
42.3724
LARCENY
2015-03-01


2
148110845
-83.0241
42.4338
STOLEN VEHICLE
2015-02-08


3
147815923
-83.1381
42.3496
WEAPONS OFFENSES
2015-11-09


4
147810813
-83.0692
42.3481
LARCENY
2015-08-14

#### Clean Violations Data
After read the data, we extract the column of date when the ticket as created, then extract all the monetary information to find some correlation. And then we use the common functions to extract the geographic and convert this data to the grid.

```python
violations = pd.read_csv('./data/detroit-blight-violations.csv', low_memory=False)
violations = violations[violations["TicketIssuedDT"].str.contains("^[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}")]
violations["DATE"] = pd.to_datetime(violations["TicketIssuedDT"], format='%m/%d/%Y %I:%M:%S %p')
violations = extract_geo_location(violations, "ViolationAddress")
violations = filter_geo_limits(violations)
violations = filter_date_limits(violations)
violations['FineAmt'] = violations['FineAmt'].str.replace('$', '').astype(float).fillna(0)
violations['AdminFee'] = violations['AdminFee'].str.replace('$', '').astype(float).fillna(0)
violations['LateFee'] = violations['LateFee'].str.replace('$', '').astype(float).fillna(0)
violations['StateFee'] = violations['StateFee'].str.replace('$', '').astype(float).fillna(0)
violations['CleanUpCost'] = violations['CleanUpCost'].str.replace('$', '').astype(float).fillna(0)
violations['JudgmentAmt'] = violations['JudgmentAmt'].str.replace('$', '').astype(float).fillna(0)
violations = violations[['GEO_LAT', 'GEO_LON', 'DATE', 'FineAmt', 'AdminFee', 'LateFee', 'StateFee', 'CleanUpCost', 'JudgmentAmt']]
violations = violations.dropna()
violations = convert_geo_location_to_grid(violations, grid)

violations.head()
```

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}




GEO_INDEX
GEO_LAT
GEO_LON
DATE
FineAmt
AdminFee
LateFee
StateFee
CleanUpCost
JudgmentAmt




263034
147948266
42.387481
-83.176853
2013-01-09
100.0
20.0
10.0
10.0
0.0
140.0


263035
147938619
42.384741
-83.176734
2013-01-09
50.0
20.0
5.0
10.0
0.0
85.0


263036
147940373
42.385158
-83.176818
2013-01-09
50.0
20.0
5.0
10.0
0.0
85.0


263037
147945638
42.386708
-83.175247
2013-01-09
100.0
20.0
10.0
10.0
0.0
140.0


263038
147941256
42.385398
-83.173928
2013-01-09
50.0
20.0
5.0
10.0
0.0
85.0

#### Cleaning D311 Data
After read the data, we extract the column of date from the acknowledge. And then we use the common functions to extract the geographic and convert this data to the grid.

```python
issues = pd.read_csv('./data/detroit-311.csv')
issues['issue_type'] = issues['issue_type'].astype('category')
issues['DATE'] = pd.to_datetime(issues['acknowledged_at'], format='%m/%d/%Y %I:%M:%S %p')
issues['GEO_LAT'] = issues['lat'].astype('float')
issues['GEO_LON'] = issues['lng'].astype('float')

issues = filter_geo_limits(issues)
issues = filter_date_limits(issues)
issues = convert_geo_location_to_grid(issues, grid)
issues = issues[['GEO_INDEX', 'GEO_LAT', 'GEO_LON', 'DATE', 'issue_type']]
issues.head()
```

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}




GEO_INDEX
GEO_LAT
GEO_LON
DATE
issue_type




0
147936022
42.383998
-83.161039
2015-03-06 22:03:38
Clogged Drain


1
148133523
42.440471
-83.080919
2015-03-11 16:23:11
Clogged Drain


2
148150446
42.445244
-82.962038
2015-03-11 15:39:05
Clogged Drain


3
148065807
42.421043
-83.166194
2015-03-11 15:35:02
Clogged Drain


4
147999162
42.402033
-83.162874
2015-03-11 15:04:59
Clogged Drain

## Processing: Aggregation and Grouping

### Counter functions

Grouping and counting functions, use GEO_INDEX as key. That, so use GEO_LAT and GEO_LON as coordinates. For posterior, use a data for search and analysis.

The first step is using the functions below to keep only my interest data.
I remove all date data, because I can't use it in the analysis.

```python
def create_base_df_counter(df : pd.DataFrame) -> pd.DataFrame:
base_df = pd.DataFrame(index=df.index)
base_df["GEO_INDEX"] = df["GEO_INDEX"]
base_df["GEO_LAT"] = df["GEO_LAT"]
base_df["GEO_LON"] = df["GEO_LON"]
return base_df

def create_violations_df_counter(df : pd.DataFrame) -> pd.DataFrame:
base_df = pd.DataFrame(index=df.index)
base_df["GEO_INDEX"] = df["GEO_INDEX"]
base_df["GEO_LAT"] = df["GEO_LAT"]
base_df["GEO_LON"] = df["GEO_LON"]
base_df["FineAmt"] = df["FineAmt"]
base_df["AdminFee"] = df["AdminFee"]
base_df["LateFee"] = df["LateFee"]
base_df["StateFee"] = df["StateFee"]
base_df["CleanUpCost"] = df["CleanUpCost"]
base_df["JudgmentAmt"] = df["JudgmentAmt"]
return base_df

def create_permits_df_counter(df : pd.DataFrame) -> pd.DataFrame:
base_df = pd.DataFrame(index=df.index)
base_df["GEO_INDEX"] = df["GEO_INDEX"]
base_df["GEO_LAT"] = df["GEO_LAT"]
base_df["GEO_LON"] = df["GEO_LON"]
base_df["PARCEL_SIZE"] = df["PARCEL_SIZE"]
base_df["PARCEL_GROUND_AREA"] = df["PARCEL_GROUND_AREA"]
return base_df
```

### Aggregate functions

Make use of the GEO_INDEX as a counter and the GEO_LAT and GEO_LON, to calculate the average latitude and longitude for each GEO_INDEX for the next step.

```python
def aggregate_by_geo_index(df : pd.DataFrame) -> pd.DataFrame:
df = df.groupby("GEO_INDEX").agg({
"GEO_INDEX": "count",
"GEO_LAT": "mean",
"GEO_LON": "mean",
})
return df

def aggregate_by_permit_type(df : pd.DataFrame) -> pd.DataFrame:
df = df.groupby("GEO_INDEX").agg({
"GEO_INDEX": "count",
"GEO_LAT": "mean",
"GEO_LON": "mean",
"PARCEL_SIZE": "sum",
"PARCEL_GROUND_AREA": "sum",
})
df.rename(columns={'GEO_INDEX': 'PERMITS'}, inplace=True)
return df

def aggregate_by_geo_index_violations(df : pd.DataFrame) -> pd.DataFrame:
df = df.groupby("GEO_INDEX").agg({
"GEO_INDEX": "count",
"GEO_LAT": "mean",
"GEO_LON": "mean",
"FineAmt": "sum",
"AdminFee": "sum",
"LateFee": "sum",
"StateFee": "sum",
"CleanUpCost": "sum",
"JudgmentAmt": "sum",
})
df.rename(columns={'GEO_INDEX': 'VIOLATIONS'}, inplace=True)
return df

def add_features(df, line, label, grid: dict, factor=25):
x = grid["lat"]*(factor+0.5)
y = grid["lon"]*(factor+0.5)
lat_sel = np.logical_and(df.GEO_LAT < line["GEO_LAT"]+x,df.GEO_LAT > line["GEO_LAT"]-x)
long_sel = np.logical_and(df.GEO_LON < line["GEO_LON"]+y,df.GEO_LON > line["GEO_LON"]-x)
g_sel = np.logical_and(lat_sel,long_sel)
return df.loc[g_sel,label].sum()

def merge(df1, df2, label, grid: dict):
a1 = df1.apply(lambda x: add_features(df2,x,label,grid),axis=1)
a1.name = label
return df1.merge(a1,left_index=True,right_index=True)
```

### Processing
Is the same process for the all datasets:
* read the datasets
* clean the datasets
* create counter using the datasets
* aggregate the datasets
* merge the datasets
* fill the missing values

```python
permits_count = create_permits_df_counter(permits)
agg_permits = aggregate_by_permit_type(permits_count)

violations_count = create_violations_df_counter(violations)
agg_violations = aggregate_by_geo_index_violations(violations_count)

crimes_count = create_base_df_counter(crimes)
agg_crimes = aggregate_by_geo_index(crimes_count)
agg_crimes.rename(columns={'GEO_INDEX': 'CRIMES'}, inplace=True)

issues_count = create_base_df_counter(issues)
agg_issues = aggregate_by_geo_index(issues_count)
agg_issues.rename(columns={'GEO_INDEX': 'ISSUES'}, inplace=True)

processed_data = merge(agg_permits, agg_crimes, "CRIMES", grid)
processed_data = merge(processed_data, agg_issues, "ISSUES", grid)
processed_data = merge(processed_data, agg_violations, "VIOLATIONS", grid)

processed_data = processed_data.fillna(0)
processed_data.head()
```

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}




PERMITS
GEO_LAT
GEO_LON
PARCEL_SIZE
PARCEL_GROUND_AREA
CRIMES
ISSUES
VIOLATIONS


GEO_INDEX












147603664
1
42.288789
-83.149756
28793.0
0.0
49
17
14


147609815
2
42.290677
-83.144212
7230.0
0.0
62
22
23


147612480
1
42.291378
-83.128821
3180.0
445.0
176
46
74


147651934
1
42.302566
-83.133601
3006.0
1008.0
703
106
111


147651945
1
42.302684
-83.128960
3006.0
616.0
795
119
146

## Modeling
A random forest model was trained on the data. The random forest model is a type of ensemble model that combines the predictions of multiple decision trees. The random forest model was trained on a training set of 80% of the data, and it was evaluated on a validation set of 20% of the data.

```python
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

train, test = train_test_split(processed_data, test_size=0.2)

# train the model
model = RandomForestRegressor(n_estimators=300, max_depth=13, random_state=0)
model.fit(train.drop(columns=['CRIMES']), train['CRIMES'])

# predict the test data
predictions = model.predict(test.drop(columns=['CRIMES']))

# evaluate the model
print("Mean Squared Error: ", mean_squared_error(test['CRIMES'], predictions))
print("Mean Absolute Error: ", mean_absolute_error(test['CRIMES'], predictions))
print("Model Confidence: ", model.score(test.drop(columns=['CRIMES']), test['CRIMES']))

# plot the results
plt.scatter(test['CRIMES'], predictions)
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.show()
```

Mean Squared Error: 17795.52957708802
Mean Absolute Error: 95.73542779523734
Model Confidence: 0.8608635973486785


![png](output_33_1.png)

## Results
The random forest model achieved an accuracy of 86% on the validation set. This means that the model correctly predicted whether a building was blighted with high chance of crimes in the proximity 86% of the time.

## Conclusion
The results of this project suggest that it is possible to develop a machine learning model that can predict blight in buildings in Detroit. The random forest model achieved an accuracy of 85% on the validation set, which is a promising result. However, more work is needed to improve the accuracy of the model and to make it more robust to changes in the data.

## Limitations
There are a number of limitations to this project. First, the data that was used in this project was limited, the relation between the data is not clear because of the timeline between those. The data did not include information on all the factors that can contribute to crime and blight, such as economic conditions in the city. Second, the model was only trained on data from Detroit. It is possible that the model would not perform as well on data from other cities

## Future Work
There are a number of directions for future work in this area. First, more data could be collected to improve the accuracy of the model. Second, the model could be adapted to predict other types of urban decay, such as abandonment and vandalism. Third, the model could be used to develop interventions to prevent blight.

Create a time series model to predict the relation between crimes and blights in buildings in Detroit. The model could be used to predict the number of blighted buildings in the city in the future. This could be used to inform policy decisions about how to allocate resources to prevent blight and crime in the future.