https://github.com/iyashwantsaini/911_capstone
For this capstone project we will be analyzing some 911 call data from Kaggle.
https://github.com/iyashwantsaini/911_capstone
capstone data-science data-visualization python3
Last synced: 3 months ago
JSON representation
For this capstone project we will be analyzing some 911 call data from Kaggle.
- Host: GitHub
- URL: https://github.com/iyashwantsaini/911_capstone
- Owner: iyashwantsaini
- License: mit
- Created: 2020-08-14T11:37:01.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-08-14T11:41:48.000Z (almost 5 years ago)
- Last Synced: 2025-01-16T22:30:01.381Z (5 months ago)
- Topics: capstone, data-science, data-visualization, python3
- Language: HTML
- Homepage:
- Size: 4.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 911 Calls Capstone Project
In this capstone project we analyze some 911 call data from [Kaggle](https://www.kaggle.com/mchirico/montcoalert). The data contains the following fields:
* lat : String variable, Latitude
* lng: String variable, Longitude
* desc: String variable, Description of the Emergency Call
* zip: String variable, Zipcode
* title: String variable, Title
* timeStamp: String variable, YYYY-MM-DD HH:MM:SS
* twp: String variable, Township
* addr: String variable, Address
* e: String variable, Dummy variable (always 1)## Data and Setup
____
** Import numpy and pandas **```python
import numpy as np
import pandas as pd
```** Import visualization libraries and set %matplotlib inline. **
```python
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```** Read in the csv file as a dataframe called df **
```python
df=pd.read_csv('911.csv')
```** Check the info() of the df **
```python
df.info()
```
RangeIndex: 99492 entries, 0 to 99491
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 lat 99492 non-null float64
1 lng 99492 non-null float64
2 desc 99492 non-null object
3 zip 86637 non-null float64
4 title 99492 non-null object
5 timeStamp 99492 non-null object
6 twp 99449 non-null object
7 addr 98973 non-null object
8 e 99492 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 6.8+ MB
** Check the head of df **
```python
df.head()
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
lat
lng
desc
zip
title
timeStamp
twp
addr
e
0
40.297876
-75.581294
REINDEER CT & DEAD END; NEW HANOVER; Station ...
19525.0
EMS: BACK PAINS/INJURY
2015-12-10 17:40:00
NEW HANOVER
REINDEER CT & DEAD END
1
1
40.258061
-75.264680
BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...
19446.0
EMS: DIABETIC EMERGENCY
2015-12-10 17:40:00
HATFIELD TOWNSHIP
BRIAR PATH & WHITEMARSH LN
1
2
40.121182
-75.351975
HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...
19401.0
Fire: GAS-ODOR/LEAK
2015-12-10 17:40:00
NORRISTOWN
HAWS AVE
1
3
40.116153
-75.343513
AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...
19401.0
EMS: CARDIAC EMERGENCY
2015-12-10 17:40:01
NORRISTOWN
AIRY ST & SWEDE ST
1
4
40.251492
-75.603350
CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...
NaN
EMS: DIZZINESS
2015-12-10 17:40:01
LOWER POTTSGROVE
CHERRYWOOD CT & DEAD END
1
## Basic Questions
** What are the top 5 zipcodes for 911 calls? **
```python
df['zip'].value_counts().head(5)
```19401.0 6979
19464.0 6643
19403.0 4854
19446.0 4748
19406.0 3174
Name: zip, dtype: int64** What are the top 5 townships (twp) for 911 calls? **
```python
df['twp'].value_counts().head(5)
```LOWER MERION 8443
ABINGTON 5977
NORRISTOWN 5890
UPPER MERION 5227
CHELTENHAM 4575
Name: twp, dtype: int64** In the 'title' column, how many unique title codes are there? **
```python
df['title'].nunique()
```110
## Creating new features
** In the titles column there are "Reasons/Departments" specified before the title code. These are EMS, Fire, and Traffic. Use .apply() with a custom lambda expression to create a new column called "Reason" that contains this string value.**
**For example, if the title column value is EMS: BACK PAINS/INJURY , the Reason column value would be EMS. **
```python
df['Reason']=df['title'].apply(lambda x : x.split(':')[0])
df.head()
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
0
40.297876
-75.581294
REINDEER CT & DEAD END; NEW HANOVER; Station ...
19525.0
EMS: BACK PAINS/INJURY
2015-12-10 17:40:00
NEW HANOVER
REINDEER CT & DEAD END
1
EMS
1
40.258061
-75.264680
BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...
19446.0
EMS: DIABETIC EMERGENCY
2015-12-10 17:40:00
HATFIELD TOWNSHIP
BRIAR PATH & WHITEMARSH LN
1
EMS
2
40.121182
-75.351975
HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...
19401.0
Fire: GAS-ODOR/LEAK
2015-12-10 17:40:00
NORRISTOWN
HAWS AVE
1
Fire
3
40.116153
-75.343513
AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...
19401.0
EMS: CARDIAC EMERGENCY
2015-12-10 17:40:01
NORRISTOWN
AIRY ST & SWEDE ST
1
EMS
4
40.251492
-75.603350
CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...
NaN
EMS: DIZZINESS
2015-12-10 17:40:01
LOWER POTTSGROVE
CHERRYWOOD CT & DEAD END
1
EMS
** What is the most common Reason for a 911 call based off of this new column? **
```python
df['Reason'].value_counts().head(3)
```EMS 48877
Traffic 35695
Fire 14920
Name: Reason, dtype: int64** Seaborn to create a countplot of 911 calls by Reason. **
```python
sns.countplot(x='Reason',data=df)
```

___
** Now let us begin to focus on time information. What is the data type of the objects in the timeStamp column? **```python
type(df['timeStamp'][0])
```str
** You should have seen that these timestamps are still strings. Use [pd.to_datetime](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) to convert the column from strings to DateTime objects. **
```python
df['timeStamp']=pd.to_datetime(df['timeStamp'])
type(df['timeStamp'][0])
```pandas._libs.tslibs.timestamps.Timestamp
** You can now grab specific attributes from a Datetime object by calling them. For example:**
time = df['timeStamp'].iloc[0]
time.hour**You can use Jupyter's tab method to explore the various attributes you can call. Now that the timestamp column are actually DateTime objects, use .apply() to create 3 new columns called Hour, Month, and Day of Week. You will create these columns based off of the timeStamp column, reference the solutions if you get stuck on this step.**
```python
df['timeStamp'][0]
```Timestamp('2015-12-10 17:40:00')
** Notice how the Day of Week is an integer 0-6. Use the .map() with this dictionary to map the actual string names to the day of the week: **
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
```python
#day# dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
# d=df['timeStamp'][0].dayofweek
# d=dmap[d]
# ddf['day']=df['timeStamp'].apply(lambda x : dmap[x.dayofweek])
df.head(2)
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
day
0
40.297876
-75.581294
REINDEER CT & DEAD END; NEW HANOVER; Station ...
19525.0
EMS: BACK PAINS/INJURY
2015-12-10 17:40:00
NEW HANOVER
REINDEER CT & DEAD END
1
EMS
Thu
1
40.258061
-75.264680
BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...
19446.0
EMS: DIABETIC EMERGENCY
2015-12-10 17:40:00
HATFIELD TOWNSHIP
BRIAR PATH & WHITEMARSH LN
1
EMS
Thu
```python
#month# m=df['timeStamp'][0].month
# mdf['month']=df['timeStamp'].apply(lambda x : x.month)
df.head(2)
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
day
month
0
40.297876
-75.581294
REINDEER CT & DEAD END; NEW HANOVER; Station ...
19525.0
EMS: BACK PAINS/INJURY
2015-12-10 17:40:00
NEW HANOVER
REINDEER CT & DEAD END
1
EMS
Thu
12
1
40.258061
-75.264680
BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...
19446.0
EMS: DIABETIC EMERGENCY
2015-12-10 17:40:00
HATFIELD TOWNSHIP
BRIAR PATH & WHITEMARSH LN
1
EMS
Thu
12
```python
#hour# h=df['timeStamp'][0].hour
# hdf['hour']=df['timeStamp'].apply(lambda x : x.hour)
df.head(2)
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
day
month
hour
0
40.297876
-75.581294
REINDEER CT & DEAD END; NEW HANOVER; Station ...
19525.0
EMS: BACK PAINS/INJURY
2015-12-10 17:40:00
NEW HANOVER
REINDEER CT & DEAD END
1
EMS
Thu
12
17
1
40.258061
-75.264680
BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...
19446.0
EMS: DIABETIC EMERGENCY
2015-12-10 17:40:00
HATFIELD TOWNSHIP
BRIAR PATH & WHITEMARSH LN
1
EMS
Thu
12
17
** Now use seaborn to create a countplot of the Day of Week column with the hue based off of the Reason column. **
```python
sns.countplot(data=df,x='day',hue='Reason')
```

**Now do the same for Month:**
```python
sns.countplot(data=df,x='month',hue='Reason')
```

**Did you notice something strange about the Plot?**
_____
** You should have noticed it was missing some Months, let's see if we can maybe fill in this information by plotting the information in another way, possibly a simple line plot that fills in the missing months, in order to do this, we'll need to do some work with pandas... **
** Now create a gropuby object called byMonth, where you group the DataFrame by the month column and use the count() method for aggregation. Use the head() method on this returned DataFrame. **
```python
monthcount=df.groupby('month').count()
# df.head(2)
monthcount
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
day
hour
month
1
13205
13205
13205
11527
13205
13205
13203
13096
13205
13205
13205
13205
2
11467
11467
11467
9930
11467
11467
11465
11396
11467
11467
11467
11467
3
11101
11101
11101
9755
11101
11101
11092
11059
11101
11101
11101
11101
4
11326
11326
11326
9895
11326
11326
11323
11283
11326
11326
11326
11326
5
11423
11423
11423
9946
11423
11423
11420
11378
11423
11423
11423
11423
6
11786
11786
11786
10212
11786
11786
11777
11732
11786
11786
11786
11786
7
12137
12137
12137
10633
12137
12137
12133
12088
12137
12137
12137
12137
8
9078
9078
9078
7832
9078
9078
9073
9025
9078
9078
9078
9078
12
7969
7969
7969
6907
7969
7969
7963
7916
7969
7969
7969
7969
** Now create a simple plot off of the dataframe indicating the count of calls per month. **
```python
monthcount.plot()
```

** Now see if you can use seaborn's lmplot() to create a linear fit on the number of calls per month. Keep in mind you may need to reset the index to a column. **
```python
sns.lmplot(data=monthcount.reset_index(),x='month',y='twp')
```

**Create a new column called 'Date' that contains the date from the timeStamp column. You'll need to use apply along with the .date() method. **
```python
df['date']=df['timeStamp'].apply(lambda x : x.date())
df.head(2)
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
lat
lng
desc
zip
title
timeStamp
twp
addr
e
Reason
day
month
hour
date
0
40.297876
-75.581294
REINDEER CT & DEAD END; NEW HANOVER; Station ...
19525.0
EMS: BACK PAINS/INJURY
2015-12-10 17:40:00
NEW HANOVER
REINDEER CT & DEAD END
1
EMS
Thu
12
17
2015-12-10
1
40.258061
-75.264680
BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...
19446.0
EMS: DIABETIC EMERGENCY
2015-12-10 17:40:00
HATFIELD TOWNSHIP
BRIAR PATH & WHITEMARSH LN
1
EMS
Thu
12
17
2015-12-10
** Now groupby this Date column with the count() aggregate and create a plot of counts of 911 calls.**
```python
df.groupby('date').count().plot(figsize=(10,5))
```

** Now recreate this plot but create 3 separate plots with each plot representing a Reason for the 911 call**
```python
df[df['Reason']=='Traffic'].groupby('date').count()['twp'].plot(figsize=(10,5))
plt.title('traffic')
```Text(0.5, 1.0, 'traffic')

```python
df[df['Reason']=='Fire'].groupby('date').count()['twp'].plot(figsize=(10,5))
plt.title('fire')
```Text(0.5, 1.0, 'fire')

```python
df[df['Reason']=='EMS'].groupby('date').count()['twp'].plot(figsize=(10,5))
plt.title('ems')
```Text(0.5, 1.0, 'ems')

____
** Now let's move on to creating heatmaps with seaborn and our data. We'll first need to restructure the dataframe so that the columns become the Hours and the Index becomes the Day of the Week. There are lots of ways to do this, but I would recommend trying to combine groupby with an [unstack](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html) method.```python
dayhour=df.groupby(by=['day','hour']).count()['Reason'].unstack()
dayhour.head(5)
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
hour
0
1
2
3
4
5
6
7
8
9
...
14
15
16
17
18
19
20
21
22
23
day
Fri
275
235
191
175
201
194
372
598
742
752
...
932
980
1039
980
820
696
667
559
514
474
Mon
282
221
201
194
204
267
397
653
819
786
...
869
913
989
997
885
746
613
497
472
325
Sat
375
301
263
260
224
231
257
391
459
640
...
789
796
848
757
778
696
628
572
506
467
Sun
383
306
286
268
242
240
300
402
483
620
...
684
691
663
714
670
655
537
461
415
330
Thu
278
202
233
159
182
203
362
570
777
828
...
876
969
935
1013
810
698
617
553
424
354
5 rows × 24 columns
** Now create a HeatMap using this new DataFrame. **
```python
plt.figure(figsize=(12,4))
sns.heatmap(data=dayhour,cmap='viridis')
```

** Now create a clustermap using this DataFrame. **
```python
sns.clustermap(data=dayhour,cmap='viridis')
```

** Repeating same plots and operations, for a DataFrame that shows the Month as the column. **
```python
daymonth=df.groupby(by=['day','month']).count()['Reason'].unstack()
daymonth.head(5)
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
month
1
2
3
4
5
6
7
8
12
day
Fri
1970
1581
1525
1958
1730
1649
2045
1310
1065
Mon
1727
1964
1535
1598
1779
1617
1692
1511
1257
Sat
2291
1441
1266
1734
1444
1388
1695
1099
978
Sun
1960
1229
1102
1488
1424
1333
1672
1021
907
Thu
1584
1596
1900
1601
1590
2065
1646
1230
1266
```python
plt.figure(figsize=(10,4))
sns.heatmap(data=daymonth,cmap='viridis')
```

```python
sns.clustermap(data=daymonth,cmap='viridis')
```
