https://github.com/jdkato/openpdi
A Python 3 library for decentralized aggregation of data from the Police Data Initiative (PDI).
https://github.com/jdkato/openpdi
data-science machine-learning nlp open-data python3
Last synced: 8 months ago
JSON representation
A Python 3 library for decentralized aggregation of data from the Police Data Initiative (PDI).
- Host: GitHub
- URL: https://github.com/jdkato/openpdi
- Owner: jdkato
- License: mit
- Created: 2018-10-20T20:02:35.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2020-12-18T01:35:50.000Z (over 5 years ago)
- Last Synced: 2025-03-14T03:43:07.694Z (over 1 year ago)
- Topics: data-science, machine-learning, nlp, open-data, python3
- Language: Python
- Homepage:
- Size: 125 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION
Awesome Lists containing this project
README
# OpenPDI [](https://travis-ci.org/OpenPDI/openpdi) [](https://github.com/OpenPDI/openpdi) [](https://zenodo.org/badge/latestdoi/153943607) [](https://pypi.org/project/openpdi/)
OpenPDI is an unofficial effort to document and standardize data submitted to
the [Police Data Initiative][3] (PDI). The goal is to make the data more accessible
by addressing a number of issues related to a lack of
standardization—namely,
- **File types**: While some agencies make use if the
[Socrata Open Data API](https://dev.socrata.com/), many provide their data
in raw `.csv`, `.xlsx`, or `.xls` files of varying structures.
- **Column names**: Many columns that represent the same data (e.g., `race`) are named differently across departments, cities, and states.
- **Value formats**: Dates, times, and other comparable fields are submitted in
many different formats.
- **Column availability**: It's currently very difficult to identify data
sources that contain certain columns—e.g., *Use of Force* data
specifying the hire date of the involved officer(s).
## Getting Started
###### Installation
```shell
$ pip install openpdi
```
###### Usage
| Dataset | ID | Source |
|-------------------|-------|-------------------------------------------------------------|
| [Use of Force][1] | `uof` | https://www.policedatainitiative.org/datasets/use-of-force/ |
```python
import csv
import openpdi
# The library has a single entry point:
dataset = openpdi.Dataset(
# The dataset ID (see the table above).
"uof",
# Limit the data sources to a specific state using its two-letter code.
#
# Default: `scope=[]`.
scope=["TX"],
# A list of columns that must be provided in every data source included in
# this dataset. See `openpdi/meta/{ID}/schema.json` for the available
# columns.
#
# Default: `columns=[]`.
columns=["reason"],
# If `True`, only return the user-specified columns -- i.e., those listed
# in the `columns` parameter.
#
# Default: `strict=False`.
strict=False)
# The names of the agencies included in this dataset:
print(dataset.agencies)
# The URLs of the external data sources inlcuded in this dataset:
print(dataset.sources)
# `gen` is a generator object for iterating over the CSV-formatted dataset.
gen = dataset.download()
# Write to a CSV file:
with open("dataset.csv", "w+") as f:
writer = csv.writer(f, delimiter=",", quoting=csv.QUOTE_ALL)
writer.writerows(gen)
```
## Datasets
In an attempt to avoid unnecessary bloat (in terms of GBs), we don't actually store any PDI data in this repository. Instead, we store small, JSON-formatted descriptions of externally hosted datasets—for example, [`uof/CA/meta.json`](https://github.com/OpenPDI/openpdi/blob/master/openpdi/meta/uof/CA/meta.json):
```json
[
{
"url": "https://www.norwichct.org/Archive.aspx?AMID=61&Type=Recent",
"type": "csv",
"start": 1,
"columns": {
"date": {
"index": 0,
"specifier": "%m/%d/%Y"
},
"city": {
"raw": "Richmond"
},
"state": {
"raw": "CA"
},
"service_type": {
"index": 1
},
"force_type": {
"index": 10
},
"light_conditions": {
"index": 8
},
"weather_conditions": {
"index": 7
},
"reason": {
"index": 2
},
"officer_injured": {
"index": 6
},
"officer_race": {
"index": 9
},
"subject_injured": {
"index": 5
},
"aggravating_factors": {
"index": 3
},
"arrested": {
"index": 4
}
}
}
]
```
This file describes a Use of Force (`uof`) dataset from Richmond, CA. Each entry in the `columns` array maps a column from the externally-hosted data to a column in the dataset's schema file ([`uof/schema.json`](https://github.com/OpenPDI/openpdi/blob/master/openpdi/meta/uof/schema.json)).
![flow][4]
The `schema.json` file assigns a `format` to every possible column in a particular dataset, which is a Python function tasked with standardizing a raw column value (see [`openpdi/validators.py`](https://github.com/OpenPDI/openpdi/blob/master/openpdi/validators.py)).
[1]: https://github.com/jdkato/OpenPDI/tree/master/openpdi/meta/uof
[2]: https://www.policedatainitiative.org/datasets/use-of-force/
[3]: https://www.policedatainitiative.org/
[4]: https://user-images.githubusercontent.com/8785025/49119503-6975ac80-f25d-11e8-9310-802492815b39.png