Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/uk-ipop/open-data-pipeline
A pipeline for processing, enhancing, and sharing open datasets.
https://github.com/uk-ipop/open-data-pipeline
actions automation data python
Last synced: 9 days ago
JSON representation
A pipeline for processing, enhancing, and sharing open datasets.
- Host: GitHub
- URL: https://github.com/uk-ipop/open-data-pipeline
- Owner: UK-IPOP
- License: gpl-3.0
- Created: 2022-08-31T14:46:19.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-12T21:02:44.000Z (7 months ago)
- Last Synced: 2024-04-13T04:06:14.752Z (7 months ago)
- Topics: actions, automation, data, python
- Language: Python
- Homepage: https://uk-ipop.github.io/open-data-pipeline/
- Size: 7.73 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# Medical Examiner Open Data Pipeline
[![Pipeline](https://github.com/UK-IPOP/open-data-pipeline/actions/workflows/pipeline.yml/badge.svg?branch=main)](https://github.com/UK-IPOP/open-data-pipeline/actions/workflows/pipeline.yml)
[![Docs](https://github.com/UK-IPOP/open-data-pipeline/actions/workflows/pages/pages-build-deployment/badge.svg?branch=gh-pages)](https://github.com/UK-IPOP/open-data-pipeline/actions/workflows/pages/pages-build-deployment)This repository contains the code for the Medical Examiner Open Data Pipeline.
We currently fetch data from the following sources:
- [Cook County Medical Examiner's Archives](https://datacatalog.cookcountyil.gov/Public-Safety/Medical-Examiner-Case-Archive/cjeq-bs86)
- [San Diego Medical Examiner's Office](https://data.sandiegocounty.gov/Safety/Medical-Examiner-Cases/jkvb-n4p7)
- [Milwaukee County Medical Examiner's Office](https://county.milwaukee.gov/EN/Medical-Examiner)
- [Connecticut (State) Accidental Drug Deaths](https://data.ct.gov/Health-and-Human-Services/Accidental-Drug-Related-Deaths-2012-2022/rybz-nyjw/about_data)
- [Santa Clara County Medical Examiner's Office](https://data.sccgov.org/Health/Medical-Examiner-Coroner-Full-dataset/s3fb-yrjp/about_data)
- [Sacramento County Medical Examiner's Office](https://sacramentocounty.maps.arcgis.com/apps/dashboards/0661fb44435b4611bf52be84708c4591)The results of this data are used in various other analysis here on GitHub:
- [Cook County](https://github.com/UK-IPOP/cook-county-analysis)
- Where we add geospatial data to the Cook County data
- This was excluded from this automated pipeline due to specific requirements for the data for only Cook County## Getting Started
This repo exists mainly to take advantage of GitHub actions for automation.
The actions workflow is located in `.github/workflows/pipeline.yml` and is triggered weekly or manually.
This workflow fetches data from the configured data sources inside `config.json`,
geocodes addresses (when available) using ArcGIS, extracts drugs using the drug extraction [toolbox](https://github.com/UK-IPOP/drug-extraction)
and then compiles and zips up the results into the GitHub Releases page.The data is then available for download from the [releases page](https://github.com/UK-IPOP/open-data-pipeline/releases) page.
Further, the entire workflow effectively runs a series of commands using the CLI application `opendata-pipeline` which is located in the `src` directory.
This is also available via a docker image hosted on [ghcr.io](https://github.com/UK-IPOP/open-data-pipeline/pkgs/container/opendata-pipeline). The
benefits of using the CLI via a docker image is that you don't have to have Python3.10 or the drug toolbox on your local machine 🙂.We utilize async methods to speed up the large number of web requests we make to the data sources.
> It is important to regularly fetch/pull from this repo to maintain an updated `config.json`
We currently do not guarantee Windows support unfortunately. If you want to help make that a reality, please submit a new [Pull Request](https://github.com/UK-IPOP/open-data-pipeline/pulls)
There is further API-documentation available on the GitHub Pages [website](https://uk-ipop.github.io/open-data-pipeline/) for this repo if you want to interact with the CLI.
I would recommend using the docker image as it is easier to use and always referring to the CLI `--help` for more information.### Workflow
The workflow can best be described by looking at the `pipeline.yml` file.
## Data Enhancements
The following table shows the fields that we **add** to the original datafiles:
| Column Name | Description |
| :------ | :------ |
| `CaseIdentifier` | A *unique* identifier *across all* the datasets. |
| `death_day` | Day of the Month death occurred |
| `death_month` | Month Name death occurred |
| `death_month_num` | Month Number death occurred |
| `death_year` | Year death occurred |
| `death_day_of_week` | Day of week death occurred. Starting with 0 on Monday. Weekends are 5 (Saturday) & 6 (Sunday). |
| `death_day_is_weekend` | Death occurred on weekend day |
| `death_day_week_of_year` | Week of the year (of 52) that death occurred |
| `geocoded_latitude` | Geocoded latitude. |
| `geocoded_longitude` | Geocoded longitude. |
| `geocoded_score` | Confidence of geocoding. 70-100. |
| `geocoded_address`| The address that the geocoded results correspond to. Not the address provided to the geocoder. |### Drug Columns
In addition to providing the extracted drugs as a separate file in each release, we also convert this data to wide-form for each dataset. This adds the following columns in the subsequent pattern:
| Column Name/Pattern | Description |
| :--- | :--- |
| `*_1` | `*` drug found in first search column provided in drug configuration |
| `*_2` | `*` drug found in second search column provided in drug configuration |
| `*_meta` | Drug of `*` category/class found in this record across _any_ search column.## Requirements
- Python 3.10
- Drug Extraction ToolBox## Installation
To install the python cli I recommend using [pipx](https://pypa.github.io/pipx/).
```bash
pipx install opendata-pipeline
```To install the docker image, you can use the following command:
```bash
docker pull ghcr.io/uk-ipop/opendata-pipeline:latest
```## Usage
Usage is very similar to any other command line application. The most important thing is to follow the workflow defined above.
## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Help me write some tests!
## License
[MIT](https://choosealicense.com/licenses/mit/)
## BibTex Citation
If you use this software or the enhanced data, please cite this repository:
```
@software{Anthony_Medical_Examiner_OpenData_2022,
author = {Anthony, Nicholas},
month = {9},
title = {{Medical Examiner OpenData Pipeline}},
url = {https://github.com/UK-IPOP/open-data-pipeline},
version = {0.2.1},
year = {2022}
}
```Thank you.