https://github.com/apelullo/yelp_health_data_curation_ops
An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data in support of ongoing health system initiatives.
https://github.com/apelullo/yelp_health_data_curation_ops
academic-research automation aws data-access data-curation data-infrastructure data-pipelines health-data operations operations-research python yelp-dataset
Last synced: 3 months ago
JSON representation
An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data in support of ongoing health system initiatives.
- Host: GitHub
- URL: https://github.com/apelullo/yelp_health_data_curation_ops
- Owner: apelullo
- License: mit
- Created: 2025-03-06T00:19:34.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-06-27T20:17:12.000Z (4 months ago)
- Last Synced: 2025-06-27T21:25:57.920Z (4 months ago)
- Topics: academic-research, automation, aws, data-access, data-curation, data-infrastructure, data-pipelines, health-data, operations, operations-research, python, yelp-dataset
- Language: Jupyter Notebook
- Homepage:
- Size: 1.18 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Yelp Health Data Curation - An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data
* **Description**: The first, foundational component of the Center for Healthcare Transformation and Innovation (CHTI) AWS data infrastructure, meant to automate novel dataset creation and manage access to high-value data assets for use in academic and operations research. The essential functions of the *Yelp health data pipeline* are as follows:
* Launch a preconfigured, CHTI-owned EC2 instance optimized for data collection and processing via launch template and execute the *yelp_health_pipeline.py* script
* extract Yelp-provided zipfiles of daily database snapshots for all Yelp "health-related" facilities from a Yelp-owned S3 bucket
* unzip each zip file and process the corresponding JSON file into three master csv files for facilities, facility categories, and facility reviews
* save zip files, JSON files, and processed master files in corresponding CHTI-owned S3 buckets with appropriate storage classes to minimize costs
* create user groups and roles with appropriate permissions to streamline data access requests and ensure data integrity/consistency in support of ongoing health system initiatives.
* **Role**: *Lead Data Scientist* allocated to the CHTI AWS data infrastructure