https://github.com/apelullo/yelp_health_data_curation_ops

An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data in support of ongoing health system initiatives.
https://github.com/apelullo/yelp_health_data_curation_ops

academic-research automation aws data-access data-curation data-infrastructure data-pipelines health-data operations operations-research python yelp-dataset

Last synced: 3 months ago
JSON representation

An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data in support of ongoing health system initiatives.

Host: GitHub
URL: https://github.com/apelullo/yelp_health_data_curation_ops
Owner: apelullo
License: mit
Created: 2025-03-06T00:19:34.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-06-27T20:17:12.000Z (4 months ago)
Last Synced: 2025-06-27T21:25:57.920Z (4 months ago)
Topics: academic-research, automation, aws, data-access, data-curation, data-infrastructure, data-pipelines, health-data, operations, operations-research, python, yelp-dataset
Language: Jupyter Notebook
Homepage:
Size: 1.18 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Yelp Health Data Curation - An AWS-based data pipeline to extract, process, store, and monitor Yelp "health-related" facility data
* **Description**: The first, foundational component of the Center for Healthcare Transformation and Innovation (CHTI) AWS data infrastructure, meant to automate novel dataset creation and manage access to high-value data assets for use in academic and operations research. The essential functions of the *Yelp health data pipeline* are as follows:
* Launch a preconfigured, CHTI-owned EC2 instance optimized for data collection and processing via launch template and execute the *yelp_health_pipeline.py* script
* extract Yelp-provided zipfiles of daily database snapshots for all Yelp "health-related" facilities from a Yelp-owned S3 bucket
* unzip each zip file and process the corresponding JSON file into three master csv files for facilities, facility categories, and facility reviews
* save zip files, JSON files, and processed master files in corresponding CHTI-owned S3 buckets with appropriate storage classes to minimize costs
* create user groups and roles with appropriate permissions to streamline data access requests and ensure data integrity/consistency in support of ongoing health system initiatives.
* **Role**: *Lead Data Scientist* allocated to the CHTI AWS data infrastructure

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apelullo/yelp_health_data_curation_ops

Awesome Lists containing this project

README