{"id":15756738,"url":"https://github.com/davidgamero/gatech-covid-data-scraper","last_synced_at":"2025-03-31T08:25:08.752Z","repository":{"id":75379824,"uuid":"290240134","full_name":"davidgamero/gatech-covid-data-scraper","owner":"davidgamero","description":"Utility for scraping GATech Exposure Alert Information into a CSV file with automated case number extraction and aggregation","archived":false,"fork":false,"pushed_at":"2020-09-25T18:35:22.000Z","size":19,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-06T12:48:15.811Z","etag":null,"topics":["covid","data","gatech","georgia","scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidgamero.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-25T14:39:38.000Z","updated_at":"2020-09-25T18:35:24.000Z","dependencies_parsed_at":"2023-06-06T08:15:31.246Z","dependency_job_id":null,"html_url":"https://github.com/davidgamero/gatech-covid-data-scraper","commit_stats":{"total_commits":13,"total_committers":1,"mean_commits":13.0,"dds":0.0,"last_synced_commit":"0262d40ebd5388d110d664323c1f333cf602def3"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidgamero%2Fgatech-covid-data-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidgamero%2Fgatech-covid-data-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidgamero%2Fgatech-covid-data-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidgamero%2Fgatech-covid-data-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidgamero","download_url":"https://codeload.github.com/davidgamero/gatech-covid-data-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246437901,"owners_count":20777301,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["covid","data","gatech","georgia","scraper"],"created_at":"2024-10-04T09:01:27.211Z","updated_at":"2025-03-31T08:25:08.734Z","avatar_url":"https://github.com/davidgamero.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 😷 GATech COVID-19 Data Scraper\nNumber of cases per day. As a CSV. Data as it should be. EZ to read.\n## 🎬 Demo\n\n[⬇️ Download the Current Data (updated hourly)](https://gatech-covid-19-data.s3.amazonaws.com/gatech_covid_data.csv)\n\n[📈 View the data in an interactive Chart and Table](https://davidgamero.github.io/gatech-covid-chart/)\n\nBelow is the link to a public S3 Object that gets updated hourly from a Lambda running this project's code. Feel free to use it for powering a dashboard or investigating the data yourself\n```\nhttps://gatech-covid-19-data.s3.amazonaws.com/gatech_covid_data.csv\n```\n\n\n## 🏁 Getting Started\nFor those who want to run the data scraper locally\n\n```\ngit clone git@github.com:davidgamero/gatech-covid-data-scraper.git\ncd gatech-covid-data-scraper\npip install -r requirements.txt\n```\n\n```\npython scrape_covid_data.py\n```\nData will be written to `gatech_covid_data.csv`\n\n## ℹ️ Project Info\nQ: Why did I make this?\n\nA: I searched \"gatech covid\" on GitHub and only got one result which was in R by [cjwichman](https://github.com/cjwichman/gatech_covid)\n\nI believe that pandemic health data should be freely and easily accessible and wanted to make my own plots, so I decided to make a Python scraper implementation to better understand the data.\n\nMy main improvements were automated extraction of case numbers aggregated by day even for rows that group cases. This was trickier than I expected for rows that differ in formatting ex: due to the [GATech Health Alert Site](https://health.gatech.edu/coronavirus/health-alerts)'s wildly inconsistent conventions 🤢 I used a series of Regular Expressions to parse for keywords and then extract integers using observed rules. All fuzzy extractions are printed to the command line for manual verification.\n\nThe patterns currently recognized are \n1) Rows with a 'Position' value of 'Students (N)' or 'Various (N)' where N is the number of cases, which I extracted with a regex capture group for the numeric contents of the parentheses\n2) Rows with a 'Position' value of 'Students' OR 'Various'. For these rows I use a regex search to find the first integer present in the 'Campus Impact' column as the number of cases. It would be nice to eventually check that there is only a single match and throw an error for manual review if there are multiple integers.\n\n## 💾 AWS Lambda -\u003e S3\n To deploy as an AWS Lambda function build `gatech-covid-data-lambda.zip` with `build_lambda_zip.sh` and upload to a Python Lambda with `s3:PutObject,s3:PutObjectAcl` permissions to the target bucket\n```\nchmod +x build_lambda_zip.sh\n./build_lambda_zip.sh\n```\n\nUpload `gatech-covid-data-lambda.zip` to AWS Lambda\n\nI recommend increasing timeout to \u003e5s as the data size increases over time with more rows\n\n## Acknowledgements\nShout out to [cjwichman](https://github.com/cjwichman/gatech_covid) for paving the way with their [gatech_covid](https://github.com/cjwichman/gatech_covid) repo","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidgamero%2Fgatech-covid-data-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidgamero%2Fgatech-covid-data-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidgamero%2Fgatech-covid-data-scraper/lists"}