{"id":47285396,"url":"https://github.com/zimolzak/covid-datathon","last_synced_at":"2026-03-16T05:52:18.486Z","repository":{"id":68878729,"uuid":"270824021","full_name":"zimolzak/covid-datathon","owner":"zimolzak","description":"Analysis of COVID outcome predictors, and of datathon survey responses.","archived":false,"fork":false,"pushed_at":"2024-08-08T00:13:05.000Z","size":180,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-09-04T21:56:35.527Z","etag":null,"topics":["covid-19","datathon","hackathon"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zimolzak.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-06-08T20:52:35.000Z","updated_at":"2024-08-08T00:12:19.000Z","dependencies_parsed_at":"2025-04-12T00:22:40.755Z","dependency_job_id":"eb449824-8829-486c-88b7-fbbc6d3dabb6","html_url":"https://github.com/zimolzak/covid-datathon","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/zimolzak/covid-datathon","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zimolzak%2Fcovid-datathon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zimolzak%2Fcovid-datathon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zimolzak%2Fcovid-datathon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zimolzak%2Fcovid-datathon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zimolzak","download_url":"https://codeload.github.com/zimolzak/covid-datathon/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zimolzak%2Fcovid-datathon/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30569654,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-16T04:42:47.996Z","status":"ssl_error","status_checked_at":"2026-03-16T04:42:44.668Z","response_time":96,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["covid-19","datathon","hackathon"],"created_at":"2026-03-16T05:52:17.952Z","updated_at":"2026-03-16T05:52:18.478Z","avatar_url":"https://github.com/zimolzak.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"Lessons learned from an enterprise-wide clinical datathon\n========\n\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13261085.svg)](https://doi.org/10.5281/zenodo.13261085)\n\nThis repository contains code that was used in the research reported\nin the following journal article:\n\nZimolzak AJ, Davila JA, Punugoti V, et al. Lessons learned from an\nenterprise-wide clinical datathon. J Clin Transl Sci. 2022;6(1):e125.\nPublished 2022 Aug 24.\n[doi:10.1017/cts.2022.450](https://doi.org/10.1017/cts.2022.450)\n\n- [PMID 36590351](https://pubmed.ncbi.nlm.nih.gov/36590351/)\n\n- Free full text at [PMCID PMC9794964](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9794964/)\n\n- Direct [link to\n  paper](https://www.cambridge.org/core/journals/journal-of-clinical-and-translational-science/article/lessons-learned-from-an-enterprisewide-clinical-datathon/A399543C00670D6A43F1911D28A111ED)\n  at *Journal of Clinical and Translational Science* and\n  [PDF](https://www.cambridge.org/core/services/aop-cambridge-core/content/view/A399543C00670D6A43F1911D28A111ED/S2059866122004502a.pdf/lessons-learned-from-an-enterprise-wide-clinical-datathon.pdf)\n  from the journal. (Paper is licensed CC BY-NC-SA 4.0.)\n\n\n\n\n* * *\n\n\nItems from my own datathon project below this line.\n\nDatathon: Predictors of severe COVID-19 outcomes\n========\n\n1. Characterize the BCM experience with COVID, including\nhospitalization rate by comorbidity, and ICU utilization.\n\n2. Train a multivariable predictive model for severity of COVID (ICU\nadmission, incidence of ARDS criteria, length of stay, and in-hospital\nmortality) as a function of known and novel factors such as\ncomorbidity.\n\n3. Study the covariation in severe outcomes with treatment modalities\nto evaluate population-level changes.\n\nThe file you probably want\n--------\n\n`bslmc_v6.R` (click on it up above).\n\n`Routputs_v6.txt` for results (although you should treat them as proof\nof concept only and not believe them. Not even hypothesis-generating.)\n\nAlso I feel the `outputs/makefile` file is useful and nifty.\n\nAnalytic data set sketch\n--------\n\n* one row per ED encounter\n    * patient ID of course\n    * ER visit diagnoses\n    * admission diagnoses, if admitted\n    * covid test date(s) \u0026 results\n    * ER visit index date\n* outcomes as follows:\n    * admitted yes/no\n    * ER directly to ICU yes/no\n    * length of stay, if admitted (continuous)\n    * pao2:fio2 ratio (future summary measure, of oxygenation)\n    * mortality yes/no (and date)\n    * intubated yes/no (and date)\n    * maybe future ICU admit \u0026 date if I can manage it\n* predictors as follows\n    * demographics\n        * age\n        * sex\n        * race\n        * ethnicity\n        * ZIP\n    * comorbidities (two columns for each: N prior visits or rate, and prob list yes/no)\n        * diabetes\n        * copd\n        * asthma\n        * hypertension\n        * coronary disease\n        * cancer\n    * number of prior hospital admissions (or rate)\n    * number of prior ER visits (or rate)\n    * vitals (summary meas if needed)\n        * temp\n        * pulse\n        * respirations\n        * BP\n        * SpO2\n        * height\n        * weight\n    * labs (summary measure of labs just before/on index date)\n        * wbc\n        * hgb\n        * plt\n        * sodium\n        * K\n        * bicarb\n        * creatinine\n        * d-dimer\n        * CRP\n        * LDH\n        * BUN\n        * HDL\n        * direct bilirubin\n        * RDW\n        * albumin\n        * neutrophils\n        * lymphocytes\n        * ALT\n        * P:F ratio can be predictor for more \"hard\" downstream events\n\nData pull spec\n--------\n\nInclude: anyone with *positive* covid test. BSLMC and BCM outpatient.\nIf more than 10,000 patients, OK to randomly sample. (Seems to be 1100\nto 1800). 700 ish at office visit in person.\n\nTables (inpatient/BSLMC):\n\n- PAT_ID\n\n- ENC_DX including distant past\n\n- PROBLEM including distant past\n\n- ORDER_RESULTS only need recent dates like 2020\n\n- HSP, including all types of visits (ER, inpatient, etc.). OK to\n  limit to 2020 only, but not necessary.\n\n- possibly flowsheet: height weight systolic diastolic pulse\n  respirations spo2. limit to intubated or other o2 params, and spo2\n  po2.\n\n- discharge disposition (looking for discharge disposition)\n\nTables (outpatient):\n\n- TBD\n\nWhere to look in this code\n--------\n\nHigh value areas would be `bslmc_v4_DataSets_pipe.R` and `bslmc_v6.R` for inpatient\ndata and `analysis_outpat.R` for outpatient. Pay the most attention to\nvariables \u0026 values mentioned within `select()` and `filter()`\nstatements to get a sense of how the data is structured.\n\nRequirements for current repo\n--------\n\n- R, Rscript\n- R packages: dplyr, ggplot2, tidyr, lubridate, here, earth, ROCR\n- make and usual UNIX-like toolchain (mv, rm, cp)\n- pandoc (only for documentation)\n\n\nDatathon \"alpha phase\" use case examples\n========\n\n|ID| Hard? | Waiting on:   | Description                                                 |\n|--|-------|---------------|-------------------------------------------------------------|\n|1 | Easy  | **Done**      | Test volume, positive tests, by date. **Foundational.**     |\n|2 | Easy  | **Done**      | Count tests, positive tests, by comorbidity (see below).    |\n|3 | Easy  | Andy          | Retest volume, likelihood of positivity. By clinic.         |\n|4o| Int.  | **Done**      | Pulse oximetry (SpO2) by positive/negative.                 |\n|4i| Int.  | **Done**/Rory | \" \" \"                                                       |\n|5 | Int.  | Andy          | Basic labs (see below) by positive/negative.                |\n|6 | Adva. | Rory          | *I:* People who \"touch\" the chart (PPE estimate).           |\n|7 | Adva. | Andy          | *I:* Rate of testing late in an admission, rate of positive.|\n|8 | Adva. | Rory          | *I:* Basic descriptives (LOS, floor/hosp census by date).   |\n|9 | Adva. | Andy          | *O:* Rt. of admission/ER (manual rev.?); risk factors.      |\n|10| Adva. | Andy          | Anything to do with mortality.                              |\n\nCapture what Baylor Medicine clinic it was sent from. Capture what lab\n(vendor) it went to (LabCorp, or whoever), called \"test perfomed by.\"\nMost common \"lab facility\" are CPL and LabCorp. Slicer has the\n*confirmed* and the *suspected* registries for COVID.\n\nDefine \"comorbidity\"\n--------\n\n1. The \"big four\": Asthma, COPD, DM, HTN.[^ehrn]\n\n    a. Define first using just the problem list. This is \"middle\n    school level.\"\n\n    b. Then define using encounters, \"high school:\" anyone with 2 or more\n    encounters, from the beginning of time, is defined has having the\n    diabetes (copd, htn, etc.) phenotype. Also note that we want to\n    know the *value* of the counts: really the *distribution* of the\n    count over patients. (100 patients have 2 COPD encounters, 10 have\n    3 encounters, and 1 patient had 10 encounters, since beginning of\n    the database, which means 5 years.) Final note: for big four, this\n    depends only on the encounters, not on problem list. (A woman with 3\n    COPD office visits in past year should be considered as having\n    COPD according to this rule, even if she does not have it in the\n    problem list.)\n\n    c. Then meds, \"college level.\" Requires increasingly more work.\n\n    d. Then more fancy, such as moving beyond rules based phenotype\n    definitions. Probably don't do this for purposes of this stage in\n    the datathon.\n\n2. Also want a broad view of all \"medical history\" items in problem\nlist (not just big four). *This is shown on one of Gloria's SlicerDicer\nslides as a bar graph.*\n\nEpic has notion of encounter for office, procedure, and rx. Gloria\nknows someone for data quality of tobacco use.\n\nDefine \"labs\"\n--------\n\nLabs are called \"procedure.\" Labs are of definite interest. Big ones\n(values are interesting): cmp, cbc, flu, bmp, crp, d-dimer, ferritin. Within\nthose lab panels, the values I care most about are: sodium,\ncreatinine, and white blood cell count, detailed white cell\ndifferential (neutrophils, lymphocytes, monocytes, eosinophils,\nbasophils).\n\nIn broader \"non big-one\" sense, I *do* want to know *whether* they\nhave the lab ordered (e.g. A1c, lipid panel: I don't care so much\nabout the exact value, but I want to know whether or not the patient\nhad the lab ordered/done). *This is shown on one of Gloria's\nSlicerDicer slides.*\n\n[^ehrn]: Epic Health Research Network, https://ehrn.org/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzimolzak%2Fcovid-datathon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzimolzak%2Fcovid-datathon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzimolzak%2Fcovid-datathon/lists"}