{"id":13592339,"url":"https://github.com/philipdarke/ukbb-ehr-data","last_synced_at":"2025-04-23T20:24:07.477Z","repository":{"id":134820474,"uuid":"424007145","full_name":"philipdarke/ukbb-ehr-data","owner":"philipdarke","description":"Prepare UK Biobank Electronic Health Record data for research","archived":false,"fork":false,"pushed_at":"2022-08-31T10:44:08.000Z","size":258,"stargazers_count":26,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-11T18:27:25.276Z","etag":null,"topics":["ehr","electronic-health-records","healthcare","uk-biobank"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/philipdarke.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-11-02T21:40:49.000Z","updated_at":"2025-04-01T14:59:46.000Z","dependencies_parsed_at":"2024-01-16T22:19:52.771Z","dependency_job_id":"5d5bbfe6-b361-4fbc-9a55-7e9385cf21e1","html_url":"https://github.com/philipdarke/ukbb-ehr-data","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philipdarke%2Fukbb-ehr-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philipdarke%2Fukbb-ehr-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philipdarke%2Fukbb-ehr-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philipdarke%2Fukbb-ehr-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/philipdarke","download_url":"https://codeload.github.com/philipdarke/ukbb-ehr-data/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250507139,"owners_count":21441926,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ehr","electronic-health-records","healthcare","uk-biobank"],"created_at":"2024-08-01T16:01:08.193Z","updated_at":"2025-04-23T20:24:07.450Z","avatar_url":"https://github.com/philipdarke.png","language":"R","funding_links":[],"categories":["Derivation of variables"],"sub_categories":["Electronic health records"],"readme":"# Prepare UK Biobank EHR data for research\n\n[![DOI](https://img.shields.io/badge/DOI-10.1093/jamia/ocab260-blue)](https://doi.org/10.1093/jamia/ocab260)\n\nClean and prepare UK Biobank primary care EHR for research. Tested with the [interim EHR data release](https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/primary_care_data.pdf).\n\n## Installation\n\n1. Install the `ukbbhelpr` R package from [here](https://github.com/philipdarke/ukbbhelpr).\n2. Clone the EHR code set repository [here](https://github.com/philipdarke/ehr-codesets).\n3. Clone this repository and follow the instructions below.\n\nThe following R packages are required. Install them using:\n\n```R\nrequired \u003c- c(\"zoo\", \"dplyr\", \"plyr\", \"ggplot2\", \"cowplot\")\noptional \u003c- c(\"caret\", \"QDiabetes\", \"survival\")\ninstall.packages(required)\ninstall.packages(optional)  # needed to run code in the paper directory\n```\n\n## UK Biobank data\n\nDownload the data for your UK Biobank application from the [data showcase](https://biobank.ndph.ox.ac.uk/showcase/). The following fields are required to process the primary care EHR data:\n\nDescription | Field\n----------- | -----\nYear and month of birth | `34`, `52`\nDate of assessment centre visit | `53`\nLinked date of death | `40000`\n\nThe fields below are required to run the code in the `02_extract_records` and `paper` directories:\n\nDescription | Field\n----------- | -----\nDemographic data | `31`, `189`, `21000`\n| Anthropomorphic measurements | `48`, `50`, `21002`\nHbA1c blood glucose  | `30750`\nSelf-reported non-cancer medical history | `2986`, `20002`, `20003`, `20008`\nSmoking history | `1249`, `2887`, `3456`, `20116`\nSummary secondary care data | `41270`, `41271`, `41272`, `41273`, `41280`, `41281`, `41282`, `41283`\n\n:warning: Edit `01_prepare_data/01_subset_visit_data.R` if any of the optional fields above are unavailable.\n\nIn addition, the primary care data is required:\n\nDescription | File\n----------- | ----\nParticipant registration records | `gp_registrations.txt`\nClinical event records | `gp_clinical.txt`\nPrescription records | `gp_scripts.txt`\n\n## Prepare the data for research\n\n1. Update `file_paths.R` with the paths to your downloaded data.\n2. Run the scripts in the `01_prepare_data` directory sequentially to infer periods of data collection for each participant. The results are saved in `data/data_period.rds` by default.\n3. Run the scripts in the `02_extract_records` directory sequentially to extract the files marked * in the table below.\n\nAlternatively, `run_all.R` can be run instead of steps 3 and 4.\n\n:warning: The EHR data are large files and `run_all.R` in particular is very memory intensive. Use of a high performance computing service is recommended. UK Biobank data must be stored and processed as required under the Material Transfer Agreement.\n\nTested with the September 2019 interim EHR release on an Intel Xeon E5-2699 v4 processor (2.2 GHz, 22 cores, 55 MB cache) with 256Gb RAM running R 3.6 on CentOS Linux 7. The code has not been tested on R 4.0+.\n\n## Output summary\n\nThe following files are saved in the `data` directory by default:\n\nFile | Description\n---- | -----------\n`data_period.rds` | Period(s) of EHR data collection for each participant\n`gp_event.rds` | Clean event/diagnosis data\n`gp_presc.rds` | Clean prescription data\n`biomarkers.rds`* | Extracted biomarkers\n`demographic.rds`* | Ethnicity, smoking history and Townsend deprivation\n`family_history.rds`* | Family history data\n`diagnoses.rds`* | Extracted diagnosis codes for a range of common conditions\n`prescriptions.rds`* | Estimated periods during which selected drugs were prescribed\n\nFiles marked * are generated by the scripts in the `02_extract_records` directory.\n\n## Visualising the results\n\n### Estimating periods of EHR data collection\n\n`visualisation/01_algorithm.R` can be used to plot the results of the algorithm used to infer periods of EHR data collection for a participant.\n\n![Data collection algorithm example](algo_output.png)\n\n### Diabetes phenotyping case study\n\n`visualisation/02_phenotyping.R` can be used to plot the results of the diabetes phenotyping algorithm. `paper/02_diabetes_phenotyping.R` must be run first.\n\n![Example output from diabetes phenotyping tool](pheno_output.png)\n\n## Citing this work\n\nIf you use this work, please cite it as below:\n\n```\n@article{10.1093/jamia/ocab260,\n    author = {Darke, Philip and Cassidy, Sophie and Catt, Michael and Taylor, Roy and Missier, Paolo and Bacardit, Jaume},\n    title = \"{Curating a longitudinal research resource using linked primary care EHR data - a UK Biobank case study}\",\n    journal = {Journal of the American Medical Informatics Association},\n    volume = {29},\n    number = {3},\n    pages = {546-552},\n    year = {2021},\n    month = {12},\n    issn = {1527-974X},\n    doi = {10.1093/jamia/ocab260},\n    url = {https://doi.org/10.1093/jamia/ocab260},\n    eprint = {https://academic.oup.com/jamia/article-pdf/29/3/546/42333190/ocab260.pdf},\n}\n```\n\n## Licence\n\nMade available under the [MIT Licence](LICENCE).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilipdarke%2Fukbb-ehr-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphilipdarke%2Fukbb-ehr-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilipdarke%2Fukbb-ehr-data/lists"}