{"id":18907036,"url":"https://github.com/jordandeklerk/ehr-bert","last_synced_at":"2025-07-21T04:07:48.460Z","repository":{"id":241012662,"uuid":"804048595","full_name":"jordandeklerk/EHR-BERT","owner":"jordandeklerk","description":"BERT style transformer model on CMS synthetic EHR data for diagnosis and procedure prediction in PyTorch","archived":false,"fork":false,"pushed_at":"2025-05-05T18:40:32.000Z","size":53208,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-01T00:49:44.139Z","etag":null,"topics":["artificial-intelligence","bert","cms","deep-learning","ehr-data","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jordandeklerk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-05-21T21:21:14.000Z","updated_at":"2025-05-05T18:40:37.000Z","dependencies_parsed_at":"2025-05-24T13:57:30.348Z","dependency_job_id":null,"html_url":"https://github.com/jordandeklerk/EHR-BERT","commit_stats":null,"previous_names":["jordandeklerk/ehr-bert"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jordandeklerk/EHR-BERT","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jordandeklerk%2FEHR-BERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jordandeklerk%2FEHR-BERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jordandeklerk%2FEHR-BERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jordandeklerk%2FEHR-BERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jordandeklerk","download_url":"https://codeload.github.com/jordandeklerk/EHR-BERT/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jordandeklerk%2FEHR-BERT/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266236913,"owners_count":23897283,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","bert","cms","deep-learning","ehr-data","machine-learning"],"created_at":"2024-11-08T09:19:33.649Z","updated_at":"2025-07-21T04:07:48.455Z","avatar_url":"https://github.com/jordandeklerk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# EHR-BERT: Masked Token Learning for Inpatient Diagnosis and Procedure Prediction \n\n## Overview\n\n![BEHRT Architecture](./images/bert.png)\n\nEHR-BERT is aimed at leveraging BERT (Bidirectional Encoder Representations from Transformers) for Electronic Health Records (EHR) data. BERT is an encoder-only transformer model that processes sequences of data by capturing context from both directions (left-to-right and right-to-left) simultaneously. This bidirectional capability allows it to understand the nuanced relationships within the data. In the context of EHR data, BERT is adapted to handle the unique structure of medical records, which include sequential visits, diagnoses, procedures, and patient information such as age and event positions.\n\nEHR-BERT takes into account diagnoses, age, positional encodings for events, and segment encodings that differentiate between visits. These elements are combined to form a final embedding that represents the latent contextual information of a patient's EHR at any given visit.\n\n## Project Structure\n\n```\n├── cms_de10_data\n├── data\n│   ├── Bert-Pretraining\n│   │   ├── bert_config.json\n│   │   └── pytorch_model.bin\n│   ├── combined_ip_claims.pkl\n│   ├── data-comb-visit.pkl\n│   ├── eval-id.txt\n│   ├── test-id.txt\n│   └── train-id.txt\n├── download_data.sh\n├── main.py\n├── requirements.txt\n├── src\n│   ├── EHRBert\n│   │   ├── bert.py\n│   │   ├── bert_config.py\n│   │   └── bert_pretrain.py\n│   ├── ehr_dataset.py\n│   ├── preprocess_ip_claims.py\n│   └── utils.py\n└── vocab\n    ├── dx-vocab.txt\n    └── proc-vocab.txt\n```\n\n## CMS Data Summary\n\nMedicare Claims Synthetic Public Use Files (SynPUFs) provide a way to work with realistic Medicare claims data while protecting beneficiary privacy. The SynPUFs are similar in structure to CMS Limited Data Sets but with fewer variables, allowing users to develop programs and products that will work on actual CMS data files. These synthetic files include a robust set of metadata not available in the public domain.\n\nThough limited in inferential research value, SynPUFs offer a timely and cost-effective way to access realistic data, fostering innovation and better care for Medicare beneficiaries. In this project, we use the 2008-2010 Data Entrepreneurs’ SynPUF, available for free download.\n\n## Usage\n\n### Install Dependencies\n\nCreate a virtual environment and clone this repository:\n\n```bash\n# Clone the repo\ngit clone git@github.com:jordandeklerk/EHR-BERT.git\ncd EHR-BERT\n\n# Create a virtual environment\npython3 -m venv myenv\n\n# Activate the virtual environment\nsource myenv/bin/activate\n\n# Install the required Python packages\npip install -r requirements.txt\n```\n\n### Download Data\nExecute the following `download_data.sh` script to download and unzip the CMS data files into the `cms_de10_data` folder (this will take a while for the full sample). Edit the shell script if you don't want to utilize the full sample of CMS data:\n```bash\nbash download_data.sh\n```\n\n### Preprocess Data\nRun the `preprocess_ip_claims.py` script to pre-process the downloaded data for inpatient claims and create the ICD and PROC code vocab files:\n```bash\npython preprocess_ip_claims.py\n```\n\n### Main Script\nTo train and evaluate the model, run the following `main.py` script and specify the parameters for training:\n```bash\npython main.py --model_name 'Bert-Pretraining' \\\n               --data_dir './data' \\\n               --pretrain_dir './data' \\\n               --train_file 'data-comb-visit.pkl' \\\n               --output_dir './data/Bert-Pretraining' \\\n               --do_train \\\n               --do_eval \\\n               --num_train_epochs 15 \\\n               --learning_rate 5e-4\n```\n\n### Results\n\n| Metric                | Value      | Hyperparameters       |\n|-----------------------|------------|-----------------------|\n| Diagnosis PR-AUC      | ~83%       | `num_train_epochs=15`, `learning_rate=5e-4` |\n| Procedure PR-AUC      | ~84%       | `num_train_epochs=15`, `learning_rate=5e-4` |\n\n## Todo\n\n- [ ] Use the pre-trained model for downstream re-admission prediction.\n\n## File Descriptions\n\n- `download_data.sh`: Script for downloading the necessary CMS data files.\n- `preprocess_ip_claims.py`: Script to preprocess the raw claims data.\n- `ehr_dataset.py`: Contains functions for dataset loading and preprocessing.\n- `main.py`: Main script for training and evaluating the model.\n\n## Citations\n\n```bibtex\n@article{devlin2018bert,\n  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},\n  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},\n  journal={arXiv preprint arXiv:1810.04805},\n  year={2018}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjordandeklerk%2Fehr-bert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjordandeklerk%2Fehr-bert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjordandeklerk%2Fehr-bert/lists"}