{"id":35048222,"url":"https://github.com/sanofi-public/clm-ler","last_synced_at":"2026-05-23T07:08:02.387Z","repository":{"id":293498247,"uuid":"983048137","full_name":"Sanofi-Public/CLM-LER","owner":"Sanofi-Public","description":"Train, and fine-tune transformer models on huggingface for tasks involving electronic health records (EHR) and laboratory data","archived":false,"fork":false,"pushed_at":"2025-05-13T20:17:18.000Z","size":80,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-19T18:40:59.739Z","etag":null,"topics":["bert","clinical-nlp","ehr-data"],"latest_commit_sha":null,"homepage":"https://www.sanofi.com/en/our-science/digital","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sanofi-Public.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"license.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-13T19:50:25.000Z","updated_at":"2025-05-15T13:23:39.000Z","dependencies_parsed_at":"2025-05-15T17:11:50.180Z","dependency_job_id":null,"html_url":"https://github.com/Sanofi-Public/CLM-LER","commit_stats":null,"previous_names":["sanofi-public/clm-ler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Sanofi-Public/CLM-LER","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sanofi-Public%2FCLM-LER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sanofi-Public%2FCLM-LER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sanofi-Public%2FCLM-LER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sanofi-Public%2FCLM-LER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sanofi-Public","download_url":"https://codeload.github.com/Sanofi-Public/CLM-LER/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sanofi-Public%2FCLM-LER/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28076528,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-27T02:00:05.897Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","clinical-nlp","ehr-data"],"created_at":"2025-12-27T09:02:20.912Z","updated_at":"2025-12-27T09:02:23.008Z","avatar_url":"https://github.com/Sanofi-Public.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![License](https://img.shields.io/badge/License-Academic%20Non--Commercial-blue.svg)](LICENSE)\n[![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)\n\n# CLM-LER: Clinical Language Models for Lab and Electronic Health Records\n\nCLM-LER is a package designed to preprocess, train, and fine-tune transformer models on huggingface for tasks involving electronic health records (EHR) and laboratory data. It supports workflows for data preprocessing, model training, and evaluation, leveraging tools like PySpark, Hugging Face Transformers, and WandB for efficient and scalable operations.\n\n---\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Installation](#installation)\n3. [Pretraining Pipeline Description](#pretraining-pipeline-description)\n   - [Data Processing for Model Pre-Training](#data-processing-for-model-pre-training)\n   - [Model Pre-Training](#model-pre-training)\n   - [Model Fine-Tuning](#model-fine-tuning)\n4. [Testing with EHRSHOT Benchmarks](#testing-with-ehrshot-benchmarks)\n5. [UMLS for Mapping](#umls-for-mapping)\n6. [Acknowledgements](#acknowledgements)\n\n---\n\n## Overview\n\nCLM-LER provides a modular framework for working with EHR data. It includes utilities for:\n- Preprocessing raw EHR data into tokenized formats.\n- Training CLM models on large-scale EHR datasets.\n- Fine-tuning models for specific downstream tasks like classification.\n- Handling unit conversions, percentile calculations, and UMLS-based translations.\n\n---\n\n## Installation\n\n### Prerequisites\n- Python 3.10\n- PySpark\n- Hugging Face Transformers\n- WandB (Weights and Biases)\n\n### Work in a Conda Environment or Python Virtual Environment\nUsing a virtual environment prevents conflicting installations of packages. You can create one with the following:\n```\nconda create -n train-clm python=3.10 -y\nconda activate train-clm\n```\n\n### Define Torch Dependencies\nSpecify the version of `torch` to install and the index for downloading it. Torch 2.0.1 (compiled for CUDA 117) was found to work well for this project. If you're using another version of CUDA, adjust the torch version accordingly.\n\n### Basic Installation\nGood for development work and training models interactively.\n```\n\u003cinstall-torch\u003e # e.g. pip install torch==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117/\npip install -e .[dev,train,spark]\n```\n\n### Full Installation\nFor all dependencies:\n```\n\u003cinstall-torch\u003e # e.g. pip install torch==2.0.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117/\npip install -e .[full]\n```\n\n### Installation Options:\ndev -- dependencies to run unit tests and develop in this package.\ntrain -- dependencies to train a model.\nspark -- what you need for any data processing, i.e. a pyspark installation.\n\nN.B. None of the packages above depend on torch explicitly, but torch is required to be installed.\n\n## The Use of S3 \nWe used Amazon S3 for handling much of our I/O. We open source this work showcasing how AWS access keys with permissions to a s3 bucket may work. \nWe encourage anyone following along to perform a similar setup to reduce friction in getting started! \n\n## The Use of W\u0026B\nGiven how many clinical language models we were training to get toward our CLM-LER architecture, we used W\u0026B for support and logging. \n\nWe have online and offline options, but please raise an issue if you have some trouble!\n\nTo set this up:\n```\nwandb login\n```\n\n## Spark Cluster Setup (Optional and Encouraged)\nWe used spark clusters on our side to handle the tens of millions patients in EMRs we work with.\nIf you have AWS EMR, this could work for you! We suggest the use of [emrflow](https://github.com/Sanofi-Public/emrflow).\n\nWe removed this from the repo to increase adoption/minimize dependencies in the case you use another distributed compute tool (e.g. Databricks clusters).\n\n## Environment setup of following\n```\nexport CLM_AWS_ACCESS_KEY_ID=XYZ\nexport CLM_AWS_SECRET_ACCESS_KEY=XYZ\nexport CLM_AWS_DEFAULT_REGION=\u003cregion\u003e # e.g. eu-west-1\nexport WANDB_API_KEY=XYZ\nexport WANDB_USERNAME=XYZ\nexport WANDB_ENTITY=XYZ\nexport CLMENCODER_DEPS=train,spark\n```\n\n---\n\n# Pretraining Pipeline Description\n\nThere are four key stages handled by this package: data processing for model training, model pre-training, model fine-tuning, and testing with EHRSHOT benchmarks. _Note: Explainability has been split into a separate repository [here](https://github.com/Sanofi-Public/Clinical-BERT-Explainability?tab=readme-ov-file)._\n\n---\n\n## Data Processing for Model Pre-Training\n\nThe input datasets are defined in configuration files, such as `clm_ler/config/data_files_full.yaml`.\n\n### Step 1: Create Train/Test/Validation Splits\nThe first step to building up the clinical language model, is to split our clinical data into train/val/test. The following script showcases how you may trigger a similar process:\n```\nbash scripts/create_global_data_split.sh\n```\n\n### Step 2: Arrange Data in the Required Format\nThe data must be arranged in the CLM-LER data model. For an example of how to preprocess the data, refer to the script `scripts/preprocess_data.py`. This script demonstrates the steps to preprocess patient, diagnosis, prescription, procedure, and lab data into tokenized formats.\n\n---\n\n## Step 3: Model Pre-Training\n\nOnce the data is preprocessed, follow these steps to pre-train a CLM model:\n\n### Generate Vocabulary\nGenerate a vocabulary file for the model. See `scripts/preprocess_data.py` for an example!\n\n\n### Train the Model\nTrain the CLM model using the preprocessed data and generated vocabulary.\nTraining a full CLM model typically takes about a week on an NVIDIA A10G GPU (e.g., g5.xlarge EC2 instance) for an EHR dataset with over 40M U.S patients.\nSee `scripts/preprocess_data.py` for usage example.\n\n---\n\n# Step 4: Model Fine-Tuning\n\nFine-tune the pre-trained model for specific tasks, such as classification.\n\n### Prepare Data like Step 3 and add Labeled Data\nEnsure the dataset includes a label column for the target classification task.\n\n### Fine-Tune the Model\nSee `scripts/run_asthma_with_labs.sh` or `run_all_ehrshot_training.sh` scripts for examples on the fine-tuning call.\n\n---\n\n## Testing with EHRSHOT Benchmarks\n\nEHRSHOT is a benchmarking dataset for evaluating model performance on various EHR-related tasks. Learn more: [EHRSHOT](https://github.com/som-shahlab/ehrshot-benchmark).\nIt provides multiple tasks for which a model can be finetuned. Using this dataset involves a few steps.\n\nFirstly, if you are using a pre-trained model, you will want to map the EHRShot dataset's tokens to those expected by your model's vocabulary. This is handled by the script src/clm_ler/data_processing/process_ehrshot_data.py. This script takes a model and the raw data. Given a config file like src/clm_ler/config/mapping_config_to_clm_ler.yaml, the data in ehrshot and the models vocabulary is normalized into the names expected by UMLS. When running the script, you will be notified of any failed sources of codes that couldn't be sourced to UMLS. For example, this could be because you did not map ICD9 -\u003e ICD9CM (The name of this source in UMLS.).\n\nSecondly, this is a timeseries dataset with labelled events stored separately. Once you have created the dataset above, you need to join the labels into it, creating the dataset needed for inference and training. A config example is suppllied: src/clm_ler/config/config_add_labels_translated_data.yaml.\n\nTo see an example of processing data for CLM-LER, see `scripts/run_clmler_ehrshot_preprocess.py`.\n\n---\n\n## UMLS for Mapping\n\nThe Unified Medical Language System (UMLS) is used for mapping medical codes to standardized concepts. This ensures consistency across datasets and models.\n\n- **Mapping Configurations**: See `clm_ler/config/mapping_config_to_clm_ler.yaml` for examples of UMLS mappings.\n- **Translation Utilities**: The `clm_ler.data_processing.data_processing_utils` module provides functions for deriving UMLS translations.\n\nFor more details, refer to the [UMLS documentation](https://www.nlm.nih.gov/research/umls/index.html).\n\n---\n\n## Acknowledgements\n\nThis project leverages several key resources and contributions:\n\n1. **UMLS (Unified Medical Language System)**  \n   - The UMLS Metathesaurus is used for mapping medical codes to standardized concepts.  \n   - Learn more: [UMLS](https://www.nlm.nih.gov/research/umls/index.html)\n   - _Ref_: UMLS Knowledge Sources [dataset on the Internet]. Release 2024AA. Bethesda (MD): National Library of Medicine (US); 2024 May 6 [cited 2024 Jul 15]. Available from: http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html\n\n2. **EHRSHOT**  \n   - The EHRSHOT benchmark datasets are used for evaluating model performance on various EHR-related tasks.  \n   - Learn more: [EHRSHOT](https://github.com/som-shahlab/ehrshot-benchmark)\n   - _Ref_: Michael Wornow, Rahul Thapa, Ethan Steinberg, Jason A. Fries, and Nigam H. Shah. 2023. EHRSHOT: an EHR benchmark for few-shot evaluation of foundation models. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 2933, 67125–67137.\n\n3. **Inventors**  \n   This project was developed by:\n   - Lukas Adamek\n   - Jenny Du\n   - Maksim Kriukov\n   - Towsif Rahman\n   - Utkarsh Vashisth\n   - Brandon Rufino\n\n   Special thanks to the inventors for their contributions to the development of CLM-LER.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanofi-public%2Fclm-ler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsanofi-public%2Fclm-ler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanofi-public%2Fclm-ler/lists"}