{"id":14483086,"url":"https://github.com/aphp/eds-pseudo","last_synced_at":"2025-07-17T13:33:35.370Z","repository":{"id":64658347,"uuid":"576346858","full_name":"aphp/eds-pseudo","owner":"aphp","description":"EDS-Pseudo is a hybrid model for detecting personally identifying entities in clinical reports","archived":false,"fork":false,"pushed_at":"2025-04-07T17:01:01.000Z","size":4807,"stargazers_count":52,"open_issues_count":2,"forks_count":6,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-13T00:03:40.580Z","etag":null,"topics":["edsnlp","nlp","pseudonymisation"],"latest_commit_sha":null,"homepage":"https://aphp.github.io/eds-pseudo","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aphp.png","metadata":{"files":{"readme":"README.md","changelog":"changelog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-09T16:05:32.000Z","updated_at":"2025-03-17T10:55:32.000Z","dependencies_parsed_at":"2023-09-29T11:16:07.279Z","dependency_job_id":"fc7b1ca7-d922-4454-a339-4fae1eeb090e","html_url":"https://github.com/aphp/eds-pseudo","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/aphp/eds-pseudo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aphp%2Feds-pseudo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aphp%2Feds-pseudo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aphp%2Feds-pseudo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aphp%2Feds-pseudo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aphp","download_url":"https://codeload.github.com/aphp/eds-pseudo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aphp%2Feds-pseudo/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265611323,"owners_count":23797859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["edsnlp","nlp","pseudonymisation"],"created_at":"2024-09-03T00:01:29.386Z","updated_at":"2025-07-17T13:33:35.349Z","avatar_url":"https://github.com/aphp.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003c!-- modelcard --\u003e\n\u003cdiv\u003e\n\n[\u003cimg style=\"display: inline\" src=\"https://img.shields.io/github/actions/workflow/status/aphp/eds-pseudo/tests.yml?branch=main\u0026label=tests\u0026style=flat-square\" alt=\"Tests\"\u003e]()\n[\u003cimg style=\"display: inline\" src=\"https://img.shields.io/github/actions/workflow/status/aphp/eds-pseudo/documentation.yml?branch=main\u0026label=docs\u0026style=flat-square\" alt=\"Documentation\"\u003e](https://aphp.github.io/eds-pseudo/latest/)\n[\u003cimg style=\"display: inline\" src=\"https://img.shields.io/codecov/c/github/aphp/eds-pseudo?logo=codecov\u0026style=flat-square\" alt=\"Codecov\"\u003e](https://codecov.io/gh/aphp/eds-pseudo)\n[\u003cimg style=\"display: inline\" src=\"https://img.shields.io/badge/repro-poetry-blue?style=flat-square\" alt=\"Poetry\"\u003e](https://python-poetry.org)\n[\u003cimg style=\"display: inline\" src=\"https://img.shields.io/badge/repro-dvc-blue?style=flat-square\" alt=\"DVC\"\u003e](https://dvc.org)\n[\u003cimg style=\"display: inline\" src=\"https://img.shields.io/badge/demo%20%F0%9F%9A%80-streamlit-purple?style=flat-square\" alt=\"Demo\"\u003e](https://eds-pseudo-public.streamlit.app/)\n\n\u003c/div\u003e\n\n# EDS-Pseudo\n\nThe EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily tested\non clinical reports at AP-HP's Clinical Data Warehouse (EDS).\n\nThe model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a\nhybrid model (rule-based + deep learning) for which we provide\nrules ([`eds-pseudo/pipes`](https://github.com/aphp/eds-pseudo/tree/main/eds_pseudo/pipes))\nand a training recipe [`train.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/train.py).\n\nWe also provide some fictitious\ntemplates ([`templates.txt`](https://github.com/aphp/eds-pseudo/blob/main/data/templates.txt)) and a script to\ngenerate a synthetic\ndataset [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py).\n\nThe entities that are detected are listed below.\n\n| Label            | Description                                                   |\n|------------------|---------------------------------------------------------------|\n| `ADRESSE`        | Street address, eg `33 boulevard de Picpus`                   |\n| `DATE`           | Any absolute date other than a birthdate                      |\n| `DATE_NAISSANCE` | Birthdate                                                     |\n| `HOPITAL`        | Hospital name, eg `Hôpital Rothschild`                        |\n| `IPP`            | Internal AP-HP identifier for patients, displayed as a number |\n| `MAIL`           | Email address                                                 |\n| `NDA`            | Internal AP-HP identifier for visits, displayed as a number   |\n| `NOM`            | Any last name (patients, doctors, third parties)              |\n| `PRENOM`         | Any first name (patients, doctors, etc)                       |\n| `SECU`           | Social security number                                        |\n| `TEL`            | Any phone number                                              |\n| `VILLE`          | Any city                                                      |\n| `ZIP`            | Any zip code                                                  |\n\n## Downloading the public pre-trained model\n\nThe public pretrained model is available on the HuggingFace model hub at\n[AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public) and was trained on synthetic data\n(see [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py)). You can also\ntest it directly on the **[demo](https://eds-pseudo-public.streamlit.app/)**.\n\n1. Install the latest version of edsnlp\n\n    ```shell\n    pip install \"edsnlp[ml]\" -U\n    ```\n\n2. Get access to the model at [AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public)\n3. Create and copy a huggingface token https://huggingface.co/settings/tokens?new_token=true\n4. Register the token (only once) on your machine\n\n    ```python\n    import huggingface_hub\n\n    huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)\n    ```\n\n5. Load the model\n\n   ```python\n   import edsnlp\n\n   nlp = edsnlp.load(\"AP-HP/eds-pseudo-public\", auto_update=True)\n   doc = nlp(\n       \"En 2015, M. Charles-François-Bienvenu \"\n       \"Myriel était évêque de Digne. C’était un vieillard \"\n       \"d’environ soixante-quinze ans ; il occupait le \"\n       \"siège de Digne depuis 2006.\"\n   )\n\n   for ent in doc.ents:\n       print(ent, ent.label_, str(ent._.date))\n   ```\n\nTo apply the model on many documents using one or more GPUs, refer to the documentation\nof [edsnlp](https://aphp.github.io/eds-pseudo/main/inference).\n\n\u003c!-- metrics --\u003e\n\n## Installation to reproduce\n\nIf you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:\n\n```shell\ngit clone https://github.com/aphp/eds-pseudo.git\ncd eds-pseudo\n```\n\nAnd install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager\nlike [Poetry](https://python-poetry.org/).\n\n```shell\npoetry install\n```\n\n## How to use without machine learning\n\n```python\nimport edsnlp\n\nnlp = edsnlp.blank(\"eds\")\n\n# Some text cleaning\nnlp.add_pipe(\"eds.normalizer\")\n\n# Various simple rules\nnlp.add_pipe(\n    \"eds_pseudo.simple_rules\",\n    config={\"pattern_keys\": [\"TEL\", \"MAIL\", \"SECU\", \"PERSON\"]},\n)\n\n# Address detection\nnlp.add_pipe(\"eds_pseudo.addresses\")\n\n# Date detection\nnlp.add_pipe(\"eds_pseudo.dates\")\n\n# Contextual rules (requires a dict of info about the patient)\nnlp.add_pipe(\"eds_pseudo.context\")\n\n# Apply it to a text\ndoc = nlp(\n    \"En 2015, M. Charles-François-Bienvenu \"\n    \"Myriel était évêque de Digne. C’était un vieillard \"\n    \"d’environ soixante-quinze ans ; il occupait le \"\n    \"siège de Digne depuis 2006.\"\n)\n\nfor ent in doc.ents:\n    print(ent, ent.label_)\n\n# 2015 DATE\n# Charles-François-Bienvenu NOM\n# Myriel PRENOM\n# 2006 DATE\n```\n\n## How to train\n\nBefore training a model, you should update the\n[configs/config.cfg](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg) and\n[pyproject.toml](https://github.com/aphp/eds-pseudo/blob/main/pyproject.toml) files to\nfit your needs.\n\nPut your data in the `data/dataset` folder (or edit the paths `configs/config.cfg` file to point\nto `data/gen_dataset/train.jsonl`).\n\nThen, run the training script\n\n```shell\npython scripts/train.py --config configs/config.cfg --seed 43\n```\n\nThis will train a model and save it in `artifacts/model-last`. You can evaluate it on the test set (defaults\nto `data/dataset/test.jsonl`) with:\n\n```shell\npython scripts/evaluate.py --config configs/config.cfg\n```\n\nTo package it, run:\n\n```shell\npython scripts/package.py\n```\n\nThis will create a `dist/eds-pseudo-aphp-***.whl` file that you can install with `pip install dist/eds-pseudo-aphp-***`.\n\nYou can use it in your code:\n\n```python\nimport edsnlp\n\n# Either from the model path directly\nnlp = edsnlp.load(\"artifacts/model-last\")\n\n# Or from the wheel file\nimport eds_pseudo_aphp\n\nnlp = eds_pseudo_aphp.load()\n```\n\n## Documentation\n\nVisit the [documentation](https://aphp.github.io/eds-pseudo/) for more information!\n\n## Publication\n\nPlease find our publication at the following link: https://doi.org/mkfv.\n\nIf you use EDS-Pseudo, please cite us as below:\n\n```\n@article{eds_pseudo,\n  title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse},\n  author={Tannier, Xavier and Wajsb{\\\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},\n  journal={Methods of Information in Medicine},\n  year={2024},\n  publisher={Georg Thieme Verlag KG}\n}\n```\n\n## Acknowledgement\n\nWe would like to thank [Assistance Publique – Hôpitaux de Paris](https://www.aphp.fr/)\nand [AP-HP Foundation](https://fondationrechercheaphp.fr/) for funding this project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faphp%2Feds-pseudo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faphp%2Feds-pseudo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faphp%2Feds-pseudo/lists"}