Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aphp/eds-pseudo
EDS-Pseudo is a hybrid model for detecting personally identifying entities in clinical reports
https://github.com/aphp/eds-pseudo
edsnlp nlp pseudonymisation
Last synced: 18 days ago
JSON representation
EDS-Pseudo is a hybrid model for detecting personally identifying entities in clinical reports
- Host: GitHub
- URL: https://github.com/aphp/eds-pseudo
- Owner: aphp
- License: other
- Created: 2022-12-09T16:05:32.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-14T17:41:45.000Z (29 days ago)
- Last Synced: 2024-11-14T18:32:17.474Z (29 days ago)
- Topics: edsnlp, nlp, pseudonymisation
- Language: Python
- Homepage: https://aphp.github.io/eds-pseudo
- Size: 4.1 MB
- Stars: 46
- Watchers: 4
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: changelog.md
- License: LICENSE
Awesome Lists containing this project
README
[]()
[](https://aphp.github.io/eds-pseudo/latest/)
[](https://codecov.io/gh/aphp/eds-pseudo)
[](https://python-poetry.org)
[](https://dvc.org)
[](https://eds-pseudo-public.streamlit.app/)# EDS-Pseudo
The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily tested
on clinical reports at AP-HP's Clinical Data Warehouse (EDS).The model is built on top of [edsnlp](https://github.com/aphp/edsnlp), and consists in a
hybrid model (rule-based + deep learning) for which we provide
rules ([`eds-pseudo/pipes`](https://github.com/aphp/eds-pseudo/tree/main/eds_pseudo/pipes))
and a training recipe [`train.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/train.py).We also provide some fictitious
templates ([`templates.txt`](https://github.com/aphp/eds-pseudo/blob/main/data/templates.txt)) and a script to
generate a synthetic
dataset [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py).The entities that are detected are listed below.
| Label | Description |
|------------------|---------------------------------------------------------------|
| `ADRESSE` | Street address, eg `33 boulevard de Picpus` |
| `DATE` | Any absolute date other than a birthdate |
| `DATE_NAISSANCE` | Birthdate |
| `HOPITAL` | Hospital name, eg `Hôpital Rothschild` |
| `IPP` | Internal AP-HP identifier for patients, displayed as a number |
| `MAIL` | Email address |
| `NDA` | Internal AP-HP identifier for visits, displayed as a number |
| `NOM` | Any last name (patients, doctors, third parties) |
| `PRENOM` | Any first name (patients, doctors, etc) |
| `SECU` | Social security number |
| `TEL` | Any phone number |
| `VILLE` | Any city |
| `ZIP` | Any zip code |## Downloading the public pre-trained model
The public pretrained model is available on the HuggingFace model hub at
[AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public) and was trained on synthetic data
(see [`generate_dataset.py`](https://github.com/aphp/eds-pseudo/blob/main/scripts/generate_dataset.py)). You can also
test it directly on the **[demo](https://eds-pseudo-public.streamlit.app/)**.1. Install the latest version of edsnlp
```shell
pip install "edsnlp[ml]" -U
```2. Get access to the model at [AP-HP/eds-pseudo-public](https://hf.co/AP-HP/eds-pseudo-public)
3. Create and copy a huggingface token https://huggingface.co/settings/tokens?new_token=true
4. Register the token (only once) on your machine```python
import huggingface_hubhuggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)
```5. Load the model
```python
import edsnlpnlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True)
doc = nlp(
"En 2015, M. Charles-François-Bienvenu "
"Myriel était évêque de Digne. C’était un vieillard "
"d’environ soixante-quinze ans ; il occupait le "
"siège de Digne depuis 2006."
)for ent in doc.ents:
print(ent, ent.label_, str(ent._.date))
```To apply the model on many documents using one or more GPUs, refer to the documentation
of [edsnlp](https://aphp.github.io/eds-pseudo/main/inference).## Installation to reproduce
If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:
```shell
git clone https://github.com/aphp/eds-pseudo.git
cd eds-pseudo
```And install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager
like [Poetry](https://python-poetry.org/).```shell
poetry install
```## How to use without machine learning
```python
import edsnlpnlp = edsnlp.blank("eds")
# Some text cleaning
nlp.add_pipe("eds.normalizer")# Various simple rules
nlp.add_pipe(
"eds_pseudo.simple_rules",
config={"pattern_keys": ["TEL", "MAIL", "SECU", "PERSON"]},
)# Address detection
nlp.add_pipe("eds_pseudo.addresses")# Date detection
nlp.add_pipe("eds_pseudo.dates")# Contextual rules (requires a dict of info about the patient)
nlp.add_pipe("eds_pseudo.context")# Apply it to a text
doc = nlp(
"En 2015, M. Charles-François-Bienvenu "
"Myriel était évêque de Digne. C’était un vieillard "
"d’environ soixante-quinze ans ; il occupait le "
"siège de Digne depuis 2006."
)for ent in doc.ents:
print(ent, ent.label_)# 2015 DATE
# Charles-François-Bienvenu NOM
# Myriel PRENOM
# 2006 DATE
```## How to train
Before training a model, you should update the
[configs/config.cfg](https://github.com/aphp/eds-pseudo/blob/main/configs/config.cfg) and
[pyproject.toml](https://github.com/aphp/eds-pseudo/blob/main/pyproject.toml) files to
fit your needs.Put your data in the `data/dataset` folder (or edit the paths `configs/config.cfg` file to point
to `data/gen_dataset/train.jsonl`).Then, run the training script
```shell
python scripts/train.py --config configs/config.cfg --seed 43
```This will train a model and save it in `artifacts/model-last`. You can evaluate it on the test set (defaults
to `data/dataset/test.jsonl`) with:```shell
python scripts/evaluate.py --config configs/config.cfg
```To package it, run:
```shell
python scripts/package.py
```This will create a `dist/eds-pseudo-aphp-***.whl` file that you can install with `pip install dist/eds-pseudo-aphp-***`.
You can use it in your code:
```python
import edsnlp# Either from the model path directly
nlp = edsnlp.load("artifacts/model-last")# Or from the wheel file
import eds_pseudo_aphpnlp = eds_pseudo_aphp.load()
```## Documentation
Visit the [documentation](https://aphp.github.io/eds-pseudo/) for more information!
## Publication
Please find our publication at the following link: https://doi.org/mkfv.
If you use EDS-Pseudo, please cite us as below:
```
@article{eds_pseudo,
title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse},
author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},
journal={Methods of Information in Medicine},
year={2024},
publisher={Georg Thieme Verlag KG}
}
```## Acknowledgement
We would like to thank [Assistance Publique – Hôpitaux de Paris](https://www.aphp.fr/)
and [AP-HP Foundation](https://fondationrechercheaphp.fr/) for funding this project.