{"id":13530876,"url":"https://github.com/nedap/deidentify","last_synced_at":"2025-04-07T13:05:30.852Z","repository":{"id":40483524,"uuid":"228331179","full_name":"nedap/deidentify","owner":"nedap","description":"A Python library to de-identify medical records with state-of-the-art NLP methods.","archived":false,"fork":false,"pushed_at":"2023-11-14T03:18:28.000Z","size":241,"stargazers_count":129,"open_issues_count":5,"forks_count":24,"subscribers_count":39,"default_branch":"master","last_synced_at":"2025-03-31T12:04:31.073Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nedap.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-12-16T07:47:18.000Z","updated_at":"2025-03-27T15:34:37.000Z","dependencies_parsed_at":"2024-01-12T17:35:54.500Z","dependency_job_id":"f8f0de84-a127-4622-abee-ebb991869083","html_url":"https://github.com/nedap/deidentify","commit_stats":{"total_commits":109,"total_committers":8,"mean_commits":13.625,"dds":"0.44036697247706424","last_synced_commit":"a827378b5b454a928cccdb8fe85d6e1ae5c26464"},"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nedap%2Fdeidentify","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nedap%2Fdeidentify/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nedap%2Fdeidentify/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nedap%2Fdeidentify/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nedap","download_url":"https://codeload.github.com/nedap/deidentify/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247657275,"owners_count":20974344,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T07:00:56.921Z","updated_at":"2025-04-07T13:05:30.832Z","avatar_url":"https://github.com/nedap.png","language":"Python","funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"readme":"# deidentify\n\nA Python library to de-identify medical records with state-of-the-art NLP methods. Pre-trained models for the Dutch language are available.\n\nThis repository shares the resources developed in the following paper:\n\n\u003e J. Trienes, D. Trieschnigg, C. Seifert, and D. Hiemstra. Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records. In: *Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop (HSDM)*, 2020.\n\nRead more about the work in our [paper](https://arxiv.org/abs/2001.05714) or [blog post](https://medium.com/nedap/de-identification-of-ehr-using-nlp-a270d40fc442).\n\n## Quick Start\n\n### Installation\n\nCreate a new virtual environment with an environment manager of your choice. Then, install `deidentify`:\n\n```sh\npip install deidentify\n```\n\nWe use the spaCy tokenizer. For good compatibility with the pre-trained models, we recommend using the same spaCy version that we used to train the de-identification models.\n\n```sh\npip install -U \"spacy\u003c3\" https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-2.3.0/nl_core_news_sm-2.3.0.tar.gz#egg=nl_core_news_sm==2.3.0\n```\n\n### Example Usage\n\nThe code below shows how to apply a pre-trained de-identification pipeline to an example document. We provide a [list of available models](#pre-trained-models) below.\n\n```py\nfrom deidentify.base import Document\nfrom deidentify.taggers import FlairTagger\nfrom deidentify.tokenizer import TokenizerFactory\n\n# Create some text\ntext = (\n    \"Dit is stukje tekst met daarin de naam Jan Jansen. De patient J. Jansen (e: \"\n    \"j.jnsen@email.com, t: 06-12345678) is 64 jaar oud en woonachtig in Utrecht. Hij werd op 10 \"\n    \"oktober door arts Peter de Visser ontslagen van de kliniek van het UMCU.\"\n)\n\n# Wrap text in document\ndocuments = [\n    Document(name='doc_01', text=text)\n]\n\n# Select downloaded model\nmodel = 'model_bilstmcrf_ons_fast-v0.2.0'\n\n# Instantiate tokenizer\ntokenizer = TokenizerFactory().tokenizer(corpus='ons', disable=(\"tagger\", \"ner\"))\n\n# Load tagger with a downloaded model file and tokenizer\ntagger = FlairTagger(model=model, tokenizer=tokenizer, verbose=False)\n\n# Annotate your documents\nannotated_docs = tagger.annotate(documents)\n```\n\nThis completes the annotation stage. Let's inspect the entities that the tagger found:\n\n```py\nfrom pprint import pprint\n\nfirst_doc = annotated_docs[0]\npprint(first_doc.annotations)\n```\n\nThis should print the entities of the first document.\n\n```py\n[Annotation(text='Jan Jansen', start=39, end=49, tag='Name', doc_id='', ann_id='T0'),\n Annotation(text='J. Jansen', start=62, end=71, tag='Name', doc_id='', ann_id='T1'),\n Annotation(text='j.jnsen@email.com', start=76, end=93, tag='Email', doc_id='', ann_id='T2'),\n Annotation(text='06-12345678', start=98, end=109, tag='Phone_fax', doc_id='', ann_id='T3'),\n Annotation(text='64 jaar', start=114, end=121, tag='Age', doc_id='', ann_id='T4'),\n Annotation(text='Utrecht', start=143, end=150, tag='Address', doc_id='', ann_id='T5'),\n Annotation(text='10 oktober', start=164, end=174, tag='Date', doc_id='', ann_id='T6'),\n Annotation(text='Peter de Visser', start=185, end=200, tag='Name', doc_id='', ann_id='T7'),\n Annotation(text='UMCU', start=234, end=238, tag='Hospital', doc_id='', ann_id='T8')]\n```\n\n#### Mask Annotations\n\nUse masking to replace annotations with placeholders. Example: `Jan Jansen -\u003e [NAME]`\n\n```py\nfrom deidentify.util import mask_annotations\n\nmasked_doc = mask_annotations(first_doc)\nprint(masked_doc.text)\n```\n\nWhich should print:\n\n\u003e Dit is stukje tekst met daarin de naam [NAME]. De patient [NAME] (e: [EMAIL], t: [PHONE_FAX]) is [AGE] oud en woonachtig in [ADDRESS]. Hij werd op [DATE] door arts [NAME] ontslagen van de kliniek van het [HOSPITAL].\n\n#### Replace Annotations with Surrogates [experimental]\n\nUse sorrogate generation to replace annotations with random but realistic alternatives. Example: `Jan Jansen -\u003e Bart Bakker`. The surrogate replacement strategy follows [Stubbs et al. (2015)](https://doi.org/10.1007/978-3-319-23633-9_27).\n\n```py\nfrom deidentify.util import surrogate_annotations\n\n# The surrogate generation process involves some randomness.\n# You can set a seed to make the process deterministic.\niter_docs = surrogate_annotations(docs=[first_doc], seed=1)\nsurrogate_doc = list(iter_docs)[0]\nprint(surrogate_doc.text)\n```\n\nThis code should print:\n\n\u003e Dit is stukje tekst met daarin de naam Gijs Hermelink. De patient G. Hermelink (e: n.qvgjj@spqms.com, t: 06-83662585) is 64 jaar oud en woonachtig in Cothen. Hij werd op 28 juni door arts Jullian van Troost ontslagen van de kliniek van het UMCU.\n\n### Available Taggers\n\nThere are currently three taggers that you can use:\n\n   * `DeduceTagger`: A wrapper around the DEDUCE tagger by Menger et al. (2018, [code](https://github.com/vmenger/deduce), [paper](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365))\n   * `CRFTagger`: A CRF tagger using the feature set by Liu et al. (2015, [paper](https://www.sciencedirect.com/science/article/pii/S1532046415001197))\n   * `FlairTagger`: A wrapper around the Flair [`SequenceTagger`](https://github.com/zalandoresearch/flair/blob/2d6e89bdfe05644b4e5c7e8327f6ecc6b834ec9e/flair/models/sequence_tagger_model.py#L68) allowing the use of neural architectures such as BiLSTM-CRF. The pre-trained models below use contextualized string embeddings by Akbik et al. (2018, [paper](https://www.aclweb.org/anthology/C18-1139/))\n\nAll taggers implement the `deidentify.taggers.TextTagger` interface which you can implement to provide your own taggers.\n\n### Tag Set\n\nUse the `TextTagger.tags` to get a list of supported tags. For the `FlairTagger` in above demo this looks as follows:\n\n```py\n\u003e\u003e\u003e tagger.tags\n['Internal_Location', 'Age', 'Phone_fax', 'Name', 'SSN', 'Hospital', 'Email', 'Initials', 'O',\n'Organization_Company', 'ID', 'Profession', 'Care_Institute', 'Other', 'Date', 'URL_IP', 'Address']\n```\n\n### Pre-trained Models\n\nWe provide a number of pre-trained models for the Dutch language. The models were developed on the Nedap/University of Twente (NUT) dataset. The dataset consists of 1260 documents from three domains of Dutch healthcare: elderly care, mental care and disabled care (note: in the codebase we sometimes also refer to this dataset as `ons`). More information on the design of the dataset can be found in [our paper](https://arxiv.org/abs/2001.05714).\n\n\n| Name | Tagger | Lang | Dataset | F1* | Precision* | Recall* | Tags |\n|------|--------|----------|---------|----|-----------|--------|--------|\n| [DEDUCE (Menger et al., 2018)](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365)** | `DeduceTagger` | NL | NUT | 0.6649 | 0.8192 | 0.5595 | [8 PHI Tags](https://github.com/nedap/deidentify/blob/168ad67aec586263250900faaf5a756d3b8dd6fa/deidentify/methods/deduce/run_deduce.py#L17) |\n| [model_crf_ons_tuned-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.2.0) | `CRFTagger` | NL | NUT | 0.8511 | 0.9337 | 0.7820 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_crf_ons_tuned-v0.2.0) |\n| [model_bilstmcrf_ons_fast-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.2.0) | `FlairTagger`  | NL | NUT | 0.8914 | 0.9101 | 0.8735 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_fast-v0.2.0) |\n| [model_bilstmcrf_ons_large-v0.2.0](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.2.0) | `FlairTagger` | NL | NUT | 0.8990 | 0.9240 | 0.8754 | [15 PHI Tags](https://github.com/nedap/deidentify/releases/tag/model_bilstmcrf_ons_large-v0.2.0) |\n\n*\\*All scores are micro-averaged entity-level precision/recall/F1 obtained on the test portion of each dataset. For additional metrics, see the corresponding model release.*\n\n*\\*\\*DEDUCE was developed on a dataset of psychiatric nursing notes and treatment plans. The numbers reported here were obtained by applying DEDUCE to our NUT dataset. For more information on the development of DEDUCE, see the paper by [Menger et al. (2018)](https://www.sciencedirect.com/science/article/abs/pii/S0736585316307365).*\n\n## Running Experiments and Training Models\n\nIf you have your own dataset of annotated documents and you want to train your own models on it, you can take a look at the following guides:\n\n   * [Convert your data into our corpus format](docs/01_data_format.md)\n   * [Train and evaluate your own models](docs/02_train_evaluate_models.md)\n   * [Logging and pipeline verbosity](docs/06_pipeline_verbosity.md)\n\nIf you want more information on the experiments in our paper, have a look here:\n\n   * [NUT annotation guidelines](docs/03_hsdm2020_nut_annotation_guidelines.md)\n   * [Surrogate generation procedure](docs/04_hsdm2020_surrogate_generation.md)\n   * [Experiments on English corpora: i2b2/UTHealth and nursing notes](docs/05_hsdm2020_english_datasets.md)\n\n### Computational Environment\n\nWhen you want to run your own experiments, we assume that you clone this code base locally and execute all scripts under `deidentify/` within the following conda environment:\n\n```sh\n# Install package dependencies and add local files to the Python path of that environment.\nconda env create -f environment.yml\nconda activate deidentify \u0026\u0026 export PYTHONPATH=\"${PYTHONPATH}:$(pwd)\"\n```\n\n## Citation\n\nPlease cite the following paper when using `deidentify`:\n\n```bibtex\n@inproceedings{Trienes:2020:CRF,\n  title={Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records},\n  author={Trienes, Jan and Trieschnigg, Dolf and Seifert, Christin and Hiemstra, Djoerd},\n  booktitle = {Proceedings of the 1st ACM WSDM Health Search and Data Mining Workshop},\n  series = {{HSDM} 2020},\n  year = {2020}\n}\n```\n\n## Contact\n\nIf you have any question, please contact Jan Trienes at jan.trienes@gmail.com.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnedap%2Fdeidentify","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnedap%2Fdeidentify","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnedap%2Fdeidentify/lists"}