{"id":13673789,"url":"https://github.com/microsoft/presidio-research","last_synced_at":"2025-04-12T14:58:13.695Z","repository":{"id":38369227,"uuid":"231945987","full_name":"microsoft/presidio-research","owner":"microsoft","description":"This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models.","archived":false,"fork":false,"pushed_at":"2025-03-03T22:18:27.000Z","size":10788,"stargazers_count":196,"open_issues_count":13,"forks_count":65,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-04-12T14:58:07.944Z","etag":null,"topics":["deep-learning","flair","machine-learning","named-entity-recognition","natural-language-processing","ner","nlp","pii","privacy","spacy","transformers"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-05T16:35:12.000Z","updated_at":"2025-04-04T02:04:15.000Z","dependencies_parsed_at":"2023-02-12T01:16:06.796Z","dependency_job_id":"61befb65-20e6-4363-85bd-855d19c92730","html_url":"https://github.com/microsoft/presidio-research","commit_stats":{"total_commits":210,"total_committers":24,"mean_commits":8.75,"dds":0.6857142857142857,"last_synced_commit":"1e984aec81aae3e6ebc1c931028cd41f65447bf6"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fpresidio-research","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fpresidio-research/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fpresidio-research/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2Fpresidio-research/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/presidio-research/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248586250,"owners_count":21128997,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","flair","machine-learning","named-entity-recognition","natural-language-processing","ner","nlp","pii","privacy","spacy","transformers"],"created_at":"2024-08-02T11:00:22.487Z","updated_at":"2025-04-12T14:58:13.672Z","avatar_url":"https://github.com/microsoft.png","language":"Jupyter Notebook","funding_links":[],"categories":["Safety, Security \u0026 LLMOps","Evaluation and analysis"],"sub_categories":["Other"],"readme":"# Presidio-research\r\n\r\nThis package provides evaluation and data-science capabilities for \r\n[Presidio](https://github.com/microsoft/presidio) and PII detection models in general.\r\n\r\nIt also includes a fake data generator that creates synthetic sentences based on templates and fake PII.\r\n\r\n## Who should use it?\r\n\r\n- Anyone interested in **developing or evaluating PII detection models**, an existing Presidio instance or a Presidio PII recognizer.\r\n- Anyone interested in **generating new data based on previous datasets or sentence templates** (e.g., to increase the coverage of entity values) for Named Entity Recognition models.\r\n\r\n## Getting started\r\n\r\n\r\n### Using notebooks\r\nThe easiest way to get started is by reviewing the notebooks. \r\n- [Notebook 1](notebooks/1_Generate_data.ipynb): Shows how to use the PII data generator.\r\n- [Notebook 2](notebooks/2_PII_EDA.ipynb): Shows a simple analysis of the PII dataset.\r\n- [Notebook 3](notebooks/3_Split_by_pattern_number.ipynb): Provides tools to split the dataset into train/test/validation sets while avoiding leakage due to the same pattern appearing in multiple folds (only applicable for synthetically generated data).\r\n- [Notebook 4](notebooks/4_Evaluate_Presidio_Analyzer.ipynb): Shows how to use the evaluation tools to evaluate how well Presidio detects PII. Note that this is using the vanilla Presidio, and the results aren't very accurate.\r\n- [Notebook 5](notebooks/5_Evaluate_Custom_Presidio_Analyzer.ipynb): Shows how one can configure Presidio to detect PII much more accurately, and boost the f score in ~30%.\r\n\r\n### Installation\r\n\r\n\u003eNote: Presidio evaluator requires Python version 3.9 or higher.\r\n\r\n#### From PyPI\r\n\r\n``` sh\r\nconda create --name presidio python=3.9\r\nconda activate presidio\r\npip install presidio-evaluator\r\npython -m spacy download en_core_web_sm # for tokenization\r\npython -m spacy download en_core_web_lg # for NER\r\n\r\n```\r\n\r\n#### From source\r\n\r\nTo install the package:\r\n1. Clone the repo\r\n2. Install all dependencies:\r\n\r\n``` sh\r\n# Install package+dependencies\r\npip install poetry\r\npoetry install --with=dev\r\n\r\n# Download tge spaCy pipeline used for tokenization\r\npoetry run python -m spacy download en_core_web_sm\r\n\r\n# To install with all additional NER dependencies (e.g. Flair, Stanza), run:\r\n# poetry install --with='ner,dev'\r\n\r\n# To use the default Presidio configuration, a spaCy model is required:\r\npoetry run python -m spacy download en_core_web_lg\r\n\r\n# Verify installation\r\npytest\r\n```\r\n\r\nNote that some dependencies (such as Flair and Stanza) are not automatically installed to reduce installation complexity.\r\n\r\n## What's in this package?\r\n\r\n1. **Fake data generator** for PII recognizers and NER models\r\n2. **Data representation layer** for data generation, modeling and analysis\r\n3. Multiple **Model/Recognizer evaluation** files (e.g. for Presidio, Spacy, Flair, Azure AI Language)\r\n4. **Training and modeling code** for multiple models\r\n5. Helper functions for **results analysis**\r\n\r\n## 1. Data generation\r\n\r\nSee [Data Generator README](presidio_evaluator/data_generator/README.md) for more details.\r\n\r\nThe data generation process takes a file with templates, e.g. `My name is {{name}}`. \r\nThen, it creates new synthetic sentences by sampling templates and PII values. \r\nFurthermore, it tokenizes the data, creates tags (either IO/BIO/BILUO) and spans for the newly created samples.\r\n\r\n- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).\r\n- For an example for running the generation process, see [this notebook](notebooks/1_Generate_data.ipynb).\r\n- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/2_PII_EDA.ipynb).\r\n\r\nOnce data is generated, it could be split into train/test/validation sets \r\nwhile ensuring that each template only exists in one set. \r\nSee [this notebook for more details](notebooks/3_Split_by_pattern_number.ipynb).\r\n\r\n## 2. Data representation\r\n\r\nIn order to standardize the process, \r\nwe use specific data objects that hold all the information needed for generating, \r\nanalyzing, modeling and evaluating data and models. Specifically, \r\nsee [data_objects.py](presidio_evaluator/data_objects.py).\r\n\r\nThe standardized structure, `List[InputSample]`, can be translated into different formats:\r\n- CoNLL\r\n  - To CoNLL:\r\n    ```python\r\n    from presidio_evaluator import InputSample\r\n    dataset = InputSample.read_dataset_json(\"data/synth_dataset_v2.json\")\r\n    conll = InputSample.create_conll_dataset(dataset)\r\n    conll.to_csv(\"dataset.csv\", sep=\"\\t\")\r\n    ```\r\n\r\n  - From CoNLL\r\n    ```python\r\n    from pathlib import Path\r\n    from presidio_evaluator.dataset_formatters import CONLL2003Formatter\r\n    # Read from a folder containing ConLL2003 files\r\n    conll_formatter = CONLL2003Formatter(files_path=Path(\"data/conll2003\").resolve())\r\n    train_samples = conll_formatter.to_input_samples(fold=\"train\")\r\n    ```  \r\n\r\n\r\n- spaCy v3\r\n  ```python\r\n  from presidio_evaluator import InputSample\r\n  dataset = InputSample.read_dataset_json(\"data/synth_dataset_v2.json\")\r\n  InputSample.create_spacy_dataset(dataset, output_path=\"dataset.spacy\")\r\n  ```\r\n\r\n- Flair\r\n  ```python\r\n  from presidio_evaluator import InputSample\r\n  dataset = InputSample.read_dataset_json(\"data/synth_dataset_v2.json\")\r\n  flair = InputSample.create_flair_dataset(dataset)\r\n  ```\r\n\r\n- json\r\n  ```python\r\n  from presidio_evaluator import InputSample\r\n  dataset = InputSample.read_dataset_json(\"data/synth_dataset_v2.json\")\r\n  InputSample.to_json(dataset, output_file=\"dataset_json\")\r\n  ```\r\n\r\n## 3. PII models evaluation\r\n\r\nThe presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, or a specific PII recognizer for precision, recall, and error analysis. See [Notebook 5](notebooks/5_Evaluate_Custom_Presidio_Analyzer.ipynb) for an example.\r\n\r\n## For more information\r\n\r\n- [Blog post on NLP approaches to data anonymization](https://towardsdatascience.com/nlp-approaches-to-data-anonymization-1fb5bde6b929)\r\n- [How to evaluate PII Detection output with Presidio Evaluator](https://tranguyen221.medium.com/how-to-evaluate-pii-detection-output-with-presidio-evaluator-3f2684ba3091)\r\n- [Conference talk about leveraging Presidio and utilizing NLP approaches for data anonymization](https://youtu.be/Tl773LANRwY)\r\n\r\n# Contributing\r\n\r\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\r\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\r\nthe rights to use your contribution. For details, visit \u003chttps://cla.opensource.microsoft.com\u003e.\r\n\r\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\r\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\r\nprovided by the bot. You will only need to do this once across all repos using our CLA.\r\n\r\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\r\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\r\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\r\n\r\nCopyright notice:\r\n\r\nFake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/)\r\nare licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/).\r\nFake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fpresidio-research","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fpresidio-research","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fpresidio-research/lists"}