https://github.com/norskregnesentral/neuraltextsanitizer
Neural models for detecting and masking personal information from texts
https://github.com/norskregnesentral/neuraltextsanitizer
Last synced: 5 months ago
JSON representation
Neural models for detecting and masking personal information from texts
- Host: GitHub
- URL: https://github.com/norskregnesentral/neuraltextsanitizer
- Owner: NorskRegnesentral
- License: mit
- Created: 2022-06-01T17:13:49.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-11-25T13:07:21.000Z (almost 3 years ago)
- Last Synced: 2025-04-05T20:04:41.633Z (6 months ago)
- Language: Python
- Size: 4.2 MB
- Stars: 15
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# NeuralTextSanitizer
Text sanitization with explicit measures of privacy risk.For ```python>=3.7```:
* Download the three models from [this link](https://drive.google.com/drive/folders/1p9znczAIruZKvUxY0hLRy5YXyj0SfOYk?usp=sharing) and place them in the SampleData folder
* ```python -m pip install -r requirements.txt```The input should be a file containing the text(s) to be sanitized. See *sample2.json* and *sample.json* in the SampleData folder for an example input.
| Field | Description | |
| ------------- | ------------- | ------------- |
| text | The text to be sanitized | required |
| target | The individual to be protected in the text | required |
| annotations| Manual annotated start and end offsets, and semantic label of PII in the text | optional |To run the whole pipeline, provide the path to an input file as follows:
* ```python sanitize.py SampleData/sample2.json```The output is a json file containing the masking decisions of each module of the pipeline. More specifically:
| Field | Description |
| ------------- | ------------- |
| opt_decision | The masking decisions after the Optimization Algorithm |
| PII | Personally Identifiable Information in the text|
| blacklist1| The masking decisions of the Language Model |
| blacklist2| The masking decisions of the Web Query model |
| blacklist3| The masking decisions of the Mask Classifier model |