Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/NorskRegnesentral/text-anonymization-benchmark
Annotated corpus + evaluation metrics for text anonymisation
https://github.com/NorskRegnesentral/text-anonymization-benchmark
Last synced: 3 months ago
JSON representation
Annotated corpus + evaluation metrics for text anonymisation
- Host: GitHub
- URL: https://github.com/NorskRegnesentral/text-anonymization-benchmark
- Owner: NorskRegnesentral
- License: mit
- Created: 2021-10-28T21:46:13.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-02-09T10:45:06.000Z (5 months ago)
- Last Synced: 2024-03-19T10:04:44.461Z (4 months ago)
- Language: Python
- Size: 7.06 MB
- Stars: 40
- Watchers: 7
- Forks: 7
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Lists
- awesome-pii - Text Anonymization Benchmark (TAB) - source corpus for text anonymization. It comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR). (Datasets / Other)
README
The _Text Anonymization Benchmark_ (TAB) is a new, open-source corpus for text anonymization. It comprises 1,268 English-language court cases from the [European Court of Human Rights (ECHR)](https://www.echr.coe.int/Pages/home.aspx?p=home) manually annotated with:
* semantic categories for personal identifiers,
* masking decisions (in regard to the re-identification risk for the person to protect),
* confidential attributes,
* co-reference relations.Details about the annotation process employed to develop this corpus can be found in the following paper:
> Ildikó Pilán, Pierre Lison, Lilja Øvrelid, Anthi Papadopoulou, David Sánchez, Montserrat Batet.
[The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization](https://arxiv.org/abs/2202.00443),
_arXiv:2202.00443_.## General information
This repository contains the v1.0 release of the Text Anonymization Benchmark, a corpus for text anonymization.
The corpus comprises 1,268 English-language court cases from the [European Court for Human Rights (ECHR)](https://www.echr.coe.int/). The documents were manually annotated with information about personal identifiers (including their semantic category and need for masking), confidential attributes and co-reference relations.## Data format
The data is distributed in a standoff JSON format consisting of a list of document object with the following information:| Variable name | Description |
|---------------|-------------|
| annotations | an object with document annotations, each containing an object with entity mention annotations |
| dataset_type | which data split the court case belongs to (train /dev / test) |
| doc_id | the ID of the court case (e.g. “001-61807”) |
| meta | an object with metadata for each case (year, countries and legal articles involved etc.) |
| quality_checked | whether the document was revised by another annotator |
| task | the target of the anonymisation task (i.g. who to anonymise) |
| text | the text of the court case used during the annotation |Each entity mention object under 'annotations' has the following attributes:
| Variable name | Description |
|---------------|-------------|
| entity_type | the semantic category of the entity (e.g. PERSON) |
| entity_mention_id | ID of the entity mention |
| start_offset | start character offset of the annotated span |
| end_offset | end character offset of the annotated span |
| span_text | the text of the annotated span |
| edit_type | type of annotator action for the mention (check / insert / correct) |
| identifier_type | the need for masking, masked if 'DIRECT' or 'QUASI', 'NO_MASK' otherwise |
| entity_id | ID of the entity the entity mention is related to in meaning |
| confidential_status | category of a potential source of discrimination (e.g. beliefs, sexual orientation etc.) |## License
TAB is released under an MIT License.
The MIT License is a short and simple permissive license allowing both commercial and non-commercial use of the software. The only requirement is to preserve the copyright and license notices (see file [License](https://github.com/NorskRegnesentral/text-anonymisation-benchmark/blob/master/LICENSE.txt)). Licensed works, modifications, and larger works may be distributed under different terms and without source code.