Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/NorskRegnesentral/text-anonymization-benchmark

Annotated corpus + evaluation metrics for text anonymisation
https://github.com/NorskRegnesentral/text-anonymization-benchmark

Last synced: 3 months ago
JSON representation

Annotated corpus + evaluation metrics for text anonymisation

Host: GitHub
URL: https://github.com/NorskRegnesentral/text-anonymization-benchmark
Owner: NorskRegnesentral
License: mit
Created: 2021-10-28T21:46:13.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-02-09T10:45:06.000Z (5 months ago)
Last Synced: 2024-03-19T10:04:44.461Z (4 months ago)
Language: Python
Size: 7.06 MB
Stars: 40
Watchers: 7
Forks: 7
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Lists

awesome-pii - Text Anonymization Benchmark (TAB) - source corpus for text anonymization. It comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR). (Datasets / Other)

README

        The _Text Anonymization Benchmark_ (TAB) is a new, open-source corpus for text anonymization. It comprises 1,268 English-language court cases from the [European Court of Human Rights (ECHR)](https://www.echr.coe.int/Pages/home.aspx?p=home) manually annotated with:

* semantic categories for personal identifiers,

* masking decisions (in regard to the re-identification risk for the person to protect),

* confidential attributes,

* co-reference relations.

Details about the annotation process employed to develop this corpus can be found in the following paper: 

>       Ildikó Pilán, Pierre Lison, Lilja Øvrelid, Anthi Papadopoulou, David Sánchez, Montserrat Batet. 


      [The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization](https://arxiv.org/abs/2202.00443),


      _arXiv:2202.00443_.

## General information

This repository contains the v1.0 release of the Text Anonymization Benchmark, a corpus for text anonymization.

The corpus comprises 1,268 English-language court cases from the [European Court for Human Rights (ECHR)](https://www.echr.coe.int/). The documents were manually annotated with information about personal identifiers (including their semantic category and need for masking), confidential attributes and co-reference relations. 

## Data format

The data is distributed in a standoff JSON format consisting of a list of document object with the following information:

| Variable name | Description |

|---------------|-------------|

| annotations | an object with document annotations, each containing an object with entity mention annotations |

| dataset_type | which data split the court case belongs to (train /dev / test) |

| doc_id | the ID of the court case (e.g. “001-61807”) |

| meta | an object with metadata for each case (year, countries and legal articles involved etc.) |

| quality_checked | whether the document was revised by another annotator |

| task | the target of the anonymisation task (i.g. who to anonymise) |

| text | the text of the court case used during the annotation |

Each entity mention object under 'annotations' has the following attributes:

| Variable name | Description |

|---------------|-------------|

| entity_type | the semantic category of the entity (e.g. PERSON) |

| entity_mention_id | ID of the entity mention |

| start_offset | start character offset of the annotated span |

| end_offset | end character offset of the annotated span |

| span_text | the text of the annotated span |

| edit_type | type of annotator action for the mention (check / insert / correct) |

| identifier_type | the need for masking, masked if 'DIRECT' or 'QUASI', 'NO_MASK' otherwise |

| entity_id | ID of the entity the entity mention is related to in meaning |

| confidential_status | category of a potential source of discrimination (e.g. beliefs, sexual orientation etc.) |

## License

TAB is released under an MIT License.

The MIT License is a short and simple permissive license allowing both commercial and non-commercial use of the software. The only requirement is to preserve the copyright and license notices (see file [License](https://github.com/NorskRegnesentral/text-anonymisation-benchmark/blob/master/LICENSE.txt)). Licensed works, modifications, and larger works may be distributed under different terms and without source code.