https://github.com/centre-for-humanities-computing/danish-ner-bias

Investigating bias in Danish language models in Named Entity Recognition (NER). Code from the paper titled "Detecting intersectionality in NER models: A data-driven approach."
https://github.com/centre-for-humanities-computing/danish-ner-bias

language-models named-entity-recognition nlp

Last synced: 12 months ago
JSON representation

Investigating bias in Danish language models in Named Entity Recognition (NER). Code from the paper titled "Detecting intersectionality in NER models: A data-driven approach."

Host: GitHub
URL: https://github.com/centre-for-humanities-computing/danish-ner-bias
Owner: centre-for-humanities-computing
License: apache-2.0
Created: 2023-01-31T09:53:18.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-05-08T21:28:26.000Z (about 3 years ago)
Last Synced: 2025-02-22T22:41:53.284Z (over 1 year ago)
Topics: language-models, named-entity-recognition, nlp
Language: Python
Homepage:
Size: 3.85 MB
Stars: 2
Watchers: 0
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

          # Detecting intersectionality in NER models: A data-driven approach

This repository contains the code used to produce the results in the paper "**Detecting intersectionality in NER models: A data-driven approach**" by Lassen et al. (2023). 

The project investigates the effect of intersectional biases in Danish language models (see [list](https://github.com/centre-for-humanities-computing/Danish-NER-bias#danish-language-models)) used for Named Entity Recognition (NER). This is achieved by applying a data augmentation technique, namely augmenting all names in the [DaNe](https://aclanthology.org/2020.lrec-1.565/) testset on gender-divided name lists for both majority and minority names. 

For instructions on how to reproduce the results, please refer to the [Pipeline](https://github.com/centre-for-humanities-computing/Danish-NER-bias#pipeline) section.

## Project structure 

The repository has the following directory structure:

| 
| Description |

|---------|:-----------|

| ```name_lists``` | Contains name lists used for data augmentation|

| ```requirements``` | Requirements file for all models, and a seperate file for polyglot |

| ```results``` | Results from all model evaluations saved as CSV files|

| ```src```  | Contains scripts for evaluating all models (```evaluate_XX.py```). Also has helper modules for preprocessing name lists (```process_names```), importing models (```apply_fns```), and augmenting names + evaluating models (```evaluate_fns```).|

| ```utils-R``` | Utils for running ```results.md```|

| ```results.md``` | Rmarkdown for producing results table from the paper (```Table 2```) |

| ```polyglot.sh``` | Installs necessary tools and packages and runs the evaluation of ```polyglot``` |

| ```setup.sh``` | Installs necessary packages for running evaluation of all models except ```polyglot```|

| ```run-models.sh``` | Runs the evaluation of all models except ```polyglot```|

### Danish language models

The following models are evaluated:

* [ScandiNER](https://huggingface.co/saattrupdan/nbailab-base-ner-scandi)

* [DaCy models](https://github.com/centre-for-humanities-computing/DaCy)

    * DaCY large (da_dacy_large_trf-0.1.0)

    * DaCy medium (da_dacy_medium_trf-0.1.0)

    * DaCY small (da_dacy_small_trf-0.1.0)

* [DaNLP BERT](https://danlp-alexandra.readthedocs.io/en/stable/docs/tasks/ner.html#bert)

* [Flair](https://github.com/flairNLP/flair)

* [NERDA](https://github.com/ebanalyse/NERDA/)

* [Spacy models](https://spacy.io/models/da)

    * SpaCy large (da_core_news_lg-3.4.0)

    * SpaCy medium (da_core_news_md-3.4.0)

    * SpaCy small (da_core_news_sm-3.4.0)

* [Polyglot](https://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html)

## Pipeline 

The pipeline has been built on Ubuntu ([UCloud](https://cloud.sdu.dk/)). 

For all models except ```polyglot```, first run the ```setup.sh```

```

bash setup.sh

```

This will create a virtual environment (```env```) and install the necessary packages. 

To then evaluate all models, run: 

```

bash run-models.sh

```

### Polyglot

To setup and evaluate ```polyglot```, run: 

```

sudo bash polyglot.sh

```

**NB! Notice that it is necessary to run this code with sudo as the setup requires certain devtools that will not be installed otherwise. Run at own risk!**

The ```polyglot.sh``` script will both install devtools, packages and run the evaluation of the model in a seperately created environment called ```polyenv```. 

## Acknowledgements

The name augmentation was performed using [augmenty](https://kennethenevoldsen.github.io/augmenty/) and model evaluation was performed using the [DaCy](https://github.com/centre-for-humanities-computing/DaCy) framework.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/centre-for-humanities-computing/danish-ner-bias

Awesome Lists containing this project

README