An open API service indexing awesome lists of open source software.

https://github.com/qurator-spk/sbb_ner

Named Entity Recognition
https://github.com/qurator-spk/sbb_ner

bert-ner named-entity-recognition qurator

Last synced: 6 months ago
JSON representation

Named Entity Recognition

Awesome Lists containing this project

README

          

![sbb-ner-demo example](.screenshots/sbb_ner_demo.png?raw=true)

How the models have been obtained is described in our [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).

***

# Installation:

Recommended python version is 3.11.
Consider use of [pyenv](https://github.com/pyenv/pyenv) if that python version is not available on your system.

Activate virtual environment (virtualenv):
```
source venv/bin/activate
```
or (pyenv):
```
pyenv activate my-python-3.11-virtualenv
```

Update pip:
```
pip install -U pip
```
Install sbb_ner:
```
pip install git+https://github.com/qurator-spk/sbb_ner.git
```
Download required models: https://qurator-data.de/sbb_ner/models.tar.gz

Extract model archive:
```
tar -xzf models.tar.gz
```

Copy [config file](qurator/sbb_ner/webapp/config.json) into working directory.
Set USE_CUDA environment variable to True/False depending on GPU availability.

Run webapp directly:

```
env CONFIG=config.json env FLASK_APP=qurator/sbb_ner/webapp/app.py env FLASK_ENV=development env USE_CUDA=True/False flask run --host=0.0.0.0
```

For production purposes rather use
```
env CONFIG=config.json env USE_CUDA=True/False gunicorn --bind 0.0.0.0:5000 qurator.sbb_ner.webapp.wsgi:app
```

# Docker

## CPU-only:

```
docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-cpu -f Dockerfile.cpu .
```

```
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-cpu
```

## GPU:

Make sure that your GPU is correctly set up and that nvidia-docker has been installed.

```
docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-gpu -f Dockerfile .
```

```
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-gpu
```

NER web-interface is availabe at http://localhost:5000 .

# REST - Interface

Get available models:
```
curl http://localhost:5000/models
```

Output:

```
[
{
"default": true,
"id": 1,
"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-de-finetuned",
"name": "DC-SBB + CONLL + GERMEVAL"
},
{
"default": false,
"id": 2,
"model_dir": "data/konvens2019/build-on-all-german-de-finetuned/bert-sbb-de-finetuned",
"name": "DC-SBB + CONLL + GERMEVAL + SBB"
},
{
"default": false,
"id": 3,
"model_dir": "data/konvens2019/build-wd_0.03/bert-sbb-de-finetuned",
"name": "DC-SBB + SBB"
},
{
"default": false,
"id": 4,
"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-baseline",
"name": "CONLL + GERMEVAL"
}
]
```

Perform NER using model 1:

```
curl -d '{ "text": "Paris Hilton wohnt im Hilton Paris in Paris." }' -H "Content-Type: application/json" http://localhost:5000/ner/1
```

Output:

```
[
[
{
"prediction": "B-PER",
"word": "Paris"
},
{
"prediction": "I-PER",
"word": "Hilton"
},
{
"prediction": "O",
"word": "wohnt"
},
{
"prediction": "O",
"word": "im"
},
{
"prediction": "B-ORG",
"word": "Hilton"
},
{
"prediction": "I-ORG",
"word": "Paris"
},
{
"prediction": "O",
"word": "in"
},
{
"prediction": "B-LOC",
"word": "Paris"
},
{
"prediction": "O",
"word": "."
}
]
]
```
The JSON above is the expected input format of the
[SBB named entity linking and disambiguation system](https://github.com/qurator-spk/sbb_ned).
# Model-Training

***
## Preprocessing of NER ground-truth:

### compile_conll

Read CONLL 2003 ner ground truth files from directory and
write the outcome of the data parsing to some pandas DataFrame that is
stored as pickle.

#### Usage

```
compile_conll --help
```

### compile_germ_eval

Read germ eval .tsv files from directory and write the
outcome of the data parsing to some pandas DataFrame that is stored as
pickle.

#### Usage

```
compile_germ_eval --help
```

### compile_europeana_historic

Read europeana historic ner ground truth .bio files from directory
and write the outcome of the data parsing to some pandas
DataFrame that is stored as pickle.

#### Usage

```
compile_europeana_historic --help
```

### compile_wikiner

Read wikiner files from directory and write the outcome
of the data parsing to some pandas DataFrame that is stored as pickle.

#### Usage

```
compile_wikiner --help
```

***
## Train BERT - NER model:

### bert-ner

Perform BERT for NER supervised training and test/cross-validation.

#### Usage

```
bert-ner --help
```

## BERT-Pre-training:

### collectcorpus

```
collectcorpus --help

Usage: collectcorpus [OPTIONS] FULLTEXT_FILE SELECTION_FILE CORPUS_FILE

Reads the fulltext from a CSV or SQLITE3 file (see also altotool) and
write it to one big text file.

FULLTEXT_FILE: The CSV or SQLITE3 file to read from.

SELECTION_FILE: Consider only a subset of all pages that is defined by the
DataFrame that is stored in .

CORPUS_FILE: The output file that can be used by bert-pregenerate-trainingdata.

Options:
--chunksize INTEGER Process the corpus in chunks of .
default:10**4

--processes INTEGER Number of parallel processes. default: 6
--min-line-len INTEGER Lower bound of line length in output file.
default:80

--help Show this message and exit.

```

### bert-pregenerate-trainingdata

Generate data for BERT pre-training from a corpus text file where
the documents are separated by an empty line (output of corpuscollect).

#### Usage

```
bert-pregenerate-trainingdata --help
```

### bert-finetune

Perform BERT pre-training on pre-generated data.

#### Usage

```
bert-finetune --help
```