https://github.com/derlin/swisstext-bert-lid

Package to finetune and use a BERT model for Swiss-German language identification.
https://github.com/derlin/swisstext-bert-lid

Last synced: 29 days ago
JSON representation

Package to finetune and use a BERT model for Swiss-German language identification.

Host: GitHub
URL: https://github.com/derlin/swisstext-bert-lid
Owner: derlin
Created: 2020-02-14T11:39:14.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-02-16T18:13:39.000Z (over 6 years ago)
Last Synced: 2025-11-10T07:33:13.670Z (9 months ago)
Language: Python
Size: 19.5 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Swiss-German LID using BERT

This repository let's you finetune a BERT model to perform the task of Language Identification.
The target task is to properly identify Swiss German.

The languages the model will be trained on are:

* `afr`: Afrikaans
* `deu`: German
* `gsw`: Swiss German
* `gsw_like`: a mix of Bavarian, Kolsch, Limburgan, Low German, Northern Frisian and Palatine German
* `ltz`: Luxembourgian
* `nld`: Dutch
* `other`: a mix of Catalan, Croatian, Danish, Esperanto, Estonian, Finnish, French, Irish, Galician,
Icelandic, Italian, Javanese, Konkani, Papiamento, Portuguese, Romanian, Slovenian, Spanish, Swahili and Swedish

The procedure:

1. ensure you have a pip version >= 15.0: `pip install --upgrade pip`
2. install this repo: `pip install .` (or `pip install -e .`, for editable mode). **DO NOT USE setup.py directly**;
3. get Swiss German sentences into a CSV file;
4. use the scripts in `training` to generate a model (see below);
5. set the generated model as a default in the module `bert_lid`, by copying the out directory to `bert_lid/models/default`;
(Note: if you didn't install the module in development mode, the model must be written to the location of the installed module);
6. now, you can use `bert_lid.BertLid` and install it in other environments;

## Training a model

**Important notice** we provide everything needed to train the model, **except the Swiss-German** data.
It is your task to generate one CSV file containing Swiss German sentences in a column named `text`.
Tip: you can access Swiss German sentences from the Leipzig Corpora Collection.

Once you have a Swiss German CSV file ready, the only thing left to do is to run the scripts in the `training` folder in order.

```bash
# ensure you launch the scripts from the training directory !
cd training

./1_download-data.py
./2_prepare-data.py --gsw path/to/swiss-german-sentences.csv
./3_split-data.py
./4_finetune-bert.sh # <= this one long-running (>20 minutes), would better be running in a screen
./4_eval-model.py
```

At this point, you should have a model saved in `training/out`. The only thing left to do is to make it the default model,
by copying it to `bert_lid/models/default` (actual location varies depending on the kind of installation you did, install or development):

```bash
bert_lid_install_model -i training/out
```

## Inference

As long as you have a model somewhere (or installed one as the default using script 6 in `training`), this is straight-forward:

```python
>>> from bert_lid import BertLid
>>> lid = BertLid()
>>> lid.predict(['Das isch sone seich'])
(['gsw'], [99.83619689941406])
>>> lid.predict(['Trop top ce module, il marche bien et est bien documenté!'], mode='row')
[('Trop top ce module, il marche bien et est bien documenté!', 'other', 99.75711822509766)]
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/derlin/swisstext-bert-lid

Awesome Lists containing this project

README