Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hallerpatrick/gerpt

German Generative Pre-Trained Transformer Model
https://github.com/hallerpatrick/gerpt

Last synced: 10 days ago
JSON representation

German Generative Pre-Trained Transformer Model

Host: GitHub
URL: https://github.com/hallerpatrick/gerpt
Owner: HallerPatrick
Created: 2022-04-27T15:42:46.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-07-05T14:29:09.000Z (over 1 year ago)
Last Synced: 2024-12-08T18:04:55.478Z (17 days ago)
Language: Python
Size: 6.18 MB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# GERPT - Training a German Generative Transformer Model using N-Gram Multihot Encodings

Experiments for my thesis 🤗

## Setup

Install necessary dependencies:

```
pip install -r requirements.txt
```

### Optional

Compile the CUDA extensions for the N-Gram Multihot approach:

```
cd cpp
python setup.py install
```

To run the training on GPUs please install `pytorch` with CUDA support.

The following tasks can all be run with: `tools/run_all.sh`

## Pre-Training

### Pre-Process

The preprocess script sets the vocabulary and the tokenized dataset up.
The easiest way is to use the training config, with the configs `data` for the dataset, `saved_dict` and
`saved_data` for the outfile of the dictionary and tokenized dataset respectively.

*NOTE:* The `data` setting can be a huggingface dataset set or a local one that is prefixed with `"text/"`

```
python preprocess.py --config configs/base.yaml
```

### Training

The training script will either train a standard implementation of a LSTM or Transformer model,
with the N-Gram Multihot approach.

All parameters can be defined in a yaml configuration file. See `configs/base.yaml` for possible
options or run `python train.py --help`.

```
python train.py --config configs/base.yaml
```

Parameters can also be set through the command line and will overwrite the yaml configs.

## Downstream Evaluation

For downstream evaluation we use the `flair` library. In another yaml configuration file (see `configs/flair_base.yaml`) different downstream tasks can be declared. If the setting `use` is set to `True` training for the task is started. Multiple training tasks can be declared.

```
python train_ds.py --config configs/flair_base.yaml
```

## Troubleshooting

* Deepspeed tries to access some tmp folders for cuda extensions, that the user may not have permissions for. Export `TORCH_EXTENSIONS_DIR` to a new location.

h, e, l, l, o, ,w . ,o , r, l, d
h, he, el, ll, lo, , , wo, or, rl, ld