Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hallerpatrick/gerpt
German Generative Pre-Trained Transformer Model
https://github.com/hallerpatrick/gerpt
Last synced: 10 days ago
JSON representation
German Generative Pre-Trained Transformer Model
- Host: GitHub
- URL: https://github.com/hallerpatrick/gerpt
- Owner: HallerPatrick
- Created: 2022-04-27T15:42:46.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-07-05T14:29:09.000Z (over 1 year ago)
- Last Synced: 2024-12-08T18:04:55.478Z (17 days ago)
- Language: Python
- Size: 6.18 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# GERPT - Training a German Generative Transformer Model using N-Gram Multihot Encodings
Experiments for my thesis 🤗
## Setup
Install necessary dependencies:
```
pip install -r requirements.txt
```### Optional
Compile the CUDA extensions for the N-Gram Multihot approach:
```
cd cpp
python setup.py install
```To run the training on GPUs please install `pytorch` with CUDA support.
The following tasks can all be run with: `tools/run_all.sh`
## Pre-Training
### Pre-Process
The preprocess script sets the vocabulary and the tokenized dataset up.
The easiest way is to use the training config, with the configs `data` for the dataset, `saved_dict` and
`saved_data` for the outfile of the dictionary and tokenized dataset respectively.*NOTE:* The `data` setting can be a huggingface dataset set or a local one that is prefixed with `"text/"`
```
python preprocess.py --config configs/base.yaml
```### Training
The training script will either train a standard implementation of a LSTM or Transformer model,
with the N-Gram Multihot approach.All parameters can be defined in a yaml configuration file. See `configs/base.yaml` for possible
options or run `python train.py --help`.```
python train.py --config configs/base.yaml
```Parameters can also be set through the command line and will overwrite the yaml configs.
## Downstream Evaluation
For downstream evaluation we use the `flair` library. In another yaml configuration file (see `configs/flair_base.yaml`) different downstream tasks can be declared. If the setting `use` is set to `True` training for the task is started. Multiple training tasks can be declared.
```
python train_ds.py --config configs/flair_base.yaml
```## Troubleshooting
* Deepspeed tries to access some tmp folders for cuda extensions, that the user may not have permissions for. Export `TORCH_EXTENSIONS_DIR` to a new location.
h, e, l, l, o, ,w . ,o , r, l, d
h, he, el, ll, lo, , , wo, or, rl, ld