https://github.com/mideind/greynirt2t
Machine Translation between Icelandic and English using Tensor2Tensor
https://github.com/mideind/greynirt2t
Last synced: about 2 months ago
JSON representation
Machine Translation between Icelandic and English using Tensor2Tensor
- Host: GitHub
- URL: https://github.com/mideind/greynirt2t
- Owner: mideind
- License: mit
- Created: 2020-09-24T02:15:30.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-09-29T13:57:32.000Z (over 5 years ago)
- Last Synced: 2025-06-27T08:43:12.054Z (12 months ago)
- Language: Python
- Size: 82 KB
- Stars: 1
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GreynirT2T
Machine Translation between Icelandic and English using [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor).
Most of the provided scripts assume they are running inside a docker container, although they should run outside containers as well.
Many of the paths are hard-coded and need to be adapted (see e.g. greynirt2t/translate_enis.py)
## Data ##
Bilingual parallel data and monolingual Icelandic data can be downloaded from [CLARIN](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/16 "CLARIN").
Note that a license must be accepted, and the OpenSubtitles2018 subcorpus must be downloaded and cleaned (by a provided script).
### Data preparation ###
Use the cleaning and preprocessing script filters.py to optionally clean data before training (see script).
### Vocabulary ###
In order to run the pre-trained models, the vocabulary used at training time must be used.
## Training ##
Before training can begin, the training data must be binarized and a vocabulary must be generated (if one does not already exist).
Note that batch size must be tuned according to your GPU by trial and error since it depends on available VRAM, model size, and maximum sequence size.
If a larger batch size is wanted than can fit on your GPU, then larger batch sizes can be simulated (with multistep_adam, see scripts/env.sh).
The training can be found at greynirt2t/scripts/train.sh.
## Inference / translation ##
To view model predictions, see interactive_decode.sh or translate_file.sh (or the T2T repository for documentation).