Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pcyin/pytorch_basic_nmt
A simple yet strong implementation of neural machine translation in pytorch
https://github.com/pcyin/pytorch_basic_nmt
deep-learning natural-language-processing neural-machine-translation pytorch pytorch-implmention
Last synced: 15 days ago
JSON representation
A simple yet strong implementation of neural machine translation in pytorch
- Host: GitHub
- URL: https://github.com/pcyin/pytorch_basic_nmt
- Owner: pcyin
- Created: 2018-09-14T21:14:15.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2021-02-22T22:37:20.000Z (over 3 years ago)
- Last Synced: 2024-08-01T16:35:31.570Z (3 months ago)
- Topics: deep-learning, natural-language-processing, neural-machine-translation, pytorch, pytorch-implmention
- Language: Python
- Homepage:
- Size: 16.5 MB
- Stars: 87
- Watchers: 7
- Forks: 24
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## A Basic PyTorch Implementation of Attentional Neural Machine Translation
This is a basic implementation of attentional neural machine translation (Bahdanau et al., 2015, Luong et al., 2015) in Pytorch.
It implements the model described in [Luong et al., 2015](https://arxiv.org/abs/1508.04025), and supports label smoothing, beam-search decoding and random sampling.
With 256-dimensional LSTM hidden size, it achieves 28.13 BLEU score on the IWSLT 2014 Germen-English dataset (Ranzato et al., 2015).This codebase is used for instructional purposes in Stanford [CS224N Nautral Language Processing with Deep Learning]( http://web.stanford.edu/class/cs224n/) and CMU [11-731 Machine Translation and Sequence-to-Sequence Models](http://www.phontron.com/class/mtandseq2seq2018/).
### File Structure
* `nmt.py`: contains the neural machine translation model and training/testing code.
* `vocab.py`: a script that extracts vocabulary from training data
* `util.py`: contains utility/helper functions### Example Dataset
We provide a preprocessed version of the IWSLT 2014 German-English translation task used in (Ranzato et al., 2015) [[script]](https://github.com/harvardnlp/BSO/blob/master/data_prep/MT/prepareData.sh). To download the dataset:
```bash
wget http://www.cs.cmu.edu/~pengchey/iwslt2014_ende.zip
unzip iwslt2014_ende.zip
```Running the script will extract a`data/` folder which contains the IWSLT 2014 dataset.
The dataset has 150K German-English training sentences. The `data/` folder contains a copy of the public release of the dataset. Files with suffix `*.wmixerprep` are pre-processed versions of the dataset from Ranzato et al., 2015, with long sentences chopped and rared words replaced by a special `` token. You could use the pre-processed training files for training/developing (or come up with your own pre-processing strategy), but for testing you have to use the **original** version of testing files, ie., `test.de-en.(de|en)`.### Environment
The code is written in Python 3.6 using some supporting third-party libraries. We provided a conda environment to install Python 3.6 with required libraries. Simply run
```bash
conda env create -f environment.yml
```### Usage
Each runnable script (`nmt.py`, `vocab.py`) is annotated using `dotopt`.
Please refer to the source file for complete usage.First, we extract a vocabulary file from the training data using the command:
```bash
python vocab.py \
--train-src=data/train.de-en.de.wmixerprep \
--train-tgt=data/train.de-en.en.wmixerprep \
data/vocab.json
```This generates a vocabulary file `data/vocab.json`.
The script also has options to control the cutoff frequency and the size of generated vocabulary, which you may play with.To start training and evaluation, simply run `scripts/train.sh`.
After training and decoding, we call the official evaluation script `multi-bleu.perl` to compute the corpus-level BLEU score of the decoding results against the gold-standard.### License
This work is licensed under a Creative Commons Attribution 4.0 International License.