{"id":17686987,"url":"https://github.com/posenhuang/npmt","last_synced_at":"2025-04-14T09:41:04.176Z","repository":{"id":82901751,"uuid":"119445328","full_name":"posenhuang/NPMT","owner":"posenhuang","description":"Towards Neural Phrase-based Machine Translation","archived":false,"fork":false,"pushed_at":"2018-07-11T23:20:47.000Z","size":2596,"stargazers_count":178,"open_issues_count":0,"forks_count":29,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-03-27T23:01:40.749Z","etag":null,"topics":["fairseq","lua","machine-translation","neural-machine-translation","npmt","sequence-to-sequence","swan","torch"],"latest_commit_sha":null,"homepage":"","language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/posenhuang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-29T21:38:48.000Z","updated_at":"2024-04-03T11:02:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"32b55c32-0a7c-4e92-b13a-6e1b0e3ea291","html_url":"https://github.com/posenhuang/NPMT","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posenhuang%2FNPMT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posenhuang%2FNPMT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posenhuang%2FNPMT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/posenhuang%2FNPMT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/posenhuang","download_url":"https://codeload.github.com/posenhuang/NPMT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248855700,"owners_count":21172628,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fairseq","lua","machine-translation","neural-machine-translation","npmt","sequence-to-sequence","swan","torch"],"created_at":"2024-10-24T10:46:37.124Z","updated_at":"2025-04-14T09:41:04.141Z","avatar_url":"https://github.com/posenhuang.png","language":"Lua","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Introduction\nThis is NPMT, the source codes of [Towards Neural Phrase-based Machine Translation](https://arxiv.org/abs/1706.05565) and [Sequence Modeling via Segmentations](https://arxiv.org/abs/1702.07463) from Microsoft Research.\nIt is built on top of the [fairseq toolkit](https://github.com/facebookresearch/fairseq) in [Torch](http://torch.ch/).\nWe present the setup and Neural Machine Translation (NMT) experiments in [Towards Neural Phrase-based Machine Translation](https://arxiv.org/abs/1706.05565).\n\n## NPMT \nNeural Phrase-based Machine Translation (NPMT) explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. \nTo mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. \nDifferent from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. \nInstead, it directly outputs phrases in a sequential order and can decode in linear time. \n\nModel architecture\n![Example](npmt.png)\n\nAn illustration of using NPMT in German-English translation\n![Example](de-en_example.png)\n\n\nPlease refer to the [PR](https://github.com/posenhuang/NPMT/pull/1) for our implementations. Our implementation is based on the [lastest version](https://github.com/posenhuang/NPMT/commit/7d017f0a46a3cddfc420a4778d9541ba38b6a43d) of fairseq.  \n\n\n# Citation\n\nIf you use the code in your paper, then please cite it as:\n\n```\n@article{pshuang2018NPMT,\n  author    = {Po{-}Sen Huang and\n               Chong Wang and\n               Sitao Huang and\n               Dengyong Zhou and\n               Li Deng},\n  title     = {Towards Neural Phrase-based Machine Translation},\n  journal   = {CoRR},\n  volume    = {abs/1706.05565},\n  year      = {2017},\n  url       = {http://arxiv.org/abs/1706.05565},\n  archivePrefix = {arXiv},\n  eprint    = {1706.05565},\n}\n```\n\nand\n\n```\n@inproceedings{wang2017SWAN,\n  author    = {Chong Wang and\n               Yining Wang and\n               Po{-}Sen Huang and\n               Abdelrahman Mohamed and\n               Dengyong Zhou and\n               Li Deng},\n  title     = {Sequence Modeling via Segmentations},\n  booktitle = {Proceedings of the 34th International Conference on Machine Learning,\n               {ICML} 2017, Sydney, NSW, Australia, 6-11 August 2017},\n  pages     = {3674--3683},\n  year      = {2017},\n}\n```\n\n# Requirements and Installation\n* A computer running macOS or Linux\n* For training new models, you'll also need a NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)\n* A [Torch installation](http://torch.ch/docs/getting-started.html). For maximum speed, we recommend using LuaJIT and [Intel MKL](https://software.intel.com/en-us/intel-mkl).\n* A recent version [nn](https://github.com/torch/nn). The minimum required version is from May 5th, 2017. A simple `luarocks install nn` is sufficient to update your locally installed version.\n\nInstall fairseq by cloning the GitHub repository and running\n```\nluarocks make rocks/fairseq-scm-1.rockspec\n```\nLuaRocks will fetch and build any additional dependencies that may be missing.\nIn order to install the CPU-only version (which is only useful for translating new data with an existing model), do\n```\nluarocks make rocks/fairseq-cpu-scm-1.rockspec\n```\n\nThe LuaRocks installation provides a command-line tool that includes the following functionality:\n* `fairseq preprocess`: Data pre-processing: build vocabularies and binarize training data\n* `fairseq train`: Train a new model on one or multiple GPUs\n* `fairseq generate`: Translate pre-processed data with a trained model\n* `fairseq generate-lines`: Translate raw text with a trained model\n* `fairseq score`: BLEU scoring of generated translations against reference translations\n* `fairseq tofloat`: Convert a trained model to a CPU model\n* `fairseq optimize-fconv`: Optimize a fully convolutional model for generation. This can also be achieved by passing the `-fconvfast` flag to the generation scripts.\n\n# Quick Start\n\n## Training a New Model\n\n### Data Pre-processing\nThe fairseq source distribution contains an example pre-processing script for\nthe IWSLT14 German-English corpus.\nPre-process and binarize the data as follows:\n```\n$ cd data/\n$ bash prepare-iwslt14.sh\n$ cd ..\n$ TEXT=data/iwslt14.tokenized.de-en\n$ fairseq preprocess -sourcelang de -targetlang en \\\n  -trainpref $TEXT/train -validpref $TEXT/valid -testpref $TEXT/test \\\n  -thresholdsrc 3 -thresholdtgt 3 -destdir data-bin/iwslt14.tokenized.de-en\n```\nThis will write binarized data that can be used for model training to data-bin/iwslt14.tokenized.de-en.\n\nWe also provide an example of pre-processing script for the IWSLT15 English-Vietnamese corpus.\nPre-process and binarize the data as follows:\n```\n$ cd data/\n$ bash prepare-iwslt15.sh\n$ cd ..\n$ TEXT=data/iwslt15\n$ fairseq preprocess -sourcelang en -targetlang vi \\\n -trainpref $TEXT/train -validpref $TEXT/tst2012 -testpref $TEXT/tst2013 \\\n -thresholdsrc 5 -thresholdtgt 5 -destdir data-bin/iwslt15.tokenized.en-vi\n```\n\n### Training\nUse `fairseq train` to train a new model.\nHere a few example settings that work well for the IWSLT14, IWSLT15 datasets:\n```\n# NPMT model (IWSLT DE-EN)\n$ mkdir -p trainings/iwslt_de_en\n$ fairseq train -sourcelang de -targetlang en -datadir data-bin/iwslt14.tokenized.de-en \\\n  -model npmt -nhid 256 -dec_unit_size 512 -dropout .5 dropout_hid 0 -npmt_dropout .5 \\\n  -optim adam -lr 0.001 -batchsize 32 -log_interval 100 nlayer 2 -nenclayer 2 -kwidth 7 \\\n  -max_segment_len 6 -rnn_mode GRU -group_size 500 -use_resnet_enc -use_resnet_dec -log\n  -momentum 0.99 -clip 10 -maxbatch 600 -bptt 0 -maxepoch 100 -ndatathreads 4 -seed 1002 \n  -maxsourcelen 75 -num_lower_win_layers 1 -save_interval 250 -use_accel -noearlystop \\\n  -validbleu -lrshrink 1.25 -minepochtoanneal 18 -annealing_type slow \\\n  -savedir trainings/iwslt_de_en\n\n# NPMT model (IWSLT EN-DE)\n$ mkdir -p trainings/iwslt_en_de\n$ fairseq train -sourcelang en -targetlang de -datadir data-bin/iwslt14.tokenized.en-de \\\n  -model npmt -nhid 256 -dec_unit_size 512 -dropout .5 -dropout_hid 0 -npmt_dropout .5 \\\n  -optim adam -lr 0.001 -batchsize 32 -log_interval 100 -nlayer 2 -nenclayer 2 -kwidth 7 \\\n  -max_segment_len 6 -rnn_mode GRU -group_size 500 -use_resnet_enc -use_resnet_dec \\\n  -log -momentum 0.99 -clip 10 -maxbatch 800 -bptt 0 -maxepoch 100 -ndatathreads 4 \\\n  -seed 1002 -maxsourcelen 75 -num_lower_win_layers 1 -save_interval 250 -use_accel \\\n  -noearlystop -validbleu -lrshrink 1.25 -minepochtoanneal 15 \\\n  -annealing_type slow -savedir trainings/iwslt_en_de\n  \n# NPMT model (IWSLT EN-VI)\n$ mkdir -p trainings/iwslt_en_vi\n$ fairseq train -sourcelang en -targetlang vi -datadir data-bin/iwslt15.tokenized.en-vi \\\n  -model npmt -nhid 512 -dec_unit_size 512 -dropout .4 -dropout_hid 0 -npmt_dropout .4 \\\n  -optim adam -lr 0.001 -batchsize 48 -log_interval 100 -nlayer 3 -nenclayer 2 -kwidth 7 \\\n  -max_segment_len 7 -rnn_mode LSTM   -group_size 800 -use_resnet_enc -use_resnet_dec -log \\\n  -momentum 0.99 -clip 500 -maxbatch 800 -bptt 0 -maxepoch 50 -ndatathreads 4 -seed 1002 \\\n  -maxsourcelen 75 -num_lower_win_layers 1 -save_interval 250 -use_accel -noearlystop \\\n  -validbleu -nembed 512 -lrshrink 1.25 -minepochtoanneal 8 -annealing_type slow \\\n  -savedir trainings/iwslt_en_vi\n```\n\n\nBy default, `fairseq train` will use all available GPUs on your machine.\nUse the [CUDA_VISIBLE_DEVICES](http://acceleware.com/blog/cudavisibledevices-masking-gpus) environment variable to select specific GPUs or `-ngpus` to change the number of GPU devices that will be used.\n\n### Generation\nOnce your model is trained, you can translate with it using `fairseq generate` (for binarized data) or `fairseq generate-lines` (for text).\nHere, we'll do it for a NPMT model:\n```\n\n# Translate some text\n$ DATA=data-bin/iwslt14.tokenized.de-en\n$ fairseq generate-lines -sourcedict $DATA/dict.de.th7 -targetdict $DATA/dict.en.th7 \\\n  -path trainings/iwslt_de_en/model_bestbleu.th7 -beam 1 -model npmt\n| [target] Dictionary: 22823 types\n| [source] Dictionary: 32010 types\n\u003e danke , aber das beste kommt noch .\nmax decoding:   | 1:184 1:15| 2:4| 3:28| 4:6 4:282| 6:16 6:201 6:311| 8:5|\navg. phrase size 1.666667\nS       danke , aber das beste kommt noch . \u003cpad\u003e\nO       danke , aber das beste kommt noch .\nH       -0.10934638977051       thank you , but the best is still coming .\nA       1\n\n```\nwhere the ``max decoding`` suggests the output segments are ``| thank you | , | but | the best | is still coming | . |``, and ``avg. phrase size`` represents the average phrase length ``10/6 = 1.666667``.\n\n\nGeneration with the binarized test sets can be run as follows (not in batched mode), e.g. for German-English:\n```\n\n$ fairseq generate -sourcelang de -targetlang en -datadir data-bin/iwslt14.tokenized.de-en \\\n  -path trainings/iwslt_de_en/model_bestbleu.th7 -beam 10 -lenpen 1 -dataset test -model npmt | tee /tmp/gen.out\n...\n| Translated 6750 sentences (137891 tokens) in 3013.7s (45.75 tokens/s)\n| Timings: setup 10.7s (0.4%), encoder 28.2s (0.9%), decoder 2747.9s (91.2%), search_results 0.0s (0.0%), search_prune 0.0s (0.0%)\n| BLEU4 = 29.92, 64.7/37.9/23.8/15.3 (BP=0.973, ratio=1.027, sys_len=127660, ref_len=131141)\n\n# Word-level BLEU scoring:\n$ grep ^H /tmp/gen.out | cut -f3- | sed 's/@@ //g' \u003e /tmp/gen.out.sys\n$ grep ^T /tmp/gen.out | cut -f2- | sed 's/@@ //g' \u003e /tmp/gen.out.ref\n$ fairseq score -sys /tmp/gen.out.sys -ref /tmp/gen.out.ref\nBLEU4 = 29.92, 64.7/37.9/23.8/15.3 (BP=0.973, ratio=1.027, sys_len=127660, ref_len=131141)\n\n```\n\n\n# License\nfairseq is BSD-licensed. The released codes modified the original fairseq are BSD-licensed.\nThe rest of the codes are MIT-licensed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposenhuang%2Fnpmt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fposenhuang%2Fnpmt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fposenhuang%2Fnpmt/lists"}