{"id":13754407,"url":"https://github.com/Jingjing-NLP/VOLT","last_synced_at":"2025-05-09T22:32:22.004Z","repository":{"id":38401075,"uuid":"382539842","full_name":"Jingjing-NLP/VOLT","owner":"Jingjing-NLP","description":"Code for paper \"Vocabulary Learning via Optimal Transport for Neural Machine Translation\"","archived":false,"fork":false,"pushed_at":"2022-02-02T12:21:55.000Z","size":20684,"stargazers_count":443,"open_issues_count":17,"forks_count":46,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-20T11:39:15.002Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Jingjing-NLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-03T06:09:25.000Z","updated_at":"2025-03-18T17:22:53.000Z","dependencies_parsed_at":"2022-08-18T13:20:45.162Z","dependency_job_id":null,"html_url":"https://github.com/Jingjing-NLP/VOLT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jingjing-NLP%2FVOLT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jingjing-NLP%2FVOLT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jingjing-NLP%2FVOLT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jingjing-NLP%2FVOLT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Jingjing-NLP","download_url":"https://codeload.github.com/Jingjing-NLP/VOLT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253336054,"owners_count":21892781,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:58.499Z","updated_at":"2025-05-09T22:32:16.981Z","avatar_url":"https://github.com/Jingjing-NLP.png","language":"Python","funding_links":[],"categories":["其他_NLP自然语言处理"],"sub_categories":["其他_文本生成、文本对话"],"readme":"**Codebase and data are uploaded in progress. **\n\n\nVOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.  \nTo help more readers understand our work better, I write a [blog](https://jingjing-nlp.github.io/volt-blog/) at this repo. \n\n\n### What's New:\n* July 2021: Support vocabulary learning for classification. \n* July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.  \n* July 2021: Support subword-nmt tokenization. \n* July 2021: Support sentencepiece tokenization.\n\n## What's On-going:\n* Support pip usage.\n\n\n### Features:\n\n* Efficient: CPU learning on one machine.\n* Easy-to-use: Support widely-used tokenization toolkits, subword-nmt and sentencepiece.   \n  \n# Requirements and Installation\n\nThe required environments:\n* python 3\n* tqdm\n* mosedecoder\n* subword-nmt\n* POT (local POT)\n\n\n**To use VOLT** and develop locally:\n\n``` bash\ngit clone https://github.com/Jingjing-NLP/VOLT/\ncd VOLT\ngit clone https://github.com/moses-smt/mosesdecoder.git\ngit clone https://github.com/rsennrich/subword-nmt.git\npip3 install sentencepiece\npip3 install tqdm \ncd POT\npip3 install --editable ./ -i https://pypi.doubanio.com/simple --user\ncd ../\n```\n\n# Usage\n\n* The first step is to get vocabulary candidates based on tokenized texts. Notice: the tokenized texts should be in charater level. Please do not use segmentation tools to segment your texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples.\n  * This example shows how to learn a vocabulary for seq2seq tasks ( including source data and target data). \n  ```\n  #Assume source_file is the file stroing texts in the source data\n  #Assume target_file is the file stroing texts in the target data\n  size=30000 # the size of BPE\n  cat source_file \u003e training_data\n  cat target_file \u003e\u003e training_data \n\n \n  #subword-nmt style:\n  mkdir bpeoutput\n  BPE_CODE=bpeoutput/code # the path to save vocabulary\n  python3 subword-nmt/learn_bpe.py -s $size  \u003c training_data \u003e $BPE_CODE\n  python3 subword-nmt/apply_bpe.py -c $BPE_CODE \u003c source_file \u003e bpeoutput/source.file\n  python3 subword-nmt/apply_bpe.py -c $BPE_CODE \u003c target_file \u003e bpeoutput/target.file \n\n  #sentencepiece style:\n  cd examples\n  mkdir spmout\n  python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe\n  #After this step, you will see spm.vocab and spm.model. \n  #Change spm.vocab to a file where each line is splited via a single space like example \"abc 100\"\n  sed -i 's/\\t/ /g' spm.vocab\n  python3 spm/spm_encoder.py --model spm.model --inputs source_file --outputs spmout/source.file --output_format piece\n  python3 spm/spm_encoder.py --model spm.model --inputs target_file --outputs spmout/target.file --output_format piece\n  ```\n  * This example shows how to get a vocabulary from a single file for non-seq2seq tasks.\n   ```\n  #Assume source_file is the file stroing your data\n  size=30000 # the size of BPE\n\n  #subword-nmt style:\n  mkdir bpeoutput\n  BPE_CODE=bpeoutput/code # the path to save vocabulary\n  python3 subword-nmt/learn_bpe.py -s $size  \u003c source_file \u003e $BPE_CODE\n  python3 subword-nmt/apply_bpe.py -c $BPE_CODE \u003c source_file \u003e bpeoutput/source.file\n  \n\n  #sentencepiece style:\n  cd examples\n  mkdir spmout\n  python3 spm/spm_train.py --input=source_file --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe\n  #After this step, you will see spm.vocab and spm.model\n  python3 spm/spm_encoder.py --model spm.model --inputs source_file --outputs spmout/source.file --output_format piece\n  ```\n  \n\n* The second step is to run VOLT scripts. It accepts the following parameters:\n  * --source_file: the file storing source data for seq2seq tasks or the file string all raw texts for non-seq2seq tasks.\n  * --token_candidate_file: the file storing token candidates. Each line is splited via a single space like example \"abc 100\"\n  * --tokenizer: which toolkit you use to get token candidates.  Only two choices are supported: subword-nmt and sentencepiece. \n  * --size_file: the file to store the vocabulary size recommended by VOLT.\n  * --target_file: (optional) the file storing target data for seq2seq tasks. None by default.\n  * --max_number: (optional) the maximum size of the vocabulary generated by VOLT. 10,000 by default. \n  * --interval: (optional) the search granularity in VOLT. 1,000 by default. \n  * --loop_in_ot: (optional) the maximum interation loop in the Sinkhorn solution. 500 by default.\n  * --threshold: (optional) the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Small threshold means that the final vocabulary is more like BPE-style vocabulary. 1e-5 by default.\n  ```\n  #For seq2seq tasks with source file and target file, you can use the following commands:\n  #subword-nmt style\n  python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \\\n            --token_candidate_file $BPE_CODE \\\n            --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size \n  #sentencepiece style\n  python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \\\n            --token_candidate_file spm.vocab \\\n            --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size \n\n  #For non-seq2seq tasks with one source file, you can use the following commands:\n  #subword-nmt style\n  python3 ../ot_run.py --source_file bpeoutput/source.file \\\n            --token_candidate_file $BPE_CODE \\\n            --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size \n    \n  #sentencepiece style\n  BPE_CODE=spm.vocab\n  python3 ../ot_run.py --source_file spmoutput/source.file \\\n            --token_candidate_file spm.vocab  \\\n            --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size \n\n  ```\n* The third step is to use the generated vocabulary to segment your texts:\n  \n  ```\n    #subword-nmt style\n    echo \"#version: 0.2\" \u003e bpeoutput/vocab.seg # add version info\n    echo bpeoutput/vocab \u003e\u003e bpeoutput/vocab.seg\n    python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab.seg \u003c source_file \u003e bpeoutput/source.file\n    python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab.seg \u003c target_file \u003e bpeoutput/source.file #optional if your task does not contain target texts\n\n    #sentencepiece style\n    #for sentencepiece toolkit, here we only keep the optimal size\n    best_size=$(cat spmoutput/size)\n    #training_data contains source data and target data (optional if target data is provided)\n    python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe\n    python3 spm/spm_encoder.py --model spm.model --inputs source_file --outputs spmout/source.file --output_format piece\n    python3 spm/spm_encoder.py --model spm.model --inputs target_file --outputs spmout/target.file --output_format piece #optional if your task does not contain target texts\n  ```\n\n* The last step is to use the segmented texts for downstream tasks. You can use the repo [Fairseq](https://github.com/pytorch/fairseq) for training and evaluation. We also upload the training and evaluation code in path \"examples/\". Notice: For a comparison of BLEU, you need to do \"remove-bpe\" operations for the generated texts. \n\n# Examples\n\nWe have given several examples in path \"examples/\", including En-De translation, En-Fr translation, multilingual translation, and En-De translation without joint vocabularies. \n* En-De translation: run_ende.sh\n* En-De translation without joint vocabularies: run_ende_withoutjoint.sh\n* En-Fr translation:  run_enfr.sh \n* TED bilingual translation:   run_ted_bilingual.sh\n* TED bilingual translation with sentencepiece: run_ted_bilingual_senencepiece.sh\n* TED many-to-one translation: run_ted_multilingual.sh\n\n# Datasets\n\nThe WMT-14 En-de translation data can be downloaed via the running scripts.\n\nFor TED X-EN data, you can download at [X-EN](https://drive.google.com/drive/folders/1FNH7cXFYWWnUdH2LyUFFRYmaWYJJveKy?usp=sharing).\nFor TED EN-X data, you can download at [EN-X](https://drive.google.com/drive/u/1/folders/1du13KQG6JM9u1JLhnS47Pu4BQtfP2AK3)\n\n# Citation\n\nPlease cite as:\n\n``` bibtex\n@inproceedings{volt,\n  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},\n  author= {Jingjing Xu and\n               Hao Zhou and\n               Chun Gan and\n               Zaixiang Zheng and\n               Lei Li},\n  booktitle = {Proceedings of ACL 2021},\n  year = {2021},\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJingjing-NLP%2FVOLT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FJingjing-NLP%2FVOLT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJingjing-NLP%2FVOLT/lists"}