{"id":18439567,"url":"https://github.com/idiap/han_nmt","last_synced_at":"2025-04-07T21:32:43.687Z","repository":{"id":34390152,"uuid":"149727575","full_name":"idiap/HAN_NMT","owner":"idiap","description":"Document-Level Neural Machine Translation with Hierarchical Attention Networks","archived":false,"fork":false,"pushed_at":"2022-05-09T01:08:02.000Z","size":3763,"stargazers_count":68,"open_issues_count":2,"forks_count":19,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-23T01:03:29.588Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/idiap.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-09-21T07:36:50.000Z","updated_at":"2024-07-08T01:08:19.000Z","dependencies_parsed_at":"2022-08-08T17:31:23.158Z","dependency_job_id":null,"html_url":"https://github.com/idiap/HAN_NMT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2FHAN_NMT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2FHAN_NMT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2FHAN_NMT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2FHAN_NMT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/idiap","download_url":"https://codeload.github.com/idiap/HAN_NMT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247732736,"owners_count":20986913,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T06:25:27.934Z","updated_at":"2025-04-07T21:32:38.673Z","avatar_url":"https://github.com/idiap.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Description\n\nImplementation of the paper [\"Document-Level Neural Machine Translation with Hierarchical Attention Networks\"](http://www.aclweb.org/anthology/D18-1325). It is based on OpenNMT (v.2.1) https://github.com/OpenNMT/OpenNMT-py\n\nThis is a restricted version. It DOES NOT work for shards, and multimodal translation.\n\n## Preprocess\nThe data, similary for any NMT baseline, consists of a source file and a target file which are aligned at sentence-level. However, the sentences should be in order for each document (i.e. not shuffled). Additionally, the model requires a file (doc_file) indicating the beginning of each document in the source file. Each line of the doc_file indicates the number of lines at the source file where a new document starts. \n\nExample: \n\n\u003e\t0  \n\u003e\t10  \n\u003e\t25 \n\nThere are 3 documents. The first one from line 0 to line 9, the second from line 10 to 24, the third from line 25 to the end.\n\nCommand:\n```\npython preprocess.py -train_src [source_file] -train_tgt [target_file] -train_doc [doc_file] \n-valid_src [source_dev_file] -valid_tgt [target_dev_file] -valid_doc [doc_dev_file] -save_data [out_file]\n```\nThe folder preprocess_TED_zh-en contains the files to preprocess the TED Talks zh-en dataset from https://wit3.fbk.eu/mt.php?release=2015-01.\n\n## Training\nTraining the sentence-level NMT baseline:\n\n```\npython train.py -data [data_set] -save_model [sentence_level_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 4096 -start_decay_at 20 -report_every 500 -epochs 20 -gpuid 0 -max_generator_batches 16 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot \n-train_part sentences\n```\n\nTraining HAN-encoder using the sentence-level NMT model:\n\n```\npython train.py -data [data_set] -save_model [HAN_enc_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 1024 -start_decay_at 2 -report_every 500 -epochs 1 -gpuid 0 -max_generator_batches 32 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot \n-train_part all -context_type HAN_enc -context_size 3 -train_from [sentence_level_model]\n```\n\nTraining HAN-decoder using the sentence-level NMT model:\n\n```\npython train.py -data [data_set] -save_model [HAN_dec_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 1024 -start_decay_at 2 -report_every 500 -epochs 1 -gpuid 0 -max_generator_batches 32 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot \n-train_part all -context_type HAN_dec -context_size 3 -train_from [sentence_level_model]\n```\n\nTraining HAN-joint using the HAN-encoder model:\n\n```\npython train.py -data [data_set] -save_model [HAN_joint_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 1024 -start_decay_at 2 -report_every 500 -epochs 1 -gpuid 0 -max_generator_batches 32 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot \n-train_part all -context_type HAN_join -context_size 3 -train_from [HAN_enc_model]\n```\n\nInput options:\n\n- train_part:\t[sentences, context, all]   \n- context_type:\t[HAN_enc, HAN_dec, HAN_join, HAN_dec_source, HAN_dec_context]\n- context_size:\tnumber of previous sentences\n\nNOTE: The transformer model is sensitive to variation on hyperparameters. The HAN is also sensitive to the batch size.\n\n## Translation\nThe translation is done sentence by sentence despite not being necesary for HAN_enc or baseline (this could be improved).\n\n```\npython translate.py -model [model] -src [test_source_file] -doc [test_doc_file] \n-output [out_file] -translate_part all -batch_size 1000 -gpu 0\n```\nInput options:\n\n- translate_part: [sentences, all]\n- batch_size: maximun number of sentences to keep in memory at once.\n\n\n## Test files reported in the paper\nThe output files of the 3 reported systems: transformer NMT, cache NMT, HAN-decoder NMT, HAN-encoder NMT, HAN-encoder-decoder NMT.\n\u003e\t- sub_es-en: Opensubtitles \n\u003e\t- sub_zh-en: TV subtitles \n\u003e\t- TED_es-en: TED Talks WIT 2015\n\u003e\t- TED_zh-en: TED Talks WIT 2014\n\n\n## Reference:\n\u003eMiculicich, L., Ram, D., Pappas, N. \u0026 Henderson, J. Document-Level Neural Machine Translation with Hierarchical Attention Networks. EMNLP 2018. \nhttps://www.aclweb.org/anthology/D18-1325/\n \n## Contact:\nlmiculicich@idiap.ch\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidiap%2Fhan_nmt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fidiap%2Fhan_nmt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidiap%2Fhan_nmt/lists"}