{"id":41847108,"url":"https://github.com/camel-lab/seq2seq-transliteration-tool","last_synced_at":"2026-01-25T10:03:13.650Z","repository":{"id":131318374,"uuid":"308044271","full_name":"CAMeL-Lab/seq2seq-transliteration-tool","owner":"CAMeL-Lab","description":null,"archived":false,"fork":false,"pushed_at":"2020-10-28T14:39:10.000Z","size":49,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-09T22:06:33.758Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CAMeL-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2020-10-28T14:35:08.000Z","updated_at":"2022-03-25T15:28:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"9769111a-517e-4531-847e-4e9c2068c631","html_url":"https://github.com/CAMeL-Lab/seq2seq-transliteration-tool","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CAMeL-Lab/seq2seq-transliteration-tool","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fseq2seq-transliteration-tool","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fseq2seq-transliteration-tool/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fseq2seq-transliteration-tool/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fseq2seq-transliteration-tool/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CAMeL-Lab","download_url":"https://codeload.github.com/CAMeL-Lab/seq2seq-transliteration-tool/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CAMeL-Lab%2Fseq2seq-transliteration-tool/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28751065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-25T09:58:17.166Z","status":"ssl_error","status_checked_at":"2026-01-25T09:55:56.104Z","response_time":113,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-25T10:03:10.694Z","updated_at":"2026-01-25T10:03:13.643Z","avatar_url":"https://github.com/CAMeL-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Seq2Seq Transliteration Tool\n\n## Authors\n- [Ali Shazal](https://github.com/alishazal)\n- [Aiza Usman](https://github.com/aizausman)\n\n## Prerequisites\nIt is important to match the versions of the prerequisites mentioned below in order to avoid errors. These prereqs assume you have a GPU. If you don't, ignore tensorflow-gpu (and instead install tensorflow), cudatoolkit 8.0, and cudnn 6.0.21.\n\n- Python 3.6 and its following libraries:\n    - [camel-tools](https://camel-tools.readthedocs.io/en/latest/getting_started.html)\n    - tensorflow-gpu 1.4.0\n    - cudatoolkit 8.0\n    - cudnn 6.0.21\n    - editdistance\n    - numpy\n    - pandas\n    - scipy\n- Anaconda 4.1.1\n- CUDA 8.0\n- GCC 4.9.3 (very important to match this version because of a grep command in seq2seq python scripts)\n\nWe ran our seq2seq systems with the GPU [NVIDIA Tesla V100 PCIe 32 GB](https://www.techpowerup.com/gpu-specs/tesla-v100-pcie-32-gb.c3184) on NYU Abu Dhabi's High Performance Computing cluster, known as Dalma. We set the memory flag to 30GB and set the max. time to 12 hours for each run. All other Dalma flags were kept as default. The .sh scripts that we ran can be seen in the file dalma_scripts.sh\n\n## Repository Structure\n```\nai/\n    datasets/                   #module for preprocessing of any dataset\n    models/                     #contains the model architectures of the fasttext-enabled seq2seq (named char_seq2seq) and simple seq2seq models\n    tests/                      #contains the scripts that run the seq2seq systems, MLE system, and also contains the accuracy and bleu score scripts.\nhelpers/                        #contains helper files for the transliterate.py script that runs complete systems\noutput/\n    evaluations/                #folder to store evaluation txt files\n    models/                     #folder to store trained models\n    predictions/                #folder to store predictions of trained systems\npretrained_word_embeddings/     #folder to store word embedding .bin files that are produced by Fasttext\nsplits_ldc/                     #contains the LDC data split into train, dev and test; there is also a source split that contains unannotated arabizi data which we use to train Fasttext.\ntemp/                           #folder to store machine learning input and output files that are produced during systems runs. These files are produced after preprocessing or ay-normalization.\n```\n\n\n# Transliteration Tool\nThere are 4 components of this tool. We will demostrate the use of each component using the [LDC BOLT Egyptian Arabic SMS/Chat and Transliteration](https://catalog.ldc.upenn.edu/LDC2017T07) data to transliterate Arabizi to Arabic.\n\n## 1. Data Extraction from LDC XML Files\nDownload the [data](https://catalog.ldc.upenn.edu/LDC2017T07) and unzip the downloaded file. After unzipping you will get the folder 'bolt_sms_chat_ara_src_transliteration' which will contain 'data' folder, 'docs' folder and 'index.html' file. Place the folder 'bolt_sms_chat_ara_src_transliteration' in the root of this repository.\n\n### Extracting Data Splits: Train, Dev \u0026 Test\n\nWe split the chat and SMS transliteration files in the following way:\n- Train: CHT_ARZ_{20121228.0001-20150101.0002} and SMS_ARZ_{20120223.0001-20130902.0002}\n- Dev: CHT_ARZ_{20120130.0000-20121226.0003} and SMS_ARZ_{20110705.0000-20120220.0000}\n- Test: CHT_ARZ_{20150101.0008-20160201.0001} and SMS_ARZ_{20130904.0001-20130929.0000}.\n\nWe have already written the exact files for each split in this repo in split_ldc folder. The files are train.txt, dev.txt, and test.txt. These txt files can be used to move the xml files of a specific split into a separate folder using the script splits_ldc/makeNewLDCSplits.py in the following way:\n\n1. To split the xml files into train, dev and test, run the following three commands (one command for each split):\n\n```unix\n# train\npython3 splits_ldc/makeSplits.py bolt_sms_chat_ara_src_transliteration/data/transliteration/ splits_ldc/train.txt splits_ldc/train/xml_files\n\n# dev\npython3 splits_ldc/makeSplits.py bolt_sms_chat_ara_src_transliteration/data/transliteration/ splits_ldc/dev.txt splits_ldc/dev/xml_files\n\n# test\npython3 splits_ldc/makeSplits.py bolt_sms_chat_ara_src_transliteration/data/transliteration/ splits_ldc/test.txt splits_ldc/test/xml_files\n```\n\nNow the xml files for each split will reside in the specific folder of the split. Next, we will extract data from these XML files.\n\n2. Extract the source and target. To do this run the following three commands (one command for each split)\n\n```unix\n# training data\npython3 splits_ldc/getSourceAndTarget.py splits_ldc/train/xml_files/ splits_ldc/train/train-source.arabizi splits_ldc/train/train-word-aligned-target.gold splits_ldc/train/train-sentence-aligned-target.gold\n\n# dev data\npython3 splits_ldc/getSourceAndTarget.py splits_ldc/dev/xml_files/ splits_ldc/dev/dev-source.arabizi splits_ldc/dev/dev-word-aligned-target.gold splits_ldc/dev/dev-sentence-aligned-target.gold\n\n# test data\npython3 splits_ldc/getSourceAndTarget.py splits_ldc/test/xml_files/ splits_ldc/test/test-source.arabizi splits_ldc/test/test-word-aligned-target.gold splits_ldc/test/test-sentence-aligned-target.gold\n```\n\nThe difference between word-aligned-target.gold files and sentences-aligned-target.gold files is the presence and absence of [+] and [-] tokens.\n\nAt this point in each of the split folders (train, dev and test) there will be there files: source.arabizi, word-aligned-target.gold, and sentence-aligned-target.gold.\n\n### Extracting Unannotated Arabizi Data\n\nWe also extract data from the unannotated Arabizi files. This data is used to train Fasttext for pre-trained word embeddings. The files include train, dev and test Arabizi lines and many more (they have ~1M word). However, in order to make sure that dev and test lines are unseen, we exclude them when extracting all lines. To do all this, simply run the following command:\n\n```unix\npython3 splits_ldc/getSourceArabiziWithoutDevAndTest.py bolt_sms_chat_ara_src_transliteration/data/source splits_ldc/dev/xml_files splits_ldc/test/xml_files splits_ldc/source/source-without-dev-test.arabizi\n```\n\n### Training Fasttext with Unannotated Arabizi Data\n\nDownload fasttext at the root folder level by the following commands:\n\n```unix\ngit clone https://github.com/facebookresearch/fastText.git\n\ncd fastText\n\nmake\n```\n\nNow train word-embeddings on the unannotated arabizi data we extracted (without dev and test) using Fasttext. First, preprocess the data and then start training.\n\n```unix\n# Preprocess\n\ncd ../ #move up one directory to come back to the root\n\npython3 helpers/preprocess_fasttext_data.py --input_file=splits_ldc/source/source-without-dev-test.arabizi --output_file=splits_ldc/source/source-without-dev-test-preprocessed.arabizi\n\n# Word-embeddings training\n./fastText/fasttext skipgram -input splits_ldc/source/source-without-dev-test-preprocessed.arabizi -output pretrained_word_embeddings/arabizi_300_narrow -dim 300 -minn 2 -ws 2\n```\n\nThis will save a .bin files at the output directory specified in the command. This bin file will be used in training to feed pre-trained word embeddings.\n\n## 2. Training\nTo train models on the data we've extracted run any of the following scripts depending on which model you're training:\n```unix\n# Word2Word\npython3 transliterate.py --predict=False --evaluate_accuracy=False --evaluate_bleu=False\n```\n```unix\n# Line2Line\npython3 transliterate.py --predict=False --evaluate_accuracy=False --evaluate_bleu=False --model_name=line2line --model_output_path=output/models/line2line_model --batch_size=1024\n```\n```unix\n# MLE\npython3 transliterate.py --predict=False --evaluate_accuracy=False --evaluate_bleu=False --model_name=mle --model_output_path=output/models/mle_model\n```\n\n**To train on any other data, please look at the flags in transliterate.py and run the scripts by setting the appropriate flags for your data.**\n\n## 3. Prediction with Evaluation\nTo predict the dev (or test) files using the trained models (with their temp files for word embeddings) and evaluate them using the gold files, run the following scripts according to your model prediction input/output files. To disable preprocessing set the --preprocess flag as False.\n```unix\n# Word2word\npython3 transliterate.py --train=False --predict_input_file=\u003cprediction-input-file\u003e --predict_output_file=\u003cprediction-output-file\u003e --predict_output_word_aligned_gold=\u003cword-aligned-gold-file\u003e --predict_output_sentence_aligned_gold=\u003csentence-aligned-gold-file\u003e --evaluation_results_file=\u003cevaluation-results-file\u003e\n```\n```unix\n# Line2Line\npython3 transliterate.py --train=False --model_name=line2line --model_output_path=output/models/line2line_model --prediction_loaded_model_training_train_input=temp/line2line_training_train_input --prediction_loaded_model_training_train_output=temp/line2line_training_train_output --prediction_loaded_model_training_dev_input=temp/line2line_training_dev_input --prediction_loaded_model_training_dev_output=temp/line2line_training_dev_output --prediction_loaded_model_training_test_input=temp/line2line_training_test_input --predict_input_file=\u003cprediction-input-file\u003e --predict_output_file=\u003cprediction-output-file\u003e --predict_output_word_aligned_gold=\u003cword-aligned-gold-file\u003e --predict_output_sentence_aligned_gold=\u003csentence-aligned-gold-file\u003e --evaluation_results_file=\u003cevaluation-results-file\u003e --batch_size=1024\n```\n```unix\n# MLE\npython3 transliterate.py --train=False --model_name=mle --model_output_path=output/models/mle_model --predict_input_file=\u003cprediction-input-file\u003e --predict_output_file=\u003cprediction-output-file\u003e --predict_output_word_aligned_gold=\u003cword-aligned-gold-file\u003e --predict_output_sentence_aligned_gold=\u003csentence-aligned-gold-file\u003e --evaluation_results_file=\u003cevaluation-results-file\u003e\n```\n\nTo run the hybrid system, which combines MLE (for OOV words) and Word2Word (for INV words) run the following script. It expects an MLE model and a Word2Word model through the --mle_model_file and --word2word_model_dir flags.\n```unix\npython3 transliterate.py --model_name=hybrid --train=False --mle_model_file=output/models/mle_model --word2word_model_dir output/models/word2word_model --predict_input_file=\u003cprediction-input-file\u003e --predict_output_file=\u003cprediction-output-file\u003e --predict_output_word_aligned_gold=\u003cword-aligned-gold-file\u003e --predict_output_sentence_aligned_gold=\u003csentence-aligned-gold-file\u003e --evaluation_results_file=\u003cevaluation-results-file\u003e\n```\n\n## 4. Prediction without Evaluation\nTo generate predictions given a file using our best model (word2word) settings run the following script replacing \\\u003cinput\\_file\\\u003e with the path to your input file and \\\u003coutput\\_file\\\u003e with the path to the output file. Note: we're assuming that (a). the word2word model has already been trained and is stored in output/models/word2word_model (b). the temp folder has files that were automatically generated by the system for training (if you dont have these, we'd suggest running a complete cycle of the word2word system using the script given under the \"development set results\" below)\n\n```unix\npython3 transliterate.py --train=False --evaluate_accuracy=False --evaluate_bleu=False --predict_input_file=\u003cinput_file\u003e --predict_output_file=\u003coutput_file\u003e\n```\n\nTo run the hybrid system to generate predictions, run the following:\n```unix\npython3 transliterate.py --model_name=hybrid --train=False --evaluate_accuracy=False --evaluate_bleu=False --mle_model_file=output/models/mle_model --word2word_model_dir output/models/word2word_model --predict_input_file=\u003cinput-file\u003e --predict_output_file=\u003coutput-file\u003e\n```\n\n## Running Complete Systems\nFollowing are the scripts to replicate the results in our paper using the [LDC BOLT Egyptian Arabic SMS/Chat and Transliteration](https://catalog.ldc.upenn.edu/LDC2017T07) data with the word2word, line2line and MLE models (assuming all the data files are present and Fasttext has been trained - as explained in step 1). For no-proprocessing results, pass the flag --preprocess=False\n\n#### Development Set Results\n```unix\n# Word2Word\npython3 transliterate.py\n```\n```unix\n# Line2Line\npython3 transliterate.py --model_name=line2line --model_output_path=output/models/line2line_model --prediction_loaded_model_training_train_input=temp/line2line_training_train_input --prediction_loaded_model_training_train_output=temp/line2line_training_train_output --prediction_loaded_model_training_dev_input=temp/line2line_training_dev_input --prediction_loaded_model_training_dev_output=temp/line2line_training_dev_output --prediction_loaded_model_training_test_input=temp/line2line_training_test_input --predict_output_file=output/predictions/line2line-dev.out --evaluation_results_file=output/evaluations/line2line_dev_evaluation_results.txt --batch_size=1024\n```\n```unix\n# MLE\npython3 transliterate.py --model_name=mle --model_output_path=output/models/mle_model --predict_output_file=output/predictions/mle_dev.out --evaluation_results_file=output/evaluations/mle_dev_evaluation_results.txt\n```\n\n#### Test Set Results\n```unix\n# Word2Word\npython3 transliterate.py --predict_input_file=splits_ldc/test/test-source.arabizi --predict_output_file=output/predictions/word2word-test.out --predict_output_word_aligned_gold=splits_ldc/test/test-word-aligned-target.gold --predict_output_sentence_aligned_gold=splits_ldc/test/test-sentence-aligned-target.gold --evaluation_results_file=output/evaluations/word2word_test_evaluation_results.txt\n```\n```unix\n# MLE\npython3 transliterate.py --model_name=mle --model_output_path=output/models/mle_model --predict_input_file=splits_ldc/test/test-source.arabizi --predict_output_file=output/predictions/mle-test.out --predict_output_word_aligned_gold=splits_ldc/test/test-word-aligned-target.gold --predict_output_sentence_aligned_gold=splits_ldc/test/test-sentence-aligned-target.gold --evaluation_results_file=output/evaluations/mle_test_evaluation_results.txt\n```\n\n## Troubleshooting\n#### Tensor Shape Error on Word Embeddings: LHS not equal to RHS\nThis error comes up when there is a difference in the files that the seq2seq model was trained on and the files that the model is told are \"training\" files during predictions. The model needs the same files in training and prediction because it has to load the same word embeddings everytime. If the files are different, the shape of the tensors won't match. So make sure that the files for the flags --prediction_loaded_model_training_train_input, --prediction_loaded_model_training_train_output, --prediction_loaded_model_training_dev_input, --prediction_loaded_model_training_dev_output, --prediction_loaded_model_training_test_input are the same files that were produced in temp during training. \n\n## License\nThis tool is available under the MIT license. See the [LICENSE file](LICENSE) for more info.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fseq2seq-transliteration-tool","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcamel-lab%2Fseq2seq-transliteration-tool","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcamel-lab%2Fseq2seq-transliteration-tool/lists"}