{"id":19066184,"url":"https://github.com/epfml/bi-sent2vec","last_synced_at":"2025-04-28T12:25:20.319Z","repository":{"id":78965905,"uuid":"253824913","full_name":"epfml/Bi-Sent2Vec","owner":"epfml","description":"Robust Cross-lingual Embeddings from Parallel Sentences ","archived":false,"fork":false,"pushed_at":"2020-06-27T17:28:41.000Z","size":52,"stargazers_count":22,"open_issues_count":1,"forks_count":2,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-18T16:16:52.788Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epfml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-04-07T14:52:07.000Z","updated_at":"2025-02-22T16:27:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"e2bd7cba-a2e2-4d9a-b695-d760407d7200","html_url":"https://github.com/epfml/Bi-Sent2Vec","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2FBi-Sent2Vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2FBi-Sent2Vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2FBi-Sent2Vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2FBi-Sent2Vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epfml","download_url":"https://codeload.github.com/epfml/Bi-Sent2Vec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251312327,"owners_count":21569215,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T00:55:05.273Z","updated_at":"2025-04-28T12:25:20.313Z","avatar_url":"https://github.com/epfml.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Bi-Sent2Vec\n\nTLDR: This library provides cross-lingual numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task with applications geared towards cross-lingual word translation, cross-lingual sentence retrieval as well as cross-lingual downstream NLP tasks. The library is a cross-lingual extension of [Sent2Vec](https://github.com/epfml/sent2vec).\n\nBi-Sent2Vec vectors are also well suited to monolingual tasks as indicated by a marked improvement in the monolingual quality of the word embeddings. (For more details, see [paper](https://arxiv.org/abs/1912.12481))\n\n### Table of Contents\n\n* [Setup and Requirements](#setup-and-requirements)\n* [Using the model](#using-the-model)\n    - [Downloading Bi-Sent2Vec pre-trained vectors](#downloading-bi-sent2vec-pre-trained-vectors)\n    - [Train a New Bi-Sent2Vec Model](#train-a-new-bi-sent2vec-model)\n* [Evaluation](#evaluation)\n* [References](#references)\n\n# Setup and Requirements\n\nOur code builds upon [Facebook's FastText library](https://github.com/facebookresearch/fastText).\n\nTo compile the library, simply run the `make` command.\n\n# Using the model\n\nFor the purpose of generating cross-lingual word and sentence representations, we introduce our Bi-Sent2vec method and provide code and models.\n\nThe method uses a simple but efficient  objective to train distributed representations of sentences. The algorithm outperforms the current state-of-the-art bag-of-words based models on most of the benchmark tasks, and is also competitive with deep models on some of the tasks, highlighting the robustness of the produced word and  sentence embeddings, see [*the paper*](https://arxiv.org/abs/1912.12481) for more details.\n\n## Downloading Bi-Sent2Vec pre-trained vectors\n\nModels trained and tested in the Bi-Sent2Vec paper can be downloaded from the following links. Users are encouraged to add more bi-lingual models to the list provided they have been benchmarked properly.\n\n### Unigram\n\n[EN](https://drive.google.com/file/d/1schNkg0OLTrTqA_VSCcpJnczaZbEfUiW/view?usp=sharing)-[DE](https://drive.google.com/file/d/1S76Pf_UByF9vHfGHx3EAP5bB3Vvyi8_l/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/1b_q6WCXdQEKz0Grx21mzxVaGqBV7Y5WY/view?usp=sharing)-[ES](https://drive.google.com/file/d/1pEusR2238oJwLmRzC0j7pduaKW6FLzOv/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/1Omac6Cbkb7cmyOeTZpyacGOKGHy9ixo8/view?usp=sharing)-[FI](https://drive.google.com/file/d/1rr_ZhDPjp901vGKUuK4gjXEM9aDBDOOD/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/1Ny7TDW_1jRZTH327OhrGbSpIPgsSr3LJ/view?usp=sharing)-[FR](https://drive.google.com/file/d/1WTsLmVcjG_M8vwgUvfvM_A1386q7Nv0H/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/1dPmM270pUTW2ETl14SfcFFI0hscEeQXO/view?usp=sharing)-[HU](https://drive.google.com/file/d/1aLe8CsB2o0fjmMTonYozcj0V69pPY59_/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/1C6e-6qkhsoYjWlJ0OcjWeStOSUvQUOuQ/view?usp=sharing)-[IT](https://drive.google.com/file/d/1_rO75UgpZpug7kzjtho-jkigwl_igFqm/view?usp=sharing)\n\n### Bigram\n\n[EN](https://drive.google.com/file/d/1CI4sFR0Y6v6zHzdaFN17vn9Wo_m1ywno/view?usp=sharing)-[DE](https://drive.google.com/file/d/1HyKS0QpBHd_2pLp_0JxGFDAMASydR5Xe/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/1XKk4Vw4ATMcYAhmDl_nUx1HnpIcfuEwX/view?usp=sharing)-[ES](https://drive.google.com/file/d/1oJ2LXUk0CZzwj02sWIVZtsXH6psICOHl/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/1q9dn76Sau3ArOEJ-J2mYghAnnoPjb9us/view?usp=sharing)-[FI](https://drive.google.com/file/d/1cqen99e_BNZp13wWGBpJf7x5b-PmHEKl/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/1ztCsll3YUUBVMDHTZBPDmNMcZPkQbEu-/view?usp=sharing)-[FR](https://drive.google.com/file/d/1KWuuFpNDOmEXoLvwWU2OastlK5MyIke6/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/15sMJQNm3s6uWh80y2SkxY6hCWnMx5pmx/view?usp=sharing)-[HU](https://drive.google.com/file/d/1K88rEsVM7mrcHZlqwHPvtjWy5JPkIvJ6/view?usp=sharing)\n•\n[EN](https://drive.google.com/file/d/1Iv-vuPWw40mvkbzfRJ7c0EIvvO_VRjT6/view?usp=sharing)-[IT](https://drive.google.com/file/d/1XXDGKQFscr_snJFzGc-aafUVRy_q6r0g/view?usp=sharing)\n\n\n## Train a New Bi-Sent2Vec Model\n### Tokenizing and data format\nBi-Sent2Vec requires parallel sentences (sentences which are translations of each other) for training.\nWe use [spacy](https://spacy.io/) tokenizer to tokenize the text.\n\nThe required data format is one sentence pair per line. The two parallel sentences are separated by a \\\u003c\\\u003csplit\\\u003e\\\u003e token and each word has its language code attached to it as a prefix. For example, here is an example of a snapshot of a valid English-French dataset -\n```\nthe_en train_en is_en arriving_en ._en \u003c\u003csplit\u003e\u003e le_fr train_fr arrive_fr ._fr\nfrance_en won_en the_en world_en cup_en ._en \u003c\u003csplit\u003e\u003e la_fr france a_fr gagné_fr la_fr coupe_fr du_fr monde_fr ._fr\n```\n\n## Training\n\nAssuming en-fr_sentences.txt is the pre-processed training corpus, here is an example of a command to train a Bi-Sent2Vec model:\n\n    ./fasttext bisent2vec -input en-fr_sentences.txt -output model-en-fr -dim 300 -lr 0.2 -neg 10 -bucket 2000000 -maxVocabSize 750000 -thread 30 -t 0.000005 -epoch 5 -minCount 8 -dropoutK 4 -loss ns -wordNgrams 2 -numCheckPoints 5\n\nHere is a description of all available arguments:\n\n```\nThe following arguments are mandatory:\n  -input              training file path\n  -output             output file path (model is stored in the .bin file and the vectors in .vec file)\n\nThe following arguments are optional:\n  -lr                 learning rate [0.2]\n  -lrUpdateRate       change the rate of updates for the learning rate [100]\n  -dim                dimension of word and sentence vectors [100]\n  -epoch              number of epochs [5]\n  -minCount           minimal number of word occurences [5]\n  -minCountLabel      minimal number of label occurences [0]\n  -neg                number of negatives sampled [10]\n  -wordNgrams         max length of word ngram [2]\n  -loss               loss function {ns, hs, softmax} [ns]\n  -bucket             number of hash buckets for vocabulary [2000000]\n  -thread             number of threads [2]\n  -t                  sampling threshold [0.0001]\n  -dropoutK           number of ngrams dropped when training a Bi-Sent2Vec model [2]\n  -verbose            verbosity level [2]\n  -maxVocabSize       vocabulary exceeding this size will be truncated [None]\n  -numCheckPoints     number of intermediary checkpoints to save when training [1]\n```\n### Post Processing\nUse vectors_by_lang.py to separate the vectors for the two different languages.\nExample -\n```\npython vectors_by_lang.py model-en-fr.vec en fr\n```\nThis code will create two files model-en-fr_en.vec and model-en-fr_fr.vec in word2vec format containing vectors for English and French respectively.\n\n# Evaluation\nOur models are evaluated using the standard evaluation tool in the [MUSE](https://github.com/facebookresearch/MUSE) repository by Facebook AI Research.\n\n# References\nWhen using this code or some of our pretrained vectors for your application, please cite the following paper:\n\n  Ali Sabet, Prakhar Gupta, Jean-Baptiste Cordonnier, Robert West, Martin Jaggi [*Robust Cross-lingual Embeddings from Parallel Sentences*](https://arxiv.org/abs/1912.12481)\n\n```\n@article{Sabet2019RobustCE,\n  title={Robust Cross-lingual Embeddings from Parallel Sentences},\n  author={Ali Sabet and Prakhar Gupta and Jean-Baptiste Cordonnier and Robert West and Martin Jaggi},\n  journal={ArXiv 1912.12481},\n  year={2020},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fbi-sent2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepfml%2Fbi-sent2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fbi-sent2vec/lists"}