{"id":13653070,"url":"https://github.com/epfml/sent2vec","last_synced_at":"2025-05-15T11:04:31.561Z","repository":{"id":38291237,"uuid":"85061458","full_name":"epfml/sent2vec","owner":"epfml","description":"General purpose unsupervised sentence representations ","archived":false,"fork":false,"pushed_at":"2022-08-03T18:18:45.000Z","size":454,"stargazers_count":1204,"open_issues_count":30,"forks_count":260,"subscribers_count":38,"default_branch":"master","last_synced_at":"2025-04-14T18:12:40.623Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epfml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-15T10:51:12.000Z","updated_at":"2025-04-11T08:09:17.000Z","dependencies_parsed_at":"2022-08-09T02:31:22.137Z","dependency_job_id":null,"html_url":"https://github.com/epfml/sent2vec","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fsent2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fsent2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fsent2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fsent2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epfml","download_url":"https://codeload.github.com/epfml/sent2vec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248933340,"owners_count":21185460,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T02:01:05.487Z","updated_at":"2025-04-14T18:12:45.454Z","avatar_url":"https://github.com/epfml.png","language":"C++","funding_links":[],"categories":["Feature Extraction","APIs and Libraries","Datasets","C++"],"sub_categories":["Text/NLP","Knowledge Graphs","Pre-Trained Language Models"],"readme":"## Updates \n\nCode and pre-trained models related to the [Bi-Sent2vec](https://arxiv.org/abs/1912.12481), cross-lingual extension of Sent2Vec can be found [here](https://github.com/epfml/Bi-sent2vec). \n\n# Sent2vec\n\nTLDR: This library provides numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task. \n\n### Table of Contents  \n\n* [Setup and Requirements](#setup-and-requirements)\n* [Sentence Embeddings](#sentence-embeddings)\n    - [Generating Features from Pre-Trained Models](#generating-features-from-pre-trained-models)\n    - [Downloading Sent2vec Pre-Trained Models](#downloading-sent2vec-pre-trained-models)\n    - [Train a New Sent2vec Model](#train-a-new-sent2vec-model)\n    - [Nearest Neighbour Search and Analogies](#nearest-neighbour-search-and-analogies)\n* [Word (Unigram) Embeddings](#unigram-embeddings)\n    - [Extracting Word Embeddings from Pre-Trained Models](#extracting-word-embeddings-from-pre-trained-models)\n    - [Downloading Pre-Trained Models](#downloading-pre-trained-models)\n    - [Train a CBOW Character and Word Ngrams Model](#train-a-cbow-character-and-word-ngrams-model)\n* [References](#references)\n\n# Setup and Requirements\n\nOur code builds upon [Facebook's FastText library](https://github.com/facebookresearch/fastText), see also their nice documentation and python interfaces.\n\nTo compile the library, simply run a `make` command.\n\nA Cython module allows you to keep the model in memory while inferring sentence embeddings. In order to compile and install the module, run the following from the project root folder:\n\n```\npip install .\n```\n\n## Note -  \nif you install sent2vec using\n\n```\n$ pip install sent2vec\n```\n\nthen you'll get the wrong package. Please follow the instructions in the README.md to install it correctly.\n\n# Sentence Embeddings\n\nFor the purpose of generating sentence representations, we introduce our sent2vec method and provide code and models. Think of it as an unsupervised version of [FastText](https://github.com/facebookresearch/fastText), and an extension of word2vec (CBOW) to sentences. \n\nThe method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see [*the paper*](https://aclweb.org/anthology/N18-1049) for more details.\n\n## Generating Features from Pre-Trained Models\n\n### Directly from Python\n\nIf you've installed the Cython module, you can infer sentence embeddings while keeping the model in memory:\n\n```python\nimport sent2vec\nmodel = sent2vec.Sent2vecModel()\nmodel.load_model('model.bin')\nemb = model.embed_sentence(\"once upon a time .\") \nembs = model.embed_sentences([\"first sentence .\", \"another sentence\"])\n```\n\nText preprocessing (tokenization and lowercasing) is not handled by the module, check `wikiTokenize.py` for tokenization using NLTK and Stanford NLP. \n\nAn alternative to the Cython module is using the python code provided in the `get_sentence_embeddings_from_pre-trained_models` notebook. It handles tokenization and can be given raw sentences, but does not keep the model in memory. \n\n#### Running Inference with Multiple Processes\n\nThere is an 'inference' mode for loading the model in the Cython module, which loads the model's input matrix into a shared memory segment and doesn't load the output matrix, which is not needed for inference. This is an optimization for the usecase of running inference with multiple independent processes, which would otherwise each need to load a copy of the model into their address space. To use it:\n```python\nmodel.load_model('model.bin', inference_mode=True)\n```\n\nThe model is loaded into a shared memory segment named after the model name. The model will stay in memory until you explicitely remove the shared memory segment. To do it from Python:\n```python\nmodel.release_shared_mem('model.bin')\n```\n\n### Using the Command-line Interface\n\nGiven a pre-trained model `model.bin` (download links see below), here is how to generate the sentence features for an input text. To generate the features, use the `print-sentence-vectors` command and the input text file needs to be provided as one sentence per line:\n\n```\n./fasttext print-sentence-vectors model.bin \u003c text.txt\n```\n\nThis will output sentence vectors (the features for each input sentence) to the standard output, one vector per line.\nThis can also be used with pipes:\n\n```\ncat text.txt | ./fasttext print-sentence-vectors model.bin\n```\n\n## Downloading Sent2vec Pre-Trained Models\n\n- [sent2vec_wiki_unigrams](https://drive.google.com/file/d/0B6VhzidiLvjSa19uYWlLUEkzX3c/view?usp=sharing\u0026resourcekey=0-p9iI_hJbCuNiUq5gWz7Qpg) 5GB (600dim, trained on english wikipedia)\n- [sent2vec_wiki_bigrams](https://drive.google.com/file/d/0B6VhzidiLvjSaER5YkJUdWdPWU0/view?usp=sharing\u0026resourcekey=0-MVSyokxog2m4EQ4AGsssww) 16GB (700dim, trained on english wikipedia)\n- [sent2vec_twitter_unigrams](https://drive.google.com/file/d/0B6VhzidiLvjSaVFLM0xJNk9DTzg/view?usp=sharing\u0026resourcekey=0--yCdYMEuuD2Ml7jIBhJiDw) 13GB (700dim, trained on english tweets)\n- [sent2vec_twitter_bigrams](https://drive.google.com/file/d/0B6VhzidiLvjSeHI4cmdQdXpTRHc/view?usp=sharing\u0026resourcekey=0-5wNEK0boM-tRvmkCIb8Txw) 23GB (700dim, trained on english tweets)\n- [sent2vec_toronto books_unigrams](https://drive.google.com/file/d/0B6VhzidiLvjSOWdGM0tOX1lUNEk/view?usp=sharing\u0026resourcekey=0-dQDQ3OZWooMbg-g48GRf1Q) 2GB (700dim, trained on the [BookCorpus dataset](http://yknzhu.wixsite.com/mbweb))\n- [sent2vec_toronto books_bigrams](https://drive.google.com/file/d/0B6VhzidiLvjSdENLSEhrdWprQ0k/view?usp=sharing\u0026resourcekey=0-c1Qyo6RNF5TRsVzrNXhdRw) 7GB (700dim, trained on the [BookCorpus dataset](http://yknzhu.wixsite.com/mbweb))\n\n(as used in the NAACL2018 paper)\n\nNote: users who downloaded models prior to [this release](https://github.com/epfml/sent2vec/releases/tag/v1) will encounter compatibility issues when trying to use the old models with the latest commit. Those users can still use the code in the release to keep using old models. \n\n### Tokenizing\nBoth feature generation as above and also training as below do require that the input texts (sentences) are already tokenized. To tokenize and preprocess text for the above models, you can use\n\n```\npython3 tweetTokenize.py \u003ctweets_folder\u003e \u003cdest_folder\u003e \u003cnum_process\u003e\n```\n\nfor tweets, or then the following for wikipedia:\n```\npython3 wikiTokenize.py corpora \u003e destinationFile\n```\nNote: For `wikiTokenize.py`, set the `SNLP_TAGGER_JAR` parameter to be the path of `stanford-postagger.jar` which you can download [here](http://www.java2s.com/Code/Jar/s/Downloadstanfordpostaggerjar.htm)\n\n## Train a New Sent2vec Model\n\nTo train a new sent2vec model, you first need some large training text file. This file should contain one sentence per line. The provided code does not perform tokenization and lowercasing, you have to preprocess your input data yourself, see above.\n\nYou can then train a new model. Here is one example of command:\n\n    ./fasttext sent2vec -input wiki_sentences.txt -output my_model -minCount 8 -dim 700 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000 -maxVocabSize 750000 -numCheckPoints 10\n\nHere is a description of all available arguments:\n\n```\nsent2vec -input train.txt -output model\n\nThe following arguments are mandatory:\n  -input              training file path\n  -output             output file path\n\nThe following arguments are optional:\n  -lr                 learning rate [0.2]\n  -lrUpdateRate       change the rate of updates for the learning rate [100]\n  -dim                dimension of word and sentence vectors [100]\n  -epoch              number of epochs [5]\n  -minCount           minimal number of word occurences [5]\n  -minCountLabel      minimal number of label occurences [0]\n  -neg                number of negatives sampled [10]\n  -wordNgrams         max length of word ngram [2]\n  -loss               loss function {ns, hs, softmax} [ns]\n  -bucket             number of hash buckets for vocabulary [2000000]\n  -thread             number of threads [2]\n  -t                  sampling threshold [0.0001]\n  -dropoutK           number of ngrams dropped when training a sent2vec model [2]\n  -verbose            verbosity level [2]\n  -maxVocabSize       vocabulary exceeding this size will be truncated [None]\n  -numCheckPoints     number of intermediary checkpoints to save when training [1]\n```\n\n## Nearest Neighbour Search and Analogies\nGiven a pre-trained model `model.bin` , here is how to use these features. For the nearest neighbouring sentence feature, you need the model as well as a corpora in which you can search for the nearest neighbouring sentence to your input sentence. We use cosine distance as our distance metric. To do so, we use the command `nnSent` and the input should be 1 sentence per line:\n\n```\n./fasttext nnSent model.bin corpora [k] \n```\nk is optional and is the number of nearest sentences that you want to output.     \n\nFor the analogiesSent, the user inputs 3 sentences A,B and C and finds a sentence from the corpora which is the closest to D in the A:B::C:D analogy pattern.\n```\n./fasttext analogiesSent model.bin corpora [k]\n```\n\nk is optional and is the number of nearest sentences that you want to output.     \n\n# Unigram Embeddings \n\nFor the purpose of generating word representations, we compared word embeddings obtained training sent2vec models with other word embedding models, including a novel method we refer to as CBOW char + word ngrams (`cbow-c+w-ngrams`). This method augments fasttext char augmented CBOW with word n-grams. You can see the full comparison of results in [*this paper*](https://www.aclweb.org/anthology/N19-1098). \n\n## Extracting Word Embeddings from Pre-Trained Models\n\nIf you have the Cython wrapper installed, some functionalities allow you to play with word embeddings obtained from `sent2vec` or `cbow-c+w-ngrams`:\n\n```python\nimport sent2vec\nmodel = sent2vec.Sent2vecModel()\nmodel.load_model('model.bin') # The model can be sent2vec or cbow-c+w-ngrams\nvocab = model.get_vocabulary() # Return a dictionary with words and their frequency in the corpus\nuni_embs, vocab = model.get_unigram_embeddings() # Return the full unigram embedding matrix\nuni_embs = model.embed_unigrams(['dog', 'cat']) # Return unigram embeddings given a list of unigrams\n```  \n\nAsking for a unigram embedding not present in the vocabulary will return a zero vector in case of sent2vec. The `cbow-c+w-ngrams` method will be able to use the sub-character ngrams to infer some representation. \n\n## Downloading Pre-Trained Models\n\nComing soon.\n\n## Train a CBOW Character and Word Ngrams Model\n\nVery similar to the sent2vec instructions. A plausible command would be:\n\n    ./fasttext cbow-c+w-ngrams -input wiki_sentences.txt -output my_model -lr 0.05 -dim 300 -ws 10 -epoch 9 -maxVocabSize 750000 -thread 20 -numCheckPoints 20 -t 0.0001 -neg 5 -bucket 4000000 -bucketChar 2000000 -wordNgrams 3 -minn 3 -maxn 6\n\n# References\nWhen using this code or some of our pre-trained models for your application, please cite the following paper for sentence embeddings:\n\n  Matteo Pagliardini, Prakhar Gupta, Martin Jaggi, [*Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features*](https://aclweb.org/anthology/N18-1049) NAACL 2018\n\n```\n@inproceedings{pgj2017unsup,\n  title = {{Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features}},\n  author = {Pagliardini, Matteo and Gupta, Prakhar and Jaggi, Martin},\n  booktitle={NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics},\n  year={2018}\n}\n```\n\nFor word embeddings:\n\nPrakhar Gupta, Matteo Pagliardini, Martin Jaggi, [*Better Word Embeddings by Disentangling Contextual n-Gram\nInformation*](https://www.aclweb.org/anthology/N19-1098) NAACL 2019\n\n```\n@inproceedings{DBLP:conf/naacl/GuptaPJ19,\n  author    = {Prakhar Gupta and\n               Matteo Pagliardini and\n               Martin Jaggi},\n  title     = {Better Word Embeddings by Disentangling Contextual n-Gram Information},\n  booktitle = {{NAACL-HLT} {(1)}},\n  pages     = {933--939},\n  publisher = {Association for Computational Linguistics},\n  year      = {2019}\n}\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fsent2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepfml%2Fsent2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fsent2vec/lists"}