{"id":13594508,"url":"https://github.com/facebookresearch/InferSent","last_synced_at":"2025-04-09T07:32:43.229Z","repository":{"id":37819426,"uuid":"91736182","full_name":"facebookresearch/InferSent","owner":"facebookresearch","description":"InferSent sentence embeddings","archived":true,"fork":false,"pushed_at":"2021-08-30T21:16:25.000Z","size":443,"stargazers_count":2278,"open_issues_count":34,"forks_count":470,"subscribers_count":78,"default_branch":"main","last_synced_at":"2024-09-27T03:40:14.635Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-18T20:45:29.000Z","updated_at":"2024-09-24T19:12:27.000Z","dependencies_parsed_at":"2022-08-19T11:20:43.614Z","dependency_job_id":null,"html_url":"https://github.com/facebookresearch/InferSent","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FInferSent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FInferSent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FInferSent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FInferSent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/InferSent/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223375360,"owners_count":17135355,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T16:01:34.769Z","updated_at":"2024-11-06T16:31:28.858Z","avatar_url":"https://github.com/facebookresearch.png","language":"Jupyter Notebook","funding_links":[],"categories":["Language modelling","Jupyter Notebook","Pytorch \u0026 related libraries｜Pytorch \u0026 相关库","Pytorch \u0026 related libraries","Feature Extraction"],"sub_categories":["NLP \u0026 Speech Processing｜自然语言处理 \u0026 语音处理:","NLP \u0026 Speech Processing:","Text/NLP"],"readme":"# InferSent\n\n*InferSent* is a *sentence embeddings* method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks.\n\nWe provide our pre-trained English sentence encoder from [our paper](https://arxiv.org/abs/1705.02364) and our [SentEval](https://github.com/facebookresearch/SentEval) evaluation toolkit.\n\n**Recent changes**: Removed train_nli.py and only kept pretrained models for simplicity. Reason is I do not have time anymore to maintain the repo beyond simple scripts to get sentence embeddings.\n\n## Dependencies\n\nThis code is written in python. Dependencies include:\n\n* Python 2/3\n* [Pytorch](http://pytorch.org/) (recent version)\n* NLTK \u003e= 3\n\n## Download word vectors\n\nDownload [GloVe](https://nlp.stanford.edu/projects/glove/) (V1) or [fastText](https://fasttext.cc/docs/en/english-vectors.html) (V2) vectors:\n```bash\nmkdir GloVe\ncurl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip\nunzip GloVe/glove.840B.300d.zip -d GloVe/\nmkdir fastText\ncurl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip\nunzip fastText/crawl-300d-2M.vec.zip -d fastText/\n```\n\n## Use our sentence encoder\nWe provide a simple interface to encode English sentences. **See [**demo.ipynb**](https://github.com/facebookresearch/InferSent/blob/master/demo.ipynb)\nfor a practical example.** Get started with the following steps:\n\n*0.0) Download our InferSent models (V1 trained with GloVe, V2 trained with fastText)[147MB]:*\n```bash\nmkdir encoder\ncurl -Lo encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl\ncurl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl\n```\nNote that infersent1 is trained with GloVe (which have been trained on text preprocessed with the PTB tokenizer) and infersent2 is trained with fastText (which have been trained on text preprocessed with the MOSES tokenizer). The latter also removes the padding of zeros with max-pooling which was inconvenient when embedding sentences outside of their batches.\n\n*0.1) Make sure you have the NLTK tokenizer by running the following once:*\n```python\nimport nltk\nnltk.download('punkt')\n```\n\n*1) [Load our pre-trained model](https://github.com/facebookresearch/InferSent/blob/master/encoder/demo.ipynb) (in encoder/):*\n```python\nfrom models import InferSent\nV = 2\nMODEL_PATH = 'encoder/infersent%s.pkl' % V\nparams_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,\n                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}\ninfersent = InferSent(params_model)\ninfersent.load_state_dict(torch.load(MODEL_PATH))\n```\n\n*2) Set word vector path for the model:*\n```python\nW2V_PATH = 'fastText/crawl-300d-2M.vec'\ninfersent.set_w2v_path(W2V_PATH)\n```\n\n*3) Build the vocabulary of word vectors (i.e keep only those needed):*\n```python\ninfersent.build_vocab(sentences, tokenize=True)\n```\nwhere *sentences* is your list of **n** sentences. You can update your vocabulary using *infersent.update_vocab(sentences)*, or directly load the **K** most common English words with *infersent.build_vocab_k_words(K=100000)*.\nIf **tokenize** is True (by default), sentences will be tokenized using NTLK.\n\n*4) Encode your sentences (list of *n* sentences):*\n```python\nembeddings = infersent.encode(sentences, tokenize=True)\n```\nThis outputs a numpy array with *n* vectors of dimension **4096**. Speed is around *1000 sentences per second* with batch size 128 on a single GPU.\n\n*5) Visualize the importance that our model attributes to each word:*\n\nWe provide a function to visualize the importance of each word in the encoding of a sentence:\n```python\ninfersent.visualize('A man plays an instrument.', tokenize=True)\n```\n![Model](https://dl.fbaipublicfiles.com/infersent/visualization.png)\n\n## Evaluate the encoder on transfer tasks\nTo evaluate the model on transfer tasks, see [SentEval](https://github.com/facebookresearch/SentEval/tree/master/examples). Be mindful to choose the same tokenization used for training the encoder. You should obtain the following test results for the baselines and the InferSent models:\n\nModel | MR | CR | SUBJ | MPQA | STS14 | [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results) | SICK Relatedness | SICK Entailment | SST | TREC | MRPC\n:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:\n`InferSent1` | **81.1** | **86.3** | 92.4 | **90.2** | **.68/.65** | 75.8/75.5 | 0.884 | 86.1 | **84.6** | 88.2 | **76.2**/83.1\n`InferSent2` | 79.7 | 84.2 | 92.7 | 89.4 | **.68/.66** | **78.4/78.4** | **0.888** | **86.3** | 84.3 | **90.8** | 76.0/**83.8**\n`SkipThought` | 79.4 | 83.1 | **93.7** | 89.3 | .44/.45 | 72.1/70.2| 0.858 | 79.5 | 82.9 | 88.4 | -\n`fastText-BoV` | 78.2 | 80.2 | 91.8 | 88.0 | .65/.63 | 70.2/68.3 | 0.823 | 78.9 | 82.3 | 83.4 | 74.4/82.4\n\n## Reference\n\nPlease consider citing [[1]](https://arxiv.org/abs/1705.02364) if you found this code useful.\n\n### Supervised Learning of Universal Sentence Representations from Natural Language Inference Data (EMNLP 2017)\n\n[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, [*Supervised Learning of Universal Sentence Representations from Natural Language Inference Data*](https://arxiv.org/abs/1705.02364)\n\n```\n@InProceedings{conneau-EtAl:2017:EMNLP2017,\n  author    = {Conneau, Alexis  and  Kiela, Douwe  and  Schwenk, Holger  and  Barrault, Lo\\\"{i}c  and  Bordes, Antoine},\n  title     = {Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},\n  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},\n  month     = {September},\n  year      = {2017},\n  address   = {Copenhagen, Denmark},\n  publisher = {Association for Computational Linguistics},\n  pages     = {670--680},\n  url       = {https://www.aclweb.org/anthology/D17-1070}\n}\n```\n\n### Related work\n* [J. R Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, S. Fidler - SkipThought Vectors, NIPS 2015](https://arxiv.org/abs/1506.06726)\n* [S. Arora, Y. Liang, T. Ma - A Simple but Tough-to-Beat Baseline for Sentence Embeddings, ICLR 2017](https://openreview.net/pdf?id=SyK00v5xx)\n* [Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. Goldberg - Fine-grained analysis of sentence embeddings using auxiliary prediction tasks, ICLR 2017](https://arxiv.org/abs/1608.04207)\n* [A. Conneau, D. Kiela - SentEval: An Evaluation Toolkit for Universal Sentence Representations, LREC 2018](https://arxiv.org/abs/1803.05449)\n* [S. Subramanian, A. Trischler, Y. Bengio, C. J Pal - Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning, ICLR 2018](https://arxiv.org/abs/1804.00079)\n* [A. Nie, E. D. Bennett, N. D. Goodman - DisSent: Sentence Representation Learning from Explicit Discourse Relations, 2018](https://arxiv.org/abs/1710.04334)\n* [D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil - Universal Sentence Encoder, 2018](https://arxiv.org/abs/1803.11175)\n* [A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni - What you can cram into a single vector: Probing sentence embeddings for linguistic properties, ACL 2018](https://arxiv.org/abs/1805.01070)\n* [A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. Bowman - GLUE: A Multi-Task Benchmark and Analysis Platform\nfor Natural Language Understanding](https://arxiv.org/abs/1804.07461)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FInferSent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2FInferSent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FInferSent/lists"}