{"id":20652458,"url":"https://github.com/thesofakillers/infersent-replication","last_synced_at":"2025-08-02T20:33:31.265Z","repository":{"id":70961864,"uuid":"478582674","full_name":"thesofakillers/infersent-replication","owner":"thesofakillers","description":"Partial replication of Conneau et al. (2017)","archived":false,"fork":false,"pushed_at":"2022-06-05T18:57:13.000Z","size":4702,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-07T19:03:32.084Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thesofakillers.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-04-06T13:55:21.000Z","updated_at":"2022-06-05T18:57:16.000Z","dependencies_parsed_at":null,"dependency_job_id":"d1b6f97c-5ea6-4965-a4c2-85c9df2fb9a6","html_url":"https://github.com/thesofakillers/infersent-replication","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/thesofakillers/infersent-replication","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Finfersent-replication","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Finfersent-replication/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Finfersent-replication/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Finfersent-replication/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thesofakillers","download_url":"https://codeload.github.com/thesofakillers/infersent-replication/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Finfersent-replication/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268448362,"owners_count":24252019,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-02T02:00:12.353Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T17:35:03.145Z","updated_at":"2025-08-02T20:33:31.238Z","avatar_url":"https://github.com/thesofakillers.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# InferSent: A partial replication\n\nThis repository contains the code for a partial replication of Conneau et al.\n(2017), _Supervised Learning of Universal Sentence Representations from Natural\nLanguage Inference Data_.\n\nFour different sentence encoders are implemented and trained in the \"Generic NLI\ntraining scheme\" as described in the original paper. In particular:\n\n1. _Baseline_: averaging word embeddings to obtain sentence representations\n2. _LSTM_: applied on the word embeddings, where the last hidden state is\n   considered as sentence representation.\n3. _BiLSTM_: where the last hidden state of forward and backward layers are\n   concatenated as the sentence representations.\n4. _Max-Pool BiLSTM_: with max pooling applied to the concatenation of\n   word-level hidden states from both directions to retrieve sentence\n   representations\n\nEvaluation is then done with [SentEval](https://aclanthology.org/L18-1269/).\n\n## Requirements and Setup\n\nDetails such as python and package versions can be found in the generated\n[pyproject.toml](pyproject.toml) and [poetry.lock](poetry.lock) files.\n\nWe recommend using an environment manager such as\n[conda](https://docs.conda.io/en/latest/). After setting up your environment\nwith the correct python version, please proceed with the installation of the\nrequired packages\n\nFor [poetry](https://python-poetry.org/) users, getting setup is as easy as\nrunning\n\n```terminal\npoetry install\n```\n\nWe also provide a [requirements.txt](requirements.txt) file for\n[pip](https://pypi.org/project/pip/) users who do not wish to use poetry. In\nthis case, simply run\n\n```terminal\npip install -r requirements.txt\n```\n\nOnce these packages have been installed, we have to manually set up\n[SentEval](https://github.com/facebookresearch/SentEval), since the FaceBook\nresearchers and engineers are not paid enough to make a PyPI package like\neveryone else. To do this, first clone the repository to a folder of your\nchoice:\n\n```terminal\ngit clone git@github.com:facebookresearch/SentEval.git\n```\n\nThen navigate to the SentEval repository and install it to the same environment\nas used above:\n\n```terminal\ncd SentEval\npython setup install\n```\n\n### Data, Pretrained Embeddings and Models\n\nThe repository does not include the datasets and pretrained embeddings used to\ntrain the models mentioned above, nor the trained model checkpoints themselves,\nas these are inappropriate for git version control.\n\nThe datasets relative to NLI training are public and will be automatically\ndownloaded when necessary.\n\nThe model checkpoints and evaluation results, hosted on the\n[Internet Archive](https://archive.org), can be downloaded from\n[this link](https://archive.org/download/thesofakillers-infersent-logs/logs.zip).\nPlease download and unzip the file, placing the resulting `logs/` directory in\nthe repository root.\n\nThe public datasets and embeddings used are\n[SNLI](https://nlp.stanford.edu/projects/snli/) and 840B-token 300-d\n[GloVe](https://nlp.stanford.edu/projects/glove/) respectively. If users already\nhave these locally and do not wish to re-download them, simply move (or\nsymbolically link them) to a shared data directory, and then signal this\ndirectory and the resulting paths in the arguments for the scripts.\n\nWe also make use of the SentEval datasets. To download them, visit the senteval\nrepository you previously cloned and run\n\n```terminal\ncd .data/downstream/\n./get_transfer_data.bash\n```\n\nOnce this is complete, you may then rsync or mv the `downstream/` directory to a\ndirectory of choice. Keep this directory in mind as we will then point to it\nwhen using SentEval for evaluation. For example\n\n```terminal\nrsync -r -v -h senteval/data/downstream infersent-replication/data/\n```\n\nwe would then point to `infersent-replication/data` when using SentEval for\nevaluation\n\n## Repository Structure\n\n```bash\n.\n├── models/                            # models folder\n│   ├── __init__.py\n│   ├── encoder.py                     # sentence encoder models\n│   └── infersent.py                   # Generic NLI model\n├── data/                              # data folder (not committed)\n├── logs/                              # logs folder (not committed)\n├── utils.py                           # for miscellaneous utils\n├── data.py                            # for data loading and processing\n├── train.py                           # for model training\n├── eval.py                            # for model evaluation\n├── infer.py                           # for model inference\n├── demo.ipynb                         # demo jupyter notebook\n├── error_analysis.md                  # error analysis markdown file\n├── error_analysis.pdf                 # error analysis pdf file\n├── images/                            # images folder for error_analysis.md\n├── pyproject.toml                     # repo metadata\n├── poetry.lock\n├── gen_pip_reqs.sh                    # script for gen. pip requirements\n└── README.md                          # you are here\n\n```\n\n## Usage\n\n### Demo\n\nThe repository comes with a demo [Jupyter Notebook](https://jupyter.org/) that\nallows users to load a trained model and run inference on different examples.\n\nThe notebook also provides an overview and analysis of the results.\n\nFor more fine-grained usage, please refer to the following sections.\n\n### Data and Embeddings\n\nWhen called directly, `data.py` script will take care of setting up data and\nembedding requirements for you. In particular, it will\n\n1. Download GloVe embeddings if the embeddings .txt file is not found.\n2. Parse the embeddings .txt file.\n3. Download the SNLI dataset if they are not already downloaded.\n4. Process the SNLI dataset, building the vocabulary in the process.\n5. Save the vocab to disk, to avoid having to build it again.\n6. Align GloVe embeddings to the vocab.\n7. Save the aligned glove embeddings as a Tensor to disk.\n\nFor usage:\n\n```stdout\nusage: data.py [-h] [-d DATA_DIR] [-g GLOVE] [-gv GLOVE_VARIANT] -ag\n               ALIGNED_GLOVE [-b BATCH_SIZE] [-cv CACHED_VOCAB]\n\nSets up data: downloads data and aligns GloVe embeddings to SNLI vocab.\n\noptions:\n  -h, --help            show this help message and exit\n  -d DATA_DIR, --data-dir DATA_DIR\n                        path to data directory\n  -g GLOVE, --glove GLOVE\n                        path to glove embeddings\n  -gv GLOVE_VARIANT, --glove-variant GLOVE_VARIANT\n                        which variant of glove embeddings to use\n  -ag ALIGNED_GLOVE, --aligned-glove ALIGNED_GLOVE\n                        path to save aligned glove embeddings tensor\n  -b BATCH_SIZE, --batch-size BATCH_SIZE\n                        batch size for training\n  -cv CACHED_VOCAB, --cached-vocab CACHED_VOCAB\n                        path to save/load serialized vocabulary\n```\n\n### Training\n\nWe use `train.py` for training. Most arguments should work fine with default\nvalues.\n\nFor usage:\n\n```stdout\nusage: train.py [-h] -e ENCODER_TYPE [-c CHECKPOINT_PATH] [-s SEED] [-p]\n                [-l LOG_DIR] [-d DATA_DIR] [-g GLOVE] [-gv GLOVE_VARIANT]\n                [-ag ALIGNED_GLOVE] [-b BATCH_SIZE] [-cv CACHED_VOCAB]\n                [-w NUM_WORKERS]\n\nTrains an InferSent model. Test-set evaluation is deferred to eval.py\n\noptions:\n  -h, --help            show this help message and exit\n  -e ENCODER_TYPE, --encoder-type ENCODER_TYPE\n                        one of 'baseline', 'lstm', 'bilstm', maxpoolbilstm'\n  -c CHECKPOINT_PATH, --checkpoint-path CHECKPOINT_PATH\n                        path for loading previously saved checkpoint\n  -s SEED, --seed SEED  the random seed to use\n  -p, --progress-bar    whether to show the progress bar\n  -l LOG_DIR, --log-dir LOG_DIR\n                        path to log directory\n  -d DATA_DIR, --data-dir DATA_DIR\n                        path to data directory\n  -g GLOVE, --glove GLOVE\n                        path to glove embeddings\n  -gv GLOVE_VARIANT, --glove-variant GLOVE_VARIANT\n                        which variant of glove embeddings to use\n  -ag ALIGNED_GLOVE, --aligned-glove ALIGNED_GLOVE\n                        path to aligned glove embeddings tensor\n  -b BATCH_SIZE, --batch-size BATCH_SIZE\n                        batch size for training\n  -cv CACHED_VOCAB, --cached-vocab CACHED_VOCAB\n                        path to save/load serialized vocabulary\n  -w NUM_WORKERS, --num-workers NUM_WORKERS\n                        number of workers for data loading\n```\n\n### Evaluation\n\nWe use `eval.py` for evaluation, both on the original SNLI task as well as on\nSentEval, configurably via the command-line arguments.\n\nFor usage:\n\n```stdout\nusage: eval.py [-h] [-d DATA_DIR] [-o OUTPUT_DIR] [--snli]\n               [--snli-output-dir SNLI_OUTPUT_DIR] [--senteval]\n               [--senteval-output-dir SENTEVAL_OUTPUT_DIR] -e ENCODER_TYPE -c\n               CHECKPOINT_PATH [-ag ALIGNED_GLOVE] [-cv CACHED_VOCAB]\n               [-w NUM_WORKERS] [-b BATCH_SIZE] [-p] [-s SEED] [-g]\n\nEvaluate a trained model on SNLI and SentEval\n\noptions:\n  -h, --help            show this help message and exit\n  -d DATA_DIR, --data-dir DATA_DIR\n                        path to data directory\n  -o OUTPUT_DIR, --output-dir OUTPUT_DIR\n                        Parent directory for saving results\n  --snli                Evaluate on SNLI\n  --snli-output-dir SNLI_OUTPUT_DIR\n                        Directory to save SNLI results\n  --senteval            Evaluate on SentEval\n  --senteval-output-dir SENTEVAL_OUTPUT_DIR\n                        Directory to save SentEval results\n  -e ENCODER_TYPE, --encoder-type ENCODER_TYPE\n                        one of 'baseline', 'lstm', 'bilstm', maxpoolbilstm'\n  -c CHECKPOINT_PATH, --checkpoint-path CHECKPOINT_PATH\n                        path to the checkpoint file\n  -ag ALIGNED_GLOVE, --aligned-glove ALIGNED_GLOVE\n                        path to the aligned glove file\n  -cv CACHED_VOCAB, --cached-vocab CACHED_VOCAB\n                        path to save/load serialized vocabulary\n  -w NUM_WORKERS, --num-workers NUM_WORKERS\n                        number of workers for data loading\n  -b BATCH_SIZE, --batch-size BATCH_SIZE\n                        batch size\n  -p, --progress-bar    whether to show the progress bar\n  -s SEED, --seed SEED  the random seed to use\n  -g, --gpu             whether to use gpu\n```\n\n### Inference\n\nWe provide a simple script for performing inference, `infer.py`. This can be\nused either to predict the entailment of a pair of sentences, or to embed a\nparticular sentence. For usage:\n\n```stdout\nusage: infer.py [-h] -m MODE -c CHECKPOINT_PATH [-ag ALIGNED_GLOVE] -s1\n                SENTENCE_1 [-s2 SENTENCE_2] [-map]\n\nScript for inference\n\noptions:\n  -h, --help            show this help message and exit\n  -m MODE, --mode MODE  Mode for inference. One of 'nli' or 'sentembed'\n  -c CHECKPOINT_PATH, --checkpoint-path CHECKPOINT_PATH\n                        Path to the checkpoint file\n  -ag ALIGNED_GLOVE, --aligned-glove ALIGNED_GLOVE\n                        path to the aligned glove file\n  -s1 SENTENCE_1, --sentence-1 SENTENCE_1\n                        Sentence to embed if embedding, premise if NLI'ing\n  -s2 SENTENCE_2, --sentence-2 SENTENCE_2\n                        Hypothesis. Only required if NLI'ing\n  -map, --map           Flag whether to return one of {'entailment',\n                        'neutral', 'contradiction'} instead of {0, 1, 2}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthesofakillers%2Finfersent-replication","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthesofakillers%2Finfersent-replication","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthesofakillers%2Finfersent-replication/lists"}