Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/PrincetonML/SIF
sentence embedding by Smooth Inverse Frequency weighting scheme
https://github.com/PrincetonML/SIF
Last synced: about 2 months ago
JSON representation
sentence embedding by Smooth Inverse Frequency weighting scheme
- Host: GitHub
- URL: https://github.com/PrincetonML/SIF
- Owner: PrincetonML
- License: mit
- Created: 2016-11-11T21:19:02.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2019-07-23T19:22:37.000Z (over 5 years ago)
- Last Synced: 2024-11-12T22:03:08.877Z (about 2 months ago)
- Language: Python
- Size: 2.73 MB
- Stars: 1,084
- Watchers: 34
- Forks: 306
- Open Issues: 38
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SIF
This is the code for [the paper](https://openreview.net/forum?id=SyK00v5xx) "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".
The code is written in python and requires numpy, scipy, pickle, sklearn, theano and the lasagne library.
Some functions/classes are based on the [code](https://github.com/jwieting/iclr2016) of John Wieting for the paper "Towards Universal Paraphrastic Sentence Embeddings" (Thanks John!). The example data sets are also preprocessed using the code there.## Install
To install all dependencies `virtualenv` is suggested:```
$ virtualenv .env
$ . .env/bin/activate
$ pip install -r requirements.txt
```## Get started
To get started, cd into the directory examples/ and run demo.sh. It downloads the pretrained GloVe word embeddings, and then runs the scripts:
* sif_embedding.py is an demo on how to generate sentence embedding using the SIF weighting scheme,
* sim_sif.py and sim_tfidf.py are for the textual similarity tasks in the paper,
* supervised_sif_proj.sh is for the supervised tasks in the paper.Check these files to see the options.
## Source code
The code is separated into the following parts:
* SIF embedding: involves SIF_embedding.py. The SIF weighting scheme is very simple and is implmented in a few lines.
* textual similarity tasks: involves data_io.py, eval.py, and sim_algo.py. data_io provides the code for reading the data, eval is for evaluating the performance, and sim_algo provides the code for our sentence embedding algorithm.
* supervised tasks: involves data_io.py, eval.py, train.py, proj_model_sim.py, and proj_model_sentiment.py. train provides the entry for training the models (proj_model_sim is for the similarity and entailment tasks, and proj_model_sentiment is for the sentiment task). Check train.py to see the options.
* utilities: includes lasagne_average_layer.py, params.py, and tree.py. These provides utility functions/classes for the above two parts.## References
For technical details and full experimental results, see [the paper](https://openreview.net/forum?id=SyK00v5xx).
```
@article{arora2017asimple,
author = {Sanjeev Arora and Yingyu Liang and Tengyu Ma},
title = {A Simple but Tough-to-Beat Baseline for Sentence Embeddings},
booktitle = {International Conference on Learning Representations},
year = {2017}
}
```