Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kostyaev/sentence2vec

Deep sentence embedding using Sequence to Sequence learning
https://github.com/kostyaev/sentence2vec

cuda sentence2vec seq2seq torch

Last synced: 3 months ago
JSON representation

Deep sentence embedding using Sequence to Sequence learning

Host: GitHub
URL: https://github.com/kostyaev/sentence2vec
Owner: kostyaev
License: mit
Created: 2016-05-26T14:42:23.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-01-04T10:11:07.000Z (about 8 years ago)
Last Synced: 2024-07-31T05:10:38.379Z (6 months ago)
Topics: cuda, sentence2vec, seq2seq, torch
Language: Jupyter Notebook
Homepage:
Size: 103 KB
Stars: 22
Watchers: 3
Forks: 10
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Deep sentence embedding using Sequence to Sequence learning

![screenshot](images/2d_pca_projection.png)

## Installing

1. [Install Torch](http://torch.ch/docs/getting-started.html).
2. Install the following additional Lua libs:

```sh
luarocks install nn
luarocks install rnn
luarocks install penlight
```

To train with CUDA install the latest CUDA drivers, toolkit and run:

```sh
luarocks install cutorch
luarocks install cunn
```

To train with opencl install the lastest Opencl torch lib:

```sh
luarocks install cltorch
luarocks install clnn
```

3. Download the [Cornell Movie-Dialogs Corpus](http://www.mpi-sws.org/~cristian/Cornell_Movie-Dialogs_Corpus.html) and extract all the files into data/cornell_movie_dialogs.

## Training

```sh
th train.lua [-h / options]
```

Use the `--dataset NUMBER` option to control the size of the dataset. Training on the full dataset takes about 5h for a single epoch.

The model will be saved to `data/model.t7` after each epoch if it has improved (error decreased).

## Getting a pretrained model
Download:

1. The pretraned [model.t7](https://drive.google.com/file/d/0BwsDa5L6bdMpTC1GUEtPbWE2Zms/view?usp=sharing)
2. Vocabulary [vocab.t7](https://drive.google.com/file/d/0BwsDa5L6bdMpQV9zOTRhZlNPWG8/view?usp=sharing)

Put them into the `data` directory.

## Extracting embeddings from sentences
Run the following command
```sh
th -i extract_embeddings.lua --model_file data/model.t7 --input_file data/test_sentences.txt --output_file data/embeddings.t7 --cuda
```

To visualize 2D projections of the embeddings refer to: [example.ipynb](https://github.com/kostyaev/sentence2vec/blob/master/example.ipynb)

## Acknowledgments
This implementation utilizes code from [Marc-André Cournoyer's repo](https://github.com/macournoyer/neuralconvo)

## License
MIT License