Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/oborchers/Fast_Sentence_Embeddings

Compute Sentence Embeddings Fast!
https://github.com/oborchers/Fast_Sentence_Embeddings
cython document-similarity embeddings fasttext fse gensim gensim-model maxpooling sentence-embeddings sentence-representation sentence-similarity sif swem usif word2vec-model wordembedding
Last synced: 3 months ago
JSON representation
Compute Sentence Embeddings Fast!
Host: GitHub
URL: https://github.com/oborchers/Fast_Sentence_Embeddings
Owner: oborchers
License: gpl-3.0
Created: 2019-06-06T16:29:27.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2023-03-02T15:36:36.000Z (over 1 year ago)
Last Synced: 2023-11-07T15:05:27.183Z (8 months ago)
Topics: cython, document-similarity, embeddings, fasttext, fse, gensim, gensim-model, maxpooling, sentence-embeddings, sentence-representation, sentence-similarity, sif, swem, usif, word2vec-model, wordembedding
Language: Jupyter Notebook
Homepage:
Size: 2.86 MB
Stars: 596
Watchers: 13
Forks: 84
Open Issues: 17
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

my-awesome-stars - oborchers/Fast_Sentence_Embeddings - Compute Sentence Embeddings Fast! (Jupyter Notebook)
README

        






















Fast Sentence Embeddings

==================================

Fast Sentence Embeddings is a Python library that serves as an addition to Gensim. This library is intended to compute *sentence vectors* for large collections of sentences or documents with as little hassle as possible:

```

from fse import Vectors, Average, IndexedList

vecs = Vectors.from_pretrained("glove-wiki-gigaword-50")

model = Average(vecs)

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

model.train(IndexedList(sentences))

model.sv.similarity(0,1)

```

If you want to support fse, take a quick [survey](https://forms.gle/8uSU323fWUVtVwcAA) to improve it.

Audience

------------

This package builds upon Gensim and is intenteded to compute sentence/paragraph vectors for large databases. Use this package if:

- (Sentence) Transformers are too slow

- Your dataset is too large for existing solutions (spacy)

- Using GPUs is not an option.

The average (online) inference time for a well optimized (and batched) sentence-transformer is around 1ms-10ms per sentence. If that is not enough and you are willing to sacrifice a bit in terms of quality, this is your package.

Features

------------

Find the corresponding blog post(s) here (code may be outdated):

- [Visualizing 100,000 Amazon Products](https://towardsdatascience.com/vis-amz-83dea6fcb059)

- [Sentence Embeddings. Fast, please!](https://towardsdatascience.com/fse-2b1ffa791cf9)

**fse** implements three algorithms for sentence embeddings. You can choose

between *unweighted sentence averages*,  *smooth inverse frequency averages*, and *unsupervised smooth inverse frequency averages*. 

Key features of **fse** are: 

**[X]** Up to 500.000 sentences / second (1)

**[X]** Provides HUB access to various pre-trained models for convenience

**[X]** Supports Average, SIF, and uSIF Embeddings

**[X]** Full support for Gensims Word2Vec and all other compatible classes

**[X]** Full support for Gensims FastText with out-of-vocabulary words

**[X]** Induction of word frequencies for pre-trained embeddings

**[X]** Incredibly fast Cython core routines 

**[X]** Dedicated input file formats for easy usage (including disk streaming)

**[X]** Ram-to-disk training for large corpora

**[X]** Disk-to-disk training for even larger corpora

**[X]** Many fail-safe checks for easy usage

**[X]** Simple interface for developing your own models

**[X]** Extensive documentation of all functions

**[X]** Optimized Input Classes

(1) May vary significantly from system to system (i.e. by using swap memory) and processing.

I regularly observe 300k-500k sentences/s for preprocessed data on my Macbook (2016).

Visit **Tutorial.ipynb** for an example.

Installation

------------

This software depends on NumPy, Scipy, Scikit-learn, Gensim, and Wordfreq. 

You must have them installed prior to installing fse.

As with gensim, it is also recommended you install a BLAS library before installing fse.

The simple way to install **fse** is:

    pip install -U fse

In case you want to build from source, just run:

    python setup.py install

If building the Cython extension fails (you will be notified), try:

    pip install -U git+https://github.com/oborchers/Fast_Sentence_Embeddings

Usage

-------------

Using pre-trained models with **fse** is easy. You can just use them from the hub and download them accordingly.

They will be stored locally so you can re-use them later.

```

from fse import Vectors, Average, IndexedList

vecs = Vectors.from_pretrained("glove-wiki-gigaword-50")

model = Average(vecs)

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

model.train(IndexedList(sentences))

model.sv.similarity(0,1)

```

If your vectors are large and you don't have a lot of RAM, you can supply the `mmap` argument as follows to read the vectors from disk instead of loading them into RAM:

```

Vectors.from_pretrained("glove-wiki-gigaword-50", mmap="r")

```

To check which vectors are on the hub, please check: https://huggingface.co/fse. For example, you will find:

- glove-twitter-25

- glove-twitter-50

- glove-twitter-100

- glove-twitter-200

- glove-wiki-gigaword-100

- glove-wiki-gigaword-300

- word2vec-google-news-300

- paragram-25

- paranmt-300

- paragram-300-sl999

- paragram-300-ws353

- fasttext-wiki-news-subwords-300

- fasttext-crawl-subwords-300 (Use with `FTVectors`)

In order to use **fse** with a custom model you must first estimate a Gensim model which contains a

gensim.models.keyedvectors.BaseKeyedVectors class, for example *Word2Vec* or *Fasttext*. Then you can proceed to compute sentence embeddings for a corpus as follows:

```

from gensim.models import FastText

sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]

ft = FastText(sentences, min_count=1, vector_size=10)

from fse import Average, IndexedList

model = Average(ft)

model.train(IndexedList(sentences))

model.sv.similarity(0,1)

```

fse offers multi-thread support out of the box. However, for most applications a *single thread will most likely be sufficient*.

Additional Information

-------------

Within the folder nootebooks you can find the following guides:

**Tutorial.ipynb** offers a detailed walk-through of some of the most important functions fse has to offer.

**STS-Benchmarks.ipynb** contains an example of how to use the library with pre-trained models to

replicate the STS Benchmark results [4] reported in the papers.

**Speed Comparision.ipynb** compares the speed between the numpy and the cython routines.

In order to use the **fse** model, you first need some pre-trained gensim 

word embedding model, which is then used by **fse** to compute the sentence embeddings.

After computing sentence embeddings, you can use them in supervised or

unsupervised NLP applications, as they serve as a formidable baseline.

The models presented are based on

- Deep-averaging embeddings [1]

- Smooth inverse frequency embeddings [2]

- Unsupervised smooth inverse frequency embeddings [3]

Credits to Radim Řehůřek and all contributors for the **awesome** library

and code that [Gensim](https://github.com/RaRe-Technologies/gensim) provides. A whole lot of the code found in this lib is based on Gensim.

To install **fse** on Colab, check out: https://colab.research.google.com/drive/1qq9GBgEosG7YSRn7r6e02T9snJb04OEi 

Results

------------

Model | Vectors | params | [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results)

:---: | :---: | :---: | :---:

`CBOW` | `paranmt-300` |  | 79.82

`uSIF` | `paranmt-300` | length=11 | 79.00

`SIF-10` | `paranmt-300` | components=10 | 76.72

`SIF-10` | `paragram-300-sl999` | components=10 | 74.21

`SIF-10` | `paragram-300-ws353` | components=10 | 74.03

`SIF-10` | `fasttext-crawl-subwords-300` | components=10 | 73.38

`uSIF` | `paragram-300-sl999` | length=11 | 73.04

`SIF-10` | `fasttext-wiki-news-subwords-300` | components=10 | 72.29

`uSIF` | `paragram-300-ws353` | length=11 | 71.84

`SIF-10` | `glove-twitter-200` | components=10 | 71.62

`SIF-10` | `glove-wiki-gigaword-300` | components=10 | 71.35

`SIF-10` | `word2vec-google-news-300` | components=10 | 71.12

`SIF-10` | `glove-wiki-gigaword-200` | components=10 | 70.62

`SIF-10` | `glove-twitter-100` | components=10 | 69.65

`uSIF` | `fasttext-crawl-subwords-300` | length=11 | 69.40

`uSIF` | `fasttext-wiki-news-subwords-300` | length=11 | 68.63

`SIF-10` | `glove-wiki-gigaword-100` | components=10 | 68.34

`uSIF` | `glove-wiki-gigaword-300` | length=11 | 67.60

`uSIF` | `glove-wiki-gigaword-200` | length=11 | 67.11

`uSIF` | `word2vec-google-news-300` | length=11 | 66.99

`uSIF` | `glove-twitter-200` | length=11 | 66.67

`SIF-10` | `glove-twitter-50` | components=10 | 65.52

`uSIF` | `glove-wiki-gigaword-100` | length=11 | 65.33

`uSIF` | `paragram-25` | length=11 | 64.22

`uSIF` | `glove-twitter-100` | length=11 | 64.13

`SIF-10` | `glove-wiki-gigaword-50` | components=10 | 64.11

`uSIF` | `glove-wiki-gigaword-50` | length=11 | 62.06

`CBOW` | `word2vec-google-news-300` |  | 61.54

`uSIF` | `glove-twitter-50` | length=11 | 60.41

`SIF-10` | `paragram-25` | components=10 | 59.07

`uSIF` | `glove-twitter-25` | length=11 | 55.06

`CBOW` | `paragram-300-ws353` |  | 54.72

`SIF-10` | `glove-twitter-25` | components=10 | 54.16

`CBOW` | `paragram-300-sl999` |  | 51.46

`CBOW` | `fasttext-crawl-subwords-300` |  | 48.49

`CBOW` | `glove-wiki-gigaword-300` |  | 44.46

`CBOW` | `glove-wiki-gigaword-200` |  | 42.40

`CBOW` | `paragram-25` |  | 40.13

`CBOW` | `glove-wiki-gigaword-100` |  | 38.12

`CBOW` | `glove-wiki-gigaword-50` |  | 37.47

`CBOW` | `glove-twitter-200` |  | 34.94

`CBOW` | `glove-twitter-100` |  | 33.81

`CBOW` | `glove-twitter-50` |  | 30.78

`CBOW` | `glove-twitter-25` |  | 26.15

`CBOW` | `fasttext-wiki-news-subwords-300` |  | 26.08

Changelog

-------------

1.0.0:

- Added support for gensim>=4. This library is no longer compatible with gensim<4. For migration, see the [README](https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4).

- `size` argument is now `vector_size`

0.2.0:

- Added `Vectors` and `FTVectors` class and hub support by `from_pretrained`

- Extended benchmark

- Fixed zero division bug for uSIF

- Moved tests out of the main folder

- Moved sts out of the main folder

0.1.17:

- Fixed dependency issue where you cannot install fse properly

- Updated readme

- Updated travis python versions (3.6, 3.9)

0.1.15 from 0.1.11:

- Fixed major FT Ngram computation bug

- Rewrote the input class. Turns out NamedTuple was pretty slow. 

- Added further unittests

- Added documentation

- Major speed improvements

- Fixed division by zero for empty sentences

- Fixed overflow when infer method is used with too many sentences

- Fixed similar_by_sentence bug

Literature

-------------

1. Iyyer M, Manjunatha V, Boyd-Graber J, Daumé III H (2015) Deep Unordered 

Composition Rivals Syntactic Methods for Text Classification. Proc. 53rd Annu. 

Meet. Assoc. Comput. Linguist. 7th Int. Jt. Conf. Nat. Lang. Process., 1681–1691.

2. Arora S, Liang Y, Ma T (2017) A Simple but Tough-to-Beat Baseline for Sentence

Embeddings. Int. Conf. Learn. Represent. (Toulon, France), 1–16.

3. Ethayarajh K (2018) Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline.

Proceedings of the 3rd Workshop on Representation Learning for NLP. (Toulon, France), 91–100.

4. Eneko Agirre, Daniel Cer, Mona Diab, Iñigo Lopez-Gazpio, Lucia Specia. Semeval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. Proceedings of SemEval 2017.

Copyright

-------------

**Disclaimer**: I am working full time. Unfortunately, I have yet to find time to add all the features I'd like to see. Especially the API needs some overhaul and we need support for gensim 4.0.0.

I am looking for active contributors to keep this package alive. Please feel free to ping me at  if you are interested.

Author: Oliver Borchers

Copyright (C) 2022 Oliver Borchers

Citation

-------------

If you found this software useful, please cite it in your publication.

	@misc{Borchers2019,

		author = {Borchers, Oliver},

		title = {Fast sentence embeddings},

		year = {2019},

		publisher = {GitHub},

		journal = {GitHub Repository},

		howpublished = {\url{https://github.com/oborchers/Fast_Sentence_Embeddings}},

	}