https://github.com/talmago/simple-but-tough-to-beat-examples

Bunch of examples of a "Simple but tough to beat baseline for sentence embeddings" in classification tasks
https://github.com/talmago/simple-but-tough-to-beat-examples

fake-news-classification fasttext fasttext-python imdb-dataset machine-learning nlp sentence-embeddings sentence2vec w2v word-embeddings word2vec

Last synced: 6 months ago
JSON representation

Bunch of examples of a "Simple but tough to beat baseline for sentence embeddings" in classification tasks

Host: GitHub
URL: https://github.com/talmago/simple-but-tough-to-beat-examples
Owner: talmago
License: mit
Created: 2020-01-15T16:55:25.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2020-01-21T16:37:45.000Z (almost 6 years ago)
Last Synced: 2025-03-29T22:11:45.990Z (7 months ago)
Topics: fake-news-classification, fasttext, fasttext-python, imdb-dataset, machine-learning, nlp, sentence-embeddings, sentence2vec, w2v, word-embeddings, word2vec
Language: Python
Homepage:
Size: 33.2 KB
Stars: 5
Watchers: 3
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          ## simple-but-tough-to-beat-examples

A set of examples to demonstrate the performance of "simple but tought to beat" sentence embeddings in different downstream tasks. For simplicity and speed, we use `FastText` to learn word representations from a corpus and use them for the baseline. 

## Quick Setup

Install dependencies

```sh

$ pip install -U pipenv && pipenv install --dev

```

export to $PYTHONPATH

```sh

export PYTHONPATH=$PYTHONPATH:/path/to/simple-but-tough-to-beat-examples

```

## Usage

Load pre-built w2v

```python

from encoder import build_from_w2v_path

sentence_encoder = build_from_w2v_path('wiki-news-300d-1M-subword.vec')

```

Load pre-built `FastText` model

```python

from encoder import build_from_fasttext_bin

sentence_encoder = build_from_fasttext_bin('cc.en.300.bin')

```

Fine-tune

```python

sentence_encoder.fit(corpus)

```

transform sentences to embeddings

```python

corpus = [

	'this is a sentence',

	'this is another sentence',

	...

]

sentence_encoder.fit_transform(corpus)

```

## Examples

  

  - [Fake News Classification](https://github.com/talmago/simple-but-tough-to-beat-examples/blob/master/examples/fake_news.ipynb) - Following this [blog post](https://towardsdatascience.com/using-use-universal-sentence-encoder-to-detect-fake-news-dfc02dc32ae9) we demonstrate roughly the same performance of the "Universal Sentence Encoder" for classification of "fake news".

  - [IMDB Review Sentiment Analysis](https://github.com/talmago/simple-but-tough-to-beat-examples/blob/master/examples/imdb.ipynb) - We show how ~90% accuracy can be achieved with the baseline encoder (SOTA = [XLNet](http://nlpprogress.com/english/sentiment_analysis.html)).

## References

[1] Sanjeev Arora, Yingyu Liang, Tengyu Ma, [*A Simple but Tough-to-Beat Baseline for Sentence Embeddings*](https://openreview.net/forum?id=SyK00v5xx)

```

@article{arora2017asimple, 

	author = {Sanjeev Arora and Yingyu Liang and Tengyu Ma}, 

	title = {A Simple but Tough-to-Beat Baseline for Sentence Embeddings}, 

	booktitle = {International Conference on Learning Representations},

	year = {2017}

}

```

[2] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)

```

@article{bojanowski2017enriching,

  title={Enriching Word Vectors with Subword Information},

  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},

  journal={Transactions of the Association for Computational Linguistics},

  volume={5},

  year={2017},

  issn={2307-387X},

  pages={135--146}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/talmago/simple-but-tough-to-beat-examples

Awesome Lists containing this project

README