https://github.com/giacbrd/shallowlearn

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
https://github.com/giacbrd/shallowlearn

fasttext gensim machine-learning neural-network online-learning scikit-learn shallow-learning supervised-learning text-classification text-mining word-embeddings word2vec

Last synced: 3 days ago
JSON representation

Host: GitHub
URL: https://github.com/giacbrd/shallowlearn
Owner: giacbrd
License: lgpl-3.0
Created: 2016-10-08T18:44:11.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-08-08T22:46:45.000Z (almost 8 years ago)
Last Synced: 2025-03-31T08:06:23.015Z (about 1 month ago)
Topics: fasttext, gensim, machine-learning, neural-network, online-learning, scikit-learn, shallow-learning, supervised-learning, text-classification, text-mining, word-embeddings, word2vec
Language: Python
Homepage:
Size: 537 KB
Stars: 198
Watchers: 18
Forks: 29
Open Issues: 17
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG.rst
- Contributing: CONTRIBUTING.rst
- License: COPYING

Awesome Lists containing this project

README

        ShallowLearn

============

A collection of supervised learning models based on shallow neural network approaches (e.g., word2vec and fastText)

with some additional exclusive features.

Written in Python and fully compatible with `scikit-learn `_.

**Discussion group** for users and developers: https://groups.google.com/d/forum/shallowlearn

.. image:: https://travis-ci.org/giacbrd/ShallowLearn.svg?branch=master

    :target: https://travis-ci.org/giacbrd/ShallowLearn

.. image:: https://img.shields.io/pypi/v/shallowlearn.svg

    :target: https://pypi.python.org/pypi/ShallowLearn

Getting Started

---------------

Install the latest version:

.. code:: shell

    pip install cython

    pip install shallowlearn

Import models from ``shallowlearn.models``, they implement the standard methods for supervised learning in scikit-learn,

e.g., ``fit(X, y)``, ``predict(X)``, ``predict_proba(X)``, etc.

Data is raw text, each sample in the iterable ``X`` is a list of tokens (words of a document), 

while each element in the iterable ``y`` (corresponding to an element in ``X``) can be a single label or a list in case

of a multi-label training set. Obviously, ``y`` must be of the same size of ``X``.

Models

------

GensimFastText

~~~~~~~~~~~~~~

**Choose this model if your goal is classification with fastText!** (it is going to be the most stable and rich feature-wise)

A supervised learning model based on the fastText algorithm [1]_.

The code is mostly taken and rewritten from `Gensim `_,

it takes advantage of its optimizations (e.g. Cython) and support.

It is possible to choose the Softmax loss function (default) or one of its two "approximations":

Hierarchical Softmax and Negative Sampling. 

The parameter ``bucket`` configures the feature hashing space, i.e., the *hashing trick* described in [1]_.

Using the hashing trick together with ``partial_fit(X, y)`` yields a powerful *online* text classifier (see `Online learning`_).

It is possible to load pre-trained word vectors at initialization,

passing a Gensim ``Word2Vec`` or a ShallowLearn ``LabeledWord2Vec`` instance (the latter is retrievable from a

``GensimFastText`` model by the attribute ``classifier``).

With method ``fit_embeddings(X)`` it is possible to pre-train word vectors, using the current parameter values of the model.

Constructor argument names are a mix between the ones of Gensim and the ones of fastText (see this `class docstring `_).

.. code:: python

    >>> from shallowlearn.models import GensimFastText

    >>> clf = GensimFastText(size=100, min_count=0, loss='hs', iter=3, seed=66)

    >>> clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])

    >>> clf.predict([('tall', 'am', 'i')])

    ['yes']

FastText

~~~~~~~~

The supervised algorithm of fastText implemented in `fastText.py `_ ,

which exposes an interface on the original C++ code.

The current advantages of this class over ``GensimFastText`` are the *subwords* and the *n-gram features* implemented

via the *hashing trick*.

The constructor arguments are equivalent to the original `supervised model

`_, except for ``input_file``, ``output`` and

``label_prefix``.

**WARNING**: The only way of loading datasets in fastText.py is through the filesystem (as of version 0.8.2),

so data passed to ``fit(X, y)`` will be written in temporary files on disk.

.. code:: python

    >>> from shallowlearn.models import FastText

    >>> clf = FastText(dim=100, min_count=0, loss='hs', epoch=3, bucket=5, word_ngrams=2)

    >>> clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])

    >>> clf.predict([('tall', 'am', 'i')])

    ['yes']

DeepInverseRegression

~~~~~~~~~~~~~~~~~~~~~

*TODO*: Based on https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score

DeepAveragingNetworks

~~~~~~~~~~~~~~~~~~~~~

*TODO*: Based on https://github.com/miyyer/dan

Exclusive Features

------------------

Next cool features will be listed as Issues in Github, for now:

Persistence

~~~~~~~~~~~

Any model can be serialized and de-serialized with the two methods ``save`` and ``load``.

They overload the `SaveLoad `_ interface of Gensim,

so it is possible to control the cost on disk usage of the models, instead of simply *pickling* the objects.

The original interface also allows to use compression on the serialization outputs.

``save`` may create multiple files with names prefixed by the name given to the serialized model.

.. code:: python

    >>> from shallowlearn.models import GensimFastText

    >>> clf = GensimFastText(size=100, min_count=0, loss='hs', iter=3, seed=66)

    >>> clf.save('./model')

    >>> loaded = GensimFastText.load('./model') # it also creates ./model.CLF

Benchmarks

----------

Text classification

~~~~~~~~~~~~~~~~~~~

The script ``scripts/document_classification_20newsgroups.py`` refers to this

`scikit-learn example `_

in which text classifiers are compared on a reference dataset;

we added our models to the comparison.

**The current results, even if still preliminary, are comparable with other

approaches, achieving the best performance in speed**.

Results as of release `0.0.5 `_,

with *chi2_select* option set to 80%.

The times take into account of *tf-idf* vectorization in the “classic” classifiers, and the I/O operations for the

training of fastText.py.

The evaluation measure is *macro F1*.

.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/benchmark.svg

    :alt: Text classifiers comparison

    :width: 888 px

    :align: center

Online learning

~~~~~~~~~~~~~~~

The script ``scripts/plot_out_of_core_classification.py`` computes a benchmark on some scikit-learn classifiers which are able to

learn incrementally,

a batch of examples at a time.

These classifiers can learn online by using the scikit-learn method ``partial_fit(X, y)``.

The `original example `_

describes the approach through feature hashing, which we set with parameter ``bucket``.

**The results are decent but there is room for improvement**.

We configure our classifier with ``iter=1, size=100, alpha=0.1, sample=0, min_count=0``, so to keep the model fast and

small, and to not cut off words from the few samples we have.

.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/onlinelearning.svg

    :alt: Online learning

    :width: 700 px

    :align: center

References

----------

.. [1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/giacbrd/shallowlearn

Awesome Lists containing this project

README