{"id":13935804,"url":"https://github.com/giacbrd/ShallowLearn","last_synced_at":"2025-07-19T21:30:41.483Z","repository":{"id":57466595,"uuid":"70350749","full_name":"giacbrd/ShallowLearn","owner":"giacbrd","description":"An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.","archived":false,"fork":false,"pushed_at":"2017-08-08T22:46:45.000Z","size":550,"stargazers_count":198,"open_issues_count":17,"forks_count":29,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-06-12T04:56:08.957Z","etag":null,"topics":["fasttext","gensim","machine-learning","neural-network","online-learning","scikit-learn","shallow-learning","supervised-learning","text-classification","text-mining","word-embeddings","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/giacbrd.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":"CONTRIBUTING.rst","funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-10-08T18:44:11.000Z","updated_at":"2024-07-03T01:14:53.000Z","dependencies_parsed_at":"2022-09-19T11:21:16.293Z","dependency_job_id":null,"html_url":"https://github.com/giacbrd/ShallowLearn","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/giacbrd/ShallowLearn","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giacbrd%2FShallowLearn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giacbrd%2FShallowLearn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giacbrd%2FShallowLearn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giacbrd%2FShallowLearn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/giacbrd","download_url":"https://codeload.github.com/giacbrd/ShallowLearn/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giacbrd%2FShallowLearn/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265002961,"owners_count":23696109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fasttext","gensim","machine-learning","neural-network","online-learning","scikit-learn","shallow-learning","supervised-learning","text-classification","text-mining","word-embeddings","word2vec"],"created_at":"2024-08-07T23:02:06.492Z","updated_at":"2025-07-19T21:30:41.477Z","avatar_url":"https://github.com/giacbrd.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"ShallowLearn\n============\nA collection of supervised learning models based on shallow neural network approaches (e.g., word2vec and fastText)\nwith some additional exclusive features.\nWritten in Python and fully compatible with `scikit-learn \u003chttp://scikit-learn.org\u003e`_.\n\n**Discussion group** for users and developers: https://groups.google.com/d/forum/shallowlearn\n\n.. image:: https://travis-ci.org/giacbrd/ShallowLearn.svg?branch=master\n    :target: https://travis-ci.org/giacbrd/ShallowLearn\n.. image:: https://img.shields.io/pypi/v/shallowlearn.svg\n    :target: https://pypi.python.org/pypi/ShallowLearn\n\nGetting Started\n---------------\nInstall the latest version:\n\n.. code:: shell\n\n    pip install cython\n    pip install shallowlearn\n\nImport models from ``shallowlearn.models``, they implement the standard methods for supervised learning in scikit-learn,\ne.g., ``fit(X, y)``, ``predict(X)``, ``predict_proba(X)``, etc.\n\nData is raw text, each sample in the iterable ``X`` is a list of tokens (words of a document), \nwhile each element in the iterable ``y`` (corresponding to an element in ``X``) can be a single label or a list in case\nof a multi-label training set. Obviously, ``y`` must be of the same size of ``X``.\n\nModels\n------\n\nGensimFastText\n~~~~~~~~~~~~~~\n**Choose this model if your goal is classification with fastText!** (it is going to be the most stable and rich feature-wise)\n\nA supervised learning model based on the fastText algorithm [1]_.\nThe code is mostly taken and rewritten from `Gensim \u003chttps://radimrehurek.com/gensim\u003e`_,\nit takes advantage of its optimizations (e.g. Cython) and support.\n\nIt is possible to choose the Softmax loss function (default) or one of its two \"approximations\":\nHierarchical Softmax and Negative Sampling. \n\nThe parameter ``bucket`` configures the feature hashing space, i.e., the *hashing trick* described in [1]_.\nUsing the hashing trick together with ``partial_fit(X, y)`` yields a powerful *online* text classifier (see `Online learning`_).\n\nIt is possible to load pre-trained word vectors at initialization,\npassing a Gensim ``Word2Vec`` or a ShallowLearn ``LabeledWord2Vec`` instance (the latter is retrievable from a\n``GensimFastText`` model by the attribute ``classifier``).\nWith method ``fit_embeddings(X)`` it is possible to pre-train word vectors, using the current parameter values of the model.\n\nConstructor argument names are a mix between the ones of Gensim and the ones of fastText (see this `class docstring \u003chttps://github.com/giacbrd/ShallowLearn/blob/master/shallowlearn/models.py#L74\u003e`_).\n\n.. code:: python\n\n    \u003e\u003e\u003e from shallowlearn.models import GensimFastText\n    \u003e\u003e\u003e clf = GensimFastText(size=100, min_count=0, loss='hs', iter=3, seed=66)\n    \u003e\u003e\u003e clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])\n    \u003e\u003e\u003e clf.predict([('tall', 'am', 'i')])\n    ['yes']\n\nFastText\n~~~~~~~~\nThe supervised algorithm of fastText implemented in `fastText.py \u003chttps://github.com/salestock/fastText.py\u003e`_ ,\nwhich exposes an interface on the original C++ code.\nThe current advantages of this class over ``GensimFastText`` are the *subwords* and the *n-gram features* implemented\nvia the *hashing trick*.\nThe constructor arguments are equivalent to the original `supervised model\n\u003chttps://github.com/salestock/fastText.py#supervised-model\u003e`_, except for ``input_file``, ``output`` and\n``label_prefix``.\n\n**WARNING**: The only way of loading datasets in fastText.py is through the filesystem (as of version 0.8.2),\nso data passed to ``fit(X, y)`` will be written in temporary files on disk.\n\n.. code:: python\n\n    \u003e\u003e\u003e from shallowlearn.models import FastText\n    \u003e\u003e\u003e clf = FastText(dim=100, min_count=0, loss='hs', epoch=3, bucket=5, word_ngrams=2)\n    \u003e\u003e\u003e clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])\n    \u003e\u003e\u003e clf.predict([('tall', 'am', 'i')])\n    ['yes']\n\nDeepInverseRegression\n~~~~~~~~~~~~~~~~~~~~~\n*TODO*: Based on https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score\n\nDeepAveragingNetworks\n~~~~~~~~~~~~~~~~~~~~~\n*TODO*: Based on https://github.com/miyyer/dan\n\nExclusive Features\n------------------\nNext cool features will be listed as Issues in Github, for now:\n\nPersistence\n~~~~~~~~~~~\nAny model can be serialized and de-serialized with the two methods ``save`` and ``load``.\nThey overload the `SaveLoad \u003chttps://radimrehurek.com/gensim/utils.html#gensim.utils.SaveLoad\u003e`_ interface of Gensim,\nso it is possible to control the cost on disk usage of the models, instead of simply *pickling* the objects.\nThe original interface also allows to use compression on the serialization outputs.\n\n``save`` may create multiple files with names prefixed by the name given to the serialized model.\n\n.. code:: python\n\n    \u003e\u003e\u003e from shallowlearn.models import GensimFastText\n    \u003e\u003e\u003e clf = GensimFastText(size=100, min_count=0, loss='hs', iter=3, seed=66)\n    \u003e\u003e\u003e clf.save('./model')\n    \u003e\u003e\u003e loaded = GensimFastText.load('./model') # it also creates ./model.CLF\n\nBenchmarks\n----------\n\nText classification\n~~~~~~~~~~~~~~~~~~~\n\nThe script ``scripts/document_classification_20newsgroups.py`` refers to this\n`scikit-learn example \u003chttp://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html\u003e`_\nin which text classifiers are compared on a reference dataset;\nwe added our models to the comparison.\n**The current results, even if still preliminary, are comparable with other\napproaches, achieving the best performance in speed**.\n\nResults as of release `0.0.5 \u003chttps://github.com/giacbrd/ShallowLearn/releases/tag/0.0.5\u003e`_,\nwith *chi2_select* option set to 80%.\nThe times take into account of *tf-idf* vectorization in the “classic” classifiers, and the I/O operations for the\ntraining of fastText.py.\nThe evaluation measure is *macro F1*.\n\n.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/benchmark.svg\n    :alt: Text classifiers comparison\n    :width: 888 px\n    :align: center\n\nOnline learning\n~~~~~~~~~~~~~~~\n\nThe script ``scripts/plot_out_of_core_classification.py`` computes a benchmark on some scikit-learn classifiers which are able to\nlearn incrementally,\na batch of examples at a time.\nThese classifiers can learn online by using the scikit-learn method ``partial_fit(X, y)``.\nThe `original example \u003chttp://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html\u003e`_\ndescribes the approach through feature hashing, which we set with parameter ``bucket``.\n\n**The results are decent but there is room for improvement**.\nWe configure our classifier with ``iter=1, size=100, alpha=0.1, sample=0, min_count=0``, so to keep the model fast and\nsmall, and to not cut off words from the few samples we have.\n\n.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/onlinelearning.svg\n    :alt: Online learning\n    :width: 700 px\n    :align: center\n\nReferences\n----------\n.. [1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgiacbrd%2FShallowLearn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgiacbrd%2FShallowLearn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgiacbrd%2FShallowLearn/lists"}