{"id":13609767,"url":"https://github.com/shaypal5/skift","last_synced_at":"2025-04-05T16:03:44.312Z","repository":{"id":29045711,"uuid":"120085546","full_name":"shaypal5/skift","owner":"shaypal5","description":"scikit-learn wrappers for Python fastText.","archived":false,"fork":false,"pushed_at":"2022-06-07T15:07:07.000Z","size":649,"stargazers_count":233,"open_issues_count":1,"forks_count":24,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-03-29T15:03:16.873Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shaypal5.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-03T11:37:21.000Z","updated_at":"2024-11-22T06:45:36.000Z","dependencies_parsed_at":"2022-08-07T14:01:07.819Z","dependency_job_id":null,"html_url":"https://github.com/shaypal5/skift","commit_stats":null,"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaypal5%2Fskift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaypal5%2Fskift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaypal5%2Fskift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shaypal5%2Fskift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shaypal5","download_url":"https://codeload.github.com/shaypal5/skift/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247361602,"owners_count":20926641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:01:37.832Z","updated_at":"2025-04-05T16:03:44.287Z","avatar_url":"https://github.com/shaypal5.png","language":"Jupyter Notebook","funding_links":[],"categories":["Natural Language Processing","文本数据和NLP","Jupyter Notebook","Projects by main language"],"sub_categories":["Others","NLP","Old Projects"],"readme":"skift |skift_icon|\n##################\n|PyPI-Status| |Downloads| |PyPI-Versions| |Build-Status| |Codecov| |Codefactor| |LICENCE|\n\n.. |skift_icon| image:: https://github.com/shaypal5/skift/blob/be1f8e84d311f926fd39e8ea421525782b4cb39f/skift.png\n\n``scikit-learn`` wrappers for Python ``fastText``.\n\n.. code-block:: python\n\n  \u003e\u003e\u003e from skift import FirstColFtClassifier\n  \u003e\u003e\u003e df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])\n  \u003e\u003e\u003e sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)\n  \u003e\u003e\u003e sk_clf.fit(df[['txt']], df['lbl'])\n  \u003e\u003e\u003e sk_clf.predict([['woof']])\n  [0]\n\n.. contents::\n\n.. section-numbering::\n\n\nInstallation\n============\n\nDependencies:\n\n* ``numpy``\n* ``scipy``\n* ``scikit-learn``\n* The ``fasttext`` Python package\n\n.. code-block:: bash\n\n  pip install skift\n\n\nConfiguration\n=============\n\nBecause ``fasttext`` reads input data from files, ``skift`` has to dump the input data into temporary files for ``fasttext`` to use. A dedicated folder is created for those files on the filesystem.  By default, this storage is allocated in the system temporary storage location (i.e. /tmp on \\*nix systems).  To override this default location, use the ``SKIFT_TEMP_DIR`` environment variable:\n\n.. code-block:: bash\n\n  export SKIFT_TEMP_DIR=/path/to/desired/temp/folder\n\n**NOTE:** The directory will be created if it does not already exist.\n\n\nFeatures\n========\n\n* Adheres to the ``scikit-learn`` classifier API, including ``predict_proba``.\n* Also caters to the common use case of ``pandas.DataFrame`` inputs.\n* Enables easy stacking of ``fastText`` with other types of ``scikit-learn``-compliant classifiers.\n* Pickle-able classifier objects.\n* Built around the `official fasttext Python package \u003chttps://github.com/facebookresearch/fastText/tree/master/python\u003e`_.\n* Pure python.\n* Supports Python 3.5+.\n* `Fully tested on Linux, OSX and Windows operating systems \u003chttps://travis-ci.org/shaypal5/skift\u003e`_.\n\n\nWrappers\n=========\n\n``fastText`` works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the ``fastText`` classifier use a single column as input, ignoring other columns. This is especially true when ``fastText`` is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.\n\n``skift`` includes several ``scikit-learn``-compatible wrappers (for the `official \u003chttps://github.com/facebookresearch/fastText/tree/master/python\u003e`_ ``fastText`` Python package) which cater to these use cases.\n\n**NOTICE:** Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the ``fastText.train_supervised`` method on every call to ``fit``.\n\nStandard wrappers\n-----------------\n\nThese wrappers do not make additional assumptions on input besides those commonly made by ``scikit-learn`` classifies; i.e. that input is a 2d ``ndarray`` object and such.\n\n* ``FirstColFtClassifier`` - An sklearn classifier adapter for fasttext that takes the first column of input ``ndarray`` objects as input.\n\n.. code-block:: python\n\n  \u003e\u003e\u003e from skift import FirstColFtClassifier\n  \u003e\u003e\u003e df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])\n  \u003e\u003e\u003e sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)\n  \u003e\u003e\u003e sk_clf.fit(df[['txt']], df['lbl'])\n  \u003e\u003e\u003e sk_clf.predict([['woof']])\n  [0]\n\n* ``IdxBasedFtClassifier`` - An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing the ``input_ix`` parameter to the constructor.\n\n.. code-block:: python\n\n  \u003e\u003e\u003e from skift import IdxBasedFtClassifier\n  \u003e\u003e\u003e df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl'])\n  \u003e\u003e\u003e sk_clf = IdxBasedFtClassifier(input_ix=1, lr=0.4, epoch=6)\n  \u003e\u003e\u003e sk_clf.fit(df[['count', 'txt']], df['lbl'])\n  \u003e\u003e\u003e sk_clf.predict([['woof']])\n  [0]\n\n\n\npandas-dependent wrappers\n-------------------------\n\nThese wrappers assume the ``X`` parameter given to ``fit``, ``predict``, and ``predict_proba`` methods is a ``pandas.DataFrame`` object:\n\n* ``FirstObjFtClassifier`` - An sklearn adapter for fasttext using the first column of ``dtype == object`` as input.\n\n.. code-block:: python\n\n  \u003e\u003e\u003e from skift import FirstObjFtClassifier\n  \u003e\u003e\u003e df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])\n  \u003e\u003e\u003e sk_clf = FirstObjFtClassifier(lr=0.2)\n  \u003e\u003e\u003e sk_clf.fit(df[['txt']], df['lbl'])\n  \u003e\u003e\u003e sk_clf.predict([['woof']])\n  [0]\n\n* ``ColLblBasedFtClassifier`` - An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing the ``input_col_lbl`` parameter to the constructor.\n\n.. code-block:: python\n\n  \u003e\u003e\u003e from skift import ColLblBasedFtClassifier\n  \u003e\u003e\u003e df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])\n  \u003e\u003e\u003e sk_clf = ColLblBasedFtClassifier(input_col_lbl='txt', epoch=8)\n  \u003e\u003e\u003e sk_clf.fit(df[['txt']], df['lbl'])\n  \u003e\u003e\u003e sk_clf.predict([['woof']])\n  [0]\n\n* ``SeriesFtClassifier`` - An sklearn adapter for fasttext taking a Pandas Series as input.\n\n.. code-block:: python\n\n  \u003e\u003e\u003e from skift import SeriesFtClassifier\n  \u003e\u003e\u003e df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])\n  \u003e\u003e\u003e sk_clf = SeriesFtClassifier(input_col_lbl='txt', epoch=8)\n  \u003e\u003e\u003e sk_clf.fit(df['txt'], df['lbl'])\n  \u003e\u003e\u003e sk_clf.predict(['woof'])\n  \u003e\u003e\u003e sk_clf.predict(df['txt'])\n\nHyperparameter auto-tuning\n----------------------------\n\nIt's possible to pass a validation set to ``fit()`` in order to optimize the hyper-parameters.\n\nFirst, to adjust the `auto-tune settings \u003chttps://fasttext.cc/docs/en/autotune.html\u003e`_, the corresponding keyword arguments can be passed to the constructor (if none are passed the default settings are used):\n\n.. code-block:: python\n\n  \u003e\u003e\u003e from skift import SeriesFtClassifier\n  \u003e\u003e\u003e df_train = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])\n  \u003e\u003e\u003e df_val = pandas.DataFrame([['woof woof', 0], ['meow meow', 1]], columns=['txt', 'lbl'])\n  \u003e\u003e\u003e sk_clf = SeriesFtClassifier(epoch=8, autotuneDuration=5)\n\nThen, the validation dataframe (or series, in this case, since we constructed a ``SeriesFtClassifier``) and label column should be provided to the ``fit()`` method:\n\n.. code-block:: python\n\n  \u003e\u003e\u003e sk_clf.fit(df_train['txt'], df_train['lbl'], X_validation=df_val['txt'], y_validation=df_val['lbl'])\n\nOr simply by position:\n\n.. code-block:: python\n\n  \u003e\u003e\u003e sk_clf.fit(df_train['txt'], df_train['lbl'], df_val['txt'], df_val['lbl'])\n\n\nUsing Pre-trained word vectors\n-------------------------------\n\nThis is done in the exact same way as with the Python module or the fastText CLI, but not setting the right vector dimensions in the constructor (identical to the dimensions of the pretrained vectors you are using) will crash fastText without explanation, so we provide an example:\n\n.. code-block:: python\n\n    from skift import SeriesFtClassifier\n    ft_clf = SeriesFtClassifier(\n        autotuneDuration=900,\n        pretrainedVectors='/Users/myuser/data/word_vectors/crawl-300d-2M.vec',\n        dim=300,\n    )\n\nIn this case, not providing the constructor with ``dim=300`` would bring about a crash when calling ``ft_clf.fit()``.\n\n\nContributing\n============\n\nPackage author and current maintainer is Shay Palachy (shay.palachy@gmail.com); You are more than welcome to approach him for help. Contributions are very welcomed.\n\nInstalling for development\n----------------------------\n\nClone:\n\n.. code-block:: bash\n\n  git clone git@github.com:shaypal5/skift.git\n\n\nInstall in development mode, including test dependencies:\n\n.. code-block:: bash\n\n  cd skift\n  pip install -e '.[test]'\n\n\nTo also install ``fasttext``, see instructions in the Installation section.\n\n\nRunning the tests\n-----------------\n\nTo run the tests use:\n\n.. code-block:: bash\n\n  cd skift\n  pytest\n\n\nAdding documentation\n--------------------\n\nThe project is documented using the `numpy docstring conventions`_, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow `these conventions`_.\n\n.. _`numpy docstring conventions`: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt\n.. _`these conventions`: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt\n\nAdditionally, if you update this ``README.rst`` file,  use ``python setup.py checkdocs`` to validate it compiles.\n\n\nCredits\n=======\n\nCreated by Shay Palachy (shay.palachy@gmail.com).\n\nContributions:\n\n* `Dimid Duchovny \u003chttps://github.com/dimidd\u003e_` contributed the ``SeriesFtClassifier`` class and the hyperparameter auto-tuning capability.\n\nFixes: `uniaz \u003chttps://github.com/uniaz\u003e`_, `crouffer \u003chttps://github.com/crouffer\u003e`_, `amirzamli \u003chttps://github.com/amirzamli\u003e`_ and `sgt \u003chttps://github.com/sgt\u003e`_.\n\n\n.. |PyPI-Status| image:: https://img.shields.io/pypi/v/skift.svg\n  :target: https://pypi.python.org/pypi/skift\n\n.. |PyPI-Versions| image:: https://img.shields.io/pypi/pyversions/skift.svg\n   :target: https://pypi.python.org/pypi/skift\n\n.. |Build-Status| image:: https://github.com/shaypal5/skift/actions/workflows/test.yml/badge.svg\n  :target: https://github.com/shaypal5/skift/actions/workflows/test.yml\n\n.. |LICENCE| image:: https://github.com/shaypal5/skift/blob/master/mit_license_badge.svg\n  :target: https://github.com/shaypal5/skift/blob/master/LICENSE\n\n.. https://img.shields.io/github/license/shaypal5/skift.svg\n\n.. |Codecov| image:: https://codecov.io/github/shaypal5/skift/coverage.svg?branch=master\n   :target: https://codecov.io/github/shaypal5/skift?branch=master\n\n.. |Downloads| image:: https://pepy.tech/badge/skift\n     :target: https://pepy.tech/project/skift\n     :alt: PePy stats\n\n.. |Codefactor| image:: https://www.codefactor.io/repository/github/shaypal5/skift/badge?style=plastic\n     :target: https://www.codefactor.io/repository/github/shaypal5/skift\n     :alt: Codefactor code quality\n\n.. Trigerring Travis builds\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshaypal5%2Fskift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshaypal5%2Fskift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshaypal5%2Fskift/lists"}