{"id":25975759,"url":"https://github.com/seznam/lightning-text","last_synced_at":"2026-02-16T17:03:08.091Z","repository":{"id":279216550,"uuid":"938081897","full_name":"seznam/lightning-text","owner":"seznam","description":"Adapter for using FastText library with scikit-learn and optuna.","archived":false,"fork":false,"pushed_at":"2025-02-24T12:06:51.000Z","size":23,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-11-27T18:34:03.496Z","etag":null,"topics":["fasttext","optuna","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seznam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-24T11:48:42.000Z","updated_at":"2025-02-24T12:06:53.000Z","dependencies_parsed_at":"2025-02-24T12:49:21.130Z","dependency_job_id":null,"html_url":"https://github.com/seznam/lightning-text","commit_stats":null,"previous_names":["seznam/lightning-text"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/seznam/lightning-text","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seznam%2Flightning-text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seznam%2Flightning-text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seznam%2Flightning-text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seznam%2Flightning-text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seznam","download_url":"https://codeload.github.com/seznam/lightning-text/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seznam%2Flightning-text/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29513436,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-16T09:05:14.864Z","status":"ssl_error","status_checked_at":"2026-02-16T08:55:59.364Z","response_time":115,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fasttext","optuna","scikit-learn"],"created_at":"2025-03-05T03:23:59.188Z","updated_at":"2026-02-16T17:03:08.074Z","avatar_url":"https://github.com/seznam.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LightningText\n\nLightning-fast text classification with scikit-learn integration.\n\nLightningText is an adapter for using\n[FastText's Python module](https://fasttext.cc/docs/en/python-module.html)\nwith\n[scikit-learn](https://scikit-learn.org/), enabling easy use of scikit-learn's\nfeatures (cross validation, various metrics, multi-output, ...) with FastText.\n\nPlease note that while this project strives for maximum possible compatibility\nwith scikit-learn, it is not currently possible to pass all tests executed by\n[`check_estimator`](https://scikit-learn.org/stable/modules/generated/sklearn.utils.estimator_checks.check_estimator.html),\nmostly due to FastText's behavior.\n\nWhile this project builds upon both FastText and scikit-learn, it is an\nindependent project not associated with either of the two.\n\n## Table of Contents\n\n- [Installation](#installation)\n- [API overview](#api-overview)\n  - [Base FastText API wrappers](#base-fasttext-api-wrappers)\n  - [Scikit-learn compatible FastText classifier](#scikit-learn-compatible-fasttext-classifier)\n    - [Note on labels representation](#note-on-labels-representation)\n  - [Text preprocessing](#text-preprocessing)\n  - [FastText dataset API](#fasttext-dataset-api)\n  - [Hyperparameter search using Optuna](#hyperparameter-search-using-optuna)\n  - [Additional scoring utilities](#additional-scoring-utilities)\n- [Examples](#examples)\n  - [Training and evaluating a model on train-test split](#training-and-evaluating-a-model-on-train-test-split)\n  - [Multi-label classification](#multi-label-classification)\n  - [K-fold Cross-validation](#k-fold-cross-validation)\n  - [Hyperparameter search example with Optuna](#hyperparameter-search-example-with-optuna)\n- [License](#license)\n\n## Installation\n\n```sh\npip install lightning-text\n```\n\n## API overview\n\n### Base FastText API wrappers\n\nThese are thin wrappers of the APIs exposed by FastText's Python module,\nproviding just explicit declaration of parameters, their default values and\ndocumentation. They pass the arguments directly to their respective counterparts\nin FastText module:\n\n- `tokenize`\n- `load_model`\n- `train_supervised`\n- `train_unsupervised`\n\n### Scikit-learn compatible FastText classifier\n\nThe `FastTextClassifier` class wraps supervised learning of FastText as a\nscikit-learn classifier.\n\nTo ensure compatibility with scikit-learn, the classifier requires the targets\n(labels) to be integers instead of strings. One way to encode string labels to\nintegers is to use scikit-learn's\n[`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)\nor\n[`MultiLabelBinarizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html),\na more convenient way for handling a dataset in FastText format is to use the\n`Dataset` class (see [below](#fasttext-dataset-api)).\n\nThe classifier also adds support for pickling, including serialization of the\nunderlying FastText model.\n\n#### Note on labels representation\n\nScikit-learn uses integers to represent classification targets (classes), and\nby default, these are used as label names when fitting the underlying FastText\nmodel.\n\nIf, however, text representation (usually original names) of the classes are\ndesired to be known by the FastText model, (e.g. if deploying the final model in\na stand-alone way), a label encoder can be passed to `FastTextClassifier`'s\n`fit()` method using the `label_encoder` parameter. `FastTextClassifier`\nsupports both\n[`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)\n(for binary and multi-class classification) and\n[`MultiLabelBinarizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html)\n(for multi-label classification).\n\nNote that if you use the `label_encoder` parameter, the class names must not\ncontain whitespace, otherwise you'll encounter exceptions during inference.\n\n### Text preprocessing\n\nThe module provides the `preprocess_text` utility for basic preprocessing of\nraw text for FastText. The function also provides optional removal of HTTP(S)\nURLs from text.\n\n### FastText dataset API\n\nThe `Dataset` class provides a convenient way of handling existing FastText\ndatasets, while representing the dataset in a scikit-learn-compatible way.\n\nLoading a FastText dataset using `Dataset.load_from` or `Dataset.from_file` will\nautomatically convert the string labels to integers, using either scikit-learn's\n[`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)\nor\n[`MultiLabelBinarizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html)\n(if any sample has multiple labels assigned) to do the conversion. These class\nmethods return the used label encoder with the created dataset for converting a\nfitted model's predictions back to text labels using the `inverse_transform`\nmethod on the encoder.\n\n### Hyperparameter search using Optuna\n\nLightningText provides a scikit-learn-style [Optuna](https://optuna.org/)-powered\nhyperparameter search with cross-validation, matching the APIs of other\nscikit-learn's hyperparameter searches, e.g.\n[`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html).\nSee the `OptunaSearchCV` class for details.\n\n### Additional scoring utilities\n\nLightningText provides additional utilities for interpreting scores:\n\n- `get_fold_scores_per_candidate` - Takes results of hyperparameter search with\n  cross-validation (the `cv_results_` field) and returns a dictionary of metric\n  names to 2D numpy array of shape `(n_candidates, n_folds)`.\n- `robust_cv_score` - Returns a harmonic mean of the provided scores. Harmonic\n  mean puts more weight on lower scores, naturally penalizing high variance.\n- `penalized_cv_score` - Explicitly penalize the scores by their standard\n  deviation and the provided `penalty_weight`.\n- `stability_score` - Measure how many scores are within `threshold` of mean and\n  return the ratio.\n\n## Examples\n\n### Training and evaluating a model on train-test split\n\nThis example demonstrates use with a binary or multi-class (single-label)\nclassification problem.\n\n```python\nfrom lightning_text import FastTextClassifier\nfrom sklearn.metrics import classification_report\n\nclassifier = FastTextClassifier()\nclassifier.fit(X_train, y_train)\n\ny_pred = classifier.predict(X_test)\nprint(classification_report(y_test, y_pred))\n```\n\n### Multi-label classification\n\nThere are two options for multi-label classification:\n\n#### Using the `ova` (one-vs-all) loss\n\n```python\nfrom lightning_text import FastTextClassifier\nfrom sklearn.metrics import classification_report, hamming_loss\nfrom sklearn.multioutput import MultiOutputClassifier\n\nclassifier = FastTextClassifier(\n    loss='ova,\n)\nclassifier.fit(X_train, Y_train)\n\nY_pred = classifier.predict(X_test)\nprint(f'Hamming loss: {hamming_loss(Y_test, Y_pred)}')\nprint(classification_report(Y_test, Y_pred))\n```\n\nThe classifier will be a faster to fit and a occupy smaller space when saved,\nhowever, requires tuning the decision threshold for its `predict()` method to be\nuseful (see\n[`TunedThresholdClassifierCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TunedThresholdClassifierCV.html)\nas one option for achieving this).\n\n### Training a binary classifier for each class using scikit-learn's `MultiOutputClassifier` meta-estimator\n\n```python\nfrom lightning_text import FastTextClassifier\nfrom sklearn.metrics import classification_report, hamming_loss\nfrom sklearn.multioutput import MultiOutputClassifier\n\n# Binarizes the problem and trains a FastTextClassifier for predicting the label\n# for each individual class.\nclassifier = MultiOutputClassifier(\n    FastTextClassifier(\n        verbose=0,\n    ),\n    n_jobs=4,\n)\nclassifier.fit(X_train, Y_train)\n\nY_pred = classifier.predict(X_test)\nprint(f'Hamming loss: {hamming_loss(Y_test, Y_pred)}')\nprint(classification_report(Y_test, Y_pred))\n```\n\nThis will train a binary FastText classifier for detecting each class using the\none-vs-all strategy and the resulting classifier will be usable right after\nfitting, however the classifier will be slower to fit, predict and will occupy\nlarger space when saved, which could make it impractical if the number of\nclasses is large.\n\n### K-fold Cross-validation\n\nNote that the use of\n[`StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)\nis likely beneficial for correct evaluation a binary classification problem.\nStratification is hard to impractical or impossible for multi-label problems.\n\n```python\nfrom lightning_text import FastTextClassifier\nfrom sklearn.model_selection import cross_validate\n\nclassifier = FastTextClassifier()\ncv = cross_validate(\n    classifier,\n    X,\n    y,\n    cv=5,\n    scoring='f1',\n    n_jobs=4,\n)\n\nprint(cv)\n```\n\n### Hyperparameter search example with Optuna\n\n```python\nfrom typing import Any\n\nfrom lightning_text import FastTextClassifier\nfrom lightning_text.optuna import (\n    OptunaSearchCV,\n    DEFAULT_SUPERVISED_TRAINING_HYPERPARAMETERS_SPACE,\n)\nimport optuna\nfrom sklearn.metrics import fbeta_score, make_scorer\n\n\ndef metrics_to_optuna_goals(metrics: dict[str, Any]) -\u003e float:\n    last_mean_fbeta = metrics['mean_test_score'][-1]\n    return last_mean_fbeta\n\n\ntries = 128\n\nestimator = FastTextClassifier(\n    verbose=0,\n)\n\nstudy = optuna.create_study(direction='maximize')\nsearch = OptunaSearchCV(\n    estimator=estimator,\n    study=study,\n    hyperparameters_space=DEFAULT_SUPERVISED_TRAINING_HYPERPARAMETERS_SPACE,\n    n_iter=tries,\n    scoring=make_scorer(fbeta_score, pos_label=1, beta=1),\n    optuna_metrics_exporter=metrics_to_optuna_goals,\n    n_jobs=4,\n    refit='fbeta',\n    cv=5,\n    show_progress_bar=True,\n)\nsearch.fit(X, y)\n\nprint(search.cv_results_)\nprint(search.best_params_)\nprint(search.best_score_)\n\nbest_estimator = search.best_estimator_\n```\n\n## License\n\n`lightning_text` is distributed under the terms of the\n[MIT](https://spdx.org/licenses/MIT.html) license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseznam%2Flightning-text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseznam%2Flightning-text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseznam%2Flightning-text/lists"}