{"id":17132059,"url":"https://github.com/dayyass/text-classification-baseline","last_synced_at":"2025-07-19T20:39:09.373Z","repository":{"id":46213335,"uuid":"390099277","full_name":"dayyass/text-classification-baseline","owner":"dayyass","description":"Pipeline for fast building text classification TF-IDF + LogReg baselines.","archived":false,"fork":false,"pushed_at":"2021-11-06T09:00:24.000Z","size":1638,"stargazers_count":63,"open_issues_count":18,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-09-14T07:47:25.628Z","etag":null,"topics":["baseline","classification","data-science","fast","hacktoberfest","logistic-regression","machine-learning","natural-language-processing","nlp","python","text","text-classification","tf-idf"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/text-classification-baseline/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dayyass.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-27T19:11:31.000Z","updated_at":"2024-04-28T05:13:07.000Z","dependencies_parsed_at":"2022-09-10T02:21:33.744Z","dependency_job_id":null,"html_url":"https://github.com/dayyass/text-classification-baseline","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dayyass%2Ftext-classification-baseline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dayyass%2Ftext-classification-baseline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dayyass%2Ftext-classification-baseline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dayyass%2Ftext-classification-baseline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dayyass","download_url":"https://codeload.github.com/dayyass/text-classification-baseline/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219851597,"owners_count":16556236,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["baseline","classification","data-science","fast","hacktoberfest","logistic-regression","machine-learning","natural-language-processing","nlp","python","text","text-classification","tf-idf"],"created_at":"2024-10-14T19:25:52.036Z","updated_at":"2024-10-14T19:25:52.604Z","avatar_url":"https://github.com/dayyass.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![tests](https://github.com/dayyass/text-classification-baseline/actions/workflows/tests.yml/badge.svg)](https://github.com/dayyass/text-classification-baseline/actions/workflows/tests.yml)\n[![linter](https://github.com/dayyass/text-classification-baseline/actions/workflows/linter.yml/badge.svg)](https://github.com/dayyass/text-classification-baseline/actions/workflows/linter.yml)\n[![codecov](https://codecov.io/gh/dayyass/text-classification-baseline/branch/main/graph/badge.svg?token=ABFF3YQBJV)](https://codecov.io/gh/dayyass/text-classification-baseline)\n\n[![python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://github.com/dayyass/text-classification-baseline#requirements)\n[![release (latest by date)](https://img.shields.io/github/v/release/dayyass/text-classification-baseline)](https://github.com/dayyass/text-classification-baseline/releases/latest)\n[![license](https://img.shields.io/github/license/dayyass/text-classification-baseline?color=blue)](https://github.com/dayyass/text-classification-baseline/blob/main/LICENSE)\n\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-black)](https://github.com/dayyass/text-classification-baseline/blob/main/.pre-commit-config.yaml)\n[![code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n[![pypi version](https://img.shields.io/pypi/v/text-classification-baseline)](https://pypi.org/project/text-classification-baseline)\n[![pypi downloads](https://img.shields.io/pypi/dm/text-classification-baseline)](https://pypi.org/project/text-classification-baseline)\n\n## Text Classification Baseline\nPipeline for fast building text classification baselines with **TF-IDF + LogReg**.\n\n## Usage\nInstead of writing custom code for specific text classification task, you just need:\n1. install pipeline:\n```shell script\npip install text-classification-baseline\n```\n2. run pipeline:\n- either in **terminal**:\n```shell script\ntext-clf-train --path_to_config config.yaml\n```\n- or in **python**:\n```python3\nimport text_clf\n\nmodel, target_names_mapping = text_clf.train(path_to_config=\"config.yaml\")\n```\n\n**NOTE**: more about config file [here](https://github.com/dayyass/text-classification-baseline/tree/main#config).\n\nNo data preparation is needed, only a **csv** file with two raw columns (with arbitrary names):\n- `text`\n- `target`\n\nThe **target** can be presented in any format, including text - not necessarily integers from *0* to *n_classes-1*.\n\n### Config\nThe user interface consists of two files:\n- [**config.yaml**](https://github.com/dayyass/text-classification-baseline/blob/main/config.yaml) - general configuration with sklearn **TF-IDF** and **LogReg** parameters\n- [**hyperparams.py**](https://github.com/dayyass/text-classification-baseline/blob/main/hyperparams.py) - sklearn **GridSearchCV** parameters\n\nChange **config.yaml** and **hyperparams.py** to create the desired configuration and train text classification model with the following command:\n- **terminal**:\n```shell script\ntext-clf-train --path_to_config config.yaml\n```\n- **python**:\n```python3\nimport text_clf\n\nmodel, target_names_mapping = text_clf.train(path_to_config=\"config.yaml\")\n```\n\nDefault **config.yaml**:\n```yaml\nseed: 42\npath_to_save_folder: models\nexperiment_name: model\n\n# data\ndata:\n  train_data_path: data/train.csv\n  test_data_path: data/test.csv\n  sep: ','\n  text_column: text\n  target_column: target_name_short\n\n# preprocessing\n# (included in resulting model pipeline, so preserved for inference)\npreprocessing:\n  lemmatization: null  # pymorphy2\n\n# tf-idf\ntf-idf:\n  lowercase: true\n  ngram_range: (1, 1)\n  max_df: 1.0\n  min_df: 1\n\n# logreg\nlogreg:\n  penalty: l2\n  C: 1.0\n  class_weight: balanced\n  solver: saga\n  n_jobs: -1\n\n# grid-search\ngrid-search:\n  do_grid_search: false\n  grid_search_params_path: hyperparams.py\n```\n\n**NOTE**: grid search is disabled by default, to use it set `do_grid_search: true`.\n\n**NOTE**: `tf-idf` and `logreg` are sklearn [**TfidfVectorizer**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfVectorizer) and [**LogisticRegression**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) parameters correspondingly, so you can parameterize instances of these classes however you want. The same logic applies to `grid-search` which is sklearn [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) parametrized with [**hyperparams.py**](https://github.com/dayyass/text-classification-baseline/blob/main/hyperparams.py).\n\n### Output\nAfter training the model, the pipeline will return the following files:\n- `model.joblib` - sklearn pipeline with TF-IDF and LogReg steps\n- `target_names.json` - mapping from encoded target labels from *0* to *n_classes-1* to it names\n- `config.yaml` - config that was used to train the model\n- `hyperparams.py` - grid-search parameters (if grid-search was used)\n- `logging.txt` - logging file\n\n\n### Additional functions\n- `text_clf.token_frequency.get_token_frequency(path_to_config)` - \u003cbr\u003e get token frequency of **train dataset** according to the config file parameters\n\n**Only for binary classifiers**:\n- `text_clf.pr_roc_curve.get_precision_recall_curve(path_to_model_folder)` - \u003cbr\u003e get *precision* and *recall* metrics for precision-recall curve\n- `text_clf.pr_roc_curve.get_roc_curve(path_to_model_folder)` - \u003cbr\u003e get *false positive rate (fpr)* and *true positive rate (tpr)* metrics for roc curve\n- `text_clf.pr_roc_curve.plot_precision_recall_curve(precision, recall)` - \u003cbr\u003e plot *precision-recall curve*\n- `text_clf.pr_roc_curve.plot_roc_curve(fpr, tpr)` - \u003cbr\u003e plot *roc curve*\n- `text_clf.pr_roc_curve.plot_precision_recall_f1_curves_for_thresholds(precision, recall, thresholds)` - \u003cbr\u003e plot *precision*, *recall*, *f1-score* curves for probability thresholds\n\n## Requirements\nPython \u003e= 3.6\n\n## Citation\nIf you use **text-classification-baseline** in a scientific publication, we would appreciate references to the following BibTex entry:\n```bibtex\n@misc{dayyass2021textclf,\n    author       = {El-Ayyass, Dani},\n    title        = {Pipeline for training text classification baselines},\n    howpublished = {\\url{https://github.com/dayyass/text-classification-baseline}},\n    year         = {2021}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdayyass%2Ftext-classification-baseline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdayyass%2Ftext-classification-baseline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdayyass%2Ftext-classification-baseline/lists"}