{"id":31769600,"url":"https://github.com/kenhktsui/anyclassifier","last_synced_at":"2025-10-10T02:51:45.488Z","repository":{"id":251761519,"uuid":"835932538","full_name":"kenhktsui/anyclassifier","owner":"kenhktsui","description":"One Line To Build Zero-Data Classifiers in Minutes","archived":false,"fork":false,"pushed_at":"2024-09-25T17:01:43.000Z","size":2854,"stargazers_count":58,"open_issues_count":1,"forks_count":8,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-30T02:16:57.835Z","etag":null,"topics":["automl","llm","machine-learning","natural-language-processing","unlabeled-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kenhktsui.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-30T20:16:38.000Z","updated_at":"2025-07-02T09:02:25.000Z","dependencies_parsed_at":"2024-08-28T14:15:47.675Z","dependency_job_id":"d1fee3e3-0677-4610-a0e5-e6dbb4aeac93","html_url":"https://github.com/kenhktsui/anyclassifier","commit_stats":null,"previous_names":["kenhktsui/anyclassifier"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/kenhktsui/anyclassifier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhktsui%2Fanyclassifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhktsui%2Fanyclassifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhktsui%2Fanyclassifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhktsui%2Fanyclassifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kenhktsui","download_url":"https://codeload.github.com/kenhktsui/anyclassifier/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhktsui%2Fanyclassifier/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279002551,"owners_count":26083403,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automl","llm","machine-learning","natural-language-processing","unlabeled-data"],"created_at":"2025-10-10T02:51:41.584Z","updated_at":"2025-10-10T02:51:45.483Z","avatar_url":"https://github.com/kenhktsui.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![GitHub License](https://img.shields.io/github/license/kenhktsui/anyclassifier?)![PyPI - Downloads](https://img.shields.io/pypi/dm/anyclassifier?)![PyPI - Version](https://img.shields.io/pypi/v/anyclassifier?)\n\n# ∞🧙🏼‍♂️AnyClassifier - One Line To Build Zero-Data Classifiers in Minutes, And A Step Towards The First AI ML Engineer\n![image](assets/Traditional_ML_Cycle.png)\n\n![image](assets/AnyClassifier.png)\n\n\u003eHave you ever wanted/ been requested to build a classifier without any data? What it takes now is just one line 🤯.   \n\n**AnyClassifier** is a framework that empowers you to create high-performance classifiers without any labels or data, using minimal code. \nIt's designed to revolutionize the machine learning development process by eliminating the need for extensive data curation and labeling.\n\n## 🚀 Key Features\n- **Zero Data Required**: Build classifiers from scratch, even without a dataset\n- **Competitive Result**: Achieving competitive results with synthetic data, comparable to using real data 🚀 - See [Benchmark](#benchmark)\n- **Multilingual Synthetic Data Generation**: Supports generating whatever language as long as your LLM model can. \n- **One-Line Implementation**: Democratizing AI for everyone, and agent (as a plugin for agentic flow)\n- **LLM-Powered Synthetic Data and Annotations**: Leverage SOTA LLM for high-quality synthetic data generation and labeling -  See [approach](docs/synthetic_data_generation.md)\n- **Designed for Real-World Use**: Created by ML engineers, for ML and software engineers\n\n## 🎯 Why AnyClassifier?\nAs machine learning engineers, we understand the challenges of building and curating high-quality labeled datasets.  \nThe challenge amplifies when you want to build a multilingual dataset.  \nAnyClassifier eliminates this bottleneck, allowing you to focus on what matters most - solving real-world problems.\n\n## 🏁 QuickStart in Apple Silicon - Train a model in 5 min!\n\u003cdetails\u003e\n\u003csummary\u003eExpand\u003c/summary\u003e\n\n```shell\n# install (cp39 = python3.9, other valid values are cp310, cp311, cp312)\ncurl -L -O https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.88-metal/llama_cpp_python-0.2.88-cp39-cp39-macosx_11_0_arm64.whl\npip install llama_cpp_python-0.2.88-cp39-cp39-macosx_11_0_arm64.whl\nrm llama_cpp_python-0.2.88-cp39-cp39-macosx_11_0_arm64.whl\npip install anyclassifier\n# run\ncd examples\npython train_setfit_model_imdb.py\n```\n\u003c/details\u003e\n\n![image](assets/demo.gif)\n## 🏁 QuickStart in Colab\n\n| Dataset                       | Colab Link                                                                                                                                                          |\n|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| No Data!                      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Wt_IlilfTqBbyn3gZQ3kITjObrVAypyi?usp=sharing) |\n| imdb sentiment classification | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1LB8PUTT9wM1Qb2cY-6Dx-RNiqmyCvRr1?usp=sharing) |\n\n\n## 🛠️ Usage\n### Download a small LLM (please accept the respective terms and condition of model license beforehand)\n[meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)  \n[google/gemma-2-9b](https://huggingface.co/google/gemma-2-9b)\n```python\nfrom huggingface_hub import hf_hub_download\n\n# meta-llama/Meta-Llama-3.1-8B-Instruct\nhf_hub_download(\"lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF\", \"Meta-Llama-3.1-8B-Instruct-Q8_0.gguf\")\n\n# google/gemma-2-9b\nhf_hub_download(\"lmstudio-community/gemma-2-9b-it-GGUF\", \"gemma-2-9b-it-Q8_0.gguf\")\n```\n\n### One Liner For No Data\n\n```python\n\nfrom huggingface_hub import hf_hub_download\nfrom anyclassifier import train_anyclassifier\nfrom anyclassifier.llm.llm_client import LlamaCppClient\nfrom anyclassifier.schema import Label\n\n# Magic One Line!\ntrainer = train_anyclassifier(\n  \"Classify a text's sentiment.\",\n  [\n    Label(id=1, desc='positive sentiment'),\n    Label(id=0, desc='negative sentiment')\n  ],\n  LlamaCppClient(hf_hub_download(\n    \"lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF\", \"Meta-Llama-3.1-8B-Instruct-Q8_0.gguf\" # as you like\n  ))\n)\n# Share Your Model!\ntrainer.push_to_hub(\"user_id/any_model\")\n```\n\n### One Liner For Unlabeled Data\n```python\nfrom huggingface_hub import hf_hub_download\nfrom anyclassifier import train_anyclassifier\nfrom anyclassifier.llm.llm_client import LlamaCppClient\nfrom anyclassifier.schema import Label\n\nunlabeled_dataset  # a huggingface datasets.Dataset class can be from your local json/ csv, or from huggingface hub.\n\n# Magic One Line!\ntrainer = train_anyclassifier(\n  \"Classify a text's sentiment.\",\n  [\n    Label(id=1, desc='positive sentiment'),\n    Label(id=0, desc='negative sentiment')\n  ],\n  LlamaCppClient(hf_hub_download(\n    \"lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF\", \"Meta-Llama-3.1-8B-Instruct-Q8_0.gguf\" # as you like\n  )),\n  unlabeled_dataset\n)\n# Share Your Model!\ntrainer.push_to_hub(\"user_id/any_model\")\n```\n\n### To Use Model\n\n```python\n# SetFit\nfrom setfit import SetFitModel\n\nmodel = SetFitModel.from_pretrained(\"user_id/any_model\")\npreds = model.predict([\"i loved the spiderman movie!\", \"pineapple on pizza is the worst 🤮\"])\nprint(preds)\n\n# FastText\nfrom anyclassifier.fasttext_wrapper import FastTextForSequenceClassification\n\nmodel = FastTextForSequenceClassification.from_pretrained(\"user_id/any_model\")\npreds = model.predict([\"i loved the spiderman movie!\", \"pineapple on pizza is the worst 🤮\"])\nprint(preds)\n```\n\n## 🔧 Installation\nIt is using llama.cpp as backend, and build wheel can take a lot of time (10min+), as such, we also provide an instruction to install with pre-built wheel.\n\u003cdetails\u003e\n\u003csummary\u003eMetal Backend (Apple's GPU) - cp39 = python3.9, other valid values are cp310, cp311, cp312\u003c/summary\u003e\n\n```shell\ncurl -L -O https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.88-metal/llama_cpp_python-0.2.88-cp39-cp39-macosx_11_0_arm64.whl\npip install llama_cpp_python-0.2.88-cp39-cp39-macosx_11_0_arm64.whl\nrm llama_cpp_python-0.2.88-cp39-cp39-macosx_11_0_arm64.whl\npip install anyclassifier\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eColab (T4) Prebuilt Wheel\u003c/summary\u003e\n\n```shell\ncurl -L -O https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.88-cu124/llama_cpp_python-0.2.88-cp310-cp310-linux_x86_64.whl\npip install llama_cpp_python-0.2.88-cp310-cp310-linux_x86_64.whl\nrm llama_cpp_python-0.2.88-cp310-cp310-linux_x86_64.whl\npip install anyclassifier\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eCUDA Backend (Please read [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/#installation))\u003c/summary\u003e\n\n```shell\nCMAKE_ARGS=\"-DGGML_METAL=on\" pip install anyclassifier\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eCPU\u003c/summary\u003e\n\n```shell\nCMAKE_ARGS=\"-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\" pip install anyclassifier\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eDeveloper's installation\u003c/summary\u003e\n\n```shell\npip install -e .\n```\n\u003c/details\u003e\n\n## Other Usages\n\n### Use OpenAI instead of LlamaCpp, enabling concurrency\n```python\nfrom anyclassifier.schema import Label\nfrom anyclassifier.llm.llm_client import OpenAIClient\nfrom anyclassifier.train_any import train_anyclassifier\n\ntrainer = train_anyclassifier(\n  \"Classify a text's sentiment.\",\n  [\n    Label(id=1, desc='positive sentiment'),\n    Label(id=0, desc='negative sentiment')\n  ],\n  OpenAIClient(api_key=\"\u003cYOUR_OPENAI_API_KEY\u003e\", model=\"gpt-4o-mini\"),\n  generation_concurrency=10,\n  labeling_concurrency=10\n)\n```\n\n### To Label a Dataset\n\n```python\nfrom datasets import load_dataset\nfrom anyclassifier.annotation.prompt import AnnotationPrompt\nfrom anyclassifier.llm.llm_client import LlamaCppClient\nfrom anyclassifier.schema import Label\nfrom anyclassifier.annotation.annotator import LLMAnnotator\n\nunlabeled_dataset = load_dataset(\"somepath\")\nprompt = AnnotationPrompt(\n  task_description=\"Classify a text's sentiment.\",\n  label_definition=[\n      Label(id=0, desc='negative sentiment'),\n      Label(id=1, desc='positive sentiment')\n  ]  \n)\nannotator = LLMAnnotator(LlamaCppClient(), prompt)\nlabel_dataset = annotator.annotate_dataset(unlabeled_dataset, n_record=1000)\nlabel_dataset.push_to_hub('user_id/any_data')\n```\n\n### To Generate a Synthetic Dataset\n```python\nfrom anyclassifier.schema import Label\nfrom anyclassifier.synthetic_data_generation import SyntheticDataGeneratorForSequenceClassification\nfrom anyclassifier.llm.llm_client import LlamaCppClient\n\ntree_constructor = SyntheticDataGeneratorForSequenceClassification(LlamaCppClient())\ndataset = tree_constructor.generate(\n    \"Classify a text's sentiment.\",\n    [\n        Label(id=0, desc='negative sentiment'),\n        Label(id=1, desc='positive sentiment')\n    ]\n)\ndataset.push_to_hub('user_id/any_data')\n```\n\nSee more examples in [examples](./examples)  \n\n| dataset                                          | approach                  | model_type | example                                                          | resulting model                                                       | dataset  |\n|--------------------------------------------------|---------------------------|------------|------------------------------------------------------------------|-----------------------------------------------------------------------|--------------|\n| \"stanfordnlp/imdb\" like                          | synthetic data generation | setfit     | [link](examples/train_setfit_model_imdb_synthetic.py)            | [link](https://huggingface.co/kenhktsui/setfit_test_imdb)             | [link](https://huggingface.co/datasets/kenhktsui/test_imdb_syn) |\n| stanfordnlp/imdb                                 |  annotation               | setfit     | [link](examples/train_setfit_model_imdb.py)                      | [link](https://huggingface.co/kenhktsui/anyclassifier_setfit_demo)    | [link](https://huggingface.co/datasets/kenhktsui/test_imdb) |\n| \"zeroshot/twitter-financial-news-sentiment\" like | synthetic data generation | setfit     | [link](examples/train_setfit_model_twitter_financial_news_sentiment_synthetic.py) | [link](https://huggingface.co/kenhktsui/setfit_test_twitter_news_syn) | [link](https://huggingface.co/datasets/kenhktsui/test_twitter_financial_news_syn) |\n| zeroshot/twitter-financial-news-sentiment        |  annotation               | setfit     | [link](examples/train_setfit_model_twitter_financial_news_sentiment.py) | [link](https://huggingface.co/kenhktsui/setfit_test_twitter_news)     | [link](https://huggingface.co/datasets/kenhktsui/test_twitter_financial_news) |\n| \"ccdv/arxiv-classification\" like                 | synthetic data generation | setfit     | [link](examples/train_setfit_model_arxiv_topic_synthetic.py)     | [link](kenhktsui/setfit_test_arxiv_classification_syn)                | [link](https://huggingface.co/datasets/kenhktsui/test_arxiv_classification_syn) |\n| ccdv/arxiv-classification                        |  annotation               | setfit     | [link](examples/train_setfit_model_arxiv_topic.py)               | [link](kenhktsui/setfit_test_arxiv_classification)                    | [link](https://huggingface.co/datasets/kenhktsui/test_arxiv_classification_syn) |\n| \"lmsys/toxic-chat, toxicchat0124\" like           | synthetic data generation | setfit     | [link](examples/train_setfit_model_toxic_chat_synthetic.py)      | [link](kenhktsui/setfit_test_toxic_chat_syn)                          | [link](https://huggingface.co/datasets/kenhktsui/test_toxic_chat_syn) |\n| lmsys/toxic-chat, toxicchat0124                  |  annotation               | setfit     | [link](examples/train_setfit_model_toxic_chat.py)                | [link](kenhktsui/setfit_test_toxic_chat)                              | [link](https://huggingface.co/datasets/kenhktsui/test_toxic_chat) |\n| \"fancyzhx/ag_news\" like                          | synthetic data generation | setfit     | [link](examples/train_setfit_model_ag_news_synthetic.py)         | [link](https://huggingface.co/kenhktsui/setfit_test_ag_news_syn)      | [link](https://huggingface.co/datasets/kenhktsui/test_ag_news_syn) |\n| fancyzhx/ag_news                                 |  annotation               | setfit     | [link](examples/train_setfit_model_ag_news.py)                   | [link](https://huggingface.co/kenhktsui/setfit_test_ag_news)          | [link](https://huggingface.co/datasets/kenhktsui/test_ag_news) |\n| chinese sentiment                                | synthetic data generation | N/A        | [link](examples/generate_synthetic_chinese.py)                                                         | -                                                                     | [link](https://huggingface.co/datasets/kenhktsui/chinese_sentiment_syn) |\n\n\n## 🧪Benchmark\nThe objective is to see if synthetic data is performing as well as real data (annotation). Full training dataset indicates the upper limit of performance as more data is available. \nModel performance of synthetic data is at par with/ close to that of real data, which is not bad because the testing data is usually by design more similar to training (real) data than synthetic data.\nWe also note that synthetic data is also advantageous when class is highly imbalanced like the toxic chat problem.  \nOur benchmark implies the synthetic data generated is close to the distribution of test data, showing the effectiveness of this synthetic data generation approach, without using any real data.  \nAll models finetune on `sentence-transformers/paraphrase-mpnet-base-v2` (109M). The performance can be boosted by using a larger base model and generating more data.\n\n| dataset  | metric             | synthetic data generation | annotation | full training dataset | full training reference|\n|--|--------------------|---------------------------|------------|-----------------------|--|\n|stanfordnlp/imdb| accuracy           | 0.878                     | 0.908      | 0.928                 |[lvwerra/distilbert-imdb](https://huggingface.co/lvwerra/distilbert-imdb)  |\n|zeroshot/twitter-financial-news-sentiment| f1 (weighted)      | 0.631                     | 0.676      | 0.866                 |[nickmuchi/finbert-tone-finetuned-fintwitter-classification](https://huggingface.co/nickmuchi/finbert-tone-finetuned-fintwitter-classification) |\n|ccdv/arxiv-classification| accuracy           | 0.618                     | 0.566      | 0.805                 | [paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=\u0026arnumber=8675939)                           |\n|lmsys/toxic-chat, toxicchat0124 | f1 (binary) | 0.362                     | 0.00*      | 0.822                 | [lmsys/toxicchat-t5-large-v1.0](https://huggingface.co/lmsys/toxicchat-t5-large-v1.0)               |\n|fancyzhx/ag_news| accuracy           | 0.768                     | 0.765      | 0.938                 | [fabriceyhc/bert-base-uncased-ag_news](https://huggingface.co/fabriceyhc/bert-base-uncased-ag_news) |\n\n\\* Out of 42 annotations, only 2 labels is positive, making learning hard.\n\nCodes to replicate is stored in [examples](examples/benchmark.py).\nWe will continue to add more benchmark on other datasets.\n\n\n## 📜 Documentation\nSee [docs](./docs) on our detailed methodology and documentation.\n\n## 🗺️ Roadmap\n- High Quality Data:\n  - Prompt validation\n  - Label validation - inter-model annotation\n  - R\u0026D in synthetic data generation\n- High Quality Model\n  - Auto error analysis\n  - Auto model documentation\n  - Auto targeted synthetic data\n- More Benchmarking\n- Multilingual Support\n\n# 👋 Contributing\n- build models with AnyClassifier\n- suggest features\n- create issue/ PR\n- benchmarking synthetic data generation vs annotation vs full training dataset\n\n# 📄 License\nAnyClassifier is released under the MIT License.\n\n# 🤝🏻 Credits\nWithout these open source models and libraries, this project would not have been possible. \n- powerful open source and consumer friendly LLMs e.g. Llama 3.1 8B and Gemma 2 9B\n- [llama.cpp](https://github.com/ggerganov/llama.cpp) and [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)\n- [setfit](https://github.com/huggingface/setfit)\n- huggingface ecosystem\n\n# 📖 Citation\nIf you like this work, please cite:\n```\n@software{Tsui_AnyClassifier_2024,\nauthor = {Tsui, Ken},\nmonth = {8},\ntitle = {{AnyClassifier}},\nurl = {https://github.com/kenhktsui/anyclassifier},\nyear = {2024}\n}\n```\n\n# 📩 Follow Me For Update:\n[X](https://x.com/kenhktsui)/ [huggingface](https://huggingface.co/kenhktsui)/ [github](https://github.com/kenhktsui)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkenhktsui%2Fanyclassifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkenhktsui%2Fanyclassifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkenhktsui%2Fanyclassifier/lists"}