{"id":15014135,"url":"https://github.com/davidberenstein1957/classy-classification","last_synced_at":"2025-04-08T10:14:16.116Z","repository":{"id":37809385,"uuid":"461821907","full_name":"davidberenstein1957/classy-classification","owner":"davidberenstein1957","description":"This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface. ","archived":false,"fork":false,"pushed_at":"2025-01-20T09:25:17.000Z","size":628,"stargazers_count":214,"open_issues_count":0,"forks_count":15,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-01T09:19:50.577Z","etag":null,"topics":["few-shot-classifcation","hacktoberfest","machine-learning","natural-language-processing","nlp","nlu","sentence-transformers","spacy","text-classification"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidberenstein1957.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-21T11:02:17.000Z","updated_at":"2025-03-08T17:55:37.000Z","dependencies_parsed_at":"2024-05-31T13:46:04.277Z","dependency_job_id":"1f7c950d-2650-4d66-a89f-8fba24ce0e45","html_url":"https://github.com/davidberenstein1957/classy-classification","commit_stats":{"total_commits":107,"total_committers":4,"mean_commits":26.75,"dds":0.3738317757009346,"last_synced_commit":"c16bc1dedb5615dc5ac5644cad83b641ae23148c"},"previous_names":["davidberenstein1957/classy-classification","pandora-intelligence/classy-classification"],"tags_count":24,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fclassy-classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fclassy-classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fclassy-classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fclassy-classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidberenstein1957","download_url":"https://codeload.github.com/davidberenstein1957/classy-classification/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247819933,"owners_count":21001394,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["few-shot-classifcation","hacktoberfest","machine-learning","natural-language-processing","nlp","nlu","sentence-transformers","spacy","text-classification"],"created_at":"2024-09-24T19:45:14.608Z","updated_at":"2025-04-08T10:14:16.088Z","avatar_url":"https://github.com/davidberenstein1957.png","language":"Python","funding_links":["https://www.buymeacoffee.com/98kf2552674"],"categories":["Python"],"sub_categories":[],"readme":"# Classy Classification\nHave you ever struggled with needing a [Spacy TextCategorizer](https://spacy.io/api/textcategorizer) but didn't have the time to train one from scratch? Classy Classification is the way to go! For few-shot classification using [sentence-transformers](https://github.com/UKPLab/sentence-transformers) or [spaCy models](https://spacy.io/usage/models), provide a dictionary with labels and examples, or just provide a list of labels for zero shot-classification with [Hugginface zero-shot classifiers](https://huggingface.co/models?pipeline_tag=zero-shot-classification).\n\n[![Current Release Version](https://img.shields.io/github/release/pandora-intelligence/classy-classification.svg?style=flat-square\u0026logo=github)](https://github.com/pandora-intelligence/classy-classification/releases)\n[![pypi Version](https://img.shields.io/pypi/v/classy-classification.svg?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/classy-classification/)\n[![PyPi downloads](https://static.pepy.tech/personalized-badge/classy-classification?period=total\u0026units=international_system\u0026left_color=grey\u0026right_color=orange\u0026left_text=pip%20downloads)](https://pypi.org/project/classy-classification/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)\n\n# Install\n``` pip install classy-classification```\n\n## SetFit support\n\nI got a lot of requests for SetFit support, but I decided to create a [separate package](https://github.com/davidberenstein1957/spacy-setfit) for this. Feel free to check it out. ❤️\n\n# Quickstart\n## SpaCy embeddings\n```python\nimport spacy\n# or import standalone\n# from classy_classification import ClassyClassifier\n\ndata = {\n    \"furniture\": [\"This text is about chairs.\",\n               \"Couches, benches and televisions.\",\n               \"I really need to get a new sofa.\"],\n    \"kitchen\": [\"There also exist things like fridges.\",\n                \"I hope to be getting a new stove today.\",\n                \"Do you also have some ovens.\"]\n}\n\nnlp = spacy.load(\"en_core_web_trf\")\nnlp.add_pipe(\n    \"classy_classification\",\n    config={\n        \"data\": data,\n        \"model\": \"spacy\"\n    }\n)\n\nprint(nlp(\"I am looking for kitchen appliances.\")._.cats)\n\n# Output:\n#\n# [{\"furniture\" : 0.21}, {\"kitchen\": 0.79}]\n```\n### Sentence level classification\n```python\nimport spacy\n\ndata = {\n    \"furniture\": [\"This text is about chairs.\",\n               \"Couches, benches and televisions.\",\n               \"I really need to get a new sofa.\"],\n    \"kitchen\": [\"There also exist things like fridges.\",\n                \"I hope to be getting a new stove today.\",\n                \"Do you also have some ovens.\"]\n}\n\nnlp.add_pipe(\n    \"classy_classification\",\n    config={\n        \"data\": data,\n        \"model\": \"spacy\",\n        \"include_sent\": True\n    }\n)\n\nprint(nlp(\"I am looking for kitchen appliances. And I love doing so.\").sents[0]._.cats)\n\n# Output:\n#\n# [[{\"furniture\" : 0.21}, {\"kitchen\": 0.79}]\n```\n\n### Define random seed and verbosity\n\n```python\n\nnlp.add_pipe(\n    \"classy_classification\",\n    config={\n        \"data\": data,\n        \"verbose\": True,\n        \"config\": {\"seed\": 42}\n    }\n)\n```\n\n### Multi-label classification\n\nSometimes multiple labels are necessary to fully describe the contents of a text. In that case, we want to make use of the **multi-label** implementation, here the sum of label scores is not limited to 1. Just pass the same training data to multiple keys.\n\n```python\nimport spacy\n\ndata = {\n    \"furniture\": [\"This text is about chairs.\",\n               \"Couches, benches and televisions.\",\n               \"I really need to get a new sofa.\",\n               \"We have a new dinner table.\",\n               \"There also exist things like fridges.\",\n                \"I hope to be getting a new stove today.\",\n                \"Do you also have some ovens.\",\n                \"We have a new dinner table.\"],\n    \"kitchen\": [\"There also exist things like fridges.\",\n                \"I hope to be getting a new stove today.\",\n                \"Do you also have some ovens.\",\n                \"We have a new dinner table.\",\n                \"There also exist things like fridges.\",\n                \"I hope to be getting a new stove today.\",\n                \"Do you also have some ovens.\",\n                \"We have a new dinner table.\"]\n}\n\nnlp = spacy.load(\"en_core_web_md\")\nnlp.add_pipe(\n    \"classy_classification\",\n    config={\n        \"data\": data,\n        \"model\": \"spacy\",\n        \"multi_label\": True,\n    }\n)\n\nprint(nlp(\"I am looking for furniture and kitchen equipment.\")._.cats)\n\n# Output:\n#\n# [{\"furniture\": 0.92}, {\"kitchen\": 0.91}]\n```\n\n### Outlier detection\n\nSometimes it is worth to be able to do outlier detection or binary classification. This can either be approached using\na binary training dataset, however, I have also implemented support for a `OneClassSVM` for [outlier detection using a single label](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html). Not that this method does not return probabilities, but that the data is formatted like label-score value pair to ensure uniformity.\n\nApproach 1:\n\n```python\nimport spacy\n\ndata_binary = {\n    \"inlier\": [\"This text is about chairs.\",\n               \"Couches, benches and televisions.\",\n               \"I really need to get a new sofa.\"],\n    \"outlier\": [\"Text about kitchen equipment\",\n                \"This text is about politics\",\n                \"Comments about AI and stuff.\"]\n}\n\nnlp = spacy.load(\"en_core_web_md\")\nnlp.add_pipe(\n    \"classy_classification\",\n    config={\n        \"data\": data_binary,\n    }\n)\n\nprint(nlp(\"This text is a random text\")._.cats)\n\n# Output:\n#\n# [{'inlier': 0.2926672385488411, 'outlier': 0.707332761451159}]\n```\n\nApproach 2:\n\n```python\nimport spacy\n\ndata_singular = {\n    \"furniture\": [\"This text is about chairs.\",\n               \"Couches, benches and televisions.\",\n               \"I really need to get a new sofa.\",\n               \"We have a new dinner table.\"]\n}\nnlp = spacy.load(\"en_core_web_md\")\nnlp.add_pipe(\n    \"classy_classification\",\n    config={\n        \"data\": data_singular,\n    }\n)\n\nprint(nlp(\"This text is a random text\")._.cats)\n\n# Output:\n#\n# [{'furniture': 0, 'not_furniture': 1}]\n```\n\n## Sentence-transfomer embeddings\n\n```python\nimport spacy\n\ndata = {\n    \"furniture\": [\"This text is about chairs.\",\n               \"Couches, benches and televisions.\",\n               \"I really need to get a new sofa.\"],\n    \"kitchen\": [\"There also exist things like fridges.\",\n                \"I hope to be getting a new stove today.\",\n                \"Do you also have some ovens.\"]\n}\n\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"classy_classification\",\n    config={\n        \"data\": data,\n        \"model\": \"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\",\n        \"device\": \"gpu\"\n    }\n)\n\nprint(nlp(\"I am looking for kitchen appliances.\")._.cats)\n\n# Output:\n#\n# [{\"furniture\": 0.21}, {\"kitchen\": 0.79}]\n```\n\n## Hugginface zero-shot classifiers\n\n```python\nimport spacy\n\ndata = [\"furniture\", \"kitchen\"]\n\nnlp = spacy.blank(\"en\")\nnlp.add_pipe(\n    \"classy_classification\",\n    config={\n        \"data\": data,\n        \"model\": \"typeform/distilbert-base-uncased-mnli\",\n        \"cat_type\": \"zero\",\n        \"device\": \"gpu\"\n    }\n)\n\nprint(nlp(\"I am looking for kitchen appliances.\")._.cats)\n\n# Output:\n#\n# [{\"furniture\": 0.21}, {\"kitchen\": 0.79}]\n```\n\n# Credits\n\n## Inspiration Drawn From\n\n[Huggingface](https://huggingface.co/) does offer some nice models for few/zero-shot classification, but these are not tailored to multi-lingual approaches. Rasa NLU has [a nice approach](https://rasa.com/blog/rasa-nlu-in-depth-part-1-intent-classification/) for this, but its too embedded in their codebase for easy usage outside of Rasa/chatbots. Additionally, it made sense to integrate [sentence-transformers](https://github.com/UKPLab/sentence-transformers) and [Hugginface zero-shot](https://huggingface.co/models?pipeline_tag=zero-shot-classification), instead of default [word embeddings](https://arxiv.org/abs/1301.3781). Finally, I decided to integrate with Spacy, since training a custom [Spacy TextCategorizer](https://spacy.io/api/textcategorizer) seems like a lot of hassle if you want something quick and dirty.\n\n- [Scikit-learn](https://github.com/scikit-learn/scikit-learn)\n- [Rasa NLU](https://github.com/RasaHQ/rasa)\n- [Sentence Transformers](https://github.com/UKPLab/sentence-transformers)\n- [Spacy](https://github.com/explosion/spaCy)\n\n## Or buy me a coffee\n\n[![\"Buy Me A Coffee\"](https://www.buymeacoffee.com/assets/img/custom_images/orange_img.png)](https://www.buymeacoffee.com/98kf2552674)\n\n# Standalone usage without spaCy\n\n```python\n\nfrom classy_classification import ClassyClassifier\n\ndata = {\n    \"furniture\": [\"This text is about chairs.\",\n               \"Couches, benches and televisions.\",\n               \"I really need to get a new sofa.\"],\n    \"kitchen\": [\"There also exist things like fridges.\",\n                \"I hope to be getting a new stove today.\",\n                \"Do you also have some ovens.\"]\n}\n\nclassifier = ClassyClassifier(data=data)\nclassifier(\"I am looking for kitchen appliances.\")\nclassifier.pipe([\"I am looking for kitchen appliances.\"])\n\n# overwrite training data\nclassifier.set_training_data(data=data)\nclassifier(\"I am looking for kitchen appliances.\")\n\n# overwrite [embedding model](https://www.sbert.net/docs/pretrained_models.html)\nclassifier.set_embedding_model(model=\"paraphrase-MiniLM-L3-v2\")\nclassifier(\"I am looking for kitchen appliances.\")\n\n# overwrite SVC config\nclassifier.set_classification_model(\n    config={\n        \"C\": [1, 2, 5, 10, 20, 100],\n        \"kernel\": [\"linear\"],\n        \"max_cross_validation_folds\": 5\n    }\n)\nclassifier(\"I am looking for kitchen appliances.\")\n```\n\n## Save and load models\n\n```python\ndata = {\n    \"furniture\": [\"This text is about chairs.\",\n               \"Couches, benches and televisions.\",\n               \"I really need to get a new sofa.\"],\n    \"kitchen\": [\"There also exist things like fridges.\",\n                \"I hope to be getting a new stove today.\",\n                \"Do you also have some ovens.\"]\n}\nclassifier = classyClassifier(data=data)\n\nwith open(\"./classifier.pkl\", \"wb\") as f:\n    pickle.dump(classifier, f)\n\nf = open(\"./classifier.pkl\", \"rb\")\nclassifier = pickle.load(f)\nclassifier(\"I am looking for kitchen appliances.\")\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidberenstein1957%2Fclassy-classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidberenstein1957%2Fclassy-classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidberenstein1957%2Fclassy-classification/lists"}