{"id":15014131,"url":"https://github.com/davidberenstein1957/concise-concepts","last_synced_at":"2025-04-09T17:25:43.845Z","repository":{"id":42083980,"uuid":"469388007","full_name":"davidberenstein1957/concise-concepts","owner":"davidberenstein1957","description":"This repository contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring. ","archived":false,"fork":false,"pushed_at":"2023-06-19T13:17:26.000Z","size":14164,"stargazers_count":245,"open_issues_count":5,"forks_count":14,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-02T11:52:32.209Z","etag":null,"topics":["few-shot-classifcation","gensim","hacktoberfest","machine-learning","natural-language-processing","ner","nlp","spacy"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidberenstein1957.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-03-13T14:00:36.000Z","updated_at":"2025-04-01T10:51:28.000Z","dependencies_parsed_at":"2023-07-26T17:11:52.771Z","dependency_job_id":"efba4edb-783f-47fd-b357-61b5cc11323f","html_url":"https://github.com/davidberenstein1957/concise-concepts","commit_stats":null,"previous_names":["pandora-intelligence/concise-concepts"],"tags_count":29,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fconcise-concepts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fconcise-concepts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fconcise-concepts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidberenstein1957%2Fconcise-concepts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidberenstein1957","download_url":"https://codeload.github.com/davidberenstein1957/concise-concepts/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247999865,"owners_count":21031046,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["few-shot-classifcation","gensim","hacktoberfest","machine-learning","natural-language-processing","ner","nlp","spacy"],"created_at":"2024-09-24T19:45:14.173Z","updated_at":"2025-04-09T17:25:43.823Z","avatar_url":"https://github.com/davidberenstein1957.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Concise Concepts\nWhen wanting to apply NER to concise concepts, it is really easy to come up with examples, but pretty difficult to train an entire pipeline. Concise Concepts uses few-shot NER based on word embedding similarity to get you going\nwith easy! Now with entity scoring!\n\n\n[![Python package](https://github.com/Pandora-Intelligence/concise-concepts/actions/workflows/python-package.yml/badge.svg?branch=main)](https://github.com/Pandora-Intelligence/concise-concepts/actions/workflows/python-package.yml)\n[![Current Release Version](https://img.shields.io/github/release/pandora-intelligence/concise-concepts.svg?style=flat-square\u0026logo=github)](https://github.com/pandora-intelligence/concise-concepts/releases)\n[![pypi Version](https://img.shields.io/pypi/v/concise-concepts.svg?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/concise-concepts/)\n[![PyPi downloads](https://static.pepy.tech/personalized-badge/concise-concepts?period=total\u0026units=international_system\u0026left_color=grey\u0026right_color=orange\u0026left_text=pip%20downloads)](https://pypi.org/project/concise-concepts/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)\n\n\n## Usage\nThis library defines matching patterns based on the most similar words found in each group, which are used to fill a [spaCy EntityRuler](https://spacy.io/api/entityruler). To better understand the rule definition, I recommend playing around with the [spaCy Rule-based Matcher Explorer](https://demos.explosion.ai/matcher).\n\n### Tutorials\n- [TechVizTheDataScienceGuy](https://www.youtube.com/c/TechVizTheDataScienceGuy) created a [nice tutorial](https://prakhar-mishra.medium.com/few-shot-named-entity-recognition-in-natural-language-processing-92d31f0d1143) on how to use it.\n\n- [I](https://www.linkedin.com/in/david-berenstein-1bab11105/) created a [tutorial](https://www.rubrix.ml/blog/concise-concepts-rubrix/) in collaboration with Rubrix.\n\nThe section [Matching Pattern Rules](#matching-pattern-rules) expands on the construction, analysis and customization of these matching patterns.\n\n\n# Install\n\n```\npip install concise-concepts\n```\n\n# Quickstart\n\nTake a look at the [configuration section](#configuration) for more info.\n\n## Spacy Pipeline Component\n\nNote that, [custom embedding models](#custom-embedding-models) are passed via `model_path`.\n\n```python\nimport spacy\nfrom spacy import displacy\n\ndata = {\n    \"fruit\": [\"apple\", \"pear\", \"orange\"],\n    \"vegetable\": [\"broccoli\", \"spinach\", \"tomato\"],\n    \"meat\": ['beef', 'pork', 'turkey', 'duck']\n}\n\ntext = \"\"\"\n    Heat the oil in a large pan and add the Onion, celery and carrots.\n    Then, cook over a medium–low heat for 10 minutes, or until softened.\n    Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.\n    Later, add some oranges and chickens. \"\"\"\n\nnlp = spacy.load(\"en_core_web_md\", disable=[\"ner\"])\n\nnlp.add_pipe(\n    \"concise_concepts\",\n    config={\n        \"data\": data,\n        \"ent_score\": True,  # Entity Scoring section\n        \"verbose\": True,\n        \"exclude_pos\": [\"VERB\", \"AUX\"],\n        \"exclude_dep\": [\"DOBJ\", \"PCOMP\"],\n        \"include_compound_words\": False,\n        \"json_path\": \"./fruitful_patterns.json\",\n        \"topn\": (100,500,300)\n    },\n)\ndoc = nlp(text)\n\noptions = {\n    \"colors\": {\"fruit\": \"darkorange\", \"vegetable\": \"limegreen\", \"meat\": \"salmon\"},\n    \"ents\": [\"fruit\", \"vegetable\", \"meat\"],\n}\n\nents = doc.ents\nfor ent in ents:\n    new_label = f\"{ent.label_} ({ent._.ent_score:.0%})\"\n    options[\"colors\"][new_label] = options[\"colors\"].get(ent.label_.lower(), None)\n    options[\"ents\"].append(new_label)\n    ent.label_ = new_label\ndoc.ents = ents\n\ndisplacy.render(doc, style=\"ent\", options=options)\n```\n![](https://raw.githubusercontent.com/Pandora-Intelligence/concise-concepts/master/img/example.png)\n\n## Standalone\n\nThis might be useful when iterating over few_shot training data when not wanting to reload larger models continuously.\nNote that, [custom embedding models](#custom-embedding-models) are passed via `model`.\n\n```python\nimport gensim\nimport spacy\n\nfrom concise_concepts import Conceptualizer\n\nmodel = gensim.downloader.load(\"fasttext-wiki-news-subwords-300\")\nnlp = spacy.load(\"en_core_web_sm\")\ndata = {\n    \"disease\": [\"cancer\", \"diabetes\", \"heart disease\", \"influenza\", \"pneumonia\"],\n    \"symptom\": [\"headache\", \"fever\", \"cough\", \"nausea\", \"vomiting\", \"diarrhea\"],\n}\nconceptualizer = Conceptualizer(nlp, data, model)\nconceptualizer.nlp(\"I have a headache and a fever.\").ents\n\ndata = {\n    \"disease\": [\"cancer\", \"diabetes\"],\n    \"symptom\": [\"headache\", \"fever\"],\n}\nconceptualizer = Conceptualizer(nlp, data, model)\nconceptualizer.nlp(\"I have a headache and a fever.\").ents\n```\n\n# Configuration\n## Matching Pattern Rules\nA general introduction about the usage of matching patterns in the [usage section](#usage).\n### Customizing Matching Pattern Rules\nEven though the baseline parameters provide a decent result, the construction of these matching rules can be customized via the config passed to the spaCy pipeline.\n\n - `exclude_pos`: A list of POS tags to be excluded from the rule-based match.\n - `exclude_dep`: A list of dependencies to be excluded from the rule-based match.\n - `include_compound_words`:  If True, it will include compound words in the entity. For example, if the entity is \"New York\", it will also include \"New York City\" as an entity.\n - `case_sensitive`: Whether to match the case of the words in the text.\n\n\n### Analyze Matching Pattern Rules\nTo motivate actually looking at the data and support interpretability, the matching patterns that have been generated are stored as `./main_patterns.json`. This behavior can be changed by using the `json_path` variable via the config passed to the spaCy pipeline.\n\n\n## Fuzzy matching using `spaczz`\n\n - `fuzzy`: A boolean value that determines whether to use fuzzy matching\n\n```python\ndata = {\n    \"fruit\": [\"apple\", \"pear\", \"orange\"],\n    \"vegetable\": [\"broccoli\", \"spinach\", \"tomato\"],\n    \"meat\": [\"beef\", \"pork\", \"fish\", \"lamb\"]\n}\n\nnlp.add_pipe(\"concise_concepts\", config={\"data\": data, \"fuzzy\": True})\n```\n\n## Most Similar Word Expansion\n\n- `topn`: Use a specific number of words to expand over.\n\n```python\ndata = {\n    \"fruit\": [\"apple\", \"pear\", \"orange\"],\n    \"vegetable\": [\"broccoli\", \"spinach\", \"tomato\"],\n    \"meat\": [\"beef\", \"pork\", \"fish\", \"lamb\"]\n}\n\ntopn = [50, 50, 150]\n\nassert len(topn) == len\n\nnlp.add_pipe(\"concise_concepts\", config={\"data\": data, \"topn\": topn})\n```\n\n## Entity Scoring\n\n- `ent_score`: Use embedding based word similarity to score entities against their groups\n\n```python\nimport spacy\n\ndata = {\n    \"ORG\": [\"Google\", \"Apple\", \"Amazon\"],\n    \"GPE\": [\"Netherlands\", \"France\", \"China\"],\n}\n\ntext = \"\"\"Sony was founded in Japan.\"\"\"\n\nnlp = spacy.load(\"en_core_web_lg\")\nnlp.add_pipe(\"concise_concepts\", config={\"data\": data, \"ent_score\": True, \"case_sensitive\": True})\ndoc = nlp(text)\n\nprint([(ent.text, ent.label_, ent._.ent_score) for ent in doc.ents])\n# output\n#\n# [('Sony', 'ORG', 0.5207586), ('Japan', 'GPE', 0.7371268)]\n```\n\n## Custom Embedding Models\n\n- `model_path`: Use custom `sense2vec.Sense2Vec`, `gensim.Word2vec` `gensim.FastText`, or `gensim.KeyedVectors`, or a pretrained model from [gensim](https://radimrehurek.com/gensim/downloader.html) library or a custom model path. For using a `sense2vec.Sense2Vec` take a look [here](https://github.com/explosion/sense2vec#pretrained-vectors).\n- `model`: within [standalone usage](#standalone), it is possible to pass these models directly.\n\n```python\ndata = {\n    \"fruit\": [\"apple\", \"pear\", \"orange\"],\n    \"vegetable\": [\"broccoli\", \"spinach\", \"tomato\"],\n    \"meat\": [\"beef\", \"pork\", \"fish\", \"lamb\"]\n}\n\n# model from https://radimrehurek.com/gensim/downloader.html or path to local file\nmodel_path = \"glove-wiki-gigaword-300\"\n\nnlp.add_pipe(\"concise_concepts\", config={\"data\": data, \"model_path\": model_path})\n````\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidberenstein1957%2Fconcise-concepts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidberenstein1957%2Fconcise-concepts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidberenstein1957%2Fconcise-concepts/lists"}