{"id":20837746,"url":"https://github.com/astrazeneca/vecner","last_synced_at":"2025-05-08T20:29:40.809Z","repository":{"id":103037929,"uuid":"535673971","full_name":"AstraZeneca/VecNER","owner":"AstraZeneca","description":"A library of tools for dictionary-based Named Entity Recognition (NER), based on word vector representations to expand dictionary terms.","archived":false,"fork":false,"pushed_at":"2023-07-25T21:03:18.000Z","size":351,"stargazers_count":24,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-31T17:59:02.310Z","etag":null,"topics":["dictionary-based-ner","entity-extraction","natural-language-processing","ner","nlp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AstraZeneca.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-12T13:14:09.000Z","updated_at":"2024-10-10T12:57:44.000Z","dependencies_parsed_at":null,"dependency_job_id":"bc6629d1-4f89-4f8d-a28e-e615afa2e5a4","html_url":"https://github.com/AstraZeneca/VecNER","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FVecNER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FVecNER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FVecNER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FVecNER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AstraZeneca","download_url":"https://codeload.github.com/AstraZeneca/VecNER/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253144250,"owners_count":21861029,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dictionary-based-ner","entity-extraction","natural-language-processing","ner","nlp"],"created_at":"2024-11-18T01:08:28.488Z","updated_at":"2025-05-08T20:29:40.801Z","avatar_url":"https://github.com/AstraZeneca.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003ch2\u003eProject Description\u003c/h2\u003e\n\n\u003ch3\u003e vecner :  A set of tools for lexical-based NER \u003c/h3\u003e\n\n![Maturity level-Prototype](https://img.shields.io/badge/Maturity%20Level-Prototype-red)\n[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/PyCQA/pylint)\n[![mypy checked](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)\n[![black](https://camo.githubusercontent.com/d91ed7ac7abbd5a6102cbe988dd8e9ac21bde0a73d97be7603b891ad08ce3479/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f636f64652532307374796c652d626c61636b2d3030303030302e737667)](https://pypi.org/project/black/)\n\u003c!-- [![unittest](https://github.com/python/cpython/workflows/Tests/badge.svg)](https://docs.python.org/3/library/unittest.html) --\u003e\n\nA library of tools for lexical-based Named Entity Recognition (NER), based on word vector representations to expand lexicon terms. Vecner is particularly helpful in instances such as: a) the lexicon contains limited terms ; b) interested in domain-specific NER and c) a large unlabelled corpora is available.\n\nAn example: if in the lexicon we have the only the word *football* under the label **sports**, we can expect that similar terms include *soccer*, *basketball* and others. As such, we can leverage that to our advantage when we have a lexicon with limited terms (see Examples below), by expanding the terms under **sports** using a w2vec model. This works also well when we have a domain-specific problem and a corpus, which we use to train a w2vec model. As such, the similar terms suddenly become much more relevant to our application.\n\nVecner supports:\n* Exact entity matching based on [Spacy](https://spacy.io/)'s PhraseMatcher\n* Finding entities based on similar terms in the lexicon.\n* Chunking by:\n  * Using [Spacy](https://spacy.io/)'s noun chunking\n  * Using entity edges from a dependency graph\n  * Using a user-defined script ([see more here](#user-defined))\n\n### Installation\n\nCan install by running:\n\n```bash\npython setup.py install\n```\n\n\u003ch3\u003e Examples \u003c/h3\u003e\n\n\u003ch5\u003e 1. Using pretrained models\u003c/h5\u003e\n\nThis example shows how to use a pre-trained on general corpora [Gensim](https://radimrehurek.com/gensim/) w2vec model to perform NER with vecner (run this in [jupyter](examples/restaurant-example.ipynb)).\n\n```python\nfrom vecner import ExactMatcher, ExtendedMatcher\nimport gensim.downloader as api\n\ntext = \"\"\"\nThe burger had absolutely no flavor,\nthe place itself was totally dirty, the burger\nwas overcooked and the staff incredibly rude.\n\"\"\"\n\n# loads the pretrained on general corpora model\nmodel = api.load(\"glove-wiki-gigaword-100\")\n\n# sense check and test\nmodel.most_similar('food')\n\n# custom defined lexicon\nfood_lexicon = {\n    'service' : [\n        'rude',\n        'service',\n        'friendly'\n    ],\n    'general' : [\n        'clean',\n        'dirty',\n        'decoration',\n        'atmosphere'\n    ],\n    'food' : [\n        'warm',\n        'cold',\n        'flavor',\n        'tasty',\n        'stale',\n        'disgusting',\n        'delicious'\n    ]\n}\n\n# init the exact matcher to not miss\n# any entities from the lexicon if in text\nmatcher = ExactMatcher(\n    food_lexicon,\n    spacy_model     = 'en_core_web_sm'\n)\n\n# init the Extended Matcher, which expands the lexicon\n# using the w2vec model based on similar terms\n# and then matches them in the sequence\nextendedmatcher = ExtendedMatcher(\n    food_lexicon,\n    w2vec_model     = model,\n    in_pipeline     = True,\n    spacy_model     = 'en_core_web_sm',\n    chunking_method = 'edge_chunking',\n    sensitivity     = 20\n)\n\n# exact mapping\noutput = matcher.map(\n    text = text\n)\n\n# extended matching mapping\noutput = extendedmatcher.map(\n    document = output['doc'],\n    ents = output['ents'],\n    ids = output['ids']\n)\n```\n![food example](examples/example_food.png)\n\n\u003c!-- \u003cdiv class=\"entities\" style=\"line-height: 2.5; direction: ltr\"\u003eThe burger had \u003cmark class=\"entity\" style=\"background: #FFAC33; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\"\u003e    absolutely no flavor    \u003cspan style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\"\u003efood\u003c/span\u003e\u003c/mark\u003e , the place itself was \u003cmark class=\"entity\" style=\"background: #33FFC1; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\"\u003e    totally dirty    \u003cspan style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\"\u003econdition\u003c/span\u003e\u003c/mark\u003e , the burger \u003cmark class=\"entity\" style=\"background: #FFAC33; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\"\u003e    was overcooked    \u003cspan style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\"\u003efood\u003c/span\u003e\u003c/mark\u003e and \u003cmark class=\"entity\" style=\"background: #DDFF33; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\"\u003e    the staff incredibly rude    \u003cspan style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\"\u003eservice\u003c/span\u003e\u003c/mark\u003e .\u003c/div\u003e --\u003e\n\n\u003ch5\u003e 2. Using custom trained Gensim w2vec models\u003c/h5\u003e\n\nThis example shows how to use a custom trained [Gensim](https://radimrehurek.com/gensim/) w2vec model on your own corpora to perform NER with vecner (run this in [jupyter](examples/bio-example.ipynb)).\n\n```python\n\nfrom vecner import ExactMatcher, ThresholdMatcher\nfrom gensim.models import KeyedVectors\n\ntext = \"\"\"\nI am really amazed by the overall survival,\nhowever still concerned with 3-year treatment duration.\nWould be interesting to see the predictive biomarkers in NSCLC advanced patients.\n\"\"\"\n\n# loads the pre-trained Gensim model\nmodel = KeyedVectors.load('path_to_model')\n\n# check that model was loaded properly\nmodel.most_similar('pfs')\n\n# custom defined lexicon\nbio_lexicon = {\n    'efficacy' : [\n        'overall survival',\n        'pfs'\n    ],\n    'diagnostics' : [\n        'marker'\n    ],\n    'time' : [\n        'year',\n        'month'\n    ],\n    'patient groups' : [\n        'squamous',\n        'resectable'\n    ]\n\n}\n\n# init the Exact Matcher for finding entities\n# as exactly mentioned in the lexicon\nmatcher = ExactMatcher(\n  bio_lexicon,\n  spacy_model='en_core_web_sm'\n)\n\n# init the ThresholdMatcher which finds entities\n# based on a cosine similarity threshold\nthresholdmatcher = ThresholdMatcher(\n    bio_lexicon,\n    w2vec_model=model,\n    in_pipeline=True,\n    spacy_model='en_core_web_sm',\n    chunking_method='noun_chunking',\n    threshold = 0.55\n)\n\n# map exact entities\noutput = matcher.map(\n    text = text\n)\n\n# use in pipeline to map inexact entities\noutput = thresholdmatcher.map(\n    document = output['doc'],\n    ents = output['ents'],\n    ids = output['ids']\n)\n```\n![bio example](examples/example_bio.png)\n\n\u003c!-- \u003cdiv class=\"entities\" style=\"line-height: 2.5; direction: ltr\"\u003eI am really amazed by \u003cmark class=\"entity\" style=\"background: #DDFF33; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\"\u003e    the overall survival    \u003cspan style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\"\u003eefficacy\u003c/span\u003e\u003c/mark\u003e , however still concerned with \u003cmark class=\"entity\" style=\"background: #33FFC1; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\"\u003e    3-year treatment duration    \u003cspan style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\"\u003etime\u003c/span\u003e\u003c/mark\u003e . Would be interesting to see \u003cmark class=\"entity\" style=\"background: #FFAC33; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\"\u003e    the predictive biomarkers    \u003cspan style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\"\u003ediagnostics\u003c/span\u003e\u003c/mark\u003e in \u003cmark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\"\u003e    NSCLC advanced patients    \u003cspan style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\"\u003epatient groups\u003c/span\u003e\u003c/mark\u003e .\u003c/div\u003e --\u003e\n\n\u003ch5\u003e 3. Using a User-Defined Script \u003c/h5\u003e\n \u003ca name=\"user-defined\"\u003e\u003c/a\u003e\n\n```python\nfrom vecner import ExactMatcher, ThresholdMatcher\nfrom gensim.models import KeyedVectors\n\ntext = \"\"\"\nI am really amazed by the overall survival,\nhowever still concerned with 3-year treatment duration.\nWould be interesting to see the predictive biomarkers in NSCLC advanced patients.\n\"\"\"\n\n# loads the pre-trained Gensim model\nmodel = KeyedVectors.load('path_to_model')\n\n# check that model was loaded properly\nmodel.most_similar('pfs')\n\n# custom defined lexicon\nbio_lexicon = {\n    'efficacy' : [\n        'overall survival',\n        'pfs'\n    ],\n    'diagnostics' : [\n        'marker'\n    ],\n    'time' : [\n        'year',\n        'month'\n    ],\n    'patient groups' : [\n        'squamous',\n        'resectable'\n    ]\n\n}\n\n# init the ThresholdMatcher which finds entities\n# based on a cosine similarity threshold\nthresholdmatcher = ThresholdMatcher(\n    bio_lexicon,\n    w2vec_model=model,\n    in_pipeline=False,\n    spacy_model='en_core_web_sm',\n    chunking_method='custom_chunking',\n    threshold = 0.55\n)\n\n# use in pipeline to map inexact entities\noutput = thresholdmatcher.map(\n    document = text\n)\n```\n\nwhere the custom script from which this reads from, must:\n* be named as  ```custom_file.py```\n* have the function ```rule_chunker```, with input arguments:\n  * doc\n  * ents\n  * ids\n\nYou can find and play with the following [example script](examples/example-custom_file.py) prepared. To run it simply rename to ```custom_file.py``` in the directory in which you will run your main script. To prepare your own you can follow the template below:\n\n```python\n## Template - custom_file.py\n\"\"\"\nTemplate for the custom rules file\n\"\"\"\nfrom typing import List, Dict, Any, Optional\n\ndef rule_chunker(\n                    doc     : object,\n                    ents    : Optional[List[Dict[str, any]]],\n                    ids     : Optional[Dict[int,str]] = None\n                ) -\u003e List[Dict[str, any]]:\n    \"\"\"\n    a custom chunker and/or entity expander\n    Args:\n        doc (object): spacy.doc object\n        ents (List[Dict[str, any]]): the entities -\u003e [  {\n                                                            'start' : int,\n                                                            'end'   : int,\n                                                            'text'  : str,\n                                                            'label' : label,\n                                                            'idx'   : int\n                                                        },\n                                                        ...\n                                                    ]\n            where 'start' is the start character in the sequence\n                  'end' the end character in the sequence\n                  'text' the entity textstring\n                  'label' the entity name\n                  'idx' the position of the first word in the entity in the sequence\n        ids (Dict[int,str]) : the entity idx's and their labels for easy recognition -\u003e e.g. {\n                                                                                                1 : 'time',\n                                                                                                5 : 'cost',\n                                                                                                ...\n                                                                                            }\n    Returns:\n        List[Dict[str, any]]\n    \"\"\"\n\n    new_ents = []\n\n    raise NotImplementedError\n\n    return new_ents\n```\n\n\n\u003ch2\u003e Third Party Licenses \u003c/h2\u003e\n\nThird Party Libraries licenses for the dependencies:\\\n[Spacy](https://github.com/explosion/spaCy) : [MIT License](https://github.com/explosion/spaCy/blob/master/LICENSE)\\\n[Gensim](https://github.com/RaRe-Technologies/gensim) : [LGPL-2.1 license](https://github.com/RaRe-Technologies/gensim/blob/develop/COPYING)\\\n[NLTK](https://github.com/nltk/nltk) : [Apache License 2.0](https://github.com/nltk/nltk/blob/develop/LICENSE.txt)\n\n\n\u003ch2\u003e License \u003c/h2\u003e\n\u003c!-- Copyright 2020 AXA Group Operations S.A. --\u003e\nLicensed under the Apache 2.0 license (see the [LICENSE file](LICENSE)).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrazeneca%2Fvecner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fastrazeneca%2Fvecner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fastrazeneca%2Fvecner/lists"}