{"id":15118728,"url":"https://github.com/AstraZeneca/KAZU","last_synced_at":"2025-09-28T01:30:40.768Z","repository":{"id":102562691,"uuid":"549271080","full_name":"AstraZeneca/KAZU","owner":"AstraZeneca","description":"Fast, world class biomedical NER","archived":false,"fork":false,"pushed_at":"2024-12-17T16:37:15.000Z","size":12047,"stargazers_count":78,"open_issues_count":34,"forks_count":7,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-01-13T01:05:22.799Z","etag":null,"topics":["biomedical-text-mining","natural-language-processing","nlp"],"latest_commit_sha":null,"homepage":"https://AstraZeneca.github.io/KAZU/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AstraZeneca.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-10T23:58:43.000Z","updated_at":"2024-12-31T01:19:24.000Z","dependencies_parsed_at":"2024-06-13T17:18:12.085Z","dependency_job_id":"cad5562f-8a78-44d8-ae57-55ad57a98b54","html_url":"https://github.com/AstraZeneca/KAZU","commit_stats":null,"previous_names":[],"tags_count":38,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FKAZU","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FKAZU/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FKAZU/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AstraZeneca%2FKAZU/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AstraZeneca","download_url":"https://codeload.github.com/AstraZeneca/KAZU/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234475315,"owners_count":18839358,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["biomedical-text-mining","natural-language-processing","nlp"],"created_at":"2024-09-26T01:53:37.205Z","updated_at":"2025-09-28T01:30:39.751Z","avatar_url":"https://github.com/AstraZeneca.png","language":"Python","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"\n![Maturity level-1](https://img.shields.io/badge/Maturity%20Level-ML--2-green)\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/AstraZeneca/KAZU/main/docs/kazu_logo.png\" alt=\"Kazu - Biomedical NLP Framework\" align=middle style=\"width: 66%;height: auto;\"/\u003e\n  \u003cbr\u003e\u003cbr\u003e\n\u003c/p\u003e\n\n[Find our docs here](https://astrazeneca.github.io/KAZU/index.html)\n\n# Kazu - Biomedical NLP Framework\n\n**Note: the recent 2.0 release has large elements of backwards incompatibility if you are using a custom model pack and curations.**\n\nWelcome to Kazu (Korea AstraZeneca University), a python biomedical NLP framework built in collaboration with Korea University,\ndesigned to handle production workloads.\n\nThis library aims to simplify the process of using state of the art NLP research in production systems. Some of the\nresearch contained within are our own, but most of it comes from the community, for which we are immensely grateful.\n\nIf you want to use Kazu, please cite our [EMNLP 2022 publication](https://aclanthology.org/2022.emnlp-industry.63)!\n([**citation link**](https://aclanthology.org/2022.emnlp-industry.63.bib))\n\n[Please click here for the TinyBERN2 training and evaluation code](https://github.com/dmis-lab/KAZU-NER-module)\n\n# Quickstart\n\n## Install\n\nPython version 3.9 or higher is required (tested with Python 3.11).\n\nEither:\n\n`pip install kazu`\n\nor download the wheel from the release page and install locally.\n\nIf you intend to use [Mypy](https://mypy.readthedocs.io/en/stable/#) on your own codebase, consider installing Kazu using:\n\n`pip install kazu[typed]`\n\nThis will pull in typing stubs for kazu's dependencies (such as [types-requests](https://pypi.org/project/types-requests/) for [Requests](https://requests.readthedocs.io/en/latest/))\nso that mypy has access to as much relevant typing information as possible when type checking your codebase. Otherwise (depending on mypy config), you may see errors when running mypy like:\n\n```\n.venv/lib/python3.10/site-packages/kazu/steps/linking/post_processing/xref_manager.py:10: error: Library stubs not installed for \"requests\" [import]\n```\n\n## Getting the model pack\n\nFor most functionality, you will also need the Kazu model pack. This is tied to each release, and can be found on the [release page](https://github.com/astrazeneca/kazu/releases). Once downloaded,\nextract the archive and:\n\n`export KAZU_MODEL_PACK=\u003cpath to the extracted archive\u003e`\n\nKazu is highly configurable (using [Hydra](https://hydra.cc/docs/intro/)), although it comes preconfigured with defaults appropriate for most literature processing use cases.\nTo make use of these, and process a simple document:\n\n```python\nimport hydra\nfrom hydra.utils import instantiate\n\nfrom kazu.data import Document\nfrom kazu.pipeline import Pipeline\nfrom kazu.utils.constants import HYDRA_VERSION_BASE\nfrom pathlib import Path\nimport os\n\n# the hydra config is kept in the model pack\ncdir = Path(os.environ[\"KAZU_MODEL_PACK\"]).joinpath(\"conf\")\n\n\n@hydra.main(\n    version_base=HYDRA_VERSION_BASE, config_path=str(cdir), config_name=\"config\"\n)\ndef kazu_test(cfg):\n    pipeline: Pipeline = instantiate(cfg.Pipeline)\n    text = \"EGFR mutations are often implicated in lung cancer\"\n    doc = Document.create_simple_document(text)\n    pipeline([doc])\n    print(f\"{doc.get_entities()}\")\n\n\nif __name__ == \"__main__\":\n    kazu_test()\n```\n\n## License\n\nLicensed under [Apache 2.0](https://github.com/AstraZeneca/KAZU/blob/main/LICENSE).\n\nKazu includes elements under compatible licenses (full licenses are in relevant files or as indicated):\n- Some elements are a modification of code licensed under MIT by Explosion.AI - see the README [here](https://github.com/AstraZeneca/KAZU/blob/main/kazu/ontology_matching/README.md).\n- The doc build process (conf.py's linkcode_resolve function) uses code modified from pandas, in turn modified from numpy. See [PANDAS_LICENSE.txt](https://github.com/AstraZeneca/KAZU/blob/main/docs/PANDAS_LICENSE.txt) and [NUMPY_LICENSE.txt](https://github.com/AstraZeneca/KAZU/blob/main/docs/NUMPY_LICENSE.txt)\n- Elements of the model distillation code are inspired by or modified from Huawei Noah's Ark Lab [TinyBERT](https://github.com/huawei-noah/Pretrained-Language-Model/blob/master/TinyBERT) and DMIS-Lab's [BioBERT](https://github.com/dmis-lab/biobert/tree/master).\n  See the details in dataprocessor.py, models.py and tiny_transformer.py.\n- PLSapbertModel is inspired by the code from [sapbert](https://github.com/cambridgeltl/sapbert), licensed under MIT. See the file for details, and see the [SapBert](#sapbert) section below regarding use of the model.\n- GildaUtils in the string_normalizer.py file is modified from [Gilda](https://github.com/indralab/gilda). See the file for full details\n  including the full BSD 2-Clause license.\n- The AbbreviationFinderStep uses KazuAbbreviationDetector, which is a modified version of\n  [SciSpacy](https://allenai.github.io/scispacy/)'s abbreviation finding algorithm, licensed under Apache 2.0 - see the files for full details.\n- The JWTAuthenticationBackend Starlette Middleware in jwtauth.py is originally from [starlette-jwt](https://raw.githubusercontent.com/amitripshtos/starlette-jwt/master/starlette_jwt/middleware.py), licensed under BSD 3-Clause.\n- The AddRequestIdMiddleware Starlette Middleware in req_id_header.py is modified from 'CustomHeaderMiddleware' in the [Starlette Middleware docs](https://www.starlette.io/middleware/#basehttpmiddleware).\n  This is licensed under BSD 3-Clause along with the rest of Starlette.\n- The kazu-jvm folder includes files like gradelw and gradelw.bat distributed by gradle under Apache 2.0 - see the files for details.\n- [kazu/data.py](https://github.com/AstraZeneca/KAZU/blob/main/kazu/data.py) contains `AutoNameEnum`, which is `AutoName` from\n  the [Python Enum Docs](https://docs.python.org/3/howto/enum.html#using-automatic-values) licensed under [Zero-Clause BSD](https://docs.python.org/3/license.html#zero-clause-bsd-license-for-code-in-the-python-release-documentation).\n\n## Dataset licences\n\nFor the version of each ontology currently in use, please see the 'data_origin' field in kazu/conf/ontologies\n\n### Under [Creative Commons Attribution-Share Alike 3.0 Unported Licence](https://creativecommons.org/licenses/by/3.0/legalcode)\n\n#### Chembl\nChEMBL data is from http://www.ebi.ac.uk/chembl\n\n#### CLO\nCLO data is from http://www.ebi.ac.uk/ols/ontologies/clo\n\n#### UBERON\nUBERON data is from http://www.ebi.ac.uk/ols/ontologies/uberon\n\n### Under [Creative Commons Attribution 4.0 Unported License](https://creativecommons.org/licenses/by/4.0/legalcode\u003e)\n\n#### MONDO\nMONDO data is from http://www.ebi.ac.uk/ols/ontologies/mondo\n\n#### CELLOSAURUS\nCELLOSAURUS data is from https://www.cellosaurus.org/\n\n#### Gene Ontology\nGene Ontology data is from http://purl.obolibrary.org/obo/go.owl\n\n\n### Other licenced datasets and models\n\n#### HPO\n\nThis service/product uses the Human Phenotype Ontology (version information). Find out more at http://www.human-phenotype-ontology.org\n\nFreely licenced under https://hpo.jax.org/app/license\n\nSebastian Köhler, Michael Gargano, Nicolas Matentzoglu, Leigh C Carmody, David Lewis-Smith,\nNicole A Vasilevsky, Daniel Danis, Ganna Balagura, Gareth Baynam, Amy M Brower,\nTiffany J Callahan, Christopher G Chute, Johanna L Est, Peter D Galer, Shiva Ganesan,\nMatthias Griese, Matthias Haimel, Julia Pazmandi, Marc Hanauer, Nomi L Harris,\nMichael J Hartnett, Maximilian Hastreiter, Fabian Hauck, Yongqun He, Tim Jeske, Hugh Kearney,\nGerhard Kindle, Christoph Klein, Katrin Knoflach, Roland Krause, David Lagorce, Julie A McMurry,\nJillian A Miller, Monica C Munoz-Torres, Rebecca L Peters, Christina K Rapp, Ana M Rath,\nShahmir A Rind, Avi Z Rosenberg, Michael M Segal, Markus G Seidel, Damian Smedley,\nTomer Talmy, Yarlalu Thomas, Samuel A Wiafe, Julie Xian, Zafer Yüksel, Ingo Helbig,\nChristopher J Mungall, Melissa A Haendel, Peter N Robinson,\n\nThe Human Phenotype Ontology in 2021,\n\nNucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D1207–D1217,\u003cbr\u003e\nhttps://doi.org/10.1093/nar/gkaa1043\n\n\n#### OPEN TARGETS\nOpen Targets datasets are kindly provided by www.opentargets.org, which are free for commercial use cases \u003chttps://platform-docs.opentargets.org/licence\u003e\n\nOchoa, D. et al. (2021). Open Targets Platform: supporting systematic drug–target identification and prioritisation. Nucleic Acids Research.\u003cbr\u003e\nhttps://doi.org/10.1093/nar/gkaa1027\n\n#### STANZA\n\nThe Stanza framework:\n\nPeng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations. 2020.\u003cbr\u003e\nhttps://arxiv.org/abs/2003.07082\n\nBiomedical NLP models are derived from:\n\nYuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D. Manning, Curtis P. Langlotz.\u003cbr\u003e\nBiomedical and Clinical English Model Packages in the Stanza Python NLP Library,\u003cbr\u003e\nJournal of the American Medical Informatics Association. 2021.\u003cbr\u003e\nhttps://doi.org/10.1093/jamia/ocab090\n\n#### SCISPACY\n\nBiomedical scispacy models are derived from\n\nMark Neumann, Daniel King, Iz Beltagy, Waleed Ammar\u003cbr\u003e\nScispaCy: Fast and Robust Models for Biomedical Natural Language Processing\u003cbr\u003e\nProceedings of the 18th BioNLP Workshop and Shared Task\u003cbr\u003e\nACL 2019\u003cbr\u003e\nhttps://www.aclweb.org/anthology/W19-5034\n\n#### SAPBERT\n\nKazu uses a distilled form of SAPBERT, from\n\nFangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, Nigel Collier\u003cbr\u003e\nSelf-Alignment Pretraining for Biomedical Entity Representations\u003cbr\u003e\nACL 2021\u003cbr\u003e\nhttps://aclanthology.org/2021.naacl-main.334/\n\n#### GLINER\n\nGLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer.\u003cbr\u003e\nUrchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry Charnois\u003cbr\u003e\nhttps://arxiv.org/abs/2311.08526\n\n#### SETH\n\nKazu's SethStep uses Py4j to call the SETH mutation finder.\n\nThomas, P., Rocktäschel, T., Hakenberg, J., Mayer, L., and Leser, U. (2016).\u003cbr\u003e\n[SETH detects and normalizes genetic variants in text](https://pubmed.ncbi.nlm.nih.gov/27256315/)\u003cbr\u003e\nBioinformatics (2016)\u003cbr\u003e\nhttp://dx.doi.org/10.1093/bioinformatics/btw234\n\n\n#### Opsin\n\nKazu's OpsinStep uses Py4j to call OPSIN: Open Parser for Systematic IUPAC nomenclature.\n\nDaniel M. Lowe, Peter T. Corbett, Peter Murray-Rust, and Robert C. Glen\u003cbr\u003e\nChemical Name to Structure: OPSIN, an Open Source Solution\u003cbr\u003e\nJournal of Chemical Information and Modeling 2011 51 (3), 739-753\u003cbr\u003e\nDOI: [10.1021/ci100384d](https://doi.org/10.1021/ci100384d)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAstraZeneca%2FKAZU","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAstraZeneca%2FKAZU","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAstraZeneca%2FKAZU/lists"}