{"id":18487887,"url":"https://github.com/MaartenGr/KeyBERT","last_synced_at":"2025-04-08T20:32:01.011Z","repository":{"id":37102330,"uuid":"306327980","full_name":"MaartenGr/KeyBERT","owner":"MaartenGr","description":"Minimal keyword extraction with BERT","archived":false,"fork":false,"pushed_at":"2025-03-25T11:51:56.000Z","size":4024,"stargazers_count":3809,"open_issues_count":70,"forks_count":368,"subscribers_count":34,"default_branch":"master","last_synced_at":"2025-04-01T23:04:52.072Z","etag":null,"topics":["bert","keyphrase-extraction","keyword-extraction","mmr"],"latest_commit_sha":null,"homepage":"https://MaartenGr.github.io/KeyBERT/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MaartenGr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-22T12:21:22.000Z","updated_at":"2025-04-01T13:42:50.000Z","dependencies_parsed_at":"2023-02-12T02:40:22.008Z","dependency_job_id":"4ad5593e-2b3e-4287-8a5f-7aaaecfbdbdf","html_url":"https://github.com/MaartenGr/KeyBERT","commit_stats":{"total_commits":30,"total_committers":9,"mean_commits":"3.3333333333333335","dds":"0.33333333333333337","last_synced_commit":"4f79073dfbd56fc7602a45266ed692af8e9cca8b"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaartenGr%2FKeyBERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaartenGr%2FKeyBERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaartenGr%2FKeyBERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MaartenGr%2FKeyBERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MaartenGr","download_url":"https://codeload.github.com/MaartenGr/KeyBERT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247923144,"owners_count":21018937,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","keyphrase-extraction","keyword-extraction","mmr"],"created_at":"2024-11-06T12:50:56.665Z","updated_at":"2025-04-08T20:32:00.995Z","avatar_url":"https://github.com/MaartenGr.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"[![PyPI Downloads](https://static.pepy.tech/badge/keybert)](https://pepy.tech/projects/keybert)\n[![PyPI - Python](https://img.shields.io/badge/python-3.9+-blue.svg)](https://pypi.org/project/keybert/)\n[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE)\n[![PyPI - PyPi](https://img.shields.io/pypi/v/keyBERT)](https://pypi.org/project/keybert/)\n[![Build](https://img.shields.io/github/actions/workflow/status/MaartenGr/keyBERT/testing.yml?branch=master)](https://pypi.org/keybert/)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing)\n\n\u003cimg src=\"images/logo.png\" width=\"35%\" height=\"35%\" align=\"right\" /\u003e\n\n# KeyBERT\n\nKeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to\ncreate keywords and keyphrases that are most similar to a document.\n\nCorresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea).\n\n\u003ca name=\"toc\"/\u003e\u003c/a\u003e\n## Table of Contents  \n\u003c!--ts--\u003e  \n   1. [About the Project](#about)  \n   2. [Getting Started](#gettingstarted)  \n        2.1. [Installation](#installation)  \n        2.2. [Basic Usage](#usage)  \n        2.3. [Max Sum Distance](#maxsum)  \n        2.4. [Maximal Marginal Relevance](#maximal)  \n        2.5. [Embedding Models](#embeddings)  \n   3. [Large Language Models](#llms)  \n\u003c!--te--\u003e  \n\n\n\u003ca name=\"about\"/\u003e\u003c/a\u003e\n## 1. About the Project\n[Back to ToC](#toc)\n\nAlthough there are already many methods available for keyword generation\n(e.g.,\n[Rake](https://github.com/aneesha/RAKE),\n[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.)\nI wanted to create a very basic, but powerful method for extracting keywords and keyphrases.\nThis is where **KeyBERT** comes in! Which uses BERT-embeddings and simple cosine similarity\nto find the sub-phrases in a document that are the most similar to the document itself.\n\nFirst, document embeddings are extracted with BERT to get a document-level representation.\nThen, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity\nto find the words/phrases that are the most similar to the document. The most similar words could\nthen be identified as the words that best describe the entire document.\n\nKeyBERT is by no means unique and is created as a quick and easy method\nfor creating keywords and keyphrases. Although there are many great\npapers and solutions out there that use BERT-embeddings\n(e.g.,\n[1](https://github.com/pranav-ust/BERT-keyphrase-extraction),\n[2](https://github.com/ibatra/BERT-Keyword-Extractor),\n[3](https://www.preprints.org/manuscript/201908.0073/download/final_file),\n), I could not find a BERT-based solution that did not have to be trained from scratch and\ncould be used for beginners (**correct me if I'm wrong!**).\nThus, the goal was a `pip install keybert` and at most 3 lines of code in usage.\n\n\u003ca name=\"gettingstarted\"/\u003e\u003c/a\u003e\n## 2. Getting Started\n[Back to ToC](#toc)\n\n\u003ca name=\"installation\"/\u003e\u003c/a\u003e\n###  2.1. Installation\nInstallation can be done using [pypi](https://pypi.org/project/keybert/):\n\n```\npip install keybert\n```\n\nYou may want to install more depending on the transformers and language backends that you will be using. The possible installations are:\n\n```\npip install keybert[flair]\npip install keybert[gensim]\npip install keybert[spacy]\npip install keybert[use]\n```\n\nFor a light-weight installation without PyTorch, run the following to make use of [Model2Vec](https://github.com/MinishLab/model2vec) instead:\n\n```\npip install keybert --no-deps scikit-learn model2vec\n```\n\n\u003ca name=\"usage\"/\u003e\u003c/a\u003e\n###  2.2. Usage\n\nThe most minimal example can be seen below for the extraction of keywords:\n```python\nfrom keybert import KeyBERT\n\ndoc = \"\"\"\n         Supervised learning is the machine learning task of learning a function that\n         maps an input to an output based on example input-output pairs. It infers a\n         function from labeled training data consisting of a set of training examples.\n         In supervised learning, each example is a pair consisting of an input object\n         (typically a vector) and a desired output value (also called the supervisory signal).\n         A supervised learning algorithm analyzes the training data and produces an inferred function,\n         which can be used for mapping new examples. An optimal scenario will allow for the\n         algorithm to correctly determine the class labels for unseen instances. This requires\n         the learning algorithm to generalize from the training data to unseen situations in a\n         'reasonable' way (see inductive bias).\n      \"\"\"\nkw_model = KeyBERT()\nkeywords = kw_model.extract_keywords(doc)\n```\n\nYou can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:\n\n```python\n\u003e\u003e\u003e kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)\n[('learning', 0.4604),\n ('algorithm', 0.4556),\n ('training', 0.4487),\n ('class', 0.4086),\n ('mapping', 0.3700)]\n```\n\nTo extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number\nof words you would like in the resulting keyphrases:\n\n```python\n\u003e\u003e\u003e kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)\n[('learning algorithm', 0.6978),\n ('machine learning', 0.6305),\n ('supervised learning', 0.5985),\n ('algorithm analyzes', 0.5860),\n ('learning function', 0.5850)]\n```\n\nWe can highlight the keywords in the document by simply setting `highlight`:\n\n```python\nkeywords = kw_model.extract_keywords(doc, highlight=True)\n```\n\u003cimg src=\"images/highlight.png\" width=\"75%\" height=\"75%\" /\u003e\n\n\n**NOTE**: For a full overview of all possible transformer models see [sentence-transformer](https://www.sbert.net/docs/pretrained_models.html).\nI would advise either `\"all-MiniLM-L6-v2\"` for English documents or `\"paraphrase-multilingual-MiniLM-L12-v2\"`\nfor multi-lingual documents or any other language.\n\n\u003ca name=\"maxsum\"/\u003e\u003c/a\u003e\n###  2.3. Max Sum Distance\n\nTo diversify the results, we take the 2 x top_n most similar words/phrases to the document.\nThen, we take all top_n combinations from the 2 x top_n words and extract the combination\nthat are the least similar to each other by cosine similarity.\n\n```python\n\u003e\u003e\u003e kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',\n                              use_maxsum=True, nr_candidates=20, top_n=5)\n[('set training examples', 0.7504),\n ('generalize training data', 0.7727),\n ('requires learning algorithm', 0.5050),\n ('supervised learning algorithm', 0.3779),\n ('learning machine learning', 0.2891)]\n```\n\n\n\u003ca name=\"maximal\"/\u003e\u003c/a\u003e\n###  2.4. Maximal Marginal Relevance\n\nTo diversify the results, we can use Maximal Margin Relevance (MMR) to create\nkeywords / keyphrases which is also based on cosine similarity. The results\nwith **high diversity**:\n\n```python\n\u003e\u003e\u003e kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',\n                              use_mmr=True, diversity=0.7)\n[('algorithm generalize training', 0.7727),\n ('labels unseen instances', 0.1649),\n ('new examples optimal', 0.4185),\n ('determine class labels', 0.4774),\n ('supervised learning algorithm', 0.7502)]\n```\n\nThe results with **low diversity**:\n\n```python\n\u003e\u003e\u003e kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',\n                              use_mmr=True, diversity=0.2)\n[('algorithm generalize training', 0.7727),\n ('supervised learning algorithm', 0.7502),\n ('learning machine learning', 0.7577),\n ('learning algorithm analyzes', 0.7587),\n ('learning algorithm generalize', 0.7514)]\n```\n\n\n\u003ca name=\"embeddings\"/\u003e\u003c/a\u003e\n###  2.5. Embedding Models\nKeyBERT supports many embedding models that can be used to embed the documents and words:\n\n* Sentence-Transformers\n* Flair\n* Spacy\n* Gensim\n* USE\n\nClick [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.\n\n**Sentence-Transformers**  \nYou can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)\nand pass it through KeyBERT with `model`:\n\n```python\nfrom keybert import KeyBERT\nkw_model = KeyBERT(model='all-MiniLM-L6-v2')\n```\n\nOr select a SentenceTransformer model with your own parameters:\n\n```python\nfrom keybert import KeyBERT\nfrom sentence_transformers import SentenceTransformer\n\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nkw_model = KeyBERT(model=sentence_model)\n```\n\n**Flair**  \n[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that\nis publicly available. Flair can be used as follows:\n\n```python\nfrom keybert import KeyBERT\nfrom flair.embeddings import TransformerDocumentEmbeddings\n\nroberta = TransformerDocumentEmbeddings('roberta-base')\nkw_model = KeyBERT(model=roberta)\n```\n\nYou can select any 🤗 transformers model [here](https://huggingface.co/models).\n\n\u003ca name=\"llms\"/\u003e\u003c/a\u003e\n## 3. Large Language Models\n[Back to ToC](#toc)\n\nWith `KeyLLM` you can new perform keyword extraction with Large Language Models (LLM). You can find the full documentation [here](https://maartengr.github.io/KeyBERT/guides/keyllm.html) but there are two examples that are common with this new method. Make sure to install the OpenAI package through `pip install openai` before you start.\n\nFirst, we can ask OpenAI directly to extract keywords:\n\n```python\nimport openai\nfrom keybert.llm import OpenAI\nfrom keybert import KeyLLM\n\n# Create your LLM\nclient = openai.OpenAI(api_key=MY_API_KEY)\nllm = OpenAI(client)\n\n# Load it in KeyLLM\nkw_model = KeyLLM(llm)\n```\n\nThis will query any ChatGPT model and ask it to extract keywords from text.\n\nSecond, we can find documents that are likely to have the same keywords and only extract keywords for those. \nThis is much more efficient then asking the keywords for every single documents. There are likely documents that \nhave the exact same keywords. Doing so is straightforward:\n\n```python\nimport openai\nfrom keybert.llm import OpenAI\nfrom keybert import KeyLLM\nfrom sentence_transformers import SentenceTransformer\n\n# Extract embeddings\nmodel = SentenceTransformer('all-MiniLM-L6-v2')\nembeddings = model.encode(MY_DOCUMENTS, convert_to_tensor=True)\n\n# Create your LLM\nclient = openai.OpenAI(api_key=MY_API_KEY)\nllm = OpenAI(client)\n\n# Load it in KeyLLM\nkw_model = KeyLLM(llm)\n\n# Extract keywords\nkeywords = kw_model.extract_keywords(MY_DOCUMENTS, embeddings=embeddings, threshold=.75)\n```\n\nYou can use the `threshold` parameter to decide how similar documents need to be in order to receive the same keywords.\n\n## Citation\nTo cite KeyBERT in your work, please use the following bibtex reference:\n\n```bibtex\n@misc{grootendorst2020keybert,\n  author       = {Maarten Grootendorst},\n  title        = {KeyBERT: Minimal keyword extraction with BERT.},\n  year         = 2020,\n  publisher    = {Zenodo},\n  version      = {v0.3.0},\n  doi          = {10.5281/zenodo.4461265},\n  url          = {https://doi.org/10.5281/zenodo.4461265}\n}\n```\n\n## References\nBelow, you can find several resources that were used for the creation of KeyBERT\nbut most importantly, these are amazing resources for creating impressive keyword extraction models:\n\n**Papers**:\n* Sharma, P., \u0026 Li, Y. (2019). [Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling.](https://www.preprints.org/manuscript/201908.0073/download/final_file)\n\n**Github Repos**:\n* https://github.com/thunlp/BERT-KPE\n* https://github.com/ibatra/BERT-Keyword-Extractor\n* https://github.com/pranav-ust/BERT-keyphrase-extraction\n* https://github.com/swisscom/ai-research-keyphrase-extraction\n\n**MMR**:\nThe selection of keywords/keyphrases was modeled after:\n* https://github.com/swisscom/ai-research-keyphrase-extraction\n\n**NOTE**: If you find a paper or github repo that has an easy-to-use implementation\nof BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to\nadd a reference to this repo.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMaartenGr%2FKeyBERT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMaartenGr%2FKeyBERT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMaartenGr%2FKeyBERT/lists"}