{"id":48550035,"url":"https://github.com/yaniv-shulman/chunkey-bert","last_synced_at":"2026-04-08T08:04:00.076Z","repository":{"id":325918031,"uuid":"806867028","full_name":"yaniv-shulman/chunkey-bert","owner":"yaniv-shulman","description":"ChunkeyBert is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings for unsupervised keyphrase extraction from long text documents.","archived":false,"fork":false,"pushed_at":"2024-06-07T06:03:20.000Z","size":306,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-05T06:51:22.760Z","etag":null,"topics":["keyphrase-extraction","keyword-extraction","machine-learning","mit-license","nlp","nlp-keywords-extraction","nlp-machine-learning","python","python3","topic-modeling","unsupervised-machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yaniv-shulman.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-05-28T04:10:25.000Z","updated_at":"2024-06-07T06:32:28.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/yaniv-shulman/chunkey-bert","commit_stats":null,"previous_names":["yaniv-shulman/chunkey-bert"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/yaniv-shulman/chunkey-bert","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yaniv-shulman%2Fchunkey-bert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yaniv-shulman%2Fchunkey-bert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yaniv-shulman%2Fchunkey-bert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yaniv-shulman%2Fchunkey-bert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yaniv-shulman","download_url":"https://codeload.github.com/yaniv-shulman/chunkey-bert/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yaniv-shulman%2Fchunkey-bert/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31545909,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T16:28:08.000Z","status":"online","status_checked_at":"2026-04-08T02:00:06.127Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["keyphrase-extraction","keyword-extraction","machine-learning","mit-license","nlp","nlp-keywords-extraction","nlp-machine-learning","python","python3","topic-modeling","unsupervised-machine-learning"],"created_at":"2026-04-08T08:03:42.459Z","updated_at":"2026-04-08T08:04:00.067Z","avatar_url":"https://github.com/yaniv-shulman.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Tests](https://github.com/yaniv-shulman/chunkey-bert/actions/workflows/linting_and_tests.yml/badge.svg?branch=main)\n[![phorm.ai](https://img.shields.io/badge/ask%20phorm.ai-8A2BE2)](https://www.phorm.ai/query?projectId=f7ddaf97-2b90-4515-a364-855258454655)\n[![Pyversions](https://img.shields.io/pypi/pyversions/chunkey-bert.svg?style=flat-square)](https://pypi.python.org/pypi/chunkey-bert)\n\n# ChunkeyBERT - Unsupervised Keyword Extraction from Long Documents #\n## Overview ##\nChunkeyBert is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings for unsupervised \nkeyphrase extraction from text documents. ChunkeyBert is a modification of the \n[KeyBERT method](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea) to handle documents with \narbitrary length with better results. ChunkeyBERT works by chunking the documents and uses KeyBERT to extract candidate\nkeywords/keyphrases from all chunks followed by a similarity based selection stage to produce the final keywords for the\nentire document. ChunkeyBert can use any document chunking method as long as it can be wrapped in a simple function, \nhowever it can also work without a chunker and process the entire document as a single chunk. ChunkeyBert works with any\nconfiguration of KeyBERT and can handle batches of documents. \n\n## Installation ##\nInstall from [PyPI](https://pypi.org/project/chunkey-bert/) using pip (preferred method):\n```bash\npip install chunkey-bert\n```\n\n## Details ##\n### How does ChunkeyBERT differs to KeyBERT? ###\nChunkeyBERT differs from KeyBERT primarily in its approach to handling long documents for keyword extraction. While \nKeyBERT directly applies keyword extraction techniques to the entire document, ChunkeyBERT introduces an additional step\nof chunking the document into smaller, manageable pieces before applying KeyBERT's keyword extraction methods. This \nmodification aims to improve the performance and relevance of the extracted keywords, especially for longer documents \nwhere directly applying KeyBERT might not yield optimal results due to the complexity and size of the document. Here are\nthe key differences:\n\n**Document chunking**: ChunkeyBERT uses a chunking method to divide a long document into smaller chunks. This is done through the chunker \nparameter in the extract_keywords method. The chunker can be any callable that takes a string (the document) and returns\na list of strings (the chunks). This allows ChunkeyBERT to process each chunk independently, making it more effective at\nhandling long documents. A chunker could be as simple as \n```python\nchunker: Callable[[str], List[str]] = lambda text: [t for t in text.split(\"\\n\\n\") if len(t) \u003e 25]\n```\nor can wrap more complicated logic such as a Langchain chunker for example.\n\n**Handling of chunks**: After chunking, ChunkeyBERT applies KeyBERT's keyword extraction to each chunk separately.\n\n**Keyword scoring and selection**: ChunkeyBERT introduces additional logic to score and select keywords based on their occurrence across different chunks \nand their similarity. \n\n### Flexibility in keyword extraction ###\nChunkeyBERT offers flexibility in keyword extraction in a number of ways. It can work with any configuration of KeyBERT\nand exposes a superset of KeyBERT's extract_keywords() API, which allow fine-tuning of the keyword extraction process \nbased on the characteristics of the chunks and the overall document. It can also work with any chunking method including\nsemantic chunking, chunk filtering and even sampling from the document to finetune the process. ChunkeyBERT can be \nconfigured to consider the multiplicity of keywords across chunks to account for repetitions.\n\n### Batching and GPU support ###\nChunkeyBERT works with document batches and attempts to process these batches in parallel on the GPU if possible.\n\n### Compatible with KeyBERT return values ###\nChunkeyBERT returns results in a format similar to KeyBERT but can also optionally return the embeddings for each of the\nkeywords extracted.\n\n## Usage ##\n\nThe following steps describe a basic example on how use ChunkeyBert for keyword extraction:\n\n**Install ChunkeyBert**: First, ensure that ChunkeyBert is installed in your environment. You can install it using pip as\nshown below:\n\n```bash\npip install chunkey-bert\n```\n**Import required libraries**: Import the necessary libraries including ChunkeyBert, KeyBERT, and any other dependencies you\nmight need for your specific use case.\n\n```python\nfrom keybert import KeyBERT\nfrom sentence_transformers import SentenceTransformer\nfrom chunkey_bert.model import ChunkeyBert\n```\n\n**Initialize KeyBERT**: this could be done for example using a Sentence Transformer model that is used to generate embeddings\nfor the text. _Note that the quality of extracted keywords depends greatly on how KeyBERT is configured_, so it is \nrequired to understand how to use KeyBERT effectively.\n\n```python\nsentence_model = SentenceTransformer(model_name_or_path=\"all-MiniLM-L6-v2\")\nkeybert = KeyBERT(model=sentence_model)\n```\n\n**Define a chunker function (optional)**: If you want to chunk your text into smaller parts (which is the main feature of\nChunkeyBert), define a chunker function. This function takes a string and returns a list of strings (chunks). If you\ndon't provide a chunker, ChunkeyBert will process the entire document as a single chunk but will still apply a different\nkeywords selection method to KeyBERT. Here is an example of a very simple chunker:\n\n```python\nchunker = lambda text: [t for t in text.split(\"\\n\\n\") if len(t) \u003e 25]  # Example chunker that splits text into paragraphs\n```\n\n**Create a ChunkeyBert instance**: Initialize ChunkeyBert with the KeyBERT instance you created earlier.\n\n```python\nchunkey_bert = ChunkeyBert(keybert=keybert)\n```\n**Extract keywords**: Use the extract_keywords method of ChunkeyBert to extract keywords from your document. You can specify\nthe number of keywords, whether to use the chunker, and other parameters related to keyword extraction and to \nKeyBERT.extract_keywords.\n\n```python\ntext = \"Your long document text goes here...\"\nkeywords = chunkey_bert.extract_keywords(\n    docs=text, \n    num_keywords=10, \n    chunker=chunker,  # Pass your chunker here. If None, the entire document is treated as a single chunk.\n    top_n=3,  # Number of keywords to extract from each chunk\n    nr_candidates=20,  # Number of candidate keywords/keyphrases to consider from each chunk\n)\nprint(keywords)\n```\nSee a more advanced example in this notebook: https://nbviewer.org/github/yaniv-shulman/chunkey-bert/tree/main/src/experiments/\n\n## Experimental results ##\nVery limited experimental results and demonstration of the library on a small number of documents is available at \n https://nbviewer.org/github/yaniv-shulman/chunkey-bert/tree/main/src/experiments/.\n\n## Contribution and feedback ##\nContributions and feedback are most welcome. Please see\n[CONTRIBUTING.md](https://github.com/yaniv-shulman/chunkey-bert/tree/main/CONTRIBUTING.md) for further details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyaniv-shulman%2Fchunkey-bert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyaniv-shulman%2Fchunkey-bert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyaniv-shulman%2Fchunkey-bert/lists"}