{"id":20833104,"url":"https://github.com/benja1972/topicphrase","last_synced_at":"2025-07-22T17:37:47.222Z","repository":{"id":169759009,"uuid":"315223485","full_name":"Benja1972/topicphrase","owner":"Benja1972","description":"Simple project for extraction of key-phrases from single document based on Sentence Trasfomers","archived":false,"fork":false,"pushed_at":"2025-04-08T13:44:28.000Z","size":3405,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-08T14:38:34.206Z","etag":null,"topics":["bert-embeddings","clusters","embeddings","key-phrase-extraction","nlp","noun-phrases-candidates","sentence-transformers","topics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Benja1972.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-23T06:40:28.000Z","updated_at":"2025-01-12T01:52:10.000Z","dependencies_parsed_at":"2023-10-19T16:30:47.905Z","dependency_job_id":"0b877660-3b49-4ad4-9878-a7383adf9012","html_url":"https://github.com/Benja1972/topicphrase","commit_stats":null,"previous_names":["benja1972/key-topic-bert"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Benja1972%2Ftopicphrase","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Benja1972%2Ftopicphrase/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Benja1972%2Ftopicphrase/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Benja1972%2Ftopicphrase/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Benja1972","download_url":"https://codeload.github.com/Benja1972/topicphrase/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252983761,"owners_count":21835758,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert-embeddings","clusters","embeddings","key-phrase-extraction","nlp","noun-phrases-candidates","sentence-transformers","topics"],"created_at":"2024-11-18T00:14:16.786Z","updated_at":"2025-05-08T01:40:56.037Z","avatar_url":"https://github.com/Benja1972.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Key-phrases extraction and topic modeling with Sentence Transformers\nSimple code for extraction of key-phrases and group them in topics from a single document or set of documents based on dense vectors representations (embeddings). The Sentence Transformers [sentence-transformers](https://github.com/UKPLab/sentence-transformers) is used to embed the documents and key-phrases candidates. It combines several ideas from different packages. Core steps of pipe-line include:\n- extract noun phrase candidates using spacy (for simplicity we take part of the  code from pke package [pke](https://github.com/boudinfl/pke));\n- calculate embedding of phrases and original document with help of  Sentence Transformers;\n- cluster key-phrase vectors with HDBSCAN to group them in topics (idea comes from nice [Top2Vec](https://github.com/ddangelov/Top2Vec));\n- sort groups (topics) and key-phases inside clusters by relevance to original document.\n\n\n```python\nfrom topicphrase.key_topic import *\n```\n\n## Load data\nFor data we use Wikipedia article about [self-driving car](https://en.wikipedia.org/wiki/Self-driving_car)\n\n```\n    A self-driving car, also known as an autonomous vehicle (AV), connected and autonomous vehicle (CAV), full self-driving car or driverless car, or robo-car or robotic car, (automated vehicles and fully automated vehicles in the European Union) is a vehicle that is capable of sensing its environment and moving safely with little or no human input.\n```\n\n```python\nf_in  = 'data/self-car.txt'\ndocs = []\n\nwith open(f_in, 'r') as fin:\n    for dcc in fin:\n        docs.append(dcc.strip('\\r\\n'))\n\ndoc = ' '.join(docs)\n```\n\n## Initiate key-phrases extractor\n\n\n```python\nkph = KeyPhraser()\n```\n\n\n## Fit documents\nIn will learn the keyphrases that match the defined part-of-speech or grammar pattern from the list of raw documents\nand clusters keyphrases in topics based on their similarity.\n\n```python\nkph.fit(doc)\n```\n\nExtracted keyphrases are sorted as an vocabulary of the model. Now one can transform any list of raw documents to the document-keyphrase matrix against this keyphrases vocabulary.\n\n```python\ndoc_matrix = kph.transform(raw_docs)\n```\n\n## Topics sorting by relevance to original document or centroids of clusters\n\n### Sort by similarity to centroid of cluster and print  topics:\n```python\nwsr_c = kph.output_topn_topics()\n\npprint(wsr_c)\n```\n\nExtracted topics ranked by similarity to the *whole corpus* and phrased sorted by centroids\n```sh\n[(6,\n  0.6340678,\n  [('autonomous vehicle', 0.9270642),\n   ('autonomous driving', 0.9261311),\n   ('autonomous car', 0.92045784),\n   ('autonomous system', 0.91655105),\n   ('autonomous transportation', 0.9070121)]),\n (3,\n  0.5180938,\n  [('self-driving car', 0.94790494),\n   ('self-driving vehicle', 0.90880316),\n   ('full self-driving car', 0.904925),\n   ('self-driving mode %', 0.89748883),\n   ('self-driving car industry', 0.89390194)]),\n (27,\n  0.4600281,\n  [('radar perception', 0.8565458),\n   ('visual perception', 0.83902097),\n   ('perception', 0.8050251),\n   ('visual object recognition', 0.7988533),\n   ('computer vision', 0.78779423)]),\n (12,\n  0.44258612,\n  [('automotive vehicles', 0.91408443),\n   ('automobiles', 0.8592148),\n   ('car manufacturers', 0.8539634),\n   ('cars', 0.85164714),\n   ('automotive industry', 0.84958434)]),\n (11,\n  0.43725145,\n  [('driving function', 0.9042318),\n   ('driving tasks', 0.89559156),\n   ('driving features', 0.8852626),\n   ('driving', 0.8605223),\n   ('driving systems', 0.8573065)])]\n\n```\n\n### Sort by similarity to original document:\n```python\nwsr_d = kph.doc_topn_topics(doc_id=0)\npprint(wsr_d)\n```\n\nExtracted topics and phrases ranked by *similarity to the doc 0* in the corpus:\n```sh\n[(6,\n  0.6340678,\n  [('autonomous vehicles market', 0.583479),\n   ('autonomous vehicle', 0.5822828),\n   ('autonomous car', 0.57185763),\n   ('autonomous vehicle industry', 0.56481636),\n   ('many autonomous vehicles', 0.55988705)]),\n (3,\n  0.5180938,\n  [('self-driving car project', 0.52936566),\n   ('modern self-driving cars', 0.5236278),\n   ('self-driving-car testing', 0.50891966),\n   ('self-driving car industry', 0.50497735),\n   ('self-driving car story', 0.49697512)]),\n (27,\n  0.4600281,\n  [('ultrasonic sensors', 0.4619211),\n   ('sensory data', 0.45013982),\n   ('sensory information', 0.4288388),\n   ('electronic blind-spot assistance', 0.38294083),\n   ('visual object recognition', 0.36293906)]),\n (12,\n  0.44258612,\n  [('car navigation system', 0.4831993),\n   ('vehicle control method', 0.47435156),\n   ('car sensors.citation', 0.46583062),\n   ('vehicle communication systems', 0.46028528),\n   ('vehicle control', 0.45426378)]),\n (11,\n  0.43725145,\n  [('driver assistance technologies', 0.46755123),\n   ('auto-piloted car', 0.46486485),\n   ('advanced driver-assistance systems', 0.45932925),\n   ('driving systems', 0.41803557),\n   ('vehicle ai', 0.4142478)])]\n\n```\n\n## Granularity of clusters\nGranularity of clusters could be controlled by decreasing `min_cluster_size`, the default is 10. Moreover one can filter kepyphrases by their counts in total corpus of documents by tuning `min_phrase_freq`. \n\n```python\nkph = KeyPhraser(min_cluster_size = 6, min_phrase_freq = 5)\n\nkph.fit(doc)\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenja1972%2Ftopicphrase","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbenja1972%2Ftopicphrase","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenja1972%2Ftopicphrase/lists"}