https://github.com/benja1972/topicphrase

Simple project for extraction of key-phrases from single document based on Sentence Trasfomers
https://github.com/benja1972/topicphrase

bert-embeddings clusters embeddings key-phrase-extraction nlp noun-phrases-candidates sentence-transformers topics

Last synced: 11 months ago
JSON representation

Simple project for extraction of key-phrases from single document based on Sentence Trasfomers

Host: GitHub
URL: https://github.com/benja1972/topicphrase
Owner: Benja1972
License: bsd-3-clause
Created: 2020-11-23T06:40:28.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2025-04-08T13:44:28.000Z (about 1 year ago)
Last Synced: 2025-04-08T14:38:34.206Z (about 1 year ago)
Topics: bert-embeddings, clusters, embeddings, key-phrase-extraction, nlp, noun-phrases-candidates, sentence-transformers, topics
Language: Python
Homepage:
Size: 3.25 MB
Stars: 7
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          # Key-phrases extraction and topic modeling with Sentence Transformers

Simple code for extraction of key-phrases and group them in topics from a single document or set of documents based on dense vectors representations (embeddings). The Sentence Transformers [sentence-transformers](https://github.com/UKPLab/sentence-transformers) is used to embed the documents and key-phrases candidates. It combines several ideas from different packages. Core steps of pipe-line include:

- extract noun phrase candidates using spacy (for simplicity we take part of the  code from pke package [pke](https://github.com/boudinfl/pke));

- calculate embedding of phrases and original document with help of  Sentence Transformers;

- cluster key-phrase vectors with HDBSCAN to group them in topics (idea comes from nice [Top2Vec](https://github.com/ddangelov/Top2Vec));

- sort groups (topics) and key-phases inside clusters by relevance to original document.

```python

from topicphrase.key_topic import *

```

## Load data

For data we use Wikipedia article about [self-driving car](https://en.wikipedia.org/wiki/Self-driving_car)

```

    A self-driving car, also known as an autonomous vehicle (AV), connected and autonomous vehicle (CAV), full self-driving car or driverless car, or robo-car or robotic car, (automated vehicles and fully automated vehicles in the European Union) is a vehicle that is capable of sensing its environment and moving safely with little or no human input.

```

```python

f_in  = 'data/self-car.txt'

docs = []

with open(f_in, 'r') as fin:

    for dcc in fin:

        docs.append(dcc.strip('\r\n'))

doc = ' '.join(docs)

```

## Initiate key-phrases extractor

```python

kph = KeyPhraser()

```

## Fit documents

In will learn the keyphrases that match the defined part-of-speech or grammar pattern from the list of raw documents

and clusters keyphrases in topics based on their similarity.

```python

kph.fit(doc)

```

Extracted keyphrases are sorted as an vocabulary of the model. Now one can transform any list of raw documents to the document-keyphrase matrix against this keyphrases vocabulary.

```python

doc_matrix = kph.transform(raw_docs)

```

## Topics sorting by relevance to original document or centroids of clusters

### Sort by similarity to centroid of cluster and print  topics:

```python

wsr_c = kph.output_topn_topics()

pprint(wsr_c)

```

Extracted topics ranked by similarity to the *whole corpus* and phrased sorted by centroids

```sh

[(6,

  0.6340678,

  [('autonomous vehicle', 0.9270642),

   ('autonomous driving', 0.9261311),

   ('autonomous car', 0.92045784),

   ('autonomous system', 0.91655105),

   ('autonomous transportation', 0.9070121)]),

 (3,

  0.5180938,

  [('self-driving car', 0.94790494),

   ('self-driving vehicle', 0.90880316),

   ('full self-driving car', 0.904925),

   ('self-driving mode %', 0.89748883),

   ('self-driving car industry', 0.89390194)]),

 (27,

  0.4600281,

  [('radar perception', 0.8565458),

   ('visual perception', 0.83902097),

   ('perception', 0.8050251),

   ('visual object recognition', 0.7988533),

   ('computer vision', 0.78779423)]),

 (12,

  0.44258612,

  [('automotive vehicles', 0.91408443),

   ('automobiles', 0.8592148),

   ('car manufacturers', 0.8539634),

   ('cars', 0.85164714),

   ('automotive industry', 0.84958434)]),

 (11,

  0.43725145,

  [('driving function', 0.9042318),

   ('driving tasks', 0.89559156),

   ('driving features', 0.8852626),

   ('driving', 0.8605223),

   ('driving systems', 0.8573065)])]

```

### Sort by similarity to original document:

```python

wsr_d = kph.doc_topn_topics(doc_id=0)

pprint(wsr_d)

```

Extracted topics and phrases ranked by *similarity to the doc 0* in the corpus:

```sh

[(6,

  0.6340678,

  [('autonomous vehicles market', 0.583479),

   ('autonomous vehicle', 0.5822828),

   ('autonomous car', 0.57185763),

   ('autonomous vehicle industry', 0.56481636),

   ('many autonomous vehicles', 0.55988705)]),

 (3,

  0.5180938,

  [('self-driving car project', 0.52936566),

   ('modern self-driving cars', 0.5236278),

   ('self-driving-car testing', 0.50891966),

   ('self-driving car industry', 0.50497735),

   ('self-driving car story', 0.49697512)]),

 (27,

  0.4600281,

  [('ultrasonic sensors', 0.4619211),

   ('sensory data', 0.45013982),

   ('sensory information', 0.4288388),

   ('electronic blind-spot assistance', 0.38294083),

   ('visual object recognition', 0.36293906)]),

 (12,

  0.44258612,

  [('car navigation system', 0.4831993),

   ('vehicle control method', 0.47435156),

   ('car sensors.citation', 0.46583062),

   ('vehicle communication systems', 0.46028528),

   ('vehicle control', 0.45426378)]),

 (11,

  0.43725145,

  [('driver assistance technologies', 0.46755123),

   ('auto-piloted car', 0.46486485),

   ('advanced driver-assistance systems', 0.45932925),

   ('driving systems', 0.41803557),

   ('vehicle ai', 0.4142478)])]

```

## Granularity of clusters

Granularity of clusters could be controlled by decreasing `min_cluster_size`, the default is 10. Moreover one can filter kepyphrases by their counts in total corpus of documents by tuning `min_phrase_freq`. 

```python

kph = KeyPhraser(min_cluster_size = 6, min_phrase_freq = 5)

kph.fit(doc)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/benja1972/topicphrase

Awesome Lists containing this project

README