Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/surajiyer/spacycake

Simple keyphrase extraction extensions and pipeline components for spaCy.
https://github.com/surajiyer/spacycake

keyphrase-extraction natural-language-processing nlp spacy spacy-extension spacy-pipeline

Last synced: 3 months ago
JSON representation

Simple keyphrase extraction extensions and pipeline components for spaCy.

Awesome Lists containing this project

README

        

# spacycaKE: Keyphrase Extraction for spaCy
[spaCy v2.0](https://spacy.io/usage/v2) extension and pipeline component for Keyphrase Extraction methods meta data to `Doc` objects.

## Installation
`spacycaKE` requires `spacy` v2.0.0 or higher and `spacybert` v1.0.0 or higher.

## Usage
```
import spacy
from spacycake import BertKeyphraseExtraction as bake
nlp = spacy.load('en')
```

Then use `bake` as part of the spacy pipeline,
```
cake = bake(nlp, from_pretrained='bert-base-cased', top_k=3)
nlp.add_pipe(cake, last=True)
```

Extract the keyphrases.
```
doc = nlp("This is a test but obviously you need to place a bigger document here to extract meaningful keyphrases")
print(doc._.extracted_phrases) # <-- List of 3 keyphrases
```

## Available attributes
The extension sets attributes on the `Doc` object. You can change the attribute names on initializing the extension.
| | | |
|-|-|-|
| `Doc._.bert_repr` | `torch.Tensor` | Document BERT embedding |
| `Doc._.noun_phrases` | `List[str]` | List of the candidate phrases from the document |
| `Doc._.extracted_phrases` | `List[str]` | List of the final extracted keyphrases |

## Settings
On initialization of `bake`, you can define the following:

| name | type | default | description |
|-|-|-|-|
| `nlp` | `spacy.lang.(...)` | - | Only used to get the language vocabulary to initialize the phrase matcher |
| `from_pretrained` | `str` | `None` | Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., `bert-base-cased` |
| `attr_names` | `Tuple[str]` | `('bert_repr', 'noun_phrases', 'extracted_phrases')` | Name of the various available attributes set to the `._` property (in order) |
| `force_extension` | `bool` | `True` | A boolean value to create the same 'Extension Attribute' upon being executed again |
| `top_k` | `int` | 5 | Max number of extracted phrases |
| `mmr_lambda` | `float` | .5 | Maximum Marginal Relevance lambda parameter. Used to control diversity of extracted keyphrases. Closer to 1., the more diverse the results. Closer to 0., the more similar the extracted phrases will be to the source document. |
| `kws` | `kwargs` | - | More keyword arguments to supply to `spacybert.BertInference()` |

## Roadmap
This extension is still experimental. Possible future updates include:
* Adding other keyphrase extraction methods.