https://github.com/surajiyer/spacycake

Simple keyphrase extraction extensions and pipeline components for spaCy.
https://github.com/surajiyer/spacycake

keyphrase-extraction natural-language-processing nlp spacy spacy-extension spacy-pipeline

Last synced: 3 months ago
JSON representation

Simple keyphrase extraction extensions and pipeline components for spaCy.

Host: GitHub
URL: https://github.com/surajiyer/spacycake
Owner: surajiyer
License: mit
Created: 2020-07-27T00:43:18.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2020-08-05T18:04:22.000Z (almost 5 years ago)
Last Synced: 2025-01-31T02:22:42.137Z (3 months ago)
Topics: keyphrase-extraction, natural-language-processing, nlp, spacy, spacy-extension, spacy-pipeline
Language: Jupyter Notebook
Homepage:
Size: 16.6 KB
Stars: 4
Watchers: 4
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # spacycaKE: Keyphrase Extraction for spaCy

[spaCy v2.0](https://spacy.io/usage/v2) extension and pipeline component for Keyphrase Extraction methods meta data to `Doc` objects.

## Installation

`spacycaKE` requires `spacy` v2.0.0 or higher and `spacybert` v1.0.0 or higher.

## Usage

```

import spacy

from spacycake import BertKeyphraseExtraction as bake

nlp = spacy.load('en')

```

Then use `bake` as part of the spacy pipeline,

```

cake = bake(nlp, from_pretrained='bert-base-cased', top_k=3)

nlp.add_pipe(cake, last=True)

```

Extract the keyphrases.

```

doc = nlp("This is a test but obviously you need to place a bigger document here to extract meaningful keyphrases")

print(doc._.extracted_phrases)  # <-- List of 3 keyphrases

```

## Available attributes

The extension sets attributes on the `Doc` object. You can change the attribute names on initializing the extension.

| | | |

|-|-|-|

| `Doc._.bert_repr` | `torch.Tensor` | Document BERT embedding |

| `Doc._.noun_phrases` | `List[str]` | List of the candidate phrases from the document |

| `Doc._.extracted_phrases` | `List[str]` | List of the final extracted keyphrases |

## Settings

On initialization of `bake`, you can define the following:

| name | type | default | description |

|-|-|-|-|

| `nlp` | `spacy.lang.(...)` | - | Only used to get the language vocabulary to initialize the phrase matcher |

| `from_pretrained` | `str` | `None` | Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., `bert-base-cased` |

| `attr_names` | `Tuple[str]` | `('bert_repr', 'noun_phrases', 'extracted_phrases')` | Name of the various available attributes set to the `._` property (in order) |

| `force_extension` | `bool` | `True` | A boolean value to create the same 'Extension Attribute' upon being executed again |

| `top_k` | `int` | 5 | Max number of extracted phrases |

| `mmr_lambda` | `float` | .5 | Maximum Marginal Relevance lambda parameter. Used to control diversity of extracted keyphrases. Closer to 1., the more diverse the results. Closer to 0., the more similar the extracted phrases will be to the source document. |

| `kws` | `kwargs` | - | More keyword arguments to supply to `spacybert.BertInference()` |

## Roadmap

This extension is still experimental. Possible future updates include:

* Adding other keyphrase extraction methods.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/surajiyer/spacycake

Awesome Lists containing this project

README