Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/boudinfl/pke
Python Keyphrase Extraction module
https://github.com/boudinfl/pke
computational-linguistics information-retrieval keyphrase keyphrase-extraction keyword keyword-extraction natural-language-processing python
Last synced: about 3 hours ago
JSON representation
Python Keyphrase Extraction module
- Host: GitHub
- URL: https://github.com/boudinfl/pke
- Owner: boudinfl
- License: gpl-3.0
- Created: 2015-11-13T08:11:45.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2023-07-12T16:18:04.000Z (over 1 year ago)
- Last Synced: 2025-02-14T11:09:39.969Z (7 days ago)
- Topics: computational-linguistics, information-retrieval, keyphrase, keyphrase-extraction, keyword, keyword-extraction, natural-language-processing, python
- Language: Python
- Homepage:
- Size: 82.6 MB
- Stars: 1,577
- Watchers: 31
- Forks: 291
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# `pke` - python keyphrase extraction
`pke` is an **open source** python-based **keyphrase extraction** toolkit. It
provides an end-to-end keyphrase extraction pipeline in which each component can
be easily modified or extended to develop new models. `pke` also allows for
easy benchmarking of state-of-the-art keyphrase extraction models, and
ships with supervised models trained on the
[SemEval-2010 dataset](http://aclweb.org/anthology/S10-1004).
## Table of Contents
* [Installation](#installation)
* [Minimal example](#minimal-example)
* [Getting started](#getting-started)
* [Implemented models](#implemented-models)
* [Model performances](#model-performances)
* [Citing pke](#citing-pke)## Installation
To pip install `pke` from github:
```bash
pip install git+https://github.com/boudinfl/pke.git
````pke` relies on `spacy` (>= 3.2.3) for text processing and requires [models](https://spacy.io/usage/models) to be installed:
```bash
# download the english model
python -m spacy download en_core_web_sm
```## Minimal example
`pke` provides a standardized API for extracting keyphrases from a document.
Start by typing the 5 lines below. For using another model, simply replace
`pke.unsupervised.TopicRank` with another model ([list of implemented models](#implemented-models)).```python
import pke# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()# load the content of the document, here document is expected to be a simple
# test string and preprocessing is carried out using spacy
extractor.load_document(input='text', language='en')# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)
```A detailed example is provided in the [`examples/`](examples/) directory.
## Getting started
To get your hands dirty with `pke`, we invite you to try our tutorials out.
| Name | Link |
| ---------------------------------------------- | ---------- |
| Getting started with `pke` and keyphrase extraction | [](https://colab.research.google.com/github/keyphrasification/hands-on-with-pke/blob/main/part-1-graph-based-keyphrase-extraction.ipynb) |
| Model parameterization | [](https://colab.research.google.com/github/keyphrasification/hands-on-with-pke/blob/main/part-2-parameterization.ipynb) |
| Benchmarking models | [](https://colab.research.google.com/github/keyphrasification/hands-on-with-pke/blob/main/part-3-benchmarking-models.ipynb) |## Implemented models
`pke` currently implements the following keyphrase extraction models:
* Unsupervised models
* Statistical models
* FirstPhrases
* TfIdf
* KPMiner [(El-Beltagy and Rafea, 2010)](http://www.aclweb.org/anthology/S10-1041.pdf)
* YAKE [(Campos et al., 2020)](https://doi.org/10.1016/j.ins.2019.09.013)
* Graph-based models
* TextRank [(Mihalcea and Tarau, 2004)](http://www.aclweb.org/anthology/W04-3252.pdf)
* SingleRank [(Wan and Xiao, 2008)](http://www.aclweb.org/anthology/C08-1122.pdf)
* TopicRank [(Bougouin et al., 2013)](http://aclweb.org/anthology/I13-1062.pdf)
* TopicalPageRank [(Sterckx et al., 2015)](http://users.intec.ugent.be/cdvelder/papers/2015/sterckx2015wwwb.pdf)
* PositionRank [(Florescu and Caragea, 2017)](http://www.aclweb.org/anthology/P17-1102.pdf)
* MultipartiteRank [(Boudin, 2018)](https://arxiv.org/abs/1803.08721)
* Supervised models
* Feature-based models
* Kea [(Witten et al., 2005)](https://www.cs.waikato.ac.nz/ml/publications/2005/chap_Witten-et-al_Windows.pdf)## Model performances
For comparison purposes, overall results of implemented models on commonly-used benchmark datasets are available in [results](results.md).
Code for reproducing these experiments are in the [benchmarking](examples/benchmarking-models.ipynb) notebook
(also available on [](https://colab.research.google.com/github/boudinfl/pke/blob/main/examples/benchmarking-models.ipynb)).## Citing pke
If you use `pke`, please cite the following paper:
```
@InProceedings{boudin:2016:COLINGDEMO,
author = {Boudin, Florian},
title = {pke: an open source python-based keyphrase extraction toolkit},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations},
month = {December},
year = {2016},
address = {Osaka, Japan},
pages = {69--73},
url = {http://aclweb.org/anthology/C16-2015}
}
```