Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tugascript/keyword-extraction-rs
Keyword extraction algorithms in Rust
https://github.com/tugascript/keyword-extraction-rs
keyword-extraction nlp
Last synced: about 2 months ago
JSON representation
Keyword extraction algorithms in Rust
- Host: GitHub
- URL: https://github.com/tugascript/keyword-extraction-rs
- Owner: tugascript
- License: lgpl-3.0
- Created: 2023-04-01T18:50:17.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2024-10-12T02:55:05.000Z (3 months ago)
- Last Synced: 2024-11-07T00:22:27.099Z (2 months ago)
- Topics: keyword-extraction, nlp
- Language: Rust
- Homepage: https://crates.io/crates/keyword_extraction
- Size: 266 KB
- Stars: 14
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: COPYING
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
# Rust Keyword Extraction
## Introduction
This is a simple NLP library with a list of unsupervised keyword extraction algorithms:
- Tokenizer for tokenizing text;
- TF-IDF for calculating the importance of a word in one or more documents;
- Co-occurrence for calculating relationships between words within a specific window size;
- RAKE for extracting key phrases from a document;
- TextRank for extracting keywords and key phrases from a document;
- YAKE for extracting keywords with a n-gram size (defaults to 3) from a document.## Algorithms
The full list of the algorithms in this library:
- Helper algorithms:
- [x] Tokenizer
- [x] Co-occurrence
- Keyword extraction algorithms:
- [x] TF-IDF
- [x] RAKE
- [x] TextRank
- [x] YAKE## Usage
Add the library to your `Cargo.toml`:
```toml
[dependencies]
keyword_extraction = "1.5.0"
```Or use cargo add:
```bash
cargo add keyword_extraction
```### Features
It is possible to enable or disable features:
- `"tf_idf"`: TF-IDF algorithm;
- `"rake"`: RAKE algorithm;
- `"text_rank"`: TextRank algorithm;
- `"yake"`: YAKE algorithm;
- `"all"`: algorimths and helpers;
- `"parallel"`: parallelization of the algorithms with Rayon;
- `"co_occurrence"`: Co-occurrence algorithm;Default features: `["tf_idf", "rake", "text_rank"]`. By default all algorithms apart from `"co_occurrence"` and `"yake"` are enabled.
NOTE: `"parallel"` feature is only recommended for large documents, it exchanges memory for computation resourses.
### Examples
For the stop words, you can use the `stop-words` crate:
```toml
[dependencies]
stop-words = "0.8.0"
```For example for english:
```rust
use stop_words::{get, LANGUAGE};fn main() {
let stop_words = get(LANGUAGE::English);
let punctuation: Vec =[
".", ",", ":", ";", "!", "?", "(", ")", "[", "]", "{", "}", "\"", "'",
].iter().map(|s| s.to_string()).collect();
// ...
}
```#### TF-IDF
Create a `TfIdfParams` enum which can be one of the following:
1. Unprocessed Documents: `TfIdfParams::UnprocessedDocuments`;
2. Processed Documents: `TfIdfParams::ProcessedDocuments`;
3. Single Unprocessed Document/Text block: `TfIdfParams::TextBlock`;```rust
use keyword_extraction::tf_idf::{TfIdf, TfIdfParams};fn main() {
// ... stop_words & punctuation
let documents: Vec = vec![
"This is a test document.".to_string(),
"This is another test document.".to_string(),
"This is a third test document.".to_string(),
];let params = TfIdfParams::UnprocessedDocuments(&documents, &stop_words, Some(&punctuation));
let tf_idf = TfIdf::new(params);
let ranked_keywords: Vec = tf_idf.get_ranked_words(10);
let ranked_keywords_scores: Vec<(String, f32)> = tf_idf.get_ranked_word_scores(10);// ...
}
```#### RAKE
Create a `RakeParams` enum which can be one of the following:
1. With defaults: `RakeParams::WithDefaults`;
2. With defaults and phrase length (phrase window size limit): `RakeParams::WithDefaultsAndPhraseLength`;
3. All: `RakeParams::All`;```rust
use keyword_extraction::rake::{Rake, RakeParams};fn main() {
// ... stop_words
let text = r#"
This is a test document.
This is another test document.
This is a third test document.
"#;let rake = Rake::new(RakeParams::WithDefaults(text, &stop_words));
let ranked_keywords: Vec = rake.get_ranked_words(10);
let ranked_keywords_scores: Vec<(String, f32)> = rake.get_ranked_word_scores(10);// ...
}
```#### TextRank
Create a `TextRankParams` enum which can be one of the following:
1. With defaults: `TextRankParams::WithDefaults`;
2. With defaults and phrase length (phrase window size limit): `TextRankParams::WithDefaultsAndPhraseLength`;
3. All: `TextRankParams::All`;```rust
use keyword_extraction::text_rank::{TextRank, TextRankParams};fn main() {
// ... stop_words
let text = r#"
This is a test document.
This is another test document.
This is a third test document.
"#;let text_rank = TextRank::new(TextRankParams::WithDefaults(text, &stop_words));
let ranked_keywords: Vec = text_rank.get_ranked_words(10);
let ranked_keywords_scores: Vec<(String, f32)> = text_rank.get_ranked_word_scores(10);
}
```#### YAKE
Create a `YakeParams` enum which can be one of the following:
1. With defaults: `YakeParams::WithDefaults`;
2. All: `YakeParams::All`;```rust
use keyword_extraction::yake::{Yake, YakeParams};fn main() {
// ... stop_words
let text = r#"
This is a test document.
This is another test document.
This is a third test document.
"#;let yake = Yake::new(YakeParams::WithDefaults(text, &stop_words));
let ranked_keywords: Vec = yake.get_ranked_keywords(10);
let ranked_keywords_scores: Vec<(String, f32)> = yake.get_ranked_keyword_scores(10);
// ...
}
```## Contributing
I would love your input! I want to make contributing to this project as easy and transparent as possible, please read the [CONTRIBUTING.md](CONTRIBUTING.md) file for details.
## License
This project is licensed under the GNU Lesser General Public License v3.0. See the [Copying](COPYING)
and [Copying Lesser](COPYING.LESSER) files for details.