https://github.com/dhchenx/rsnltk
Rust-based Natural Language Toolkit using Python Bindings
https://github.com/dhchenx/rsnltk
human-language natural-language-processing nlp-in-rust rsnltk rust-text-analysis stanza text-analysis
Last synced: 2 months ago
JSON representation
Rust-based Natural Language Toolkit using Python Bindings
- Host: GitHub
- URL: https://github.com/dhchenx/rsnltk
- Owner: dhchenx
- Created: 2022-02-01T06:40:23.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-02-04T18:31:15.000Z (over 3 years ago)
- Last Synced: 2024-08-10T19:03:08.539Z (about 1 year ago)
- Topics: human-language, natural-language-processing, nlp-in-rust, rsnltk, rust-text-analysis, stanza, text-analysis
- Language: Rust
- Homepage: https://crates.io/crates/rsnltk
- Size: 3.27 MB
- Stars: 16
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Rust-based Natural Language Toolkit (rsnltk)
A Rust library to support natural language processing with pure Rust implementation and Python bindings[Rust Docs](https://docs.rs/rsnltk/0.1.1) | [Crates Home Page](https://crates.io/crates/rsnltk) | [Tests](https://github.com/dhchenx/rsnltk/tree/main/tests) | [NER-Kit](https://pypi.org/project/ner-kit/)

## Features
The `rsnltk` library integrates various existing Python-written NLP toolkits for powerful text analysis in Rust-based applications.## Functions
This toolkit is based on the Python-written [Stanza](https://stanfordnlp.github.io/stanza/) and other important NLP crates.A list of functions from Stanza and others we bind here include:
- Tokenize
- Sentence Segmentation
- Multi-Word Token Expansion
- Part-of-Speech & Morphological Features
- Named Entity Recognition
- Sentiment Analysis
- Language Identification
- Dependency Tree AnalysisSome amazing crates are also included in `rsnltk` but with simplified APIs for actual use:
- [word2vec](https://crates.io/crates/word2vec)
- [natural](https://crates.io/crates/natural), [yn](https://crates.io/crates/yn), [whatlang](https://crates.io/crates/whatlang).Additionally, we can calculate the similarity between words based on WordNet though the `semantic-kit` PyPI project via `pip install semantic-kit`.
## Installation
1. Make sure you install Python 3.6.6+ and PIP environment in your computer. Type `python -V` in the Terminal should print no error message;
2. Install our Python-based [ner-kit](https://pypi.org/project/ner-kit/) (version>=0.0.5a2) for binding the `Stanza` package via `pip install ner-kit==0.0.5a2`;
3. Then, Rust should be also installed in your computer. I use IntelliJ to develop Rust-based applications, where you can write Rust codes;
4. Create a simple Rust application project with a `main()` function.
5. Add the `rsnltk` dependency to the `Cargo.toml` file, keep up the Latest version.
6. After you add the `rsnltk` dependency in the `toml file`, install necessary language models from Stanza using the following Rust code for the first time you use this package.
```rust
fn init_rsnltk_and_test(){
// 1. first install the necessary language models
// using language codes
let list_lang=vec!["en","zh"];
//e.g. you install two language models,
// namely, for English and Chinese text analysis.
download_langs(list_lang);
// 2. then do test NLP tasks
let text="I like Beijing!";
let lang="en";
// 2. Uncomment the below codes for Chinese NER
// let text="我喜欢北京、上海和纽约!";
// let lang="zh";
let list_ner=ner(text,lang);
for ner in list_ner{
println!("{:?}",ner);
}
}
```Or you can manually install those [language models](https://stanfordnlp.github.io/stanza/available_models.html) via the Python-written `ner-kit` package which provides more features in using Stanza. Go to: [ner-kit](https://pypi.org/project/ner-kit/)
If no error occurs in the above example, then it works. Finally, you can try the following advanced example usage.
Currently, we tested the use of English and Chinese language models; however, other language models should work as well.
## Examples with Stanza Bindings
Example 1: Part-of-speech Analysis
```rust
fn test_pos(){
//let text="我喜欢北京、上海和纽约!";
//let lang="zh";
let text="I like apple";
let lang="en";
let list_result=pos(text,lang);
for word in list_result{
println!("{:?}",word);
}
}
```Example 2: Sentiment Analysis
```rust
fn test_sentiment(){
//let text="I like Beijing!";
//let lang="en";
let text="我喜欢北京";
let lang="zh";
let sentiments=sentiment(text,lang);
for sen in sentiments{
println!("{:?}",sen);
}
}
```Example 3: Named Entity Recognition
```rust
fn test_ner(){
// 1. for English NER
let text="I like Beijing!";
let lang="en";
// 2. Uncomment the below codes for Chinese NER
// let text="我喜欢北京、上海和纽约!";
// let lang="zh";
let list_ner=ner(text,lang);
for ner in list_ner{
println!("{:?}",ner);
}
}
```Example 4: Tokenize for Multiple Languages
```rust
fn test_tokenize(){
let text="我喜欢北京、上海和纽约!";
let lang="zh";
let list_result=tokenize(text,lang);
for ner in list_result{
println!("{:?}",ner);
}
}
```Example 5: Tokenize Sentence
```rust
fn test_tokenize_sentence(){
let text="I like apple. Do you like it? No, I am not sure!";
let lang="en";
let list_sentences=tokenize_sentence(text,lang);
for sentence in list_sentences{
println!("Sentence: {}",sentence);
}
}
```Example 6: Language Identification
```rust
fn test_lang(){
let list_text = vec!["I like Beijing!",
"我喜欢北京!",
"Bonjour le monde!"];
let list_result=lang(list_text);
for lang in list_result{
println!("{:?}",lang);
}
}
```Example 7: MWT expand
```rust
fn test_mwt_expand(){
let text="Nous avons atteint la fin du sentier.";
let lang="fr";
let list_result=mwt_expand(text,lang);
}
```Example 8: Estimate the similarity between words in WordNet
You need to firstly install `semantic-kit` PyPI package!
```rust
fn test_wordnet_similarity(){
let s1="dog.n.1";
let s2="cat.n.2";
let sims=wordnet_similarity(s1,s2);
for sim in sims{
println!("{:?}",sim);
}
}
```Example 9: Obtain a dependency tree from a text
```rust
fn test_dependency_tree(){
let text="I like you. Do you like me?";
let lang="en";
let list_results=dependency_tree(text,lang);
for list_token in list_results{
for token in list_token{
println!("{:?}",token)
}}
}
```## Examples in Pure Rust
Example 1: Word2Vec similarity
```rust
fn test_open_wv_bin(){
let wv_model=wv_get_model("GoogleNews-vectors-negative300.bin");
let positive = vec!["woman", "king"];
let negative = vec!["man"];
println!("analogy: {:?}", wv_analogy(&wv_model,positive, negative, 10));
println!("cosine: {:?}", wv_cosine(&wv_model,"man", 10));
}
```Example 2: Text summarization
```rust
use rsnltk::native::summarizer::*;
fn test_summarize(){
let text="Some large txt...";
let stopwords=&[];
let summarized_text=summarize(text,stopwords,5);
println!("{}",summarized_text);
}
```Example 3: Get token list from English strings
```rust
use rsnltk::native::token::get_token_list;
fn test_get_token_list(){
let s="Hello, Rust. How are you?";
let result=get_token_list(s);
for r in result{
println!("{}\t{:?}",r.text,r);
}
}
```Example 4: Word segmentation for some language where no space exists between terms, e.g. Chinese text.
We implement three word segmentation methods in this version:
- Forward Maximum Matching (fmm), which is baseline method
- Backward Maximum Matching (bmm), which is considered better
- Bidirectional Maximum Matching (bimm), high accuracy but low speed```rust
use rsnltk::native::segmentation::*;
fn test_real_word_segmentation(){
let dict_path="30wdict.txt"; // empty if only for tokenizing
let stop_path="baidu_stopwords.txt";// empty when no stop words
let _sentence="美国太空总署希望,在深海的探险发现将有助于解开一些外太空的秘密,\
同时也可以测试前往太阳系其他星球探险所需的一些设备和实验。";
let meaningful_words=get_segmentation(_sentence,dict_path,stop_path, "bimm");
// bimm can be changed to fmm or bmm.
println!("Result: {:?}",meaningful_words);
}
```## Credits
Thank [Stanford NLP Group](https://github.com/stanfordnlp/stanza) for their hard work in [Stanza](https://stanfordnlp.github.io/stanza/).
## License
The `rsnltk` library with MIT License is provided by [Donghua Chen](https://github.com/dhchenx).