Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lindera-morphology/lindera
A morphological analysis library.
https://github.com/lindera-morphology/lindera
analyzer hacktoberfest library morphological tokenizer
Last synced: 4 months ago
JSON representation
A morphological analysis library.
- Host: GitHub
- URL: https://github.com/lindera-morphology/lindera
- Owner: lindera-morphology
- License: mit
- Created: 2020-01-22T14:27:57.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-02-28T10:57:29.000Z (4 months ago)
- Last Synced: 2024-02-28T11:51:27.591Z (4 months ago)
- Topics: analyzer, hacktoberfest, library, morphological, tokenizer
- Language: Rust
- Homepage:
- Size: 178 MB
- Stars: 329
- Watchers: 6
- Forks: 37
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Authors: AUTHORS
Lists
- awesome-stars - lindera-morphology/lindera - A multilingual morphological analysis library. (Rust)
- Awesome-Rust-MachineLearning - lindera-morphology/lindera - A morphological analysis library. (Natural Language Processing (preprocessing))
- awesome-stars - lindera-morphology/lindera - A multilingual morphological analysis library. (Rust)
- awesome-stars - lindera - morphology | 222 | (Rust)
README
# Lindera
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Join the chat at https://gitter.im/lindera-morphology/lindera](https://badges.gitter.im/lindera-morphology/lindera.svg)](https://gitter.im/lindera-morphology/lindera?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [![Crates.io](https://img.shields.io/crates/v/lindera.svg)](https://crates.io/crates/lindera)
A morphological analysis library in Rust. This project fork from [kuromoji-rs](https://github.com/fulmicoton/kuromoji-rs).
Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.
The following products are required to build:
- Rust >= 1.46.0
## Tokenizer Usage
### Basic example
Put the following in Cargo.toml:
```
[dependencies]
lindera-tokenizer = { version = "0.24.0", features = ["ipadic"] }
```This example covers the basic usage of Lindera.
It will:
- Create a tokenizer in normal mode
- Tokenize the input text
- Output the tokens```rust
use lindera_core::{mode::Mode, LinderaResult};
use lindera_dictionary::{DictionaryConfig, DictionaryKind};
use lindera_tokenizer::tokenizer::{Tokenizer, TokenizerConfig};fn main() -> LinderaResult<()> {
let dictionary = DictionaryConfig {
kind: Some(DictionaryKind::IPADIC),
path: None,
};let config = TokenizerConfig {
dictionary,
user_dictionary: None,
mode: Mode::Normal,
};// create tokenizer
let tokenizer = Tokenizer::from_config(config)?;// tokenize the text
let tokens = tokenizer.tokenize("関西国際空港限定トートバッグ")?;// output the tokens
for token in tokens {
println!("{}", token.text);
}Ok(())
}
```The above example can be run as follows:
```shell script
% cargo run --features=ipadic --example=ipadic_basic_example
```You can see the result as follows:
```text
関西国際空港
限定
トートバッグ
```### User dictionary example
You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.
```
,,
```For example:
```shell
% cat ./resources/simple_userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ
```With an user dictionary, `Tokenizer` will be created as follows:
```rust
use std::path::PathBuf;use lindera_tokenizer::tokenizer::{Tokenizer, TokenizerConfig};
use lindera_core::viterbi::Mode;
use lindera_core::LinderaResult;fn main() -> LinderaResult<()> {
let dictionary = DictionaryConfig {
kind: Some(DictionaryKind::IPADIC),
path: None,
};let user_dictionary = Some(UserDictionaryConfig {
kind: DictionaryKind::IPADIC,
path: PathBuf::from("./resources/ipadic_simple_userdic.csv"),
});let config = TokenizerConfig {
dictionary,
user_dictionary,
mode: Mode::Normal,
};let tokenizer = Tokenizer::from_config(config)?;
// tokenize the text
let tokens = tokenizer.tokenize("東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です")?;// output the tokens
for token in tokens {
println!("{}", token.text);
}Ok(())
}
```The above example can be by `cargo run --example`:
```shell
% cargo run --features=ipadic --example=ipadic_userdic_example
東京スカイツリー
の
最寄り駅
は
とうきょうスカイツリー駅
です
```## Anzalyzer Usage
### Basic example
Put the following in Cargo.toml:
```
[dependencies]
lindera-analyzer = { version = "0.24.0", features = ["ipadic", "ipadic-filter"] }
```This example covers the basic usage of Lindera Analysis Framework.
It will:
- Apply character filter for Unicode normalization (NFKC)
- Tokenize the input text with IPADIC
- Apply token filters for removing stop tags (Part-of-speech) and Japanese Katakana stem filter```rust
use std::{fs, path::PathBuf};use lindera_analyzer::analyzer::Analyzer;
use lindera_core::LinderaResult;fn main() -> LinderaResult<()> {
let path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("../resources")
.join("lindera_ipadic_conf.json");let config_bytes = fs::read(path).unwrap();
let analyzer = Analyzer::from_slice(&config_bytes).unwrap();
let mut text = "Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。".to_string();
println!("text: {}", text);// tokenize the text
let tokens = analyzer.analyze(&text)?;// output the tokens
for token in tokens {
println!(
"token: {:?}, start: {:?}, end: {:?}, details: {:?}",
token.text,
token.byte_start,
token.byte_end,
token.details
);
}Ok(())
}
```The above example can be run as follows:
```shell script
% cargo run --features=ipadic,ipadic-filter --example=analysis_example
```You can see the result as follows:
```text
text: Linderaは形態素解析エンジンです。ユーザー辞書も利用可能です。
token: Lindera, start: 0, end: 21, details: Some(["UNK"])
token: 形態素, start: 24, end: 33, details: Some(["名詞", "一般", "*", "*", "*", "*", "形態素", "ケイタイソ", "ケイタイソ"])
token: 解析, start: 33, end: 39, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "解析", "カイセキ", "カイセキ"])
token: エンジン, start: 39, end: 54, details: Some(["名詞", "一般", "*", "*", "*", "*", "エンジン", "エンジン", "エンジン"])
token: ユーザ, start: 0, end: 26, details: Some(["名詞", "一般", "*", "*", "*", "*", "ユーザー", "ユーザー", "ユーザー"])
token: 辞書, start: 26, end: 32, details: Some(["名詞", "一般", "*", "*", "*", "*", "辞書", "ジショ", "ジショ"])
token: 利用, start: 35, end: 41, details: Some(["名詞", "サ変接続", "*", "*", "*", "*", "利用", "リヨウ", "リヨー"])
token: 可能, start: 41, end: 47, details: Some(["名詞", "形容動詞語幹", "*", "*", "*", "*", "可能", "カノウ", "カノー"])
```## API reference
The API reference is available. Please see following URL:
- lindera