Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Yasu-umi/sudachiclone-rs
sudachiclone-rs is a Rust version of Sudachi, a Japanese morphological analyzer.
https://github.com/Yasu-umi/sudachiclone-rs
Last synced: 8 days ago
JSON representation
sudachiclone-rs is a Rust version of Sudachi, a Japanese morphological analyzer.
- Host: GitHub
- URL: https://github.com/Yasu-umi/sudachiclone-rs
- Owner: Yasu-umi
- License: apache-2.0
- Created: 2020-02-15T20:30:46.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-02-24T15:58:51.000Z (over 3 years ago)
- Last Synced: 2024-10-30T16:51:46.447Z (13 days ago)
- Language: Rust
- Homepage:
- Size: 395 KB
- Stars: 8
- Watchers: 2
- Forks: 2
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-Rust-MachineLearning - Yasu-umi/sudachiclone-rs - sudachiclone-rs is a Rust version of Sudachi, a Japanese morphological analyzer. (Natural Language Processing (preprocessing))
README
# sudachiclone-rs - SudachiPyClone by rust
[![sudachiclone at crates.io](https://img.shields.io/crates/v/sudachiclone.svg)](https://crates.io/crates/sudachiclone)
[![sudachiclone at docs.rs](https://docs.rs/sudachiclone/badge.svg)](https://docs.rs/sudachiclone)
[![Actions Status](https://github.com/Yasu-umi/sudachiclone-rs/workflows/test/badge.svg)](https://github.com/Yasu-umi/sudachiclone-rs/actions)sudachiclone-rs is a Rust version of [Sudachi](https://github.com/WorksApplications/sudachi), a Japanese morphological analyzer.
## Install CLI
### Setup.1 Install sudachiclone
sudachiclone is distributed from [crates.io](https://crates.io/crates/sudachiclone). You can install sudachiclone by executing cargo install sudachiclone from the command line.
```bash
$ cargo install sudachiclone
```### Setup2. Install dictionary
The default dict package SudachiDict_core is distributed from WorksAppliations Download site. Run pip install like below:
```bash
$ pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20200127.tar.gz
```## Usage CLI
After installing sudachiclone, you may also use it in the terminal via command sudachiclone.
You can excute sudachiclone with standard input by this way:
```bash
$ sudachiclone
````sudachiclone` has 4 subcommands (default: `tokenize`)
```bash
$ sudachiclone -h
Japanese Morphological AnalyzerUSAGE:
sudachiclone [FLAGS] [OPTIONS] [SUBCOMMAND]FLAGS:
-h, --help Prints help information
-q Silence all output
-V, --version Prints version information
-v Increase message verbosityOPTIONS:
-z prepend timestamp to log lines [possible values: none, sec, ms, ns]SUBCOMMANDS:
build Build Sudachi Dictionary
help Prints this message or the help of the given subcommand(s)
link Link Default Dict Package
tokenize Tokenize Text
ubuild Build User Dictionary
``````bash
$ sudachiclone tokenize -h
sudachiclone-tokenize 0.2.1
Tokenize TextUSAGE:
sudachiclone tokenize [FLAGS] [OPTIONS] [in_files]...FLAGS:
-h, --help (default) see `tokenize -h`
-a print all of the fields
-V, --version Prints version informationOPTIONS:
-o the output file
-r the setting file in JSON format
-m the mode of splitting [possible values: A, B, C]
-p path to Python executableARGS:
... text written in utf-8
``````bash
$ sudachiclone link -h
sudachiclone-link
Link Default Dict PackageUSAGE:
sudachiclone link [OPTIONS]FLAGS:
-h, --help see `link -h`
-V, --version Prints version informationOPTIONS:
-t dict dict [default: core] [possible values: small, core, full]
-p path to Python executable``````bash
$ sudachiclone build -h
sudachiclone-build
Build Sudachi DictionaryUSAGE:
sudachiclone build [FLAGS] [OPTIONS] -m [in_files]FLAGS:
-h, --help see `build -h`
-m connection matrix file with MeCab's matrix.def format
-V, --version Prints version informationOPTIONS:
-d description comment to be embedded on dictionary [default: ]
-o output file (default: system.dic) [default: system.dic]ARGS:
source files with CSV format (one of more)
```## As a Rust package
Here is an example usage:
```rust
use sudachiclone::prelude::*;let dictionary = Dictionary::setup(None, None, None).unwrap();
let tokenizer = dictionary.create();// Multi-granular tokenization
// using `system_core.dic` or `system_full.dic` version 20190781
// you may not be able to replicate this particular example due to dictionary you usefor m in tokenizer.tokenize("国家公務員", Some(SplitMode::C), None).unwrap() {
println!("{}", m.surface());
};
// => 国家公務員for m in tokenizer.tokenize("国家公務員", Some(SplitMode::B), None).unwrap() {
println!("{}", m.surface());
};
// => 国家
// => 公務員for m in tokenizer.tokenize("国家公務員", Some(SplitMode::A), None).unwrap() {
println!("{}", m.surface());
};
// => 国家
// => 公務
// => 員// Morpheme information
let m = tokenizer.tokenize("食べ", Some(SplitMode::A), None).unwrap().get(0).unwrap();
println!("{}", m.surface());
// => 食べ
println!("{}", m.dictionary_form());
// => 食べる
println!("{}", m.reading_form());
// => タベ
println!("{:?}", m.part_of_speech());
// => ["動詞", "一般", "*", "*", "下一段-バ行", "連用形-一般"]// Normalization
println!("{}", tokenizer.tokenize("附属", Some(SplitMode::A), None).unwrap().get(0).unwrap().normalized_form());
// => 付属println!("{}", tokenizer.tokenize("SUMMER", Some(SplitMode::A), None).unwrap().get(0).unwrap().normalized_form());
// => サマーprintln!("{}", tokenizer.tokenize("シュミレーション", Some(SplitMode::A), None).unwrap().get(0).unwrap().normalized_form());
// => シミュレーション
```## License
[Apache 2.0](./LICENSE).