https://github.com/centre-for-humanities-computing/chinese-tokenizer
A Rusty way of tokenizing Chinese texts
https://github.com/centre-for-humanities-computing/chinese-tokenizer
jieba rust tokenizer
Last synced: 5 days ago
JSON representation
A Rusty way of tokenizing Chinese texts
- Host: GitHub
- URL: https://github.com/centre-for-humanities-computing/chinese-tokenizer
- Owner: centre-for-humanities-computing
- License: mit
- Created: 2020-01-15T09:04:07.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-01-20T12:37:22.000Z (over 6 years ago)
- Last Synced: 2025-01-03T21:42:12.791Z (over 1 year ago)
- Topics: jieba, rust, tokenizer
- Language: Rust
- Homepage:
- Size: 9.77 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# A Rust-y tokenizer for Chinese texts #
This is a short program for tokenizing Chinese text, using a Rust port of jieba.
The default tokenizer is a maximum likelihood matching algorithm working from a Chinese lexicon (i.e. dictionary-based). However, jieba-rs also implements a Hidden Markov Model tokenizer. The preferred tokenizer can be easily selected by making the necessary changes in src/main.rs.
## Getting started
In order to run on your machine, you'll need to first install Rust and the Cargo package manager. This is done a number of different ways, depending on whether you use macOS, Linux, or Windows. You can find more information on how to do this [here](https://www.rust-lang.org/tools/install) and [here](https://doc.rust-lang.org/cargo/getting-started/installation.html).
Once that's completed, you'll need to copy your data into the empty 'data' folder. Note that the current structure of this program only allows for folder structures one level deep. In other words:
```
data/subfolder/file.txt
```
Be sure to check the comments at the beginning of src/main.rs. Some paths and variables may need to be modified to suit your needs.
## Building the program
With Rust, you have two options when running the program. Firstly, you can simply do the following in the root directory:
```
cargo run --release
```
This builds the local package and executes the binary. However, you can also run these steps seperately.
First build:
```
cargo build --release
```
Then run:
```
./target/release/chinese
```
Note that in both cases, we're using the --release flag. This prompts the compiler to perform optimisations which substantially improve performance of the tokenizer.
## NB!
This was written quite quickly to solve a specific problem and is still essentially work-in-progress. It will work for any collection of Chinese texts, as long as the corpus structured in the format outlined above. However, I hope at some point to return to this and make it more flexible, as well as offering the user the chance to set certain flags.
## Author
Author: [rdkm89](https://github.com/rdkm89)
Date: 2020-01-13
## Built with
This tokenizer pipeline is dependent on _jieba-rs_ by Github user [messense](https://github.com/messense). The original repo for that project can be found [here](https://github.com/messense/jieba-rs)
## License
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details