https://github.com/bingqiao/mandarin-to-pinyin
Mandarin Chinese characters to pinyin
https://github.com/bingqiao/mandarin-to-pinyin
chinese mandarin pinyin rust
Last synced: 3 months ago
JSON representation
Mandarin Chinese characters to pinyin
- Host: GitHub
- URL: https://github.com/bingqiao/mandarin-to-pinyin
- Owner: bingqiao
- License: mit
- Created: 2025-07-03T15:26:43.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2025-07-03T16:15:23.000Z (4 months ago)
- Last Synced: 2025-07-27T14:05:01.583Z (3 months ago)
- Topics: chinese, mandarin, pinyin, rust
- Language: Rust
- Homepage:
- Size: 377 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Mandarin to Pinyin
[](https://crates.io/crates/mandarin-to-pinyin)
[](https://docs.rs/mandarin-to-pinyin)
A lightweight, fast, and easy-to-use Rust crate for converting Mandarin Chinese characters to their corresponding Pinyin representation. It uses a pre-compiled Perfect Hash Function (PHF) map for instant lookups.
## Key Features
* **Fast:** Blazing-fast lookups using `phf`.
* **Simple API:** Get started with just a few lines of code.
* **Self-Contained:** Includes a default Unicode-to-Pinyin mapping, no external files needed.
* **Lightweight:** Option to exclude the default data to minimize your binary size if you provide your own.
* **Customizable:** Includes a utility to build your own mapping file from a tab-separated source.
* **Convenient String Conversion:** Easily convert entire Chinese sentences to Pinyin strings.
## Usage
1. **Add to your project:**
Add this line to your `Cargo.toml`:
```toml
[dependencies]
mandarin-to-pinyin = "0.0.1" # Replace with the latest version from crates.io
```
2. **Use in your code:**
The primary way to use the crate is to initialize the global map and use the lookup functions.
```rust
use mandarin_to_pinyin::{init_map, to_pinyin_string};
fn main() -> Result<(), Box> {
// 1. Initialize the map (loads default data)
init_map(None)?;
// 2. Convert a Chinese sentence to Pinyin
let chinese_sentence = "你好世界";
let pinyin_sentence = to_pinyin_string(chinese_sentence, " ")?;
println!("Pinyin for '{}': {}", chinese_sentence, pinyin_sentence);
// Expected output: Pinyin for '你好世界': nǐ hǎo shì jiè
// You can also use a different separator
let pinyin_with_hyphens = to_pinyin_string("你好", "-")?;
println!("Pinyin for '你好': {}", pinyin_with_hyphens);
// Expected output: Pinyin for '你好': nǐ-hǎo
Ok(())
}
```
## Feature Flags
This crate uses feature flags to control its behavior and size.
#### `default-data` (enabled by default)
This feature embeds the `unicode-to-pinyin.bin` file directly into your library, allowing you to use `init_map(None)` for easy setup.
If you want to minimize binary size and provide your own data file at runtime, you can disable this feature.
**Disabling default features:**
```toml
[dependencies]
mandarin-to-pinyin = { version = "0.0.1", default-features = false }
```
When `default-data` is disabled, you must pass your own byte slice to `init_map()`:
```rust
use mandarin_to_pinyin::init_map;
use std::fs;
fn main() -> Result<(), Box> {
// Read your custom .bin file
let custom_data_bytes = fs::read("path/to/your/unicode-to-pinyin.bin")?;
// Initialize the map with your custom data
init_map(Some(&custom_data_bytes))?;
// ... now you can use the lookup functions
Ok(())
}
```
#### `prepare-data` (optional)
This feature is for developers who want to create their own `unicode-to-pinyin.bin` file from a source file. It enables a binary target that you can use as a command-line tool. The source file should be a text file where each line contains a Unicode code point and its Pinyin representation, separated by a tab.
Most users of this library will not need to enable this feature.
**To install the conversion tool:**
```bash
cargo install mandarin-to-pinyin --features prepare-data --no-default-features
```
**To run the tool:**
The tool will read `data/Mandarin.dat` and generate `bincode/unicode-to-pinyin.bin`.
```bash
mandarin-to-pinyin
```
## Data Source
The `data/Mandarin.dat` file used in this project is sourced from the [Lingua::Han::PinYin Perl module](https://github.com/fayland/perl-lingua-han/tree/master/Lingua-Han-PinYin/lib/Lingua/Han/PinYin) by Fayland Lam.
## API Reference
* `fn init_map(bytes: Option<&[u8]>) -> Result<(), Box>`
Initializes the global Pinyin map. If `bytes` is `None`, it uses the default embedded data (requires the `default-data` feature). If `bytes` is `Some`, it uses the provided byte slice.
* `fn to_pinyin_string(text: &str, separator: &str) -> Result`
Converts a Chinese string to a Pinyin string, using the first Pinyin pronunciation for each character and joining them with the specified separator.
* `fn lookup_chars_for_str(chars: &str) -> Result, String>`
Looks up the Pinyin for a string slice and returns a space-separated string of Pinyin.
* `fn lookup_unicodes(unicodes: &[u32]) -> Result, String>`
Looks up the Pinyin for a slice of Unicode code points and returns a space-separated string of Pinyin.
* `fn lookup_chars_map_for_str(chars: &str) -> Result>>, String>`
Looks up the Pinyin for a string slice and returns a `HashMap` of characters to their Pinyin.
* `fn lookup_unicodes_map(unicodes: &[u32]) -> Result>>, String>`
Looks up the Pinyin for a slice of Unicode code points and returns a `HashMap` of code points to their Pinyin.
* `fn lookup_chars_vec_for_str(chars: &str) -> Result>>, String>`
Looks up the Pinyin for a string slice and returns a `Vec` of Pinyin strings.
* `fn lookup_unicodes_vec(unicodes: &[u32]) -> Result>>, String>`
Looks up the Pinyin for a slice of Unicode code points and returns a `Vec` of Pinyin strings.
* `fn diacritic_to_tone_plus_number(pinyins: &[&str]) -> Vec`
Converts Pinyin with diacritics to Pinyin with tone numbers (e.g., "xiāng" -> "xiang1").
* `fn tone_plus_number_to_diacritic(pinyins: &[&str]) -> Vec`
Converts Pinyin with tone numbers to Pinyin with diacritics (e.g., "xiang1" -> "xiāng").