https://github.com/yishn/chinese-tokenizer
Tokenizes Chinese texts into words.
https://github.com/yishn/chinese-tokenizer
chinese language tokenizer words
Last synced: 7 months ago
JSON representation
Tokenizes Chinese texts into words.
- Host: GitHub
- URL: https://github.com/yishn/chinese-tokenizer
- Owner: yishn
- License: mit
- Created: 2016-09-14T05:22:05.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2022-12-21T23:02:41.000Z (almost 3 years ago)
- Last Synced: 2025-03-24T08:10:28.427Z (7 months ago)
- Topics: chinese, language, tokenizer, words
- Language: JavaScript
- Homepage: https://yishn.github.io/chinese-tokenizer/
- Size: 11.2 MB
- Stars: 96
- Watchers: 6
- Forks: 25
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# chinese-tokenizer [](https://travis-ci.org/yishn/chinese-tokenizer)
Simple algorithm to tokenize Chinese texts into words using [CC-CEDICT](https://cc-cedict.org/). You can try it out at [the demo page](https://yishn.github.io/chinese-tokenizer/). The code for the demo page can be found in the [`gh-pages` branch](https://github.com/yishn/chinese-tokenizer/tree/gh-pages) of this repository.
## How this works
This tokenizer uses a simple greedy algorithm: It always looks for the longest word in the CC-CEDICT dictionary that matches the input, one at a time.
## Installation
Use npm to install:
~~~
npm install chinese-tokenizer --save
~~~## Usage
Make sure to provide the [CC-CEDICT](https://cc-cedict.org/) data.
~~~js
const tokenize = require('chinese-tokenizer').loadFile('./cedict_ts.u8')console.log(JSON.stringify(tokenize('我是中国人。'), null, ' '))
console.log(JSON.stringify(tokenize('我是中國人。'), null, ' '))
~~~Output:
~~~js
[
{
"text": "我",
"traditional": "我",
"simplified": "我",
"position": { "offset": 0, "line": 1, "column": 1 },
"matches": [
{
"pinyin": "wo3",
"pinyinPretty": "wǒ",
"english": "I/me/my"
}
]
},
{
"text": "是",
"traditional": "是",
"simplified": "是",
"position": { "offset": 1, "line": 1, "column": 2 },
"matches": [
{
"pinyin": "shi4",
"pinyinPretty": "shì",
"english": "is/are/am/yes/to be"
}
]
},
{
"text": "中國人",
"traditional": "中國人",
"simplified": "中国人",
"position": { "offset": 2, "line": 1, "column": 3 },
"matches": [
{
"pinyin": "Zhong1 guo2 ren2",
"pinyinPretty": "Zhōng guó rén",
"english": "Chinese person"
}
]
},
{
"text": "。",
"traditional": "。",
"simplified": "。",
"position": { "offset": 5, "line": 1, "column": 6 },
"matches": []
}
]
~~~## API
### `chineseTokenizer.loadFile(path)`
Reads the [CC-CEDICT](https://cc-cedict.org/) file from given `path` and returns a tokenize function based on the dictionary.
### `chineseTokenizer.load(content)`
Parses [CC-CEDICT](https://cc-cedict.org/) string content from `content` and returns a tokenize function based on the dictionary.
### `tokenize(text)`
Tokenizes the given `text` string and returns an array with tokens of the following form:
~~~js
{
"text": ,
"traditional": ,
"simplified": ,
"position": { "offset": , "line": , "column": },
"matches": [
{
"pinyin": ,
"pinyinPretty": ,
"english":
},
...
]
}
~~~