https://github.com/yishn/chinese-tokenizer

Tokenizes Chinese texts into words.
https://github.com/yishn/chinese-tokenizer

chinese language tokenizer words

Last synced: 7 months ago
JSON representation

Tokenizes Chinese texts into words.

Host: GitHub
URL: https://github.com/yishn/chinese-tokenizer
Owner: yishn
License: mit
Created: 2016-09-14T05:22:05.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2022-12-21T23:02:41.000Z (almost 3 years ago)
Last Synced: 2025-03-24T08:10:28.427Z (7 months ago)
Topics: chinese, language, tokenizer, words
Language: JavaScript
Homepage: https://yishn.github.io/chinese-tokenizer/
Size: 11.2 MB
Stars: 96
Watchers: 6
Forks: 25
Open Issues: 1
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          # chinese-tokenizer [![Build Status](https://travis-ci.org/yishn/chinese-tokenizer.svg?branch=master)](https://travis-ci.org/yishn/chinese-tokenizer)

Simple algorithm to tokenize Chinese texts into words using [CC-CEDICT](https://cc-cedict.org/). You can try it out at [the demo page](https://yishn.github.io/chinese-tokenizer/). The code for the demo page can be found in the [`gh-pages` branch](https://github.com/yishn/chinese-tokenizer/tree/gh-pages) of this repository.

## How this works

This tokenizer uses a simple greedy algorithm: It always looks for the longest word in the CC-CEDICT dictionary that matches the input, one at a time.

## Installation

Use npm to install:

~~~

npm install chinese-tokenizer --save

~~~

## Usage

Make sure to provide the [CC-CEDICT](https://cc-cedict.org/) data.

~~~js

const tokenize = require('chinese-tokenizer').loadFile('./cedict_ts.u8')

console.log(JSON.stringify(tokenize('我是中国人。'), null, '  '))

console.log(JSON.stringify(tokenize('我是中國人。'), null, '  '))

~~~

Output:

~~~js

[

  {

    "text": "我",

    "traditional": "我",

    "simplified": "我",

    "position": { "offset": 0, "line": 1, "column": 1 },

    "matches": [

      {

        "pinyin": "wo3",

        "pinyinPretty": "wǒ",

        "english": "I/me/my"

      }

    ]

  },

  {

    "text": "是",

    "traditional": "是",

    "simplified": "是",

    "position": { "offset": 1, "line": 1, "column": 2 },

    "matches": [

      {

        "pinyin": "shi4",

        "pinyinPretty": "shì",

        "english": "is/are/am/yes/to be"

      }

    ]

  },

  {

    "text": "中國人",

    "traditional": "中國人",

    "simplified": "中国人",

    "position": { "offset": 2, "line": 1, "column": 3 },

    "matches": [

      {

        "pinyin": "Zhong1 guo2 ren2",

        "pinyinPretty": "Zhōng guó rén",

        "english": "Chinese person"

      }

    ]

  },

  {

    "text": "。",

    "traditional": "。",

    "simplified": "。",

    "position": { "offset": 5, "line": 1, "column": 6 },

    "matches": []

  }

]

~~~

## API

### `chineseTokenizer.loadFile(path)`

Reads the [CC-CEDICT](https://cc-cedict.org/) file from given `path` and returns a tokenize function based on the dictionary.

### `chineseTokenizer.load(content)`

Parses [CC-CEDICT](https://cc-cedict.org/) string content from `content` and returns a tokenize function based on the dictionary.

### `tokenize(text)`

Tokenizes the given `text` string and returns an array with tokens of the following form:

~~~js

{

  "text": ,

  "traditional": ,

  "simplified": ,

  "position": { "offset": , "line": , "column":  },

  "matches": [

    {

      "pinyin": ,

      "pinyinPretty": ,

      "english": 

    },

    ...

  ]

}

~~~

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yishn/chinese-tokenizer

Awesome Lists containing this project

README