Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/seanghay/khmertokenizer

A fast Khmer text tokenizer that ensures the all characters are included in the process.
https://github.com/seanghay/khmertokenizer

khmer nodejs tokenizer

Last synced: 2 days ago
JSON representation

A fast Khmer text tokenizer that ensures the all characters are included in the process.

Awesome Lists containing this project

README

        

## Khmer Tokenizer

A fast Khmer text tokenizer that ensures all characters are included in the process.

[Web demo](https://khmertokenizer.netlify.app/)

```js
import { tokenize } from 'khmertokenizer';

tokenize("ភាសាខ្មែរ១២ 123 ABC")
// => ["ភា","សា","ខ្មែ","រ","១","២"," ","1","2","3"," ","A","B","C"]
```

### Iterator

```js
import { tokenizeAsIterator } from 'khmertokenizer';

for (const c of tokenizeAsIterator("ភាសាខ្មែរ១២ 123 ABC")) {
console.log(c);
}
```

### Grapheme Validation

```js
import { tokenize, isInvalidKhmerGrapheme } from 'khmertokenizer';

const input = "ភាសាខ្មែរ១២ 123 ABC ២ ៗាា"
const output = tokenize(input)
.filter(c => !isInvalidKhmerGrapheme(c)) // remove invalid graphemes
.join("")

//=> "ភាសាខ្មែរ១២ 123 ABC ២ ៗ"
```