Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/seanghay/khmertokenizer
A fast Khmer text tokenizer that ensures the all characters are included in the process.
https://github.com/seanghay/khmertokenizer
khmer nodejs tokenizer
Last synced: 2 days ago
JSON representation
A fast Khmer text tokenizer that ensures the all characters are included in the process.
- Host: GitHub
- URL: https://github.com/seanghay/khmertokenizer
- Owner: seanghay
- License: mit
- Created: 2023-04-28T08:01:35.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-04-29T06:13:10.000Z (over 1 year ago)
- Last Synced: 2024-04-24T15:26:22.699Z (7 months ago)
- Topics: khmer, nodejs, tokenizer
- Language: JavaScript
- Homepage: https://npmjs.com/package/khmertokenizer
- Size: 21.5 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: license
Awesome Lists containing this project
- awesome-khmer-language - seanghay/khmertokenizer
README
## Khmer Tokenizer
A fast Khmer text tokenizer that ensures all characters are included in the process.
[Web demo](https://khmertokenizer.netlify.app/)
```js
import { tokenize } from 'khmertokenizer';tokenize("ភាសាខ្មែរ១២ 123 ABC")
// => ["ភា","សា","ខ្មែ","រ","១","២"," ","1","2","3"," ","A","B","C"]
```### Iterator
```js
import { tokenizeAsIterator } from 'khmertokenizer';for (const c of tokenizeAsIterator("ភាសាខ្មែរ១២ 123 ABC")) {
console.log(c);
}
```### Grapheme Validation
```js
import { tokenize, isInvalidKhmerGrapheme } from 'khmertokenizer';const input = "ភាសាខ្មែរ១២ 123 ABC ២ ៗាា"
const output = tokenize(input)
.filter(c => !isInvalidKhmerGrapheme(c)) // remove invalid graphemes
.join("")//=> "ភាសាខ្មែរ១២ 123 ABC ២ ៗ"
```