Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/veer66/chamkho
Khmer, Lao, Myanmar, and Thai word segmentation/breaking library and command line
https://github.com/veer66/chamkho
Last synced: about 2 months ago
JSON representation
Khmer, Lao, Myanmar, and Thai word segmentation/breaking library and command line
- Host: GitHub
- URL: https://github.com/veer66/chamkho
- Owner: veer66
- License: apache-2.0
- Archived: true
- Created: 2015-03-11T06:36:20.000Z (almost 10 years ago)
- Default Branch: main
- Last Pushed: 2023-10-22T11:23:13.000Z (about 1 year ago)
- Last Synced: 2024-10-28T17:22:02.802Z (2 months ago)
- Language: Rust
- Homepage:
- Size: 1010 KB
- Stars: 38
- Watchers: 7
- Forks: 7
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- nlp_thai_resources - GitHub
README
# Moved to https://codeberg.org/mekong-lang/chamkho
# Chamkho
Khmer, Lao, Myanmar, and Thai word segmentation/breaking library and command line## Algorithm
https://github.com/veer66/wordcut-engine
## C-ABI Library
https://github.com/veer66/wordcutw
## Usage
### Binary tarball
#### On Google Colab
```
!wget -q https://github.com/veer66/chamkho/releases/download/1.4.0/chamkho-1.4.0-linux-x64-musl.tar.gz -O - | tar -xzf -
with open('input.txt', 'w') as f:
f.write(u'กาเตเภไก่')
!cd chamkho-1.4.0-linux-x64-musl; ./wordcut < ../input.txt; cd ..
```#### On GNU/Linux
```
wget -q https://github.com/veer66/chamkho/releases/download/1.4.0/chamkho-1.4.0-linux-x64-musl.tar.gz -O - | tar -xzf -
cd chamkho-1.4.0-linux-x64-musl
echo กากากา | ./wordcut
cd ..
```#### On Windows Powershell
```
PS C:\ex1> $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
PS C:\ex1> Invoke-WebRequest -uri https://github.com/veer66/chamkho/releases/download/1.1.0/chamkho-1.1.0-windows-amd64.zip -OutFile chamkho.zip
PS C:\ex1> Expand-Archive -Path .\chamkho.zip -DestinationPath .
PS C:\ex1> cd .\chamkho-1.1.0-windows-amd64\
PS C:\ex1\chamkho-1.1.0-windows-amd64> echo ฉันง่วงมาก | .\wordcut.
exe
ฉัน|ง่วง|มาก
```### As command line
echo "กากกา" | wordcut
#### External dictionary
echo "กากกา" | wordcut -d
#### Specific language
```Bash
echo "ພາສາລາວມີ" | wordcut -l lao
``````Bash
echo នៅក្នុងការប្រកបរបរអាជីវកម្ | wordcut -l khmer
``````Bash
echo မြန်မာမှာ | wordcut -l myanmar
```#### WebAssembly
```Bash
# Add wasm32-wasi target
rustup target add wasm32-wasi# Install wasmtime (macOS) https://docs.wasmtime.dev/cli-install.html
curl https://wasmtime.dev/install.sh -sSf | bash# Source (or restart terminal)
source ~/.zshrc# Build
cargo build --target=wasm32-wasi# Run with input
echo "ฉันง่วงมาก" | wasmtime --dir=$(pwd)/data target/wasm32-wasi/debug/wordcut.was# Run and wait for input
wasmtime --dir=$(pwd)/data target/wasm32-wasi/debug/wordcut.wasm
```## Benchmark
### Setup
* Computer: Hetzner's CX11
* Rustc: rustc 1.53.0 (53cb7b09b 2021-06-17)
* OS: Linux exper1 5.12.15-300.fc34.x86_64 #1 SMP Wed Jul 7 19:46:50 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
* Script:```Bash
#!/bin/bash
set -x
INPUT=thwik-head1m.txt
#INPUT=x.txt
for i in {1..10}
do
{ time nlpo3 segment < $INPUT > o3 ; } 2>> bench_o3.txt
{ time wordcut < $INPUT > wc.txt ; } 2>> bench_wc.txt
done
```* nlpo3 version: 1.1.2
* nlpo3-cli version: 0.0.1
* chamkho version: 0.5.0
* dataset: https://file.veer66.rocks/langbench/thwik-head1m.txt### Result
#### nlpo3
```
[root@exper1 ~]# grep real bench_o3.txt
real 3m26.884s
real 3m15.001s
real 3m12.829s
real 3m11.998s
real 3m12.399s
real 3m13.829s
real 3m14.506s
real 3m9.198s
real 3m6.749s
real 3m8.729s
```#### chamko
```
[root@exper1 ~]# grep real bench_wc.txt
real 1m41.611s
real 1m40.262s
real 1m40.488s
real 1m40.765s
real 1m39.385s
real 1m41.002s
real 1m38.292s
real 1m35.906s
real 1m40.263s
real 1m36.523s
```#### Average
* nlpo3: 193.21s
* chamkho: 99.44s## Benchmark 2
Chamkho and Newmm on Mac mini M1
### Setup
* Computer: Scaleway's Mac mini M1
* Rustc: rustc 1.54.0 (a178d0322 2021-07-26)
* Python: Python 3.8.2
* OS: Darwin 506124d8-4acf-4595-9d46-8ca4b44b8110 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64
* Script:```Bash
#!/bin/bashset -x
INPUT=thwik-head1m.txt
for i in {1..10}
do
{ time python3 newmm.py < $INPUT > newmm.out ; } 2>> bench_newmm.txt
{ time wordcut < $INPUT > cham.out ; } 2>> bench_chamkho.txt
done
```
* PyThaiNLP: 2.3.1
* chamkho version: 1.0.2
* dataset: https://file.veer66.rocks/langbench/thwik-head1m.txt### Result
#### newmm
```
real 7m40.693s
real 7m40.623s
real 7m41.623s
real 7m40.438s
real 7m41.363s
real 7m39.108s
real 7m39.486s
real 7m39.946s
real 7m39.960s
real 7m40.279s
```#### chamko
```
real 1m2.110s
real 1m2.200s
real 1m1.954s
real 1m1.823s
real 1m1.836s
real 1m1.864s
real 1m1.638s
real 1m1.641s
real 1m1.688s
real 1m1.923s
```#### Average
* newmm```
$ grep real bench_newmm.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
460.3519
```* chamkho:
```
$ grep real bench_chamkho.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
61.8677
```#### Performance ratio
7.4
## Benchmark 3
Chamkho and Newmm on Xeon
### Setup
* Computer: Hetzner's CX11
* Rustc: rustc 1.54.0 (a178d0322 2021-07-26)
* Python: Python 3.9.6
* OS: Linux fedora-2gb-hel1-1 5.12.15-300.fc34.x86_64 #1 SMP Wed Jul 7 19:46:50 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
* Script:```Bash
#!/bin/bash
set -xINPUT=thwik-head1m.txt
for i in {1..10}
do
{ time python newmm.py < $INPUT > newmm.out ; } 2>> bench_newmm.txt
{ time wordcut < $INPUT > cham.out ; } 2>> bench_chamkho.txt
done
```* PyThaiNLP: 2.3.1
* chamkho version: 1.0.2
* dataset: https://file.veer66.rocks/langbench/thwik-head1m.txt### Result
#### newmm
```
# grep real bench_newmm.txt
real 17m15.608s
real 17m14.038s
real 17m7.864s
real 17m17.329s
real 17m5.501s
real 17m10.841s
real 17m16.348s
real 17m19.813s
real 17m28.796s
real 17m26.056s
```#### chamko
```
# grep real bench_chamkho.txt
real 1m46.157s
real 1m47.785s
real 1m47.173s
real 1m45.656s
real 1m45.554s
real 1m46.612s
real 1m48.991s
real 1m49.656s
real 1m47.677s
real 1m47.876s
```#### Average
* newmm```
# grep real bench_newmm.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
1036.2194000000002
```* chamkho:
```
$ # grep real bench_chamkho.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
107.31370000000001
```#### Performance ratio
9.65
## Benchmark 4
Chamkho and Nlpo3 on Mac mini M1
### Setup
* Computer: Scaleway's Mac mini M1
* Rustc: rustc 1.54.0 (a178d0322 2021-07-26)
* OS: Darwin 506124d8-4acf-4595-9d46-8ca4b44b8110 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64
* Script:```Bash
#!/bin/bashset -x
INPUT=thwik-head1m.txt
for i in {1..10}
do
{ time wordcut < $INPUT > newmm.out ; } 2>> bench_chamkho.txt
{ time nlpo3 segment < $INPUT > cham.out ; } 2>> bench_o3.txt
done
```* nlpo3-cli version: 0.0.1
* chamkho version: 1.0.2
* dataset: https://file.veer66.rocks/langbench/thwik-head1m.txt### Result
#### nlpo3
```
% grep real bench_o3.txt
real 2m7.639s
real 2m7.024s
real 2m6.296s
real 2m7.731s
real 2m7.873s
real 2m7.028s
real 2m6.411s
real 2m6.974s
real 2m7.746s
real 2m6.955s
```#### chamko
```
% grep real bench_chamkho.txt
real 1m0.237s
real 1m1.799s
real 1m1.752s
real 1m1.373s
real 1m1.128s
real 1m1.870s
real 1m1.878s
real 1m1.709s
real 1m1.690s
real 1m1.030s
```#### Average
* nlpo3```
% grep real bench_o3.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
127.1677
```* chamkho:
```
% grep real bench_chamkho.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
61.44659999999999
```
#### Performance ratio2.07