https://github.com/bact/thaitokens
Thai subword tokens
https://github.com/bact/thaitokens
thai thai-language tokenization tokens
Last synced: 7 months ago
JSON representation
Thai subword tokens
- Host: GitHub
- URL: https://github.com/bact/thaitokens
- Owner: bact
- License: apache-2.0
- Created: 2024-04-08T18:00:21.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-28T18:09:33.000Z (over 1 year ago)
- Last Synced: 2025-02-24T12:46:40.431Z (8 months ago)
- Topics: thai, thai-language, tokenization, tokens
- Language: Jupyter Notebook
- Homepage:
- Size: 1.58 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# thaitokens
Experimenting extracting Thai subword tokens for language model creation, using [TokenMonster](https://github.com/alasdairforsythe/tokenmonster/).
ทดสอบใช้ [TokenMonster](https://github.com/alasdairforsythe/tokenmonster/) สร้างรายการหน่วยคำย่อย จากชุดข้อมูลภาษาไทย
ตัวอย่างรายการหน่วยคำย่อยที่สร้างจากชุดข้อมูล [Wisesight Sentiment Corpus](https://github.com/PyThaiNLP/wisesight-sentiment) (ดูทั้งหมดได้ที่ [wss.vocab.yaml](wss/wss.vocab.yaml)):
```yaml
charset: utf-8
normalization: "nfd quotemarks collapse trim unixlines"
capcode: 0
training-param: 34
tokens:
- token: "TokenMonsterHexEncode{b8}"
id: 155
score: 0.0063883355
encoded: true
- token: "TokenMonsterHexEncode{b9}"
id: 156
score: 0.0019254258
encoded: true
- token: " "
id: 4
score: 0.0017494631
encoded: true
- token: "\n"
id: 1
score: 0.0014612209
encoded: true
- token: " #"
id: 237
score: 0.0011326408
encoded: true
- token: " และ"
id: 19632
score: 0.00089728273
encoded: true
- token: "ครับ"
id: 22685
score: 0.0007870678
encoded: true
```[...]
```yaml
- token: "ลยค่ะ"
id: 36162
score: 0.00013340132
encoded: true
- token: "สมิติเวช"
id: 54060
score: 0.00013340132
encoded: true
- token: "ไม่เห็น"
id: 51286
score: 0.00013340132
encoded: true
- token: "แนะนำให้"
id: 54773
score: 0.00013340132
encoded: true
- token: "ผู้โชคดี"
id: 53678
score: 0.00013340132
encoded: true
```## Steps
Follow [4 training steps](https://github.com/alasdairforsythe/tokenmonster/tree/main/training) as detailed by the TokenMonster project. You need the Go compiler to build the training toolchain.
### 1. Prepare the dataset
Build a mini dataset from [Wisesight Sentiment Corpus](https://github.com/PyThaiNLP/wisesight-sentiment) (6 MiB):
```sh
cat neg.txt neu.txt pos.txt q.txt > wss.txt
```### 2. Generate tokens
```sh
./getalltokens -dataset wss.txt -output wss.alltokens -mode balanced -capcode 0 -charset utf-8 -norm "collapse quotemarks nfd trim unixlines" -only-valid -min-occur 2 -workers 2
```- `-capcode 0` is recommended by TokenMonster for languages that don't use spaces as word separators.
- `-workers N` is a number of worker threads to run, excluding main thread. Best to set it to 1 less than the number of CPU threads.It will start generating tokens:
```text
Charset: UTF-8
Normalization: NFD Quotemarks Collapse Trim UnixLines
Capcode: 0 (disabled)
Optimization mode: 2 (balanced)
Only valid UTF-8 allowed
2024/04/08 21:31:14 Loading wss.txt
2024/04/08 21:31:14 Finding tokens in chunk 1 of 1
2024/04/08 21:45:06 Tokens before final trim: 25,759,395
2024/04/08 21:45:06 Trimming final tokens for min 2
2024/04/08 21:45:11 Tokens after trimming: 7,317,906
2024/04/08 21:45:11 Filtered 251,869,920 tokens in 13m56.882s
2024/04/08 21:45:11 Saving tokens...
2024/04/08 21:45:16 Saved: wss.alltokens
```### 3. Train vocabulary
Use the dataset from Step (1) and tokens from Step (2) to get a vocabulary:
```sh
./trainvocab -dataset wss.txt -dictionary wss.alltokens -dir wss-results -include-utf8-bytes -vocab-size 65536 -workers 2
```Different results will be saved to the `wss-results` directory:
```text
Loading wss.alltokens
Charset: UTF-8
Normalization: NFD Quotemarks Collapse Trim UnixLines
Capcode: 0 (disabled)
Optimization mode: 2 (balanced)
Vocabulary size: 65536
Single byte tokens: 213
Loading wss.txt
2024/04/08 23:14:07 Worker 1 starting run 1
2024/04/08 23:14:07 Worker 0 starting run 1
2024/04/08 23:14:09 Worker 1 completed run 1 Score: 635,748
2024/04/08 23:14:09 Worker 0 completed run 1 Score: 629,851[...]
2024/04/09 00:46:01 Worker 1 completed run 1028 Score: 651,943
2024/04/09 00:46:01 Deleted 3 of 3 tokens; Remaining 65,560 tokens; reached_vocab Best: 651,555; Tries:998
2024/04/09 00:46:03 Worker 0 completed run 1029 Score: 651,902
2024/04/09 00:46:03 Deleted 1 of 2 tokens; Remaining 65,559 tokens; reached_vocab Best: 651,555; Tries:999
2024/04/09 00:46:04 Worker 1 completed run 1029 Score: 652,022
2024/04/09 00:46:04 -- FINISHED --
No new best score in 1000 runs
Best result tokenized 6,296,789 bytes with 651,555 tokens
Average 9.664 characters/token
Best result:
wss-results/651555_568.tok
```### 4. Export vocabulary
Extract tokens from the best vocabulary:
```sh
./exportvocab -input wss-results -output wss.vocab
``````text
Loading wss-results/651555_568.tok
Capcode: 0 (disabled)
Charset: UTF-8
Normalization: NFD Quotemarks Collapse Trim UnixLines
Optimization mode: 2 (balanced)
Maximum token length: 40
Regular tokens: 65322
Single byte tokens: 214
Special tokens: 0
UNK token: No (can be added)
Deleted tokens: 0
Total tokens: 65536Exported: wss.vocab
```Convert it to YAML format:
```sh
./exportvocab -input-vocab wss.vocab -output-yaml wss.vocab.yaml -order-by-score
```See [wss.vocab.yaml](wss/wss.vocab.yaml) to see how the resulting vocabulary can look like.