https://github.com/bact/thaitokens

Thai subword tokens
https://github.com/bact/thaitokens

thai thai-language tokenization tokens

Last synced: 7 months ago
JSON representation

Thai subword tokens

Host: GitHub
URL: https://github.com/bact/thaitokens
Owner: bact
License: apache-2.0
Created: 2024-04-08T18:00:21.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-04-28T18:09:33.000Z (over 1 year ago)
Last Synced: 2025-02-24T12:46:40.431Z (8 months ago)
Topics: thai, thai-language, tokenization, tokens
Language: Jupyter Notebook
Homepage:
Size: 1.58 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# thaitokens

Experimenting extracting Thai subword tokens for language model creation, using [TokenMonster](https://github.com/alasdairforsythe/tokenmonster/).

ทดสอบใช้ [TokenMonster](https://github.com/alasdairforsythe/tokenmonster/) สร้างรายการหน่วยคำย่อย จากชุดข้อมูลภาษาไทย

ตัวอย่างรายการหน่วยคำย่อยที่สร้างจากชุดข้อมูล [Wisesight Sentiment Corpus](https://github.com/PyThaiNLP/wisesight-sentiment) (ดูทั้งหมดได้ที่ [wss.vocab.yaml](wss/wss.vocab.yaml)):

```yaml
charset: utf-8
normalization: "nfd quotemarks collapse trim unixlines"
capcode: 0
training-param: 34
tokens:
- token: "TokenMonsterHexEncode{b8}"
id: 155
score: 0.0063883355
encoded: true
- token: "TokenMonsterHexEncode{b9}"
id: 156
score: 0.0019254258
encoded: true
- token: " "
id: 4
score: 0.0017494631
encoded: true
- token: "\n"
id: 1
score: 0.0014612209
encoded: true
- token: " #"
id: 237
score: 0.0011326408
encoded: true
- token: " และ"
id: 19632
score: 0.00089728273
encoded: true
- token: "ครับ"
id: 22685
score: 0.0007870678
encoded: true
```

[...]

```yaml
- token: "ลยค่ะ"
id: 36162
score: 0.00013340132
encoded: true
- token: "สมิติเวช"
id: 54060
score: 0.00013340132
encoded: true
- token: "ไม่เห็น"
id: 51286
score: 0.00013340132
encoded: true
- token: "แนะนำให้"
id: 54773
score: 0.00013340132
encoded: true
- token: "ผู้โชคดี"
id: 53678
score: 0.00013340132
encoded: true
```

## Steps

Follow [4 training steps](https://github.com/alasdairforsythe/tokenmonster/tree/main/training) as detailed by the TokenMonster project. You need the Go compiler to build the training toolchain.

### 1. Prepare the dataset

Build a mini dataset from [Wisesight Sentiment Corpus](https://github.com/PyThaiNLP/wisesight-sentiment) (6 MiB):

```sh
cat neg.txt neu.txt pos.txt q.txt > wss.txt
```

### 2. Generate tokens

```sh
./getalltokens -dataset wss.txt -output wss.alltokens -mode balanced -capcode 0 -charset utf-8 -norm "collapse quotemarks nfd trim unixlines" -only-valid -min-occur 2 -workers 2
```

- `-capcode 0` is recommended by TokenMonster for languages that don't use spaces as word separators.
- `-workers N` is a number of worker threads to run, excluding main thread. Best to set it to 1 less than the number of CPU threads.

It will start generating tokens:

```text
Charset: UTF-8
Normalization: NFD Quotemarks Collapse Trim UnixLines
Capcode: 0 (disabled)
Optimization mode: 2 (balanced)
Only valid UTF-8 allowed
2024/04/08 21:31:14 Loading wss.txt
2024/04/08 21:31:14 Finding tokens in chunk 1 of 1
2024/04/08 21:45:06 Tokens before final trim: 25,759,395
2024/04/08 21:45:06 Trimming final tokens for min 2
2024/04/08 21:45:11 Tokens after trimming: 7,317,906
2024/04/08 21:45:11 Filtered 251,869,920 tokens in 13m56.882s
2024/04/08 21:45:11 Saving tokens...
2024/04/08 21:45:16 Saved: wss.alltokens
```

### 3. Train vocabulary

Use the dataset from Step (1) and tokens from Step (2) to get a vocabulary:

```sh
./trainvocab -dataset wss.txt -dictionary wss.alltokens -dir wss-results -include-utf8-bytes -vocab-size 65536 -workers 2
```

Different results will be saved to the `wss-results` directory:

```text
Loading wss.alltokens
Charset: UTF-8
Normalization: NFD Quotemarks Collapse Trim UnixLines
Capcode: 0 (disabled)
Optimization mode: 2 (balanced)
Vocabulary size: 65536
Single byte tokens: 213
Loading wss.txt
2024/04/08 23:14:07 Worker 1 starting run 1
2024/04/08 23:14:07 Worker 0 starting run 1
2024/04/08 23:14:09 Worker 1 completed run 1 Score: 635,748
2024/04/08 23:14:09 Worker 0 completed run 1 Score: 629,851

[...]

2024/04/09 00:46:01 Worker 1 completed run 1028 Score: 651,943
2024/04/09 00:46:01 Deleted 3 of 3 tokens; Remaining 65,560 tokens; reached_vocab Best: 651,555; Tries:998
2024/04/09 00:46:03 Worker 0 completed run 1029 Score: 651,902
2024/04/09 00:46:03 Deleted 1 of 2 tokens; Remaining 65,559 tokens; reached_vocab Best: 651,555; Tries:999
2024/04/09 00:46:04 Worker 1 completed run 1029 Score: 652,022
2024/04/09 00:46:04 -- FINISHED --
No new best score in 1000 runs
Best result tokenized 6,296,789 bytes with 651,555 tokens
Average 9.664 characters/token
Best result:
wss-results/651555_568.tok
```

### 4. Export vocabulary

Extract tokens from the best vocabulary:

```sh
./exportvocab -input wss-results -output wss.vocab
```

```text
Loading wss-results/651555_568.tok
Capcode: 0 (disabled)
Charset: UTF-8
Normalization: NFD Quotemarks Collapse Trim UnixLines
Optimization mode: 2 (balanced)
Maximum token length: 40
Regular tokens: 65322
Single byte tokens: 214
Special tokens: 0
UNK token: No (can be added)
Deleted tokens: 0
Total tokens: 65536

Exported: wss.vocab
```

Convert it to YAML format:

```sh
./exportvocab -input-vocab wss.vocab -output-yaml wss.vocab.yaml -order-by-score
```

See [wss.vocab.yaml](wss/wss.vocab.yaml) to see how the resulting vocabulary can look like.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bact/thaitokens

Awesome Lists containing this project

README