{"id":20462064,"url":"https://github.com/bact/thaitokens","last_synced_at":"2025-03-05T11:32:32.215Z","repository":{"id":232321152,"uuid":"783859187","full_name":"bact/thaitokens","owner":"bact","description":"Thai subword tokens","archived":false,"fork":false,"pushed_at":"2024-04-28T18:09:33.000Z","size":1659,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-24T12:46:40.431Z","etag":null,"topics":["thai","thai-language","tokenization","tokens"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bact.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-08T18:00:21.000Z","updated_at":"2024-04-28T18:09:36.000Z","dependencies_parsed_at":"2024-04-12T04:41:55.703Z","dependency_job_id":null,"html_url":"https://github.com/bact/thaitokens","commit_stats":null,"previous_names":["bact/thaitokens"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bact%2Fthaitokens","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bact%2Fthaitokens/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bact%2Fthaitokens/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bact%2Fthaitokens/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bact","download_url":"https://codeload.github.com/bact/thaitokens/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242018977,"owners_count":20058710,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["thai","thai-language","tokenization","tokens"],"created_at":"2024-11-15T12:29:50.639Z","updated_at":"2025-03-05T11:32:31.845Z","avatar_url":"https://github.com/bact.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# thaitokens\n\nExperimenting extracting Thai subword tokens for language model creation, using [TokenMonster](https://github.com/alasdairforsythe/tokenmonster/).\n\nทดสอบใช้ [TokenMonster](https://github.com/alasdairforsythe/tokenmonster/) สร้างรายการหน่วยคำย่อย จากชุดข้อมูลภาษาไทย\n\nตัวอย่างรายการหน่วยคำย่อยที่สร้างจากชุดข้อมูล [Wisesight Sentiment Corpus](https://github.com/PyThaiNLP/wisesight-sentiment) (ดูทั้งหมดได้ที่ [wss.vocab.yaml](wss/wss.vocab.yaml)):\n\n```yaml\ncharset: utf-8\nnormalization: \"nfd quotemarks collapse trim unixlines\"\ncapcode: 0\ntraining-param: 34\ntokens:\n    - token:   \"TokenMonsterHexEncode{b8}\"\n      id:      155\n      score:   0.0063883355\n      encoded: true\n    - token:   \"TokenMonsterHexEncode{b9}\"\n      id:      156\n      score:   0.0019254258\n      encoded: true\n    - token:   \" \"\n      id:      4\n      score:   0.0017494631\n      encoded: true\n    - token:   \"\\n\"\n      id:      1\n      score:   0.0014612209\n      encoded: true\n    - token:   \" #\"\n      id:      237\n      score:   0.0011326408\n      encoded: true\n    - token:   \" และ\"\n      id:      19632\n      score:   0.00089728273\n      encoded: true\n    - token:   \"ครับ\"\n      id:      22685\n      score:   0.0007870678\n      encoded: true\n```\n\n[...]\n\n```yaml\n    - token:   \"ลยค่ะ\"\n      id:      36162\n      score:   0.00013340132\n      encoded: true\n    - token:   \"สมิติเวช\"\n      id:      54060\n      score:   0.00013340132\n      encoded: true\n    - token:   \"ไม่เห็น\"\n      id:      51286\n      score:   0.00013340132\n      encoded: true\n    - token:   \"แนะนำให้\"\n      id:      54773\n      score:   0.00013340132\n      encoded: true\n    - token:   \"ผู้โชคดี\"\n      id:      53678\n      score:   0.00013340132\n      encoded: true\n```\n\n## Steps\n\nFollow [4 training steps](https://github.com/alasdairforsythe/tokenmonster/tree/main/training) as detailed by the TokenMonster project. You need the Go compiler to build the training toolchain.\n\n### 1. Prepare the dataset\n\nBuild a mini dataset from [Wisesight Sentiment Corpus](https://github.com/PyThaiNLP/wisesight-sentiment) (6 MiB):\n\n```sh\ncat neg.txt neu.txt pos.txt q.txt \u003e wss.txt\n```\n\n### 2. Generate tokens\n\n```sh\n./getalltokens -dataset wss.txt -output wss.alltokens -mode balanced -capcode 0 -charset utf-8 -norm \"collapse quotemarks nfd trim unixlines\" -only-valid -min-occur 2 -workers 2\n```\n\n- `-capcode 0` is recommended by TokenMonster for languages that don't use spaces as word separators.\n- `-workers N` is a number of worker threads to run, excluding main thread. Best to set it to 1 less than the number of CPU threads.\n\nIt will start generating tokens:\n\n```text\nCharset: UTF-8\nNormalization: NFD Quotemarks Collapse Trim UnixLines\nCapcode: 0 (disabled)\nOptimization mode: 2 (balanced)\nOnly valid UTF-8 allowed\n2024/04/08 21:31:14 Loading wss.txt\n2024/04/08 21:31:14 Finding tokens in chunk 1 of 1\n2024/04/08 21:45:06 Tokens before final trim: 25,759,395\n2024/04/08 21:45:06 Trimming final tokens for min 2\n2024/04/08 21:45:11 Tokens after trimming: 7,317,906\n2024/04/08 21:45:11 Filtered 251,869,920 tokens in 13m56.882s\n2024/04/08 21:45:11 Saving tokens...\n2024/04/08 21:45:16 Saved: wss.alltokens\n```\n\n### 3. Train vocabulary\n\nUse the dataset from Step (1) and tokens from Step (2) to get a vocabulary:\n\n```sh\n./trainvocab -dataset wss.txt -dictionary wss.alltokens -dir wss-results -include-utf8-bytes -vocab-size 65536 -workers 2\n```\n\nDifferent results will be saved to the `wss-results` directory:\n\n```text\nLoading wss.alltokens\nCharset: UTF-8\nNormalization: NFD Quotemarks Collapse Trim UnixLines\nCapcode: 0 (disabled)\nOptimization mode: 2 (balanced)\nVocabulary size: 65536\nSingle byte tokens: 213\nLoading wss.txt\n2024/04/08 23:14:07 Worker 1 starting run 1\n2024/04/08 23:14:07 Worker 0 starting run 1\n2024/04/08 23:14:09 Worker 1 completed run 1  Score: 635,748\n2024/04/08 23:14:09 Worker 0 completed run 1  Score: 629,851\n\n[...]\n\n2024/04/09 00:46:01 Worker 1 completed run 1028  Score: 651,943\n2024/04/09 00:46:01 Deleted 3 of 3 tokens; Remaining 65,560 tokens;  reached_vocab Best: 651,555; Tries:998\n2024/04/09 00:46:03 Worker 0 completed run 1029  Score: 651,902\n2024/04/09 00:46:03 Deleted 1 of 2 tokens; Remaining 65,559 tokens;  reached_vocab Best: 651,555; Tries:999\n2024/04/09 00:46:04 Worker 1 completed run 1029  Score: 652,022\n2024/04/09 00:46:04 -- FINISHED --\nNo new best score in 1000 runs\nBest result tokenized 6,296,789 bytes with 651,555 tokens\nAverage 9.664 characters/token\nBest result:\n  wss-results/651555_568.tok\n```\n\n### 4. Export vocabulary\n\nExtract tokens from the best vocabulary:\n\n```sh\n./exportvocab -input wss-results -output wss.vocab\n```\n\n```text\nLoading wss-results/651555_568.tok\nCapcode:               0 (disabled)\nCharset:               UTF-8\nNormalization:         NFD Quotemarks Collapse Trim UnixLines\nOptimization mode:     2 (balanced)\nMaximum token length:  40\nRegular tokens:        65322\nSingle byte tokens:    214\nSpecial tokens:        0\nUNK token:             No (can be added)\nDeleted tokens:        0\nTotal tokens:          65536\n\nExported: wss.vocab\n```\n\nConvert it to YAML format:\n\n```sh\n./exportvocab -input-vocab wss.vocab -output-yaml wss.vocab.yaml -order-by-score\n```\n\nSee [wss.vocab.yaml](wss/wss.vocab.yaml) to see how the resulting vocabulary can look like.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbact%2Fthaitokens","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbact%2Fthaitokens","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbact%2Fthaitokens/lists"}