{"id":18107695,"url":"https://github.com/zetavg/twlm","last_synced_at":"2025-08-20T20:33:46.984Z","repository":{"id":169749557,"uuid":"645774510","full_name":"zetavg/twlm","owner":"zetavg","description":"Taiwanese Mandarin LLM Project","archived":false,"fork":false,"pushed_at":"2023-05-26T14:01:47.000Z","size":1242,"stargazers_count":19,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-12-13T08:31:56.778Z","etag":null,"topics":["llm","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zetavg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-26T12:04:52.000Z","updated_at":"2024-02-16T05:37:25.000Z","dependencies_parsed_at":null,"dependency_job_id":"376b0af8-b0f0-4697-84d0-d5e023c16a88","html_url":"https://github.com/zetavg/twlm","commit_stats":null,"previous_names":["zetavg/twlm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zetavg%2Ftwlm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zetavg%2Ftwlm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zetavg%2Ftwlm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zetavg%2Ftwlm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zetavg","download_url":"https://codeload.github.com/zetavg/twlm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230454447,"owners_count":18228393,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","machine-learning"],"created_at":"2024-10-31T23:13:18.437Z","updated_at":"2024-12-19T15:16:32.198Z","avatar_url":"https://github.com/zetavg.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Taiwanese Mandarin LM\n\nAn attempt to re-train EN language models to understand and generate fluent Taiwanese Mandarin (Traditional Chinese).\n\n## Trained Models\n\n* [TW-Pythia-6.9B-Chat](https://huggingface.co/twlm/tw-pythia-6.9b-chat-v0_2)\n\nDemo: see https://hackmd.io/@z/twlm-demo\n\n## Usage\n\n主要有三個步驟：\n\n1. Build Tokenizer - 擴充指定的 base tokenizer 加入新的中文 token (`python build_tokenizer.py`)。\n2. Prepare Dataset - 準備訓練資料集 (`python prepare_dataset.py \u003ctrain_name\u003e`)。\n3. Train Model - 訓練模型 (`python train.py \u003ctrain_name\u003e`)。\n\n每個步驟的參數細節都由 config 檔決定，詳細請參考 `configs/sample.yaml` 的內容。\n\nTrain 可以定義多個，每個可以使用不同的訓練資料、超參數，以及可訓練的參數。例如，可以在 config 檔中這寫：\n\n```yaml\ntraining:\n  # 第一次訓練，只訓練 text embedding\n  embeddings:\n    dataset:\n      # ...\n    only_train_parameters_matching:\n      - 'embed'  # 只訓練名字符合 'embed' 的參數\n    training_arguments:\n      # ...\n\n  # 第二次訓練，做 instruction tuning\n  instruction_tuning:\n    base_on:\n      output_of: embeddings  # 基於 'embeddings' 訓練後產出的模型繼續訓練\n    dataset:\n      # ...\n    training_arguments:\n      # ...\n```\n\n依照以上定義的參數，可以執行 `python prepare_dataset.py embeddings` 來準備 'embeddings' 的訓練資料，然後執行 `python train.py embeddings` 開始 'embeddings' 的訓練。\n\n以上指令都可以用 `--cfg` 來指定要使用哪一個 config 檔，例如 `python build_tokenizer.py --cfg default` 為使用 `configs/default.yaml`。亦可以用 `--config_file_path` 來指定 config 檔的路徑，例如 `python train.py --config_file_path '~/configs/80k_tokens.yaml`。\n\n\n### 立即存擋以及提前中止\n\n在 `train.py` 執行訓練的過程中，若偵測到專案目錄中存在名為 `save_now` 檔案，將會立即儲存一份 checkpoint。\n\n而若偵測到專案目錄中存在名為 `abort` 的檔案，將會提前中止訓練。提前終止的訓練仍然會儲存 model，以及將 model 上傳到 Hugging Face Hub（若有啟用）。提前終止而上傳到 Hugging Face Hub 的模型將會在 model card 上自動註記提前終止時的 epoch 及 global step。\n\n舉例來說，我們可以切換到 train.py 所在的目錄下，執行 `touch save_now` 來立即存檔，或執行 `touch abort` 提前中止訓練。\n\n\n### 使用 SkyPilot 在雲端訓練\n\n（需要先安裝以及設定好 SkyPilot，詳見： https://skypilot.readthedocs.io/en/latest/getting-started/installation.html 。）\n\n首先，將 `sky_training.yaml.sample` 檔案複製為 `sky_training.yaml` (`cp sky_training.yaml.sample sky_training.yaml`)，然後編輯 `sky_training.yaml` 的內容，調整要使用的機器資源以及 storage bucket。\n\n接著，若有需要，複製 `sky_prepare.sh.sample` 檔案為 `sky_prepare.sh`，並編輯其內容，加入要在每次開始雲端訓練前執行的指令，例如切換到特定的 Google Cloud 設定檔。\n\n準備完成後，執行 `./sky_train.sh \u003ctrain_name\u003e`，即可開始雲端訓練。`sky_train.sh` 封裝了原本的 `sky launch` 或 `sky exec` 指令，會將本地端必要的訓練程式碼複製到雲端，同時將本機已登入的 Hugging Face 與 Weights \u0026 Biases 認證資訊與雲端機器共享。\n\n使用 `./sky_train.sh` 與 `python train.py` 相同，可以使用 `--cfg` 來指定要使用的 config（但不支援 `--config_file_path`）。\n\n除此之外，`./sky_train.sh` 還可以使用 `--cluster_name \u003cname\u003e` 或是 `-n \u003cname\u003e` 來指定要使用的 SkyPilot cluster (等同 `sky launch` 的 `-c`)，以及使用 `--skip_setup` 或 `-s` 來跳過雲端機器的 setup (若使用了 `--skip_setup`，背後將會使用 `sky exec` 而非 `sky launch`)。\n\n\n### 其他工具\n\n* 預覽 dataset：`python preview_dataset.py --cfg=... \u003ctrain_name\u003e --split=test --range_=10,20` (參數基本與 `train.py` 相同，但多了 `--split`、`--range_` 以及 `--only_preview` 三個參數)。\n* 訓練前初步檢查 config 內容：`python train_check_config.py --cfg=... \u003ctrain_name\u003e`。\n* 比較兩份 config 的差異：`python diff_configs.py config_1 config_2`。\n\n## Related Projects\n\n* [zetavg/LLM-Research](https://github.com/zetavg/LLM-Research)\n* [zetavg/LLaMA-LoRA-Tuner](https://github.com/zetavg/LLaMA-LoRA-Tuner)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzetavg%2Ftwlm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzetavg%2Ftwlm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzetavg%2Ftwlm/lists"}