{"id":20457845,"url":"https://github.com/pythainlp/wangchanglm","last_synced_at":"2025-04-13T05:30:21.578Z","repository":{"id":158479797,"uuid":"629091754","full_name":"PyThaiNLP/WangChanGLM","owner":"PyThaiNLP","description":"WangChanGLM 🐘 - The Multilingual Instruction-Following Model","archived":false,"fork":false,"pushed_at":"2023-11-22T01:44:47.000Z","size":3163,"stargazers_count":94,"open_issues_count":0,"forks_count":7,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-26T22:11:38.665Z","etag":null,"topics":["chatbot","instruction-following","llm","multilingual-models"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PyThaiNLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-04-17T15:46:18.000Z","updated_at":"2024-11-10T06:09:49.000Z","dependencies_parsed_at":"2023-06-25T19:37:51.470Z","dependency_job_id":null,"html_url":"https://github.com/PyThaiNLP/WangChanGLM","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2FWangChanGLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2FWangChanGLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2FWangChanGLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2FWangChanGLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PyThaiNLP","download_url":"https://codeload.github.com/PyThaiNLP/WangChanGLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248669844,"owners_count":21142892,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","instruction-following","llm","multilingual-models"],"created_at":"2024-11-15T12:09:27.578Z","updated_at":"2025-04-13T05:30:21.547Z","avatar_url":"https://github.com/PyThaiNLP.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WangChanGLM 🐘 - The Multilingual Instruction-Following Model\n\n[Blog](https://medium.com/@iwishcognitivedissonance/wangchanglm-the-thai-turned-multilingual-instruction-following-model-7aa9a0f51f5f) | [Codes](https://github.com/pythainlp/wangchanglm) | [Demo](https://colab.research.google.com/github/pythainlp/WangChanGLM/blob/main/demo/WangChanGLM_v0_1_demo.ipynb) \n\n![](https://i.imgur.com/RL5Mqs3.png)\n\nWangChanGLM is a multilingual, instruction-finetuned Facebook XGLM-7.5B using open-source, commercially permissible datasets (LAION OIG chip2 and infill_dbpedia, DataBricks Dolly v2, OpenAI TL;DR, and Hello-SimpleAI HC3; about 400k examples), released under CC-BY SA 4.0. The models are trained to perform a subset of instruction-following tasks we found most relevant namely: reading comprehension, brainstorming, and creative writing. We provide the weights for a model finetuned on an English-only dataset ([wangchanglm-7.5B-sft-en](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-en)) and another checkpoint further finetuned on Google-Translated Thai dataset ([wangchanglm-7.5B-sft-enth](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-enth)). We perform Vicuna-style evaluation using both humans and ChatGPT (in our case, `gpt-3.5-turbo` since we are still on the waitlist for `gpt-4`) and observe some discrepancies between the two types of annoators. All training and evaluation codes are shared under the [Apache-2.0 license](https://github.com/pythainlp/wangchanglm/blob/main/LICENSE) in our Github, as well as datasets and model weights on [HuggingFace](https://huggingface.co/pythainlp). In a similar manner to [Dolly v2](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm), we use only use open-source, commercially permissive pretrained models and datasets, our models are neither restricted by non-commercial clause like models that use LLaMA as base nor non-compete clause like models that use self-instruct datasets from ChatGPT. See our live demo [here](https://colab.research.google.com/github/pythainlp/WangChanGLM/blob/main/demo/WangChanGLM_v0_1_demo.ipynb).\n\n## Models\nWe provide various versions of our models as follows:\n* [pythainlp/wangchanglm-7.5B-sft-en](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-en)\n* [pythainlp/wangchanglm-7.5B-sft-enth](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-enth)\n\nSharded versions used in demo:\n* [pythainlp/wangchanglm-7.5B-sft-en-sharded](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-en-sharded)\n* [pythainlp/wangchanglm-7.5B-sft-enth-sharded](https://huggingface.co/pythainlp/wangchanglm-7.5B-sft-enth-sharded)\n\n## Training Sets\n\nWe provide our training sets as follows:\n* [pythainlp/final_training_set_v1](https://huggingface.co/datasets/pythainlp/final_training_set_v1)\n* [pythainlp/final_training_set_v1_enth](https://huggingface.co/datasets/pythainlp/final_training_set_v1_enth)\n\n## Finetuning\n\n### Multi-world LoRA\n\nWe finetuned XGLM-7.5B on 4 V100 GPU (32GB VARM) with the hyperparameters described in `script/train_sft_peft_multi_world.py`.\n\n```\npython -m torch.distributed.launch --nproc_per_node=4 train_sft_peft_multi_world.py \\\n--per_device_train_batch_size 1 --gradient_accumulation_steps 32 \\ #effective batch size = 128 (4 GPUs * 1 batch size * 32 gradient accumulation)\n--wandb_project your_project_name \\\n--model_name facebook/xglm-7.5B \\\n--dataset_name pythainlp/final_training_set_v1 \\ \n--adapter_name save_adapter_to\n```\n\nThe adapter is merged to the main weights with the script from [lvwerra/trl](https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt-neox-20b_peft/merge_peft_adapter.py).\n\n### Single-world LoRA\n\nIt is possible to finetune XGLM-7.5B on a single 32GB-VRAM GPU or multiple GPUs with a smaller VRAM with the hyperparameters described in `script/train_sft_peft_single_world.py`.\n\n```\npython train_sft_peft_single_world.py \\\n--per_device_train_batch_size 2 --gradient_accumulation_steps 64 \\ #effective batch size = 128 (1 GPU * 2 batch size * 64 gradient accumulation)\n--wandb_project your_project_name \\\n--model_name facebook/xglm-7.5B \\\n--dataset_name pythainlp/final_training_set_v1 \\ \n--adapter_name save_adapter_to\n```\n\n### Full-finetuning\n\nWe also provide a script for full finetuning we experimented with a smaller model on a different set of training data.\n\n```\npython -m torch.distributed.launch --nproc_per_node=8 train_sft.py \\\n--per_device_train_batch_size=8 --per_device_eval_batch_size=8 --gradient_accumulation_steps=16 \\\n--model_name=facebook/xglm-1.7B --bf16 --deepspeed=../config/sft_deepspeed_config.json\n```\n\n## Inference\n\nWe performed inference on the OpenAssistant prompts using hyperparameters described in `script/generate_huggingface_answer.py`.\n\n```\npython generate_huggingface_answer.py --input_fname ../data/oasst1_gpt35turbo_answer.csv \\\n--model_name pythainlp/wangchanglm-7.5B-sft-en \\\n--tokenizer_name pythainlp/wangchanglm-7.5B-sft-en \\\n--output_fname ../data/oasst1_wangchang_sft_en_only_answer_answer.csv \n```\n\n## Evaluation\n\nWe evaluated any pair of model answers using `gpt-3.5-turbo` as described in `script/eval_vicuna_style.py`. The entire inference and evaluation is stored in `script/infer_and_eval.sh`. The human questionnaires are stored in `data/human_questionnaire`.\n\n## Environmental Impact\n\nExperiments were conducted using a private infrastructure, which has a carbon efficiency of 0.432 kgCO2eq/kWh. A cumulative of 500 hours of computation was performed on hardware of type Tesla V100-SXM2-32GB (TDP of 300W). Total emissions are estimated to be 64.8 CO2eq of which 0 percents were directly offset. Estimations were conducted using the [MachineLearning Impact calculator](https://mlco2.github.io/impact#compute) presented in [lacoste2019quantifying](https://arxiv.org/abs/1910.09700).\n\n## Bibtex\n```\n@software{charin_polpanumas_2023_7878101,\n  author       = {Charin Polpanumas and\n                  Wannaphong Phatthiyaphaibun and\n                  Patomporn Payoungkhamdee and\n                  Peerat Limkonchotiwat and\n                  Lalita Lowphansirikul and\n                  Can Udomcharoenchaikit and\n                  Titipat Achakulwisut and\n                  Ekapol Chuangsuwanich and\n                  Sarana Nutanong},\n  title        = {{WangChanGLM🐘 — The Multilingual Instruction- \n                   Following Model}},\n  month        = apr,\n  year         = 2023,\n  publisher    = {Zenodo},\n  version      = {v0.1},\n  doi          = {10.5281/zenodo.7878101},\n  url          = {https://doi.org/10.5281/zenodo.7878101}\n}\n```\n\n## Acknowledgements\n\nWe would like to thank Huggingface for the open-source infrastructure and ecosystem they have built, especially [lvwerra](https://github.com/lvwerra/) of the [trl repository](https://github.com/lvwerra/). We give our appreciation to the open-source finetuning pioneers that come before us including but not limited to [Alpaca](https://github.com/tatsu-lab/stanford_alpaca), [Alpaca-LoRA](https://github.com/tloen/alpaca-lora/), [GPT4All](https://github.com/nomic-ai/gpt4all), [OpenAssistant](https://github.com/LAION-AI/Open-Assistant), [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/), [Vicuna](https://vicuna.lmsys.org/), and [Dolly](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm).\n\n## License\n\nThe source code is licensed under the [Apache-2.0 license](https://github.com/pythainlp/wangchanglm/blob/main/LICENSE). The model weights are licensed under [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). Finetuning datasets are sourced from [LAION OIG chip2 and infill_dbpedia](https://huggingface.co/datasets/laion/OIG) ([Apache-2.0](https://github.com/pythainlp/wangchanglm/blob/main/LICENSE)), [DataBricks Dolly v2](https://github.com/databrickslabs/dolly) ([Apache-2.0](https://github.com/pythainlp/wangchanglm/blob/main/LICENSE)), [OpenAI TL;DR](https://github.com/openai/summarize-from-feedback) ([MIT](https://opensource.org/license/mit/)), and [Hello-SimpleAI HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) ([CC-BY SA](https://creativecommons.org/licenses/by-sa/4.0/)).\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fwangchanglm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpythainlp%2Fwangchanglm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fwangchanglm/lists"}