https://github.com/pythainlp/khanomtanllm

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/pythainlp/khanomtanllm
Owner: PyThaiNLP
License: apache-2.0
Created: 2024-08-31T06:49:30.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-09-24T15:15:51.000Z (over 1 year ago)
Last Synced: 2025-03-26T22:11:14.656Z (about 1 year ago)
Language: Python
Homepage: http://pythainlp.org/KhanomTanLLM/
Size: 19.5 KB
Stars: 15
Watchers: 1
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # KhanomTanLLM

> KhanomTan (Thai name is ขนมตาล) + LLM

![](https://imgur.com/LpQmJqY.png)

> Image gen from [FLUX.1 [dev]](https://huggingface.co/spaces/black-forest-labs/FLUX.1-dev)

KhanomTan LLM is a bilingual language model trained in Thai and English from open source dataset by PyThaiNLP. We train the model from public dataset only. It is a fully open source model. We releses the dataset, training pipeline, and models.

Codename: numfa-v2

Blog Post (Thai): [https://pythainlp.org/2024-09-12-khanomtanllm/](https://pythainlp.org/2024-09-12-khanomtanllm/)

- **Online Demo**: [https://huggingface.co/spaces/wannaphong/KhanomTanLLM-demo](https://huggingface.co/spaces/wannaphong/KhanomTanLLM-demo)

- Pretraining dataset: [https://huggingface.co/datasets/wannaphong/KhanomTanLLM-pretrained-dataset](https://huggingface.co/datasets/wannaphong/KhanomTanLLM-pretrained-dataset)

    * Thai subset only: [https://huggingface.co/datasets/wannaphong/KhanomTanLLM-pretrained-dataset-thai-subset](https://huggingface.co/datasets/wannaphong/KhanomTanLLM-pretrained-dataset-thai-subset)

    * List Thai subset: [https://huggingface.co/collections/pythainlp/datasets-for-pretrained-thai-llm-65db96ab730386b492889a98](https://huggingface.co/collections/pythainlp/datasets-for-pretrained-thai-llm-65db96ab730386b492889a98)

- Pretraining script: [https://github.com/wannaphong/EasyLM/tree/KhanomTanLLM-pretraining](https://github.com/wannaphong/EasyLM/tree/KhanomTanLLM-pretraining)

- Pretrained Models:

    * 1B: [https://huggingface.co/pythainlp/KhanomTanLLM-1B](https://huggingface.co/pythainlp/KhanomTanLLM-1B)

    * 3B: [https://huggingface.co/pythainlp/KhanomTanLLM-3B](https://huggingface.co/pythainlp/KhanomTanLLM-3B)

- Instruct Models:

    * Instruct dataset: [wannaphong/KhanomTanLLM-Instruct-dataset](https://huggingface.co/datasets/wannaphong/KhanomTanLLM-Instruct-dataset)

    * SFT Script: [https://github.com/PyThaiNLP/KhanomTanLLM/tree/main/finetuning](https://github.com/PyThaiNLP/KhanomTanLLM/tree/main/finetuning)

    * 1B: [https://huggingface.co/pythainlp/KhanomTanLLM-1B-Instruct](https://huggingface.co/pythainlp/KhanomTanLLM-1B-Instruct)

    * 3B: [https://huggingface.co/pythainlp/KhanomTanLLM-3B-Instruct/](https://huggingface.co/pythainlp/KhanomTanLLM-3B-Instruct/)

### Instruct Models

We fine-turning model from [wannaphong/KhanomTanLLM-Instruct-dataset](https://huggingface.co/datasets/wannaphong/KhanomTanLLM-Instruct-dataset). We doesn't have any safeguard, so use your risk.

To get the best result, we suggest the setting:

- temperature: 2 - 4

- min_p: > 0.6

## Acknowledgements

Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC). We use TPU4-64 for training model.

Thank you [TPU Research Cloud](https://sites.research.google/trc/about/) and [EasyLM project](https://github.com/young-geng/EasyLM)! We use EasyLM for pretraining model.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pythainlp/khanomtanllm

Awesome Lists containing this project

README