https://github.com/xingyaoww/nemo-util
https://github.com/xingyaoww/nemo-util
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/xingyaoww/nemo-util
- Owner: xingyaoww
- Created: 2024-03-29T06:03:33.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-07T08:56:38.000Z (about 1 year ago)
- Last Synced: 2024-04-07T09:38:08.170Z (about 1 year ago)
- Language: Python
- Size: 8.79 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Setup
```
git clone https://github.com/xingyaoww/nemo-util
git submodule update --init --recursive
cd NeMo; git checkout chatml-support; cd ..
```# SFT Training pipleine (use Mistral-7B as an example)
## Step 1: Convert a huggingface model to NEMO format
You should `git clone` the model to `data/models/raw_hf` first following huggingface's instruction, for example to `data/models/raw_hf/Mistral-7B-v0.1`.
You should add chatML tokens to the model's tokenizer and re-size the model's embedding for subsequent SFT.
```bash
python3 scripts/model_conversion/expand_mistral_7b_hf.py \
--ckpt_dir data/models/raw_hf/Mistral-7B-v0.1 \
--output_dir data/models/converted_hf/Mistral-7B-v0.1
```You will see the converted model at `data/models/converted_hf/Mistral-7B-v0.1`. Now you can convert it to NEMO format:
```bash
# enter NEMO docker
MODEL_DIR=`pwd`/data/models/converted_hf ./scripts/docker/run_nemo_interactive.sh
# Do the conversion
./scripts/model_conversion/convert_mistral_7b.sh
```Then you will be able to see your model at `data/models/nemo/mistral-7b-base.nemo`.
## Step 2: Prepare your dataset
We will use [CodeActInstruct](https://huggingface.co/datasets/xingyaoww/code-act) for example. You can first download it to `.jsonl` files:
```bash
python3 scripts/data/download_dataset_from_hf.py
```Because CodeActInstruct uses OpenAI messages format for chat, you need to convert it to NeMo's chat format. Comparison between two format can be found [here](./scripts/data/convert_openai_to_nemo_chat_format.py).
```bash
python3 scripts/data/convert_openai_to_nemo_chat_format.py \
data/datasets/codeact.jsonl,data/datasets/general.jsonl \
--output_file data/datasets/codeact-mixture.nemo.jsonl
```Then you can pack shorter examples in `data/datasets/codeact-mixture.nemo.jsonl` to a longer sequence by running:
```bash
./scripts/data/convert_nemo_chat_to_packed.sh
```This script by default uses ChatML chat template and max sequence length of 16k. You can customize the script to better suite your need.