https://github.com/xingyaoww/nemo-util

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/xingyaoww/nemo-util
Owner: xingyaoww
Created: 2024-03-29T06:03:33.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-04-07T08:56:38.000Z (about 1 year ago)
Last Synced: 2024-04-07T09:38:08.170Z (about 1 year ago)
Language: Python
Size: 8.79 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Setup

```
git clone https://github.com/xingyaoww/nemo-util
git submodule update --init --recursive
cd NeMo; git checkout chatml-support; cd ..
```

# SFT Training pipleine (use Mistral-7B as an example)

## Step 1: Convert a huggingface model to NEMO format

You should `git clone` the model to `data/models/raw_hf` first following huggingface's instruction, for example to `data/models/raw_hf/Mistral-7B-v0.1`.

You should add chatML tokens to the model's tokenizer and re-size the model's embedding for subsequent SFT.

```bash
python3 scripts/model_conversion/expand_mistral_7b_hf.py \
--ckpt_dir data/models/raw_hf/Mistral-7B-v0.1 \
--output_dir data/models/converted_hf/Mistral-7B-v0.1
```

You will see the converted model at `data/models/converted_hf/Mistral-7B-v0.1`. Now you can convert it to NEMO format:

```bash
# enter NEMO docker
MODEL_DIR=`pwd`/data/models/converted_hf ./scripts/docker/run_nemo_interactive.sh
# Do the conversion
./scripts/model_conversion/convert_mistral_7b.sh
```

Then you will be able to see your model at `data/models/nemo/mistral-7b-base.nemo`.

## Step 2: Prepare your dataset

We will use [CodeActInstruct](https://huggingface.co/datasets/xingyaoww/code-act) for example. You can first download it to `.jsonl` files:

```bash
python3 scripts/data/download_dataset_from_hf.py
```

Because CodeActInstruct uses OpenAI messages format for chat, you need to convert it to NeMo's chat format. Comparison between two format can be found [here](./scripts/data/convert_openai_to_nemo_chat_format.py).

```bash
python3 scripts/data/convert_openai_to_nemo_chat_format.py \
data/datasets/codeact.jsonl,data/datasets/general.jsonl \
--output_file data/datasets/codeact-mixture.nemo.jsonl
```

Then you can pack shorter examples in `data/datasets/codeact-mixture.nemo.jsonl` to a longer sequence by running:

```bash
./scripts/data/convert_nemo_chat_to_packed.sh
```

This script by default uses ChatML chat template and max sequence length of 16k. You can customize the script to better suite your need.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xingyaoww/nemo-util

Awesome Lists containing this project

README