https://github.com/bobazooba/xllm-demo

Demo project using XLLM
https://github.com/bobazooba/xllm-demo
chatbot chatgpt deep-learning deep-neural-networks gpt gpt-4 large-language-models llm mistral natural-language-processing nlp openai pytorch torch transfer-learning transformer transformers
Last synced: 4 months ago
JSON representation
Demo project using XLLM
Host: GitHub
URL: https://github.com/bobazooba/xllm-demo
Owner: BobaZooba
License: apache-2.0
Created: 2023-11-10T17:56:14.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-11-17T11:55:26.000Z (over 1 year ago)
Last Synced: 2025-01-10T23:47:05.928Z (4 months ago)
Topics: chatbot, chatgpt, deep-learning, deep-neural-networks, gpt, gpt-4, large-language-models, llm, mistral, natural-language-processing, nlp, openai, pytorch, torch, transfer-learning, transformer, transformers
Language: Python
Homepage: https://github.com/BobaZooba/xllm
Size: 67.4 KB
Stars: 9
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # 🦖 Demo project using XLLM

This project is made in order to demonstrate the possibilities of customization of the `xllm` library. In this project,

the simplest extensions will be implemented so that the user can understand as much as possible how to make his project

using `xllm`.

`xllm` enables not only to prototype models, but also facilitates the development of production-ready solutions through

built-in capabilities and customization.

Using `xllm` to train a model is simple and involves these few steps:

1. `Prepare` — Get the data and the model ready by downloading and preparing them. Saves data locally

   to `config.train_local_path_to_data` and `config.eval_local_path_to_data` if you are using eval dataset

2. `Train` — Use the data prepared in the previous step to train the model

3. `Fuse` — If you used LoRA during the training, fuse LoRA

4. `Quantize` — Make your model take less memory by quantizing it

Remember, these tasks in `xllm` start from the command line. In this project, we will cover all aspects of

customizing `xllm`.

[](https://github.com/BobaZooba/xllm)

## Useful materials

- [X—LLM Repo](https://github.com/BobaZooba/xllm): main repo of the `xllm` library

- [Quickstart](https://github.com/KompleteAI/xllm/tree/docs-v1#quickstart-): basics of `xllm`

- [Examples](https://github.com/BobaZooba/xllm/examples): minimal examples of using `xllm`

- [Guide](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md): here, we go into detail about everything the library

  can

  do

- [Demo project](https://github.com/BobaZooba/xllm-demo): here's a minimal step-by-step example of how to use X—LLM and

  fit it

  into your own project

- [WeatherGPT](https://github.com/BobaZooba/wgpt): this repository features an example of how to utilize the xllm

  library. Included is a solution for a common type of assessment given to LLM engineers, who typically earn between

  $120,000 to $140,000 annually

- [Shurale](https://github.com/BobaZooba/shurale): project with the finetuned 7B Mistal model

# Installation

Create python virtual environment

```bash

python3 -m venv venv

```

Activate venv

```bash

source venv/bin/activate

```

Install project (editable) and requirements

```bash

pip install -e .

```

## Environment variables

Make `.env` file and fill with your values. Please take a look at `.env.template` file.

W&B values needs only for `train` step.

# Dataset

For demo purposes we will use [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)

This dataset is widely used in the field of language models. It is made up of pairs of dialogues, labeled as `chosen`

and `rejected`, with the distinction being only in the final line of dialogue. Our approach to utilizing this dataset

will be unconventional; rather than adhering to the authors' original design, we will select the `chosen` dialogues and

use them to train a conventional language model to predict the following token.

To achieve this, we will need to implement a new `xllm` dataset.

Implementing a new dataset is very simple. You need to create a new class that inherits from `BaseDataset` and implement

two methods: `get_data` (classmethod) and `get_sample`.

```python

from typing import Tuple, Dict, List, Optional

from xllm import Config

from xllm.datasets import BaseDataset

from xllm.types import RawSample

class AntropicDataset(BaseDataset):

  @classmethod

  def get_data(cls, config: Config) -> Tuple[List[RawSample], Optional[List[RawSample]]]:

    ...

  def get_sample(self, index: int) -> RawSample:

    ...

```

Please read this information to understand what you need to do at each

method: [How to implement dataset](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md#how-to-implement-dataset)

In brief, the `get_data` method should return a list of training samples (and optionally evaluation samples) with

arbitrary structure. In the `get_sample` method, you need to transform each sample you created in `get_data` into a

specific structure, an example of which is provided below.

Example of `get_sample` output:

```python

{

  "text_parts": [

    "Hello!",

    "My name is Boris"

  ]

}

```

# Registry

We need to register new components such as dataset, collator, trainer, and experiment to make them accessible through

the `xllm` command-line tools.

To do this, let's implement the `components_registry` function. We need to import `datasets_registry` and

add `AntropicDataset` to it with the key `antropic`. We can place the key `antropic` in the constants of our project.

`xllm_demo/core/registry.py`

```python

from xllm.datasets import datasets_registry

from xllm_demo.core.constants import DATASET_KEY

from xllm_demo.core.dataset import AntropicDataset

def components_registry():

  datasets_registry.add(key=DATASET_KEY, item=AntropicDataset)

```

# CLI

At this stage, we are now ready to initiate the first step in our pipeline, which is `prepare`. This step is responsible

for data preparation, downloading the tokenizer, and the model.

Let's make a file `prepare.py`

`xllm_demo/cli/prepare.py`

```python

from xllm.cli import cli_run_prepare

from xllm_demo.core.registry import components_registry

if __name__ == "__main__":

  components_registry()

  cli_run_prepare()

```

Now we can run the `prepare` step, specifying the key of the new dataset.

```sh

python xllm_demo/cli/prepare.py --dataset_key antropic

```

More details you could find

here: [How to add CLI tools to your project](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md#how-to-add-cli-tools-to-your-project)

### Extend CLI

Let's add the remaining CLI functions: `train`, `fuse`, `quantize`. In each function, we will call the component

registration and pass the new config.

`xllm_demo/cli/train.py`

```python

from xllm.cli import cli_run_train

from xllm_demo.core.registry import components_registry

if __name__ == "__main__":

  components_registry()

  cli_run_train()

```

`xllm_demo/cli/fuse.py`

```python

from xllm.cli import cli_run_fuse

from xllm_demo.core.registry import components_registry

if __name__ == "__main__":

  components_registry()

  cli_run_fuse()

```

`xllm_demo/cli/quantize.py`

```python

from xllm.cli import cli_run_quantize

from xllm_demo.core.registry import components_registry

if __name__ == "__main__":

  components_registry()

  cli_run_quantize()

```

## Run

Now our project is ready and we can train the model.

### Prepare

`cli/prepare.sh`

  Bash script

  ```bash

  #!/bin/bash

  

  python3 xllm_demo/cli/prepare.py \

    --dataset_key antropic \

    --model_name_or_path mistralai/Mistral-7B-v0.1 \

    --path_to_env_file ./.env

  ```

### Train

`cli/train.sh`

  Bash script

```bash

#!/bin/bash

python3 xllm_demo/cli/train.py \

  --use_gradient_checkpointing True \

  --deepspeed_stage 0 \

  --stabilize True \

  --model_name_or_path mistralai/Mistral-7B-v0.1 \

  --use_flash_attention_2 False \

  --load_in_4bit True \

  --apply_lora True \

  --raw_lora_target_modules all \

  --per_device_train_batch_size 2 \

  --warmup_steps 1000 \

  --save_total_limit 0 \

  --push_to_hub True \

  --hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \

  --hub_private_repo True \

  --report_to_wandb True \

  --path_to_env_file ./.env

```

### Fuse

`cli/fuse.sh`

  Bash script

```bash

#!/bin/bash

python3 xllm_demo/cli/fuse.py \

  --model_name_or_path mistralai/Mistral-7B-v0.1 \

  --lora_hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \

  --hub_model_id DemoXLLM7B-v1 \

  --hub_private_repo True \

  --force_fp16 True \

  --fused_model_local_path ./fused_model/ \

  --path_to_env_file ./.env

```

### Quantize

`cli/quantize.sh`

  Bash script

```bash

#!/bin/bash

python3 xllm_demo/cli/quantize.py \

  --model_name_or_path ./fused_model/ \

  --apply_lora False --stabilize False \

  --quantized_model_path ./quantized_model/ \

  --prepare_model_for_kbit_training False \

  --quantized_hub_model_id DemoXLLM7B-v1-GPTQ \

  --quantized_hub_private_repo True \

  --path_to_env_file ./.env

```

### DeepSpeed (Multiple GPUs)

We can train our model on multiple GPUs using DeepSpeed.

`cli/train-deepspeed.sh`

  Bash script

```sh

#!/bin/bash

deepspeed --num_gpus=8 xllm_demo/cli/train.py \

  --use_gradient_checkpointing True \

  --deepspeed_stage 2 \

  --stabilize True \

  --model_name_or_path mistralai/Mistral-7B-v0.1 \

  --use_flash_attention_2 False \

  --load_in_4bit True \

  --apply_lora True \

  --raw_lora_target_modules all \

  --per_device_train_batch_size 8 \

  --warmup_steps 1000 \

  --save_total_limit 0 \

  --push_to_hub True \

  --hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \

  --hub_private_repo True \

  --report_to_wandb True \

  --logging_steps 1 \

  --num_train_epochs 3 \

  --save_steps 1000 \

  --max_steps 10050 \

  --path_to_env_file ./.env

```

### Run scripts

Don't forget to install additional libraries (deepspeed) for this case, which are specified in `requirements-train.txt`.

```sh

pip install -r requirements-train.txt

```

Now we need to grant executable permission to our bash scripts, and then we can start our training.

```sh

chmod -R 755 ./scripts/

```

Run all steps

```sh

./scripts/prepare.sh

./scripts/train.sh

./scripts/fuse.sh

./scripts/quantize.sh

```

**IMPORTANT:** Actually, we can already run our entire project by implementing only the new dataset. All other steps are

optional.

# Collator

The main task of the collator is to convert a list of samples into a batch that can be input to the model. That is, to

turn texts into a tokenized batch and create targets.

Every collator must be inherited from `BaseCollator` and the method `parse_batch` must be implemented.

```python

from typing import List

from xllm.types import RawSample, Batch

from xllm.collators import BaseCollator

class LastPartCollator(BaseCollator):

  def parse_batch(self, raw_batch: List[RawSample]) -> Batch:

    ...

```

Your task is to write the logic of how to process the list in order to eventually obtain a `Batch`. A `Batch` is a

dictionary where the key is a string, and the value is a `PyTorch Tensor`.

More details you could find

here: [How to implement collator](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md#how-to-implement-collator)

### Registry

Let's expand the registration of new components by adding a collator.

`xllm_demo/core/registry.py`

```python

from xllm.datasets import datasets_registry

from xllm.collators import collators_registry

from xllm_demo.core.constants import DATASET_KEY, COLLATOR_KEY

from xllm_demo.core.dataset import AntropicDataset

from xllm_demo.core.collator import LastPartCollator

def components_registry():

  datasets_registry.add(key=DATASET_KEY, item=AntropicDataset)

  collators_registry.add(key=COLLATOR_KEY, item=LastPartCollator)

```

# Config

We can extend Config. This may be necessary in order to pass the values of new fields via the command line.

Let's add a new field to the config.

`xllm_demo/core/config.py`

```python

from dataclasses import dataclass, field

from xllm import Config

from xllm_demo.core.constants import DATASET_KEY

@dataclass

class DemoXLLMConfig(Config):

  text_field: str = field(default="chosen", metadata={

    "help": "Field for Antropic RLHF dataset",

  })

  dataset_key: str = field(default=DATASET_KEY, metadata={

    "help": "Dataset key",

  })

```

At the moment of calling `get_data` of the newly implemented `Dataset`, we can choose not the `chosen` dialog but, for

example, the `rejected` one. Although this logic is made for demonstration purposes, it can be useful in real projects.

Now we will be able to pass through the command line which field needs to be selected.

Now by default, the `dataset_key` in the project is the one that we have recently implemented. Therefore, there is no

need for us to pass it every time we call the CLI functions. This is an optional step for demonstration.

Now we need to pass the new config class to every CLI method that we implement.

`xllm_demo/cli/prepare.py`

```python

from xllm.cli import cli_run_prepare

from xllm_demo.core.config import DemoXLLMConfig

from xllm_demo.core.registry import components_registry

if __name__ == "__main__":

  components_registry()

  cli_run_prepare(config_cls=DemoXLLMConfig)

```

Run CLI

```sh

python xllm_demo/cli/prepare.py --text_field rejected

```

Please note that we do not need to register a new config. We need to explicitly specify it in the calls to the `xllm`

CLI functions.

More details you could find

here: [How to extend config](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md#how-to-extend-config-1)

In this example, we do not need to update the config for the other steps (`train`, `fuse`, `quantize`), because all that

we added to the config only concerns the preparation of the dataset, meaning only the `prepare` step. However, sometimes

it may be necessary. We can do this similarly to the `prepare` step.

`xllm_demo/cli/train.py`

```python

from xllm.cli import cli_run_train

from xllm_demo.core.config import DemoXLLMConfig

from xllm_demo.core.registry import components_registry

if __name__ == "__main__":

  components_registry()

  cli_run_train(config_cls=DemoXLLMConfig)

```

# Trainer

We can also implement a new `trainer`. It should inherit from the standard trainer of `transformers`.

`xllm_demo/core/trainer.py`

```python

from typing import Union, Dict, Tuple, Optional

from peft import PeftModel

from torch import Tensor

from transformers import PreTrainedModel, TrainingArguments

from xllm.trainers import LMTrainer

from xllm.datasets import BaseDataset

from xllm.collators import BaseCollator

from xllm_demo.core.config import DemoXLLMConfig

class MyLMTrainer(LMTrainer):

  def __init__(

    self,

    config: DemoXLLMConfig,

    model: Union[PreTrainedModel, PeftModel],

    args: TrainingArguments,

    data_collator: BaseCollator,

    train_dataset: BaseDataset,

    ignore_index: int,

    eval_dataset: Optional[BaseDataset] = None,

  ):

    super().__init__(config, model, args, data_collator, train_dataset, ignore_index, eval_dataset)

    self.my_steps = 0

  def compute_loss(

    self,

    model: Union[PreTrainedModel, PeftModel],

    inputs: Dict[str, Tensor],

    return_outputs: bool = False,

  ) -> Union[Tensor, Tuple[Tensor, Dict[str, Tensor]]]:

    self.my_steps += 1

    return super().compute_loss(model=model, inputs=inputs, return_outputs=return_outputs)

```

Above, we simply added a step counter.

We also need to register the new trainer.

`xllm_demo/core/registry.py`

```python

from xllm.datasets import datasets_registry

from xllm.collators import collators_registry

from xllm.trainers import trainers_registry

from xllm_demo.core.constants import DATASET_KEY, COLLATOR_KEY, TRAINER_KEY

from xllm_demo.core.dataset import AntropicDataset

from xllm_demo.core.collator import LastPartCollator

from xllm_demo.core.trainer import MyLMTrainer

def components_registry():

  datasets_registry.add(key=DATASET_KEY, item=AntropicDataset)

  collators_registry.add(key=COLLATOR_KEY, item=LastPartCollator)

  trainers_registry.add(key=TRAINER_KEY, item=MyLMTrainer)

```

More details you could find

here: [How to implement trainer](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md#how-to-implement-trainer)

# Experiment

The experiment acts as an aggregator for training, where the necessary components are loaded. We can customize

the `Experiment` by overriding hook methods.

Let's add a few checks and slightly expand the logging.

`xllm_demo/core/experiment.py`

```python

from xllm.experiments import Experiment

from xllm import dist_logger

class MyExperiment(Experiment):

  def before_model_build(self) -> None:

    assert self.model is None

    dist_logger.info("Model is not None", local_rank=self.config.local_rank)

  def after_model_build(self) -> None:

    assert self.model is not None

    dist_logger.info("Model is not None", local_rank=self.config.local_rank)

  def after_train(self) -> None:

    if hasattr(self.model, "my_steps"):

      num_steps = self.model.my_steps

      dist_logger.info(f"Steps: {num_steps}", local_rank=self.config.local_rank)

```

Let's register the new experiment.

`xllm_demo/core/registry.py`

```python

from xllm.datasets import datasets_registry

from xllm.collators import collators_registry

from xllm.trainers import trainers_registry

from xllm.experiments import experiments_registry

from xllm_demo.core.constants import DATASET_KEY, COLLATOR_KEY, TRAINER_KEY, EXPERIMENT_KEY

from xllm_demo.core.dataset import AntropicDataset

from xllm_demo.core.collator import LastPartCollator

from xllm_demo.core.trainer import MyLMTrainer

from xllm_demo.core.experiment import MyExperiment

def components_registry():

  datasets_registry.add(key=DATASET_KEY, item=AntropicDataset)

  collators_registry.add(key=COLLATOR_KEY, item=LastPartCollator)

  trainers_registry.add(key=TRAINER_KEY, item=MyLMTrainer)

  experiments_registry.add(key=EXPERIMENT_KEY, item=MyExperiment)

```

More details you could find

here: [How to implement experiment](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md#how-to-implement-experiment)

# Run

Having implemented and registered the new components, we can make some adjustments to the training bash scripts, as only

in this step might the new components be utilized.

We add three new lines:

```bash

  --collator_key last_part \

  --trainer_key steps \

  --experiment_key check_model

```

### Train

`cli/train.sh`

  Bash script

```bash

#!/bin/bash

python3 xllm_demo/cli/train.py \

  --use_gradient_checkpointing True \

  --deepspeed_stage 0 \

  --stabilize True \

  --model_name_or_path mistralai/Mistral-7B-v0.1 \

  --use_flash_attention_2 False \

  --load_in_4bit True \

  --apply_lora True \

  --raw_lora_target_modules all \

  --per_device_train_batch_size 2 \

  --warmup_steps 1000 \

  --save_total_limit 0 \

  --push_to_hub True \

  --hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \

  --hub_private_repo True \

  --report_to_wandb True \

  --path_to_env_file ./.env \

  --collator_key last_part \

  --trainer_key steps \

  --experiment_key check_model

```

### DeepSpeed (Multiple GPUs)

`cli/train-deepspeed.sh`

  Bash script

```sh

#!/bin/bash

deepspeed --num_gpus=8 xllm_demo/cli/train.py \

  --use_gradient_checkpointing True \

  --deepspeed_stage 2 \

  --stabilize True \

  --model_name_or_path mistralai/Mistral-7B-v0.1 \

  --use_flash_attention_2 False \

  --load_in_4bit True \

  --apply_lora True \

  --raw_lora_target_modules all \

  --per_device_train_batch_size 8 \

  --warmup_steps 1000 \

  --save_total_limit 0 \

  --push_to_hub True \

  --hub_model_id BobaZooba/DemoXLLM7B-v1-LoRA \

  --hub_private_repo True \

  --report_to_wandb True \

  --logging_steps 1 \

  --num_train_epochs 3 \

  --save_steps 1000 \

  --max_steps 10050 \

  --path_to_env_file ./.env \

  --collator_key last_part \

  --trainer_key steps \

  --experiment_key check_model

```

## Run scripts

Afterward, we can run all the scripts just as we did before.

```sh

./scripts/prepare.sh

./scripts/train.sh

./scripts/fuse.sh

./scripts/quantize.sh

```

# Conclusions

`xllm` library has a plethora of customization options. Not all of them are necessary, and most often you will only

utilize the implementation of a new dataset, which is by design.

# 🎉 Done! You are awesome!

## Now you know how to create projects using `xllm`

## Useful materials

- [X—LLM Repo](https://github.com/BobaZooba/xllm): main repo of the `xllm` library

- [Quickstart](https://github.com/KompleteAI/xllm/tree/docs-v1#quickstart-): basics of `xllm`

- [Examples](https://github.com/BobaZooba/xllm/examples): minimal examples of using `xllm`

- [Guide](https://github.com/BobaZooba/xllm/blob/main/GUIDE.md): here, we go into detail about everything the library

  can

  do

- [Demo project](https://github.com/BobaZooba/xllm-demo): here's a minimal step-by-step example of how to use X—LLM and

  fit it

  into your own project

- [WeatherGPT](https://github.com/BobaZooba/wgpt): this repository features an example of how to utilize the xllm

  library. Included is a solution for a common type of assessment given to LLM engineers, who typically earn between

  $120,000 to $140,000 annually

- [Shurale](https://github.com/BobaZooba/shurale): project with the finetuned 7B Mistal model

## Tale Quest

`Tale Quest` is my personal project which was built using `xllm` and `Shurale`. It's an interactive text-based game

in `Telegram` with dynamic AI characters, offering infinite scenarios

You will get into exciting journeys and complete fascinating quests. Chat

with `George Orwell`, `Tech Entrepreneur`, `Young Wizard`, `Noir Detective`, `Femme Fatale` and many more

Try it now: [https://t.me/talequestbot](https://t.me/TaleQuestBot?start=Z2g)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bobazooba/xllm-demo

Awesome Lists containing this project

README