Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/evilfreelancer/toxicator-ru
A playful project that cleverly transforms everyday sentences into their mischievous "toxic" counterparts. 😈
https://github.com/evilfreelancer/toxicator-ru
gpt instruct llama2 torch torchtune
Last synced: about 1 month ago
JSON representation
A playful project that cleverly transforms everyday sentences into their mischievous "toxic" counterparts. 😈
- Host: GitHub
- URL: https://github.com/evilfreelancer/toxicator-ru
- Owner: EvilFreelancer
- Created: 2024-04-21T18:12:06.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-04-26T17:00:41.000Z (7 months ago)
- Last Synced: 2024-09-23T18:32:03.071Z (about 2 months ago)
- Topics: gpt, instruct, llama2, torch, torchtune
- Language: Jupyter Notebook
- Homepage: https://huggingface.co/evilfreelancer/llama2-7b-toxicator-ru
- Size: 12.7 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.full.md
Awesome Lists containing this project
README
# Toxicator RU - Model Training with TorchTune (full guide)
This project provides detailed instructions for setting up, training, and utilizing a model designed to transform
neutral sentences on Russian language into their "toxic" counterparts.It utilize the [TorchTune](https://github.com/pytorch/torchtune) tool along with
the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model, and a custom dataset
converted from the [RUSSE detox 2022 competition](https://github.com/s-nlp/russe_detox_2022) to the HuggingFace
platform.## Prerequisites
Before you begin, ensure you have Python 3.11 and Python Virtual Environment installed on your system. It's recommended
to run this project on a machine with a GPU that supports CUDA, due to the computational demands of training the model.## Setting Up the Virtual Environment
Create and activate a virtual environment to manage dependencies:
```shell
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```## Accepting Model Use Agreement
Before using the LLaMA 2 7B model, you must agree to the terms and conditions outlined by the HuggingFace. Review and
accept the agreement at the following link:https://huggingface.co/meta-llama/Llama-2-7b-hf
## Generating "Toxicator RU" dataset
Jupyter-notebook with detailed example can be found [here](./dataset_build.ipynb).
## Downloading the Model
Download the LLaMA 2 7B model locally:
```shell
tune download meta-llama/Llama-2-7b-hf --output-dir ./Llama-2-7b-hf
```## Configuration Setup
Copy the training configuration file suited for low-memory setups:
```shell
tune cp llama2/7B_full_low_memory ./toxicator.train.yaml
```Modify the configuration file to change directory paths from `/tmp` to the current directory:
```shell
sed -r 's#/tmp/#./#g' -i ./toxicator.train.yaml
```The model is trained using a dataset hosted on HuggingFace, which has been prepared from the `russe_detox_2022` project.
Here's how to set it in the configuration:```yaml
dataset:
_component_: torchtune.datasets.instruct_dataset
source: evilfreelancer/toxicator-ru
template: AlpacaInstructTemplate
split: train
train_on_input: True
seed: null
shuffle: True
```Card of [evilfreelancer/toxicator-ru](https://huggingface.co/datasets/evilfreelancer/toxicator-ru) dataset on
HuggingFace.## Training the Model
If you wish to use Weights & Biases for tracking experiments, log in using the following command:
```shell
wandb login
```Launch the training process with the configured settings:
```shell
tune run full_finetune_single_device --config toxicator.train.yaml
```## Inference Setup
Copy the inference configuration file:
```shell
tune cp generation ./toxicator.gen.yaml
```Modify the configuration file to change directory paths from `/tmp` to the current directory:
```shell
sed -r 's#/tmp/#./#g' -i ./toxicator.gen.yaml
```Next need to update `checkpointer` section:
```yaml
checkpointer:
_component_: torchtune.utils.FullModelHFCheckpointer
checkpoint_dir: ./Llama-2-7b-hf/
checkpoint_files: [
hf_model_0001_2.pt,
hf_model_0002_2.pt,
]
output_dir: ./Llama-2-7b-hf/
model_type: LLAMA2
```As you can see `checkpoint_files` subsection was changed from defaults.
## Links
* https://huggingface.co/evilfreelancer/llama2-7b-toxicator-ru - LLaMA 2 7B - Toxicator RU
* https://huggingface.co/datasets/evilfreelancer/toxicator-ru - dataset
* https://api.wandb.ai/links/evilfreelancer/33t8pqze - wandb report about training