https://github.com/SpingenceAI/FTDataGen

Generate LLM finetune dataset
https://github.com/SpingenceAI/FTDataGen

Last synced: 3 months ago
JSON representation

Generate LLM finetune dataset

Host: GitHub
URL: https://github.com/SpingenceAI/FTDataGen
Owner: SpingenceAI
License: apache-2.0
Created: 2024-12-05T08:13:16.000Z (4 months ago)
Default Branch: main
Last Pushed: 2024-12-05T08:36:24.000Z (4 months ago)
Last Synced: 2024-12-05T09:29:49.479Z (4 months ago)
Language: Python
Size: 16.6 KB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome_ai_agents - Ftdatagen - Generate LLM finetune dataset (Building / Datasets)
awesome_ai_agents - Ftdatagen - Generate LLM finetune dataset (Building / Datasets)

README

# FTDataGen

Use LLM to generate training data for fine-tuning LLM

## Steps:
1. Parse input file (pdf, docx, txt, etc.)
2. Generate questions and answers by LLM
3. Save data to jsonl file

### output data format: jsonl
```txt
{"instruction": "instruction", "output": "output"}
{"instruction": "instruction", "output": "output"}
```

## Get Started
### 1. Build docker image
```bash
# for CPU
docker build -t ft-data-gen:cpu .
# for GPU
docker build -t ft-data-gen:gpu .
```
### 2. Run docker container
```bash
# for CPU
docker run -it --rm -v ${PWD}:/workspace ft-data-gen:cpu bash
# for GPU
docker run -it --rm -v ${PWD}:/workspace --gpus all ft-data-gen:gpu
```

### 3. Setup environment variables
```bash
cp .env.example .env
```
Modify `.env` file with your own LLM model and API key
Here we use litellm to support multiple LLM models, you can refer to [litellm](https://docs.litellm.ai/docs/providers) for more details.
##### Ollama example:
```bash
LLM_MODEL=ollama/llama3.1:70b
LLM_BASE_URL=http://localhost:11434
```
##### OpenAI example:
```bash
LLM_MODEL=openai/gpt-4o
LLM_API_KEY=sk-proj-.....
```

### 4. Generate data
Arguments:
- `--input_file`: input file path
- `--qa_num`: number of questions
- `--output_folder`: output folder

```bash
python generate_data.py --input_file data/test.txt --qa_num 2 --output_folder output
```

### 5. Find the output data in `output` folder, `output/training_data.jsonl` is the final training data

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/SpingenceAI/FTDataGen

Awesome Lists containing this project

README