Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/SpingenceAI/FTDataGen
Generate LLM finetune dataset
https://github.com/SpingenceAI/FTDataGen
Last synced: 3 days ago
JSON representation
Generate LLM finetune dataset
- Host: GitHub
- URL: https://github.com/SpingenceAI/FTDataGen
- Owner: SpingenceAI
- License: apache-2.0
- Created: 2024-12-05T08:13:16.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-12-05T08:36:24.000Z (about 1 month ago)
- Last Synced: 2024-12-05T09:29:49.479Z (about 1 month ago)
- Language: Python
- Size: 16.6 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome_ai_agents - Ftdatagen - Generate LLM finetune dataset (Building / Datasets)
- awesome_ai_agents - Ftdatagen - Generate LLM finetune dataset (Building / Datasets)
README
# FTDataGen
Use LLM to generate training data for fine-tuning LLM
## Steps:
1. Parse input file (pdf, docx, txt, etc.)
2. Generate questions and answers by LLM
3. Save data to jsonl file### output data format: jsonl
```txt
{"instruction": "instruction", "output": "output"}
{"instruction": "instruction", "output": "output"}
```## Get Started
### 1. Build docker image
```bash
# for CPU
docker build -t ft-data-gen:cpu .
# for GPU
docker build -t ft-data-gen:gpu .
```
### 2. Run docker container
```bash
# for CPU
docker run -it --rm -v ${PWD}:/workspace ft-data-gen:cpu bash
# for GPU
docker run -it --rm -v ${PWD}:/workspace --gpus all ft-data-gen:gpu
```### 3. Setup environment variables
```bash
cp .env.example .env
```
Modify `.env` file with your own LLM model and API key
Here we use litellm to support multiple LLM models, you can refer to [litellm](https://docs.litellm.ai/docs/providers) for more details.
##### Ollama example:
```bash
LLM_MODEL=ollama/llama3.1:70b
LLM_BASE_URL=http://localhost:11434
```
##### OpenAI example:
```bash
LLM_MODEL=openai/gpt-4o
LLM_API_KEY=sk-proj-.....
```### 4. Generate data
Arguments:
- `--input_file`: input file path
- `--qa_num`: number of questions
- `--output_folder`: output folder```bash
python generate_data.py --input_file data/test.txt --qa_num 2 --output_folder output
```### 5. Find the output data in `output` folder, `output/training_data.jsonl` is the final training data