https://github.com/AsadNizami/Dataset-generator-for-LLM-finetuning
https://github.com/AsadNizami/Dataset-generator-for-LLM-finetuning
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/AsadNizami/Dataset-generator-for-LLM-finetuning
- Owner: AsadNizami
- Created: 2024-12-30T10:25:51.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-12-30T13:09:53.000Z (11 months ago)
- Last Synced: 2024-12-30T14:20:19.301Z (11 months ago)
- Language: JavaScript
- Size: 212 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome_ai_agents - Dataset-Generator-For-Llm-Finetuning - A web application that generates high-quality question-answer pairs from text documents for LLM finetuning (Building / Datasets)
README
# Synthetic Dataset Generator for LLM Finetuning
This web application creates high-quality question-answer pairs from documents for fine-tuning large language models (LLMs). It utilizes Ollama to interact with local LLM models and offers a user-friendly interface for generating datasets. The application stores documents in a vector database (ChromaDB) and retrieves content based on the specified keywords.
## Features
- Generate Q&A pairs with customizable parameters
- Interactive results display
- Export datasets in JSON format
- Customizable instruction prompts
- Multiple model support through Ollama
## Installation
### 1. Clone the Repository
```bash
git clone
cd dataset-generator
```
### 2. Backend Setup
```bash
# Navigate to backend directory
cd backend
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install Python dependencies
pip install fastapi uvicorn httpx python-multipart langchain langchain-ollama
```
### 3. Frontend Setup
```bash
# Navigate to frontend directory
cd frontend
# Install Node dependencies
npm install
```
### 4. Install and Setup Ollama
1. Install Ollama from [ollama.ai](https://ollama.ai)
2. Pull a compatible model:
```bash
ollama pull llama3.2
# or
ollama pull mistral
```
## Starting the Application
### 1. Start Ollama Server
```bash
ollama serve
```
### 2. Start Backend Server
```bash
# Make sure you're in the backend directory and virtual environment is activated
cd backend
uvicorn main:app --reload --port 8000
```
The backend will be available at `http://localhost:8000`
### 3. Start Frontend Development Server
```bash
# In a new terminal, navigate to frontend directory
cd frontend
npm start
```
The application will open automatically at `http://localhost:3000`
## Usage
1. Open your browser and go to `http://localhost:3000`
2. Upload a pdf file
3. Configure generation parameters:
- Number of Q&A pairs to generate
- Temperature (0.1-1.0)
- Select LLM model
- Customize instruction prompt if needed
4. Click "Generate Dataset" to start generation
5. Review generated pairs in the interface
6. Download the dataset using the "Save" button
## Development Notes
- Backend runs on FastAPI with async support
- Frontend built with React
- Real-time streaming of generated pairs
- Automatic retry mechanism for failed generations
## Output Format
Generated datasets are saved in JSON format:
```json
{
"conversations": [
{
"from": "human",
"value": "Generated question?"
},
{
"from": "assistant",
"value": "Generated answer."
}
],
"source": "filename.pdf"
}
```
## Demo
