https://github.com/ConardLi/easy-dataset
A powerful tool for creating fine-tuning datasets for LLM
https://github.com/ConardLi/easy-dataset
dataset javascript llm
Last synced: about 1 year ago
JSON representation
A powerful tool for creating fine-tuning datasets for LLM
- Host: GitHub
- URL: https://github.com/ConardLi/easy-dataset
- Owner: ConardLi
- Created: 2025-03-04T16:14:14.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-20T15:18:40.000Z (about 1 year ago)
- Last Synced: 2025-03-20T16:22:02.200Z (about 1 year ago)
- Topics: dataset, javascript, llm
- Language: JavaScript
- Homepage:
- Size: 14.4 MB
- Stars: 2,748
- Watchers: 20
- Forks: 275
- Open Issues: 52
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-hacking-lists - ConardLi/easy-dataset - A powerful tool for creating fine-tuning datasets for LLM (JavaScript)
- awesome-side-quests - ConardLi/easy-dataset - tuning and RAG datasets from documents โ streamlines the data prep pipeline (AI & LLM / RAG & Vector Search)
- awesome-github-projects - easy-dataset - A powerful tool for creating datasets for LLM fine-tuning ใRAG and Eval โญ14,255 `JavaScript` ๐ฅ (๐ค AI & Machine Learning)
- StarryDivineSky - ConardLi/easy-dataset - datasetๆฏไธไธชๅผบๅคง็LLMๅพฎ่ฐๆฐๆฎ้ๅๅปบๅทฅๅ ทใๅฎๆจๅจ็ฎๅๅๅ ้ๆฐๆฎ้ๆๅปบๆต็จ๏ผๅฐคๅ ถ้็จไบๅคงๅ่ฏญ่จๆจกๅใ้กน็ฎ็น่ฒๅ ๆฌๆ็จๆงใ็ตๆดปๆงๅ้ซๆๆงใๅฎๅ ่ฎธ็จๆท้่ฟ็ฎๅ็้ ็ฝฎๅ่ๆฌ๏ผไปๅ็งๆฐๆฎๆบ๏ผๅฆๆๆฌๆไปถใ็ฝ้กต็ญ๏ผๆๅๅ่ฝฌๆขๆฐๆฎใeasy-dataset็ๆ ธๅฟๅทฅไฝๅ็ๆฏๆไพไธๅฅๅฏๆฉๅฑ็ๆจกๅๅๅทฅๅ ท๏ผ็จไบๆฐๆฎๆธ ๆดใๆ ๆณจๅๆ ผๅผๅ๏ผๆ็ป็ๆ็ฌฆๅLLM่ฎญ็ป่ฆๆฑ็ๆ ๅๆฐๆฎ้ใๅฎๆฏๆ่ชๅฎไนๆฐๆฎๅค็ๆต็จ๏ผๅนถๆไพไบๅค็ง้ขๅฎไน็่ฝฌๆขๅจๅ่ฟๆปคๅจใ้่ฟไฝฟ็จeasy-dataset๏ผๅผๅ่ ๅฏไปฅๆดไธๆณจไบๆจกๅ่ฎญ็ปๆฌ่บซ๏ผ่ๆ ้่ฑ่ดนๅคง้ๆถ้ดๅจ็น็็ๆฐๆฎๅๅคๅทฅไฝไธใ่ฏฅ้กน็ฎๆจๅจ้ไฝLLMๅพฎ่ฐ็้จๆง๏ผ่ฎฉๆดๅคไบบ่ฝๅค่ฝปๆพๆๅปบ้ซ่ดจ้็่ฎญ็ปๆฐๆฎ้ใ (A01_ๆๆฌ็ๆ_ๆๆฌๅฏน่ฏ / ๅคง่ฏญ่จๅฏน่ฏๆจกๅๅๆฐๆฎ)
- awesome-gpt - https://github.com/ConardLi/easy-dataset
- awesome-LLM-resources - Easy Dataset (`๐ฅ`) - tuning datasets for LLM. (ๆฐๆฎ Data)
README


**A powerful tool for creating fine-tuning datasets for Large Language Models**
[็ฎไฝไธญๆ](./README.zh-CN.md) | [English](./README.md)
[Features](#features) โข [Getting Started](#getting-started) โข [Usage](#usage) โข [Documentation](https://rncg5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb?302from=wiki) โข [Contributing](#contributing) โข [License](#license)
If you like this project, please leave a Star โญ๏ธ for it. Or you can buy the author a cup of coffee => [Support the author](./public/imgs/aw.jpg) โค๏ธ!
## Overview
Easy Dataset is a specialized application designed to streamline the creation of fine-tuning datasets for Large Language Models (LLMs). It offers an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.
With Easy Dataset, you can transform your domain knowledge into structured datasets compatible with all OpenAI-format compatible LLM APIs, making the fine-tuning process accessible and efficient.

## Features
- **Intelligent Document Processing**: Upload Markdown files and automatically split them into meaningful segments
- **Smart Question Generation**: Extract relevant questions from each text segment
- **Answer Generation**: Generate comprehensive answers for each question using LLM APIs
- **Flexible Editing**: Edit questions, answers, and datasets at any stage of the process
- **Multiple Export Formats**: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
- **Wide Model Support**: Compatible with all LLM APIs that follow the OpenAI format
- **User-Friendly Interface**: Intuitive UI designed for both technical and non-technical users
- **Customizable System Prompts**: Add custom system prompts to guide model responses
## Getting Started
### Download Client
Windows
MacOS
Linux
Setup.exe
Intel
M
AppImage
### Using npm
- Node.js 18.x or higher
- pnpm (recommended) or npm
1. Clone the repository:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. Install dependencies:
```bash
npm install
```
3. Start the development server:
```bash
npm run build
npm run start
```
### Build with Local Dockerfile
If you want to build the image yourself, you can use the Dockerfile in the project root directory:
1. Clone the repository:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. Build the Docker image:
```bash
docker build -t easy-dataset .
```
3. Run the container:
```bash
docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset
```
**Note:** Replace `{YOUR_LOCAL_DB_PATH}` with the actual path where you want to store the local database.
4. Open your browser and navigate to `http://localhost:1717`
## Usage
### Creating a Project
1. Click the "Create Project" button on the home page
2. Enter a project name and description
3. Configure your preferred LLM API settings
### Processing Documents
1. Upload your Markdown files in the "Text Split" section
2. Review the automatically split text segments
3. Adjust the segmentation if needed
### Generating Questions
1. Navigate to the "Questions" section
2. Select text segments to generate questions from
3. Review and edit the generated questions
4. Organize questions using the tag tree
### Creating Datasets
1. Go to the "Datasets" section
2. Select questions to include in your dataset
3. Generate answers using your configured LLM
4. Review and edit the generated answers
### Exporting Datasets
1. Click the "Export" button in the Datasets section
2. Select your preferred format (Alpaca or ShareGPT)
3. Choose file format (JSON or JSONL)
4. Add custom system prompts if needed
5. Export your dataset
## Project Structure
```
easy-dataset/
โโโ app/ # Next.js application directory
โ โโโ api/ # API routes
โ โ โโโ llm/ # LLM API integration
โ โ โ โโโ ollama/ # Ollama API integration
โ โ โ โโโ openai/ # OpenAI API integration
โ โ โโโ projects/ # Project management APIs
โ โ โ โโโ [projectId]/ # Project-specific operations
โ โ โ โ โโโ chunks/ # Text chunk operations
โ โ โ โ โโโ datasets/ # Dataset generation and management
โ โ โ โ โ โโโ optimize/ # Dataset optimization API
โ โ โ โ โโโ generate-questions/ # Batch question generation
โ โ โ โ โโโ questions/ # Question management
โ โ โ โ โโโ split/ # Text splitting operations
โ โ โ โโโ user/ # User-specific project operations
โ โโโ projects/ # Front-end project pages
โ โ โโโ [projectId]/ # Project-specific pages
โ โ โโโ datasets/ # Dataset management UI
โ โ โโโ questions/ # Question management UI
โ โ โโโ settings/ # Project settings UI
โ โ โโโ text-split/ # Text processing UI
โ โโโ page.js # Home page
โโโ components/ # React components
โ โโโ datasets/ # Dataset-related components
โ โโโ home/ # Home page components
โ โโโ projects/ # Project management components
โ โโโ questions/ # Question management components
โ โโโ text-split/ # Text processing components
โโโ lib/ # Core libraries and utilities
โ โโโ db/ # Database operations
โ โโโ i18n/ # Internationalization
โ โโโ llm/ # LLM integration
โ โ โโโ common/ # Common LLM utilities
โ โ โโโ core/ # Core LLM client
โ โ โโโ prompts/ # Prompt templates
โ โ โโโ answer.js # Answer generation prompts (Chinese)
โ โ โโโ answerEn.js # Answer generation prompts (English)
โ โ โโโ question.js # Question generation prompts (Chinese)
โ โ โโโ questionEn.js # Question generation prompts (English)
โ โ โโโ ... other prompts
โ โโโ text-splitter/ # Text splitting utilities
โโโ locales/ # Internationalization resources
โ โโโ en/ # English translations
โ โโโ zh-CN/ # Chinese translations
โโโ public/ # Static assets
โ โโโ imgs/ # Image resources
โโโ local-db/ # Local file-based database
โโโ projects/ # Project data storage
```
## Documentation
For detailed documentation on all features and APIs, please visit our [Documentation Site](https://rncg5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb?302from=wiki).
## Contributing
We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:
1. Fork the repository
2. Create a new branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Commit your changes (`git commit -m 'Add some amazing feature'`)
5. Push to the branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request
Please make sure to update tests as appropriate and adhere to the existing coding style.
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## Star History
[](https://www.star-history.com/#ConardLi/easy-dataset&Date)