https://github.com/ConardLi/easy-dataset

A powerful tool for creating fine-tuning datasets for LLM
https://github.com/ConardLi/easy-dataset

dataset javascript llm

Last synced: 8 months ago
JSON representation

A powerful tool for creating fine-tuning datasets for LLM

Host: GitHub
URL: https://github.com/ConardLi/easy-dataset
Owner: ConardLi
Created: 2025-03-04T16:14:14.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-03-20T15:18:40.000Z (8 months ago)
Last Synced: 2025-03-20T16:22:02.200Z (8 months ago)
Topics: dataset, javascript, llm
Language: JavaScript
Homepage:
Size: 14.4 MB
Stars: 2,748
Watchers: 20
Forks: 275
Open Issues: 52
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

StarryDivineSky - ConardLi/easy-dataset - dataset是一个强大的LLM微调数据集创建工具。它旨在简化和加速数据集构建流程，尤其适用于大型语言模型。项目特色包括易用性、灵活性和高效性。它允许用户通过简单的配置和脚本，从各种数据源（如文本文件、网页等）提取和转换数据。easy-dataset的核心工作原理是提供一套可扩展的模块化工具，用于数据清洗、标注和格式化，最终生成符合LLM训练要求的标准数据集。它支持自定义数据处理流程，并提供了多种预定义的转换器和过滤器。通过使用easy-dataset，开发者可以更专注于模型训练本身，而无需花费大量时间在繁琐的数据准备工作上。该项目旨在降低LLM微调的门槛，让更多人能够轻松构建高质量的训练数据集。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-gpt - https://github.com/ConardLi/easy-dataset
awesome-LLM-resources - Easy Dataset (`🔥`) - tuning datasets for LLM. (数据 Data)
awesome-hacking-lists - ConardLi/easy-dataset - A powerful tool for creating fine-tuning datasets for LLM (JavaScript)

README

![](./public/imgs/bg2.png)

**A powerful tool for creating fine-tuning datasets for Large Language Models**

[简体中文](./README.zh-CN.md) | [English](./README.md)

[Features](#features) • [Getting Started](#getting-started) • [Usage](#usage) • [Documentation](https://rncg5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb?302from=wiki) • [Contributing](#contributing) • [License](#license)

If you like this project, please leave a Star ⭐️ for it. Or you can buy the author a cup of coffee => [Support the author](./public/imgs/aw.jpg) ❤️!

## Overview

Easy Dataset is a specialized application designed to streamline the creation of fine-tuning datasets for Large Language Models (LLMs). It offers an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

With Easy Dataset, you can transform your domain knowledge into structured datasets compatible with all OpenAI-format compatible LLM APIs, making the fine-tuning process accessible and efficient.

![](./public/imgs/en-arc.png)

## Features

- **Intelligent Document Processing**: Upload Markdown files and automatically split them into meaningful segments
- **Smart Question Generation**: Extract relevant questions from each text segment
- **Answer Generation**: Generate comprehensive answers for each question using LLM APIs
- **Flexible Editing**: Edit questions, answers, and datasets at any stage of the process
- **Multiple Export Formats**: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
- **Wide Model Support**: Compatible with all LLM APIs that follow the OpenAI format
- **User-Friendly Interface**: Intuitive UI designed for both technical and non-technical users
- **Customizable System Prompts**: Add custom system prompts to guide model responses

## Getting Started

### Download Client

Windows

MacOS

Linux

Setup.exe

Intel

M

AppImage

### Using npm

- Node.js 18.x or higher
- pnpm (recommended) or npm

1. Clone the repository:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```

2. Install dependencies:
```bash
npm install
```

3. Start the development server:
```bash
npm run build

npm run start
```

### Build with Local Dockerfile

If you want to build the image yourself, you can use the Dockerfile in the project root directory:

1. Clone the repository:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. Build the Docker image:
```bash
docker build -t easy-dataset .
```
3. Run the container:
```bash
docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset
```
**Note:** Replace `{YOUR_LOCAL_DB_PATH}` with the actual path where you want to store the local database.

4. Open your browser and navigate to `http://localhost:1717`

## Usage

### Creating a Project

1. Click the "Create Project" button on the home page
2. Enter a project name and description
3. Configure your preferred LLM API settings

### Processing Documents

1. Upload your Markdown files in the "Text Split" section
2. Review the automatically split text segments
3. Adjust the segmentation if needed

### Generating Questions

1. Navigate to the "Questions" section
2. Select text segments to generate questions from
3. Review and edit the generated questions
4. Organize questions using the tag tree

### Creating Datasets

1. Go to the "Datasets" section
2. Select questions to include in your dataset
3. Generate answers using your configured LLM
4. Review and edit the generated answers

### Exporting Datasets

1. Click the "Export" button in the Datasets section
2. Select your preferred format (Alpaca or ShareGPT)
3. Choose file format (JSON or JSONL)
4. Add custom system prompts if needed
5. Export your dataset

## Project Structure

```
easy-dataset/
├── app/ # Next.js application directory
│ ├── api/ # API routes
│ │ ├── llm/ # LLM API integration
│ │ │ ├── ollama/ # Ollama API integration
│ │ │ └── openai/ # OpenAI API integration
│ │ ├── projects/ # Project management APIs
│ │ │ ├── [projectId]/ # Project-specific operations
│ │ │ │ ├── chunks/ # Text chunk operations
│ │ │ │ ├── datasets/ # Dataset generation and management
│ │ │ │ │ └── optimize/ # Dataset optimization API
│ │ │ │ ├── generate-questions/ # Batch question generation
│ │ │ │ ├── questions/ # Question management
│ │ │ │ └── split/ # Text splitting operations
│ │ │ └── user/ # User-specific project operations
│ ├── projects/ # Front-end project pages
│ │ └── [projectId]/ # Project-specific pages
│ │ ├── datasets/ # Dataset management UI
│ │ ├── questions/ # Question management UI
│ │ ├── settings/ # Project settings UI
│ │ └── text-split/ # Text processing UI
│ └── page.js # Home page
├── components/ # React components
│ ├── datasets/ # Dataset-related components
│ ├── home/ # Home page components
│ ├── projects/ # Project management components
│ ├── questions/ # Question management components
│ └── text-split/ # Text processing components
├── lib/ # Core libraries and utilities
│ ├── db/ # Database operations
│ ├── i18n/ # Internationalization
│ ├── llm/ # LLM integration
│ │ ├── common/ # Common LLM utilities
│ │ ├── core/ # Core LLM client
│ │ └── prompts/ # Prompt templates
│ │ ├── answer.js # Answer generation prompts (Chinese)
│ │ ├── answerEn.js # Answer generation prompts (English)
│ │ ├── question.js # Question generation prompts (Chinese)
│ │ ├── questionEn.js # Question generation prompts (English)
│ │ └── ... other prompts
│ └── text-splitter/ # Text splitting utilities
├── locales/ # Internationalization resources
│ ├── en/ # English translations
│ └── zh-CN/ # Chinese translations
├── public/ # Static assets
│ └── imgs/ # Image resources
└── local-db/ # Local file-based database
└── projects/ # Project data storage
```

## Documentation

For detailed documentation on all features and APIs, please visit our [Documentation Site](https://rncg5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb?302from=wiki).

## Contributing

We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:

1. Fork the repository
2. Create a new branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Commit your changes (`git commit -m 'Add some amazing feature'`)
5. Push to the branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request

Please make sure to update tests as appropriate and adhere to the existing coding style.

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=ConardLi/easy-dataset&type=Date)](https://www.star-history.com/#ConardLi/easy-dataset&Date)

_{Built with ❤️ by ConardLi • Follow me：WeChat｜Bilibili｜Juijin｜Zhihu}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ConardLi/easy-dataset

Awesome Lists containing this project

README