{"id":26538357,"url":"https://github.com/ConardLi/easy-dataset","last_synced_at":"2025-03-21T23:02:49.765Z","repository":{"id":280668573,"uuid":"942756187","full_name":"ConardLi/easy-dataset","owner":"ConardLi","description":"A powerful tool for creating fine-tuning datasets for LLM","archived":false,"fork":false,"pushed_at":"2025-03-20T15:18:40.000Z","size":15140,"stargazers_count":2748,"open_issues_count":52,"forks_count":275,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-03-20T16:22:02.200Z","etag":null,"topics":["dataset","javascript","llm"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ConardLi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-04T16:14:14.000Z","updated_at":"2025-03-20T16:11:31.000Z","dependencies_parsed_at":"2025-03-04T17:40:46.107Z","dependency_job_id":"1e41e5eb-b984-405b-8a39-4bc57ed2c683","html_url":"https://github.com/ConardLi/easy-dataset","commit_stats":null,"previous_names":["conardli/easy-dataset"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ConardLi%2Feasy-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ConardLi%2Feasy-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ConardLi%2Feasy-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ConardLi%2Feasy-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ConardLi","download_url":"https://codeload.github.com/ConardLi/easy-dataset/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244880607,"owners_count":20525513,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","javascript","llm"],"created_at":"2025-03-21T23:01:42.119Z","updated_at":"2025-03-21T23:02:49.757Z","avatar_url":"https://github.com/ConardLi.png","language":"JavaScript","funding_links":[],"categories":["JavaScript","javascript","A01_文本生成_文本对话","Azure Cognitive Search \u0026 OpenAI","数据 Data","llm","AI \u0026 LLM","🤖 AI \u0026 Machine Learning"],"sub_categories":["大语言对话模型及数据","RAG \u0026 Vector Search"],"readme":"\u003cdiv align=\"center\"\u003e\n\n![](./public/imgs/bg2.png)\n\n\u003cimg src=\"https://img.shields.io/badge/version-0.1.0-blue.svg\" alt=\"version 1.0.0\"/\u003e\n\u003cimg src=\"https://img.shields.io/badge/license-Apache--2.0-green.svg\" alt=\"Apache 2.0 License\"/\u003e\n\u003cimg src=\"https://img.shields.io/badge/Next.js-14.1.0-black\" alt=\"Next.js 14.1.0\"/\u003e\n\u003cimg src=\"https://img.shields.io/badge/React-18.2.0-61DAFB\" alt=\"React 18.2.0\"/\u003e\n\u003cimg src=\"https://img.shields.io/badge/MUI-5.15.7-007FFF\" alt=\"Material UI 5.15.7\"/\u003e\n\n**A powerful tool for creating fine-tuning datasets for Large Language Models**\n\n[简体中文](./README.zh-CN.md) | [English](./README.md)\n\n[Features](#features) • [Getting Started](#getting-started) • [Usage](#usage) • [Documentation](https://rncg5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb?302from=wiki) • [Contributing](#contributing) • [License](#license)\n\n\u003c/div\u003e\n\nIf you like this project, please leave a Star ⭐️ for it. Or you can buy the author a cup of coffee =\u003e [Support the author](./public/imgs/aw.jpg) ❤️! \n\n## Overview\n\nEasy Dataset is a specialized application designed to streamline the creation of fine-tuning datasets for Large Language Models (LLMs). It offers an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.\n\nWith Easy Dataset, you can transform your domain knowledge into structured datasets compatible with all OpenAI-format compatible LLM APIs, making the fine-tuning process accessible and efficient.\n\n![](./public/imgs/en-arc.png)\n\n## Features\n\n- **Intelligent Document Processing**: Upload Markdown files and automatically split them into meaningful segments\n- **Smart Question Generation**: Extract relevant questions from each text segment\n- **Answer Generation**: Generate comprehensive answers for each question using LLM APIs\n- **Flexible Editing**: Edit questions, answers, and datasets at any stage of the process\n- **Multiple Export Formats**: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)\n- **Wide Model Support**: Compatible with all LLM APIs that follow the OpenAI format\n- **User-Friendly Interface**: Intuitive UI designed for both technical and non-technical users\n- **Customizable System Prompts**: Add custom system prompts to guide model responses\n\n## Getting Started\n\n### Download Client\n\n\u003ctable style=\"width: 400px\"\u003e\n  \u003ctr\u003e\n    \u003ctd width=\"25%\" align=\"center\"\u003e\n      \u003cb\u003eWindows\u003c/b\u003e\n    \u003c/td\u003e\n    \u003ctd width=\"25%\" align=\"center\" colspan=\"2\"\u003e\n      \u003cb\u003eMacOS\u003c/b\u003e\n    \u003c/td\u003e\n    \u003ctd width=\"25%\" align=\"center\"\u003e\n      \u003cb\u003eLinux\u003c/b\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr style=\"text-align: center\"\u003e\n    \u003ctd align=\"center\" valign=\"middle\"\u003e\n      \u003ca href='https://github.com/ConardLi/easy-dataset/releases/latest'\u003e\n        \u003cimg src='./public/imgs/windows.png' style=\"height:24px; width: 24px\" /\u003e\n        \u003cbr /\u003e\n        \u003cb\u003eSetup.exe\u003c/b\u003e\n      \u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\" valign=\"middle\"\u003e\n      \u003ca href='https://github.com/ConardLi/easy-dataset/releases/latest'\u003e\n        \u003cimg src='./public/imgs/mac.png' style=\"height:24px; width: 24px\" /\u003e\n        \u003cbr /\u003e\n        \u003cb\u003eIntel\u003c/b\u003e\n      \u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\" valign=\"middle\"\u003e\n      \u003ca href='https://github.com/ConardLi/easy-dataset/releases/latest'\u003e\n        \u003cimg src='./public/imgs/mac.png' style=\"height:24px; width: 24px\" /\u003e\n        \u003cbr /\u003e\n        \u003cb\u003eM\u003c/b\u003e\n      \u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\" valign=\"middle\"\u003e\n      \u003ca href='https://github.com/ConardLi/easy-dataset/releases/latest'\u003e\n        \u003cimg src='./public/imgs/linux.png' style=\"height:24px; width: 24px\" /\u003e\n        \u003cbr /\u003e\n        \u003cb\u003eAppImage\u003c/b\u003e\n      \u003c/a\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n### Using npm\n\n- Node.js 18.x or higher\n- pnpm (recommended) or npm\n\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/ConardLi/easy-dataset.git\n   cd easy-dataset\n   ```\n\n2. Install dependencies:\n   ```bash\n   npm install\n   ```\n\n3. Start the development server:\n   ```bash\n   npm run build\n\n   npm run start\n   ```\n\n### Build with Local Dockerfile  \n\nIf you want to build the image yourself, you can use the Dockerfile in the project root directory:  \n\n1. Clone the repository:  \n   ```bash\n   git clone https://github.com/ConardLi/easy-dataset.git\n   cd easy-dataset\n   ```  \n2. Build the Docker image:  \n   ```bash\n   docker build -t easy-dataset .\n   ```  \n3. Run the container:  \n   ```bash\n   docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset\n   ```  \n   **Note:** Replace `{YOUR_LOCAL_DB_PATH}` with the actual path where you want to store the local database.  \n\n4. Open your browser and navigate to `http://localhost:1717`\n\n## Usage\n\n### Creating a Project\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/1.png\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/2.png\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n\n1. Click the \"Create Project\" button on the home page\n2. Enter a project name and description\n3. Configure your preferred LLM API settings\n\n### Processing Documents\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/3.png\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/4.png\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n1. Upload your Markdown files in the \"Text Split\" section\n2. Review the automatically split text segments\n3. Adjust the segmentation if needed\n\n### Generating Questions\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/5.png\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/6.png\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n1. Navigate to the \"Questions\" section\n2. Select text segments to generate questions from\n3. Review and edit the generated questions\n4. Organize questions using the tag tree\n\n### Creating Datasets\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/7.png\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/8.png\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n1. Go to the \"Datasets\" section\n2. Select questions to include in your dataset\n3. Generate answers using your configured LLM\n4. Review and edit the generated answers\n\n### Exporting Datasets\n\n\u003ctable\u003e\n    \u003ctr\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/9.png\"\u003e\u003c/td\u003e\n        \u003ctd\u003e\u003cimg src=\"./public/imgs/10.png\"\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n\u003c/table\u003e\n\n1. Click the \"Export\" button in the Datasets section\n2. Select your preferred format (Alpaca or ShareGPT)\n3. Choose file format (JSON or JSONL)\n4. Add custom system prompts if needed\n5. Export your dataset\n\n## Project Structure\n\n```\neasy-dataset/\n├── app/                                # Next.js application directory\n│   ├── api/                            # API routes\n│   │   ├── llm/                        # LLM API integration\n│   │   │   ├── ollama/                 # Ollama API integration\n│   │   │   └── openai/                 # OpenAI API integration\n│   │   ├── projects/                   # Project management APIs\n│   │   │   ├── [projectId]/            # Project-specific operations\n│   │   │   │   ├── chunks/             # Text chunk operations\n│   │   │   │   ├── datasets/           # Dataset generation and management\n│   │   │   │   │   └── optimize/       # Dataset optimization API\n│   │   │   │   ├── generate-questions/ # Batch question generation\n│   │   │   │   ├── questions/          # Question management\n│   │   │   │   └── split/              # Text splitting operations\n│   │   │   └── user/                   # User-specific project operations\n│   ├── projects/                       # Front-end project pages\n│   │   └── [projectId]/                # Project-specific pages\n│   │       ├── datasets/               # Dataset management UI\n│   │       ├── questions/              # Question management UI\n│   │       ├── settings/               # Project settings UI\n│   │       └── text-split/             # Text processing UI\n│   └── page.js                         # Home page\n├── components/                         # React components\n│   ├── datasets/                       # Dataset-related components\n│   ├── home/                           # Home page components\n│   ├── projects/                       # Project management components\n│   ├── questions/                      # Question management components\n│   └── text-split/                     # Text processing components\n├── lib/                                # Core libraries and utilities\n│   ├── db/                             # Database operations\n│   ├── i18n/                           # Internationalization\n│   ├── llm/                            # LLM integration\n│   │   ├── common/                     # Common LLM utilities\n│   │   ├── core/                       # Core LLM client\n│   │   └── prompts/                    # Prompt templates\n│   │       ├── answer.js               # Answer generation prompts (Chinese)\n│   │       ├── answerEn.js             # Answer generation prompts (English)\n│   │       ├── question.js             # Question generation prompts (Chinese)\n│   │       ├── questionEn.js           # Question generation prompts (English)\n│   │       └── ... other prompts\n│   └── text-splitter/                  # Text splitting utilities\n├── locales/                            # Internationalization resources\n│   ├── en/                             # English translations\n│   └── zh-CN/                          # Chinese translations\n├── public/                             # Static assets\n│   └── imgs/                           # Image resources\n└── local-db/                           # Local file-based database\n    └── projects/                       # Project data storage\n```\n\n\n## Documentation\n\nFor detailed documentation on all features and APIs, please visit our [Documentation Site](https://rncg5jvpme.feishu.cn/docx/IRuad1eUIo8qLoxxwAGcZvqJnDb?302from=wiki).\n\n## Contributing\n\nWe welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:\n\n1. Fork the repository\n2. Create a new branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Commit your changes (`git commit -m 'Add some amazing feature'`)\n5. Push to the branch (`git push origin feature/amazing-feature`)\n6. Open a Pull Request\n\nPlease make sure to update tests as appropriate and adhere to the existing coding style.\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=ConardLi/easy-dataset\u0026type=Date)](https://www.star-history.com/#ConardLi/easy-dataset\u0026Date)\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003csub\u003eBuilt with ❤️ by \u003ca href=\"https://github.com/ConardLi\"\u003eConardLi\u003c/a\u003e • Follow  me：\u003ca href=\"https://mp.weixin.qq.com/s/ac9XWvVsaXpSH1HH2x4TRQ\"\u003eWeChat\u003c/a\u003e｜\u003ca href=\"https://space.bilibili.com/474921808\"\u003eBilibili\u003c/a\u003e｜\u003ca href=\"https://juejin.cn/user/3949101466785709\"\u003eJuijin\u003c/a\u003e｜\u003ca href=\"https://www.zhihu.com/people/wen-ti-chao-ji-duo-de-xiao-qi\"\u003eZhihu\u003c/a\u003e\u003c/sub\u003e\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FConardLi%2Feasy-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FConardLi%2Feasy-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FConardLi%2Feasy-dataset/lists"}