{"id":49494182,"url":"https://github.com/arondaron/dataset-generator","last_synced_at":"2026-05-03T18:04:31.643Z","repository":{"id":353088381,"uuid":"1205800578","full_name":"AronDaron/dataset-generator","owner":"AronDaron","description":"No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload.","archived":false,"fork":false,"pushed_at":"2026-04-30T18:20:05.000Z","size":10906,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-01T08:38:18.949Z","etag":null,"topics":["alpaca","chatml","dataset-generation","desktop-app","fastapi","fine-tuning","huggingface","llm","llm-as-judge","llm-fine-tuning","nextjs","openrouter","sft","sharegpt","synthetic-data"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AronDaron.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"ko_fi":"arondaron"}},"created_at":"2026-04-09T09:38:34.000Z","updated_at":"2026-04-30T06:29:02.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/AronDaron/dataset-generator","commit_stats":null,"previous_names":["arondaron/dataset-generator"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/AronDaron/dataset-generator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AronDaron%2Fdataset-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AronDaron%2Fdataset-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AronDaron%2Fdataset-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AronDaron%2Fdataset-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AronDaron","download_url":"https://codeload.github.com/AronDaron/dataset-generator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AronDaron%2Fdataset-generator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32579092,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T06:36:36.687Z","status":"ssl_error","status_checked_at":"2026-05-03T06:36:09.306Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alpaca","chatml","dataset-generation","desktop-app","fastapi","fine-tuning","huggingface","llm","llm-as-judge","llm-fine-tuning","nextjs","openrouter","sft","sharegpt","synthetic-data"],"created_at":"2026-05-01T08:33:15.321Z","updated_at":"2026-05-03T18:04:31.637Z","avatar_url":"https://github.com/AronDaron.png","language":"TypeScript","funding_links":["https://ko-fi.com/arondaron"],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"docs/assets/logo.png\" alt=\"Dataset Generator\" width=\"220\" /\u003e\n\n# Dataset Generator\n\n**A no-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs.**\n\nPick categories, set proportions, click Generate — the app handles topic planning, example generation, quality scoring, and export to a ready-to-train JSONL file.\n\n[![Stack](https://img.shields.io/badge/stack-Next.js%2016%20%7C%20FastAPI%20%7C%20SQLite-7c3aed?style=flat-square)](#quick-start)\n[![Python](https://img.shields.io/badge/python-3.10%2B-blue?style=flat-square)](https://www.python.org/)\n[![Node](https://img.shields.io/badge/node-20%2B-339933?style=flat-square)](https://nodejs.org/)\n[![License](https://img.shields.io/badge/license-AGPL--3.0-green?style=flat-square)](LICENSE)\n[![Ko-fi](https://img.shields.io/badge/Ko--fi-support-FF5E5B?style=flat-square\u0026logo=ko-fi\u0026logoColor=white)](https://ko-fi.com/arondaron)\n\n\u003c/div\u003e\n\n---\n\n## About\n\nDataset Generator is a desktop app that automates the full dataset generation pipeline — topic planning, multi-turn conversation generation, quality validation via LLM Judge, deduplication, and HuggingFace Hub upload. No scripts to write, no ML infra to configure.\n\nUnder the hood it runs a three-stage engine: instead of a single \"generate 100 examples\" prompt, the app first decomposes the job into unique topics and outlines, only then generating the actual examples. The result: diverse, coherent data without the repetitive patterns of naive generation.\n\nEverything stays local. API keys live in SQLite on your device, datasets land in `~/.datasetgenerator/`. Talk to OpenRouter for ~300 cloud models, or point the app at a local Ollama / LM Studio / llama.cpp server for fully offline generation — both modes share the same pipeline.\n\n\u003e **Note on provider terms.** Users are responsible for complying with the terms of service of the LLM providers they use through OpenRouter. Some providers restrict using model outputs for training competitive models — check the ToS of your chosen model before generating datasets for fine-tuning.\n\n---\n\n## Why I built this\n\nI recently started fine-tuning open-source LLMs as a hobby, and build software with AI coding agents. I wanted a simple way to generate training datasets without writing custom scripts every time — pick categories, configure the pipeline, click Generate, get a JSONL ready for training. There's plenty of datasets on HuggingFace, but sometimes you want one tailored to your specific categories and proportions. So I built the tool I wanted to use.\n\n---\n\n## Benchmark\n\nDatasets generated by this app were used to fine-tune **Qwen2.5-Coder-7B-Instruct** and evaluated against the base model on **HumanEval / HumanEval+** (pass@1, average across 5 runs). **Every model in the pipeline — topic planner, example generator, and LLM Judge — was open-source** (Llama, Qwen, DeepSeek, Mistral via OpenRouter). No proprietary APIs.\n\n\u003cdiv align=\"center\"\u003e\n\n| Model | HumanEval | HumanEval+ |\n|---|:---:|:---:|\n| Base Qwen2.5-Coder-7B-Instruct | 55.5% (±2.1) | 49.0% (±1.9) |\n| FT V1 (750 samples) | 57.2% (±1.0) | 51.0% (±0.5) |\n| **FT V2 (this pipeline, 1135 samples)** | **60.0% (±0.9)** | **54.0% (±1.8)** |\n\n\u003c/div\u003e\n\n**+4.5 pts on HumanEval, +5.0 pts on HumanEval+ vs base.** Error bars don't overlap — the difference is statistically significant.\n\n🤗 **Artifacts:** [fine-tuned model](https://huggingface.co/AronDaron/Qwen2.5-Coder-7B-Instruct-DatasetGen-v2) · [V1 dataset (750 samples)](https://huggingface.co/datasets/AronDaron/dataset-gen-v1) · [V2 dataset (1135 samples)](https://huggingface.co/datasets/AronDaron/dataset-gen-v2)\n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"docs/assets/benchmark.png\" alt=\"Benchmark results — HumanEval / HumanEval+ pass@1\" width=\"800\" /\u003e\n\u003c/div\u003e\n\n\u003csub\u003eThis benchmark validates the pipeline on a coding-focused dataset (multi-turn \ncoding assistance with explanations). The tool itself is domain-agnostic — define \nany categories (writing, Q\u0026A, math, customer support, etc.) and the same workflow \napplies. Results depend on category configuration, judge criteria, and model \nselection — your mileage may vary.\u003c/sub\u003e\n\n---\n\n## Demo\n\n\u003cdiv align=\"center\"\u003e\n\nhttps://github.com/user-attachments/assets/73f43f6c-a5b8-47c9-8de2-8e016e57cfef\n\n\u003csub\u003eGenerating 10 examples across 2 categories in ShareGPT format with the LLM Judge enabled.\u003c/sub\u003e\n\n\u003c/div\u003e\n\n---\n\n## Features\n\n\u003e Actively developed — bug reports and feature requests welcome via [Issues](../../issues).\n\u003e General questions and ideas → [Discussions](../../discussions).\n\n\n- **Plan-then-Execute pipeline** — three stages (topics → outlines → examples), each can use a different model\n- **Tests** - 460-test suite (unit + integration + E2E) — internal\n- **Cloud + local providers** — OpenRouter for ~300 cloud models, plus Ollama / LM Studio / any OpenAI-compatible endpoint for fully offline generation. Mix and match per category (e.g. local generator + cloud judge).\n- **Per-category configuration** — any number of categories with custom proportions, descriptions, and dedicated models\n- **LLM Judge** — a second model scores every example 0–100 against editable criteria; rejected examples are regenerated\n- **Real-time SSE dashboard** — global and per-category progress, live example feed, running cost\n- **Three export formats** — ShareGPT, Alpaca, ChatML\n- **Multi-turn conversations** — 1–5 turns generated coherently in one LLM call\n- **Actual cost tracking** — pulls real `usage` tokens from every response, multiplies by live pricing\n- **Embedding-based deduplication** — cosine similarity over OpenRouter embeddings\n- **Quality Report** — judge histogram, token stats, efficiency, export to JSON/CSV\n- **Dataset history + in-app preview** — turn-by-turn rendering, code highlighting, dataset merging\n- **HuggingFace Hub upload** — one-click push to your repo\n\n\n---\n\n## Use cases\n\n- **Fine-tuning a domain-specific assistant** — coding, legal, medical, customer support. The benchmark above is exactly this flow.\n- **Instruction datasets at any scale** — SFT-ready JSONL for models from 7B edge deployments up to 70B+; merge multiple jobs to grow the corpus.\n- **Experimenting with fine-tuning** — quickly test how different category compositions affect model behavior without weeks of data curation.\n- **Multi-turn conversation datasets** — generate 3–5 turn dialogues for training agentic behaviors.\n\n---\n\n## Local models (Ollama / LM Studio / OpenAI-compatible)\n\nBeyond OpenRouter cloud models, the app talks to any **OpenAI-compatible** endpoint — Ollama, LM Studio, llama.cpp, vLLM, TGI, or your own server. Run the entire pipeline offline, or mix freely: e.g. local generator + cloud judge, or different models per category.\n\n**Setup:** start your local server (`ollama serve` on port 11434, LM Studio's server tab on 1234, etc.), then in the app open **Settings → Providers → Auto-detect local**. Endpoints are discovered automatically; any custom base URL of the form `http://host:port/v1` also works. For fully offline runs, pick a local embedding model (e.g. `nomic-embed-text`) in **Settings → Dedup**.\n\n### Model size matters\n\nDataset generation is more demanding than general chat — the model has to produce strict JSON, follow multi-turn structure, and stay coherent across many examples. A model that's perfectly fine for chat may fail validation here.\n\n| Size | Recommendation | Notes |\n|---|---|---|\n| **\u003c7B** (Llama 3.2:3B, etc.) | Not recommended | Frequent JSON validation failures, repetitive content, schema drift |\n| **7B–13B** (Mistral 7B, Llama 3.1:8B) | Casual use only | Works for experimentation, but expect noticeable skip rate and lower diversity |\n| **14B** (Qwen2.5-Coder:14B, Qwen3:14B) | Pragmatic minimum | Stable generation, clean output, low skip rate |\n| **32B+** (Qwen2.5-Coder-32B, DeepSeek-V3, GLM-4-32B) | Recommended target | Quality approaches cloud providers |\n\nIf you don't have the GPU for 14B+, **OpenRouter is the better path** — same pipeline, no hardware constraint, open-source models cost cents per 1000 examples.\n\n---\n\n## Download\n\nPre-built binaries are on the [Releases](../../releases) page — no Python, no Node.js required.\n\n| Platform | File | Size | Usage |\n|---|---|---|---|\n| Windows 10/11 (x64) | `DatasetGenerator-windows-x64.zip` | ~100 MB | Extract → double-click `DatasetGenerator.exe` |\n| Linux (AppImage) | `DatasetGenerator-x86_64.AppImage` | ~140 MB | `chmod +x` → double-click |\n| Linux (tar.gz) | `DatasetGenerator-linux-x64.tar.gz` | ~140 MB | Extract → run `./DatasetGenerator` |\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eWindows — SmartScreen warning\u003c/b\u003e\u003c/summary\u003e\n\nUnsigned executable: on first run click **More info** → **Run anyway**. App data is stored in `%APPDATA%\\DatasetGenerator\\`.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLinux AppImage — FUSE on Ubuntu 24.04\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nchmod +x DatasetGenerator-x86_64.AppImage\n./DatasetGenerator-x86_64.AppImage\n```\n\nIf `dlopen(): error loading libfuse.so.2` appears:\n```bash\nsudo apt install libfuse2t64   # Ubuntu 24.04+\nsudo apt install libfuse2      # Ubuntu 22.04 and older\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLinux tar.gz — GTK/WebKit requirements\u003c/b\u003e\u003c/summary\u003e\n\n```bash\ntar -xzf DatasetGenerator-linux-x64.tar.gz\ncd DatasetGenerator\n./DatasetGenerator\n```\n\nRequires GTK 3 and WebKit2GTK 4.1 (pre-installed on Ubuntu 24.04+, Fedora 38+). On older systems:\n```bash\nsudo apt install libgtk-3-0 libwebkit2gtk-4.1-0\n```\n\u003c/details\u003e\n\n---\n\n## Quick start\n\n```bash\ngit clone https://github.com/AronDaron/dataset-generator.git\ncd dataset-generator\n\n# Backend\ncd backend\npython3 -m venv venv\n./venv/bin/pip install -r requirements.txt\n./venv/bin/uvicorn app.main:app --reload --port 8000\n\n# Frontend (new terminal)\ncd frontend\nnpm install\nnpm run dev\n```\n\nBackend on `http://localhost:8000`, frontend on `http://localhost:3000`. Open Settings → enter your OpenRouter API key → pick a category → click Generate.\n\n\u003e **Windows:** replace `./venv/bin/pip` and `./venv/bin/uvicorn` with `venv\\Scripts\\pip.exe` and `venv\\Scripts\\uvicorn.exe`.\n\n**Requirements:** Python 3.10+, Node.js 20+, an [OpenRouter API key](https://openrouter.ai/keys).\n\n**Stack:** Next.js 16 + React 19, FastAPI + Pydantic v2, SQLite (aiosqlite), SSE for progress, Pywebview + PyInstaller for packaging.\n\n---\n\n## FAQ\n\n**Is Linux fully supported?**\nYes — the app ships AppImage and tar.gz builds and all features work cross-platform. That said, day-to-day development and manual testing happens on Windows; Linux builds are verified with automated smoke tests but don't get the same amount of hands-on time. If something feels off on Linux, please open an Issue — I'll take a look.\n\n**How much does it cost to generate 1000 examples?**\nDepends on model choice, turn count, and judge strictness. With open-source models available on OpenRouter (Llama 3.x, Qwen 2.5, DeepSeek, Mistral) expect single-digit dollars per 1000 multi-turn examples. Note that the UI shows the cost of accepted examples only — real spend includes rejected and skipped examples plus retries, typically 1.5-2x the displayed cost depending on judge threshold.\n\n**Is my API key safe?**\nKeys are stored locally in SQLite (`~/.datasetgenerator/database.sqlite`). No telemetry, no remote calls except to OpenRouter and (optionally) HuggingFace Hub. Nothing leaves your machine unless you push a dataset.\n\n**Why AGPL-3.0 and not MIT?**\nTo prevent closed-source SaaS forks. You're free to use, modify, and self-host — but if you deploy a derivative as a hosted service, your users have the right to receive your source code. Commercial licensing is negotiable — open an Issue or contact me directly.\n\n---\n\n## License\n\n**GNU Affero General Public License v3.0** — see [LICENSE](LICENSE).\n\nStrong copyleft: you're free to use, modify, and redistribute, but any derivative work — including SaaS / network-deployed versions — must release its full source under the same license. For proprietary commercial use, open an issue or contact me directly.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farondaron%2Fdataset-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farondaron%2Fdataset-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farondaron%2Fdataset-generator/lists"}