{"id":51041873,"url":"https://github.com/voidful/awesome-agent-dataset","last_synced_at":"2026-06-22T11:02:07.524Z","repository":{"id":365120994,"uuid":"1270645402","full_name":"voidful/awesome-agent-dataset","owner":"voidful","description":"📚 Curated catalog of agent-training datasets + a toolkit that normalizes, deduplicates, and quality-tiers them into one schema. Produces 🤗 voidful/agent-sft.","archived":false,"fork":false,"pushed_at":"2026-06-16T00:26:24.000Z","size":60,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-16T01:24:00.104Z","etag":null,"topics":["agent","agent-traces","awesome-list","dataset","fine-tuning","function-calling","huggingface","llm","swe-agent","tool-use"],"latest_commit_sha":null,"homepage":"https://huggingface.co/datasets/voidful/agent-sft","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/voidful.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-15T23:17:15.000Z","updated_at":"2026-06-16T01:00:47.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/voidful/awesome-agent-dataset","commit_stats":null,"previous_names":["voidful/awesome-agent-dataset"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/voidful/awesome-agent-dataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-agent-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-agent-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-agent-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-agent-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/voidful","download_url":"https://codeload.github.com/voidful/awesome-agent-dataset/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fawesome-agent-dataset/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34645688,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-22T02:00:06.391Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","agent-traces","awesome-list","dataset","fine-tuning","function-calling","huggingface","llm","swe-agent","tool-use"],"created_at":"2026-06-22T11:02:06.740Z","updated_at":"2026-06-22T11:02:07.473Z","avatar_url":"https://github.com/voidful.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 🤖 awesome-agent-dataset\n\n**A curated catalog of agent-training datasets — *plus* a working toolkit that normalizes, filters, and deduplicates them into one canonical schema.**\n\n[![CI](https://github.com/voidful/awesome-agent-dataset/actions/workflows/ci.yml/badge.svg)](https://github.com/voidful/awesome-agent-dataset/actions/workflows/ci.yml)\n[![Awesome](https://awesome.re/badge.svg)](https://awesome.re)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![HF Dataset](https://img.shields.io/badge/🤗%20Dataset-voidful%2Fagent--sft-orange)](https://huggingface.co/datasets/voidful/agent-sft)\n\n*Most \"agent dataset\" lists stop at links. This one ships the pipeline too —*\n*so you get **clean, deduplicated, schema-unified data**, not a pile of incompatible formats.*\n\n[**📚 Full Catalog**](CATALOG.md) · [**🤗 Output Dataset**](https://huggingface.co/datasets/voidful/agent-sft) · [**🚀 Quickstart**](#-quickstart) · [**🧩 Schema**](#-canonical-schema) · [**🤝 Contributing**](CONTRIBUTING.md)\n\n\u003c/div\u003e\n\n---\n\n## Why this exists\n\nFor fine-tuning an agent model, **finding tokens is not the bottleneck** — public agent data already far exceeds what a 30B FT needs. The real bottlenecks are:\n\n1. **Format chaos** — every dataset uses a different shape (xLAM `query/answers`, Glaive flat-text, Hermes XML tags, ToolACE Python-call DSL, OpenHands event streams, WebArena action grammars, the new HF `agent-traces` format…).\n2. **Massive overlap** — the same GitHub issue appears across 5 SWE datasets; xLAM/Glaive/ToolACE get re-packaged a dozen times. Naively concatenating **severely overestimates** your real data volume.\n3. **Quality variance** — valid JSON ≠ a successful task; you need stratification.\n4. **Coding-agent skew** — SWE/terminal data is so abundant it drowns general agent ability if unbalanced.\n\n`agentds` solves all four: **one normalizer per format → group-level dedup → quality tiers → balanced mixture**, producing [`voidful/agent-sft`](https://huggingface.co/datasets/voidful/agent-sft) in a **standard, model-agnostic OpenAI-style schema** — train any model on it (Qwen, Llama, Gemma, GPT, …).\n\n## ✨ Output dataset\n\n[**🤗 voidful/agent-sft**](https://huggingface.co/datasets/voidful/agent-sft) — a model-agnostic agent/tool-use SFT dataset, produced entirely by this repo from the wired sources.\n\n\u003c!-- STATS:START --\u003e\n**309,322 rows** from 27 wired sources, deduplicated (incl. against a schema-compatible reference dataset).\n\n| Tier | Rows | Share |\n|---|--:|--:|\n| 🛠️ function_calling | 171,910 | 56% |\n| 💻 swe_terminal | 61,153 | 20% |\n| 💬 general | 41,470 | 13% |\n| 🌐 web | 27,830 | 9% |\n| 🧵 agent_traces | 6,959 | 2% |\n| **Total** | **309,322** | |\n\n**Quality:** 147,238 high (48%) · 161,580 medium (52%) · 504 low (0.2%)\n**Dedup removed 99,492 candidates (24%)** — 43,914 SWE-group (same GitHub issue across SWE datasets) · 46,725 near-dup (MinHash) · 8,853 exact, *plus* dedup against a schema-compatible reference dataset. (E.g. `ansulev/DeepSeek-v4-Pro-Agent` → **0 kept**, fully collapsed into its `TeichAI` twin.)\n**`agentds audit`:** 0 CoT leakage · 0 schema corruption · 0 id collisions · 0 foreign-marker leaks.\n\nCoding-heavy data (swe_terminal + agent_traces) is held to **~22%** so general agent ability isn't drowned. See [CATALOG.md](CATALOG.md) for per-source counts.\n\u003c!-- STATS:END --\u003e\n\n```python\nimport json\nfrom datasets import load_dataset\n\nds = load_dataset(\"voidful/agent-sft\", split=\"train\")\nhi = ds.filter(lambda r: json.loads(r[\"quality\"])[\"tier\"] == \"high\")   # high-quality SFT subset\nex = hi[0]\nmessages = json.loads(ex[\"messages\"])   # OpenAI-style turns — feed to any chat template\ntools    = json.loads(ex[\"tools\"])      # function definitions\n# schema-compatible with voidful/gemma4-agent-sft, so you can concatenate them\n```\n\n## 🚀 Quickstart\n\n```bash\npython3 -m venv .venv \u0026\u0026 source .venv/bin/activate\npip install -e .\n\nagentds catalog                       # regenerate CATALOG.md from the registry\nagentds validate --tier function_calling -n 30   # normalize LIVE rows, sanity-check\nagentds run --tier function_calling --tier agent_traces   # stream → normalize → dedup → shards\nagentds stats                         # composition + dedup + quality report\nagentds audit                         # quantify data-quality defects (quality gate)\nagentds push --repo you/your-agent-sft --public          # to a new HF dataset repo\n```\n\nPick tiers (`function_calling`, `agent_traces`, `swe_terminal`, `web`, `general`) or individual\nsources (`--key apigen swe_gym`). `--limit N` caps rows/subset for a quick dry run.\n\n## 🧩 Canonical schema\n\nA standard, **model-agnostic** OpenAI-style chat-with-tools schema (works with any model's chat template). Wire-compatible with [`voidful/gemma4-agent-sft`](https://huggingface.co/datasets/voidful/gemma4-agent-sft), so shards concatenate cleanly:\n\n| field | type | description |\n|---|---|---|\n| `id` | str | `{source}_{config}_{hash16}` |\n| `source` | str | normalized source key |\n| `source_subset` | str | `config/split` within the source |\n| `messages` | str (JSON) | `list[{role, content, tool_calls?, tool_responses?}]` |\n| `tools` | str (JSON) | `list[{type:\"function\", function:{name, description, parameters}}]` |\n| `tool_names` | list[str] | declared tool names |\n| `quality` | str (JSON) | `{tier, score, curated, signals}` |\n| `metadata` | str (JSON) | `{hf_id, license, dedup_group, instance_id, …}` |\n\n- `tool_calls[].function.arguments` are **objects** (string-encoded args parsed).\n- Chain-of-thought (`\u003cthink\u003e…\u003c/think\u003e`, `reasoning_content`) and foreign chat-template markers are stripped.\n- `parameters` coerced to a JSON-schema `object` (xLAM/Hermes flat styles wrapped; `str`/`int` → `string`/`integer`).\n\n\u003cdetails\u003e\u003csummary\u003eExample normalized row\u003c/summary\u003e\n\n```json\n{\n  \"id\": \"apigen_mt_dataset_9009cd98a0542977\",\n  \"source\": \"apigen_mt\",\n  \"source_subset\": \"dataset/train\",\n  \"messages\": \"[{\\\"role\\\":\\\"system\\\",\\\"content\\\":\\\"# Airline Agent Policy…\\\"},{\\\"role\\\":\\\"user\\\",\\\"content\\\":\\\"I'd like to cancel a reservation.\\\"},{\\\"role\\\":\\\"assistant\\\",\\\"content\\\":null,\\\"tool_calls\\\":[{\\\"function\\\":{\\\"name\\\":\\\"get_reservation_details\\\",\\\"arguments\\\":{\\\"reservation_id\\\":\\\"0U4NPP\\\"}}}]},{\\\"role\\\":\\\"tool\\\",\\\"tool_responses\\\":[{\\\"name\\\":\\\"get_reservation_details\\\",\\\"response\\\":{\\\"reservation_id\\\":\\\"0U4NPP\\\",\\\"status\\\":\\\"active\\\"}}]},{\\\"role\\\":\\\"assistant\\\",\\\"content\\\":\\\"Your reservation 0U4NPP is active — shall I cancel it?\\\"}]\",\n  \"tools\": \"[{\\\"type\\\":\\\"function\\\",\\\"function\\\":{\\\"name\\\":\\\"get_reservation_details\\\",\\\"description\\\":\\\"…\\\",\\\"parameters\\\":{\\\"type\\\":\\\"object\\\",\\\"properties\\\":{\\\"reservation_id\\\":{\\\"type\\\":\\\"string\\\"}},\\\"required\\\":[\\\"reservation_id\\\"]}}}]\",\n  \"tool_names\": [\"get_reservation_details\", \"cancel_reservation\", \"...\"],\n  \"quality\": \"{\\\"tier\\\":\\\"high\\\",\\\"score\\\":0.9,\\\"curated\\\":false,\\\"signals\\\":{\\\"n_turns\\\":5,\\\"n_tool_calls\\\":1,\\\"multi_turn\\\":true,\\\"valid_arg_ratio\\\":1.0}}\",\n  \"metadata\": \"{\\\"hf_id\\\":\\\"Salesforce/APIGen-MT-5k\\\",\\\"license\\\":\\\"cc-by-4.0\\\",\\\"dedup_group\\\":\\\"xlam_apigen\\\"}\"\n}\n```\n\u003c/details\u003e\n\n## 📚 Catalog\n\n**[→ Full catalog of 60+ datasets, by tier, with normalization status (CATALOG.md)](CATALOG.md)**\n\n| Tier | What it teaches | Wired sources (sample) |\n|---|---|---|\n| 🛠️ **function_calling** | when/how to call tools, schema grounding, when *not* to call | apigen(xLAM), glaive, toolace, when2call, hermes, hermes_reasoning, toolmind |\n| 🧵 **agent_traces** | real `claude_code`/`pi` coding-agent sessions (HF `format: agent-traces`, decoded via `teich`) | DeepSeek-v4-Pro, synthtraces, qwen3.7-max-pi, minimax-m3, ml-intern |\n| 💻 **swe_terminal** | SWE repair, shell/terminal, long-horizon coding (streamed + sampled) | swe_gym, swe_rebench, swe_zero, swe_smith, coderforge, nemotron-terminal |\n| 🌐 **web** | observation→action loops (DOM/AXTree as text) | weblinx, mind2web, nnetnav |\n| 💬 **general** | retention — keep natural-answer ability, avoid over-tool-calling | openhermes, smoltalk2 |\n\nAdding a dataset = an entry in [`configs/registry.yaml`](configs/registry.yaml) (+ a normalizer if it's a new format). See [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## ⚙️ How it works\n\n```\n                        configs/registry.yaml  (single source of truth)\n                                  │\n   HF source ──stream──▶ normalize ──▶ validate ──▶ group-dedup ──▶ quality ──▶ parquet shards ──▶ 🤗 push\n  (streaming=True,     (per-format)   (schema     (exact + SWE-     (tiers)\n   never fully          │             well-formed) provenance +\n   downloaded)          │                          MinHash near-dup,\n                        │                          incl. vs reference)\n                  agentds/normalizers.py     agentds/dedup.py   agentds/quality.py\n```\n\n- **Normalizers** ([`agentds/normalizers.py`](agentds/normalizers.py)) — one per format family: xLAM, ShareGPT (incl. Glaive `function_call`/`observation`), Hermes `\u003ctool_call\u003e` XML, ToolACE BFCL `[Func(k=v)]` (paren/space/path-style names), When2Call `\u003cTOOLCALL\u003e` + appropriate-refusal rows, native OpenHands SWE trajectories, Nemotron terminal transcripts, WebLINX/Mind2Web/WebArena action grammars, and the HF `agent-traces` format. Tools are synthesized from observed calls when a source ships no schema.\n- **Group-level dedup** ([`agentds/dedup.py`](agentds/dedup.py)) — (1) exact xxhash of normalized content; (2) **SWE-provenance** key so the same GitHub issue across SWE-Zero/nebius/SWE-Gym/SWE-smith/CoderForge collapses to one (real `repo-NNNN` ids by issue number; synthetic ids at full granularity); (3) **MinHash + LSH** near-dup over assistant action/tool-schema shingles. Stateful across the whole run + can preload any reference dataset's hashes (`--dedup-against`).\n- **Quality** ([`agentds/quality.py`](agentds/quality.py)) — `{tier: high|medium|low, score, curated, signals}`; rewards multi-turn, schema-valid, observation-grounded tool use; folds in source success signals (SWE `resolved`, CoderForge `reward`); penalizes degenerate trajectories.\n\n## 🧪 Recommended training recipe\n\n- **Stage A — agentic continued post-training** (10–30B tok): SWE/terminal 55% · tool-use 20% · web 15% · general 10%.\n- **Stage B — high-quality agent SFT** (1–3B tok): filter to `quality.tier == \"high\"` + verified successes.\n- **Stage C — RL / rejection sampling**: use executable/verified subsets (`reward==1`, `resolved`).\n\nRecommended loss mask:\n\n```\nsystem / user / tool-schema / tool-observation : 0\nassistant natural language / final answer      : 1.0\nassistant tool-call JSON                        : 1.5\nassistant recovery-after-error action           : 2.0\n```\n\n## 🤝 Contributing\n\nPRs that **add datasets** or **wire up catalog-only entries** are the most valuable — see [CONTRIBUTING.md](CONTRIBUTING.md). The bar: it must normalize cleanly (`agentds validate` green) and declare a `dedup_group`.\n\n## 🧪 Tests\n\n```bash\n.venv/bin/python -m tests.test_normalizers   # offline, fixture-based\n```\n\n## 📄 License \u0026 citation\n\nCode: [MIT](LICENSE). **Each dataset keeps its upstream license** — recorded in every row's `metadata.license`; review before downstream use (sources span apache-2.0 / mit / cc-by-4.0 and restricted terms like cc-by-nc-sa-4.0).\n\n```bibtex\n@misc{awesome-agent-dataset,\n  title  = {awesome-agent-dataset: a catalog and normalization toolkit for agent-training data},\n  author = {voidful},\n  year   = {2026},\n  url    = {https://github.com/voidful/awesome-agent-dataset}\n}\n```\n\n## 🙏 Acknowledgements\n\nBuilt on the open datasets catalogued here and the HuggingFace `datasets` / `teich` / `datasketch` ecosystems. Schema is compatible with [`voidful/gemma4-agent-sft`](https://huggingface.co/datasets/voidful/gemma4-agent-sft).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fawesome-agent-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvoidful%2Fawesome-agent-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fawesome-agent-dataset/lists"}