https://github.com/voidful/awesome-agent-dataset
π Curated catalog of agent-training datasets + a toolkit that normalizes, deduplicates, and quality-tiers them into one schema. Produces π€ voidful/agent-sft.
https://github.com/voidful/awesome-agent-dataset
List: awesome-agent-dataset
agent agent-traces awesome-list dataset fine-tuning function-calling huggingface llm swe-agent tool-use
Last synced: 2 days ago
JSON representation
π Curated catalog of agent-training datasets + a toolkit that normalizes, deduplicates, and quality-tiers them into one schema. Produces π€ voidful/agent-sft.
- Host: GitHub
- URL: https://github.com/voidful/awesome-agent-dataset
- Owner: voidful
- License: other
- Created: 2026-06-15T23:17:15.000Z (9 days ago)
- Default Branch: main
- Last Pushed: 2026-06-16T00:26:24.000Z (9 days ago)
- Last Synced: 2026-06-16T01:24:00.104Z (8 days ago)
- Topics: agent, agent-traces, awesome-list, dataset, fine-tuning, function-calling, huggingface, llm, swe-agent, tool-use
- Language: Python
- Homepage: https://huggingface.co/datasets/voidful/agent-sft
- Size: 58.6 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# π€ awesome-agent-dataset
**A curated catalog of agent-training datasets β *plus* a working toolkit that normalizes, filters, and deduplicates them into one canonical schema.**
[](https://github.com/voidful/awesome-agent-dataset/actions/workflows/ci.yml)
[](https://awesome.re)
[](https://www.python.org/)
[](LICENSE)
[](CONTRIBUTING.md)
[](https://huggingface.co/datasets/voidful/agent-sft)
*Most "agent dataset" lists stop at links. This one ships the pipeline too β*
*so you get **clean, deduplicated, schema-unified data**, not a pile of incompatible formats.*
[**π Full Catalog**](CATALOG.md) Β· [**π€ Output Dataset**](https://huggingface.co/datasets/voidful/agent-sft) Β· [**π Quickstart**](#-quickstart) Β· [**π§© Schema**](#-canonical-schema) Β· [**π€ Contributing**](CONTRIBUTING.md)
---
## Why this exists
For fine-tuning an agent model, **finding tokens is not the bottleneck** β public agent data already far exceeds what a 30B FT needs. The real bottlenecks are:
1. **Format chaos** β every dataset uses a different shape (xLAM `query/answers`, Glaive flat-text, Hermes XML tags, ToolACE Python-call DSL, OpenHands event streams, WebArena action grammars, the new HF `agent-traces` formatβ¦).
2. **Massive overlap** β the same GitHub issue appears across 5 SWE datasets; xLAM/Glaive/ToolACE get re-packaged a dozen times. Naively concatenating **severely overestimates** your real data volume.
3. **Quality variance** β valid JSON β a successful task; you need stratification.
4. **Coding-agent skew** β SWE/terminal data is so abundant it drowns general agent ability if unbalanced.
`agentds` solves all four: **one normalizer per format β group-level dedup β quality tiers β balanced mixture**, producing [`voidful/agent-sft`](https://huggingface.co/datasets/voidful/agent-sft) in a **standard, model-agnostic OpenAI-style schema** β train any model on it (Qwen, Llama, Gemma, GPT, β¦).
## β¨ Output dataset
[**π€ voidful/agent-sft**](https://huggingface.co/datasets/voidful/agent-sft) β a model-agnostic agent/tool-use SFT dataset, produced entirely by this repo from the wired sources.
**309,322 rows** from 27 wired sources, deduplicated (incl. against a schema-compatible reference dataset).
| Tier | Rows | Share |
|---|--:|--:|
| π οΈ function_calling | 171,910 | 56% |
| π» swe_terminal | 61,153 | 20% |
| π¬ general | 41,470 | 13% |
| π web | 27,830 | 9% |
| π§΅ agent_traces | 6,959 | 2% |
| **Total** | **309,322** | |
**Quality:** 147,238 high (48%) Β· 161,580 medium (52%) Β· 504 low (0.2%)
**Dedup removed 99,492 candidates (24%)** β 43,914 SWE-group (same GitHub issue across SWE datasets) Β· 46,725 near-dup (MinHash) Β· 8,853 exact, *plus* dedup against a schema-compatible reference dataset. (E.g. `ansulev/DeepSeek-v4-Pro-Agent` β **0 kept**, fully collapsed into its `TeichAI` twin.)
**`agentds audit`:** 0 CoT leakage Β· 0 schema corruption Β· 0 id collisions Β· 0 foreign-marker leaks.
Coding-heavy data (swe_terminal + agent_traces) is held to **~22%** so general agent ability isn't drowned. See [CATALOG.md](CATALOG.md) for per-source counts.
```python
import json
from datasets import load_dataset
ds = load_dataset("voidful/agent-sft", split="train")
hi = ds.filter(lambda r: json.loads(r["quality"])["tier"] == "high") # high-quality SFT subset
ex = hi[0]
messages = json.loads(ex["messages"]) # OpenAI-style turns β feed to any chat template
tools = json.loads(ex["tools"]) # function definitions
# schema-compatible with voidful/gemma4-agent-sft, so you can concatenate them
```
## π Quickstart
```bash
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
agentds catalog # regenerate CATALOG.md from the registry
agentds validate --tier function_calling -n 30 # normalize LIVE rows, sanity-check
agentds run --tier function_calling --tier agent_traces # stream β normalize β dedup β shards
agentds stats # composition + dedup + quality report
agentds audit # quantify data-quality defects (quality gate)
agentds push --repo you/your-agent-sft --public # to a new HF dataset repo
```
Pick tiers (`function_calling`, `agent_traces`, `swe_terminal`, `web`, `general`) or individual
sources (`--key apigen swe_gym`). `--limit N` caps rows/subset for a quick dry run.
## π§© Canonical schema
A standard, **model-agnostic** OpenAI-style chat-with-tools schema (works with any model's chat template). Wire-compatible with [`voidful/gemma4-agent-sft`](https://huggingface.co/datasets/voidful/gemma4-agent-sft), so shards concatenate cleanly:
| field | type | description |
|---|---|---|
| `id` | str | `{source}_{config}_{hash16}` |
| `source` | str | normalized source key |
| `source_subset` | str | `config/split` within the source |
| `messages` | str (JSON) | `list[{role, content, tool_calls?, tool_responses?}]` |
| `tools` | str (JSON) | `list[{type:"function", function:{name, description, parameters}}]` |
| `tool_names` | list[str] | declared tool names |
| `quality` | str (JSON) | `{tier, score, curated, signals}` |
| `metadata` | str (JSON) | `{hf_id, license, dedup_group, instance_id, β¦}` |
- `tool_calls[].function.arguments` are **objects** (string-encoded args parsed).
- Chain-of-thought (`β¦`, `reasoning_content`) and foreign chat-template markers are stripped.
- `parameters` coerced to a JSON-schema `object` (xLAM/Hermes flat styles wrapped; `str`/`int` β `string`/`integer`).
Example normalized row
```json
{
"id": "apigen_mt_dataset_9009cd98a0542977",
"source": "apigen_mt",
"source_subset": "dataset/train",
"messages": "[{\"role\":\"system\",\"content\":\"# Airline Agent Policyβ¦\"},{\"role\":\"user\",\"content\":\"I'd like to cancel a reservation.\"},{\"role\":\"assistant\",\"content\":null,\"tool_calls\":[{\"function\":{\"name\":\"get_reservation_details\",\"arguments\":{\"reservation_id\":\"0U4NPP\"}}}]},{\"role\":\"tool\",\"tool_responses\":[{\"name\":\"get_reservation_details\",\"response\":{\"reservation_id\":\"0U4NPP\",\"status\":\"active\"}}]},{\"role\":\"assistant\",\"content\":\"Your reservation 0U4NPP is active β shall I cancel it?\"}]",
"tools": "[{\"type\":\"function\",\"function\":{\"name\":\"get_reservation_details\",\"description\":\"β¦\",\"parameters\":{\"type\":\"object\",\"properties\":{\"reservation_id\":{\"type\":\"string\"}},\"required\":[\"reservation_id\"]}}}]",
"tool_names": ["get_reservation_details", "cancel_reservation", "..."],
"quality": "{\"tier\":\"high\",\"score\":0.9,\"curated\":false,\"signals\":{\"n_turns\":5,\"n_tool_calls\":1,\"multi_turn\":true,\"valid_arg_ratio\":1.0}}",
"metadata": "{\"hf_id\":\"Salesforce/APIGen-MT-5k\",\"license\":\"cc-by-4.0\",\"dedup_group\":\"xlam_apigen\"}"
}
```
## π Catalog
**[β Full catalog of 60+ datasets, by tier, with normalization status (CATALOG.md)](CATALOG.md)**
| Tier | What it teaches | Wired sources (sample) |
|---|---|---|
| π οΈ **function_calling** | when/how to call tools, schema grounding, when *not* to call | apigen(xLAM), glaive, toolace, when2call, hermes, hermes_reasoning, toolmind |
| π§΅ **agent_traces** | real `claude_code`/`pi` coding-agent sessions (HF `format: agent-traces`, decoded via `teich`) | DeepSeek-v4-Pro, synthtraces, qwen3.7-max-pi, minimax-m3, ml-intern |
| π» **swe_terminal** | SWE repair, shell/terminal, long-horizon coding (streamed + sampled) | swe_gym, swe_rebench, swe_zero, swe_smith, coderforge, nemotron-terminal |
| π **web** | observationβaction loops (DOM/AXTree as text) | weblinx, mind2web, nnetnav |
| π¬ **general** | retention β keep natural-answer ability, avoid over-tool-calling | openhermes, smoltalk2 |
Adding a dataset = an entry in [`configs/registry.yaml`](configs/registry.yaml) (+ a normalizer if it's a new format). See [CONTRIBUTING.md](CONTRIBUTING.md).
## βοΈ How it works
```
configs/registry.yaml (single source of truth)
β
HF source ββstreamβββΆ normalize βββΆ validate βββΆ group-dedup βββΆ quality βββΆ parquet shards βββΆ π€ push
(streaming=True, (per-format) (schema (exact + SWE- (tiers)
never fully β well-formed) provenance +
downloaded) β MinHash near-dup,
β incl. vs reference)
agentds/normalizers.py agentds/dedup.py agentds/quality.py
```
- **Normalizers** ([`agentds/normalizers.py`](agentds/normalizers.py)) β one per format family: xLAM, ShareGPT (incl. Glaive `function_call`/`observation`), Hermes `` XML, ToolACE BFCL `[Func(k=v)]` (paren/space/path-style names), When2Call `` + appropriate-refusal rows, native OpenHands SWE trajectories, Nemotron terminal transcripts, WebLINX/Mind2Web/WebArena action grammars, and the HF `agent-traces` format. Tools are synthesized from observed calls when a source ships no schema.
- **Group-level dedup** ([`agentds/dedup.py`](agentds/dedup.py)) β (1) exact xxhash of normalized content; (2) **SWE-provenance** key so the same GitHub issue across SWE-Zero/nebius/SWE-Gym/SWE-smith/CoderForge collapses to one (real `repo-NNNN` ids by issue number; synthetic ids at full granularity); (3) **MinHash + LSH** near-dup over assistant action/tool-schema shingles. Stateful across the whole run + can preload any reference dataset's hashes (`--dedup-against`).
- **Quality** ([`agentds/quality.py`](agentds/quality.py)) β `{tier: high|medium|low, score, curated, signals}`; rewards multi-turn, schema-valid, observation-grounded tool use; folds in source success signals (SWE `resolved`, CoderForge `reward`); penalizes degenerate trajectories.
## π§ͺ Recommended training recipe
- **Stage A β agentic continued post-training** (10β30B tok): SWE/terminal 55% Β· tool-use 20% Β· web 15% Β· general 10%.
- **Stage B β high-quality agent SFT** (1β3B tok): filter to `quality.tier == "high"` + verified successes.
- **Stage C β RL / rejection sampling**: use executable/verified subsets (`reward==1`, `resolved`).
Recommended loss mask:
```
system / user / tool-schema / tool-observation : 0
assistant natural language / final answer : 1.0
assistant tool-call JSON : 1.5
assistant recovery-after-error action : 2.0
```
## π€ Contributing
PRs that **add datasets** or **wire up catalog-only entries** are the most valuable β see [CONTRIBUTING.md](CONTRIBUTING.md). The bar: it must normalize cleanly (`agentds validate` green) and declare a `dedup_group`.
## π§ͺ Tests
```bash
.venv/bin/python -m tests.test_normalizers # offline, fixture-based
```
## π License & citation
Code: [MIT](LICENSE). **Each dataset keeps its upstream license** β recorded in every row's `metadata.license`; review before downstream use (sources span apache-2.0 / mit / cc-by-4.0 and restricted terms like cc-by-nc-sa-4.0).
```bibtex
@misc{awesome-agent-dataset,
title = {awesome-agent-dataset: a catalog and normalization toolkit for agent-training data},
author = {voidful},
year = {2026},
url = {https://github.com/voidful/awesome-agent-dataset}
}
```
## π Acknowledgements
Built on the open datasets catalogued here and the HuggingFace `datasets` / `teich` / `datasketch` ecosystems. Schema is compatible with [`voidful/gemma4-agent-sft`](https://huggingface.co/datasets/voidful/gemma4-agent-sft).