An open API service indexing awesome lists of open source software.

https://github.com/voidful/awesome-agent-dataset

πŸ“š Curated catalog of agent-training datasets + a toolkit that normalizes, deduplicates, and quality-tiers them into one schema. Produces πŸ€— voidful/agent-sft.
https://github.com/voidful/awesome-agent-dataset

List: awesome-agent-dataset

agent agent-traces awesome-list dataset fine-tuning function-calling huggingface llm swe-agent tool-use

Last synced: 2 days ago
JSON representation

πŸ“š Curated catalog of agent-training datasets + a toolkit that normalizes, deduplicates, and quality-tiers them into one schema. Produces πŸ€— voidful/agent-sft.

Awesome Lists containing this project

README

          

# πŸ€– awesome-agent-dataset

**A curated catalog of agent-training datasets β€” *plus* a working toolkit that normalizes, filters, and deduplicates them into one canonical schema.**

[![CI](https://github.com/voidful/awesome-agent-dataset/actions/workflows/ci.yml/badge.svg)](https://github.com/voidful/awesome-agent-dataset/actions/workflows/ci.yml)
[![Awesome](https://awesome.re/badge.svg)](https://awesome.re)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![HF Dataset](https://img.shields.io/badge/πŸ€—%20Dataset-voidful%2Fagent--sft-orange)](https://huggingface.co/datasets/voidful/agent-sft)

*Most "agent dataset" lists stop at links. This one ships the pipeline too β€”*
*so you get **clean, deduplicated, schema-unified data**, not a pile of incompatible formats.*

[**πŸ“š Full Catalog**](CATALOG.md) Β· [**πŸ€— Output Dataset**](https://huggingface.co/datasets/voidful/agent-sft) Β· [**πŸš€ Quickstart**](#-quickstart) Β· [**🧩 Schema**](#-canonical-schema) Β· [**🀝 Contributing**](CONTRIBUTING.md)

---

## Why this exists

For fine-tuning an agent model, **finding tokens is not the bottleneck** β€” public agent data already far exceeds what a 30B FT needs. The real bottlenecks are:

1. **Format chaos** β€” every dataset uses a different shape (xLAM `query/answers`, Glaive flat-text, Hermes XML tags, ToolACE Python-call DSL, OpenHands event streams, WebArena action grammars, the new HF `agent-traces` format…).
2. **Massive overlap** β€” the same GitHub issue appears across 5 SWE datasets; xLAM/Glaive/ToolACE get re-packaged a dozen times. Naively concatenating **severely overestimates** your real data volume.
3. **Quality variance** β€” valid JSON β‰  a successful task; you need stratification.
4. **Coding-agent skew** β€” SWE/terminal data is so abundant it drowns general agent ability if unbalanced.

`agentds` solves all four: **one normalizer per format β†’ group-level dedup β†’ quality tiers β†’ balanced mixture**, producing [`voidful/agent-sft`](https://huggingface.co/datasets/voidful/agent-sft) in a **standard, model-agnostic OpenAI-style schema** β€” train any model on it (Qwen, Llama, Gemma, GPT, …).

## ✨ Output dataset

[**πŸ€— voidful/agent-sft**](https://huggingface.co/datasets/voidful/agent-sft) β€” a model-agnostic agent/tool-use SFT dataset, produced entirely by this repo from the wired sources.

**309,322 rows** from 27 wired sources, deduplicated (incl. against a schema-compatible reference dataset).

| Tier | Rows | Share |
|---|--:|--:|
| πŸ› οΈ function_calling | 171,910 | 56% |
| πŸ’» swe_terminal | 61,153 | 20% |
| πŸ’¬ general | 41,470 | 13% |
| 🌐 web | 27,830 | 9% |
| 🧡 agent_traces | 6,959 | 2% |
| **Total** | **309,322** | |

**Quality:** 147,238 high (48%) Β· 161,580 medium (52%) Β· 504 low (0.2%)
**Dedup removed 99,492 candidates (24%)** β€” 43,914 SWE-group (same GitHub issue across SWE datasets) Β· 46,725 near-dup (MinHash) Β· 8,853 exact, *plus* dedup against a schema-compatible reference dataset. (E.g. `ansulev/DeepSeek-v4-Pro-Agent` β†’ **0 kept**, fully collapsed into its `TeichAI` twin.)
**`agentds audit`:** 0 CoT leakage Β· 0 schema corruption Β· 0 id collisions Β· 0 foreign-marker leaks.

Coding-heavy data (swe_terminal + agent_traces) is held to **~22%** so general agent ability isn't drowned. See [CATALOG.md](CATALOG.md) for per-source counts.

```python
import json
from datasets import load_dataset

ds = load_dataset("voidful/agent-sft", split="train")
hi = ds.filter(lambda r: json.loads(r["quality"])["tier"] == "high") # high-quality SFT subset
ex = hi[0]
messages = json.loads(ex["messages"]) # OpenAI-style turns β€” feed to any chat template
tools = json.loads(ex["tools"]) # function definitions
# schema-compatible with voidful/gemma4-agent-sft, so you can concatenate them
```

## πŸš€ Quickstart

```bash
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

agentds catalog # regenerate CATALOG.md from the registry
agentds validate --tier function_calling -n 30 # normalize LIVE rows, sanity-check
agentds run --tier function_calling --tier agent_traces # stream β†’ normalize β†’ dedup β†’ shards
agentds stats # composition + dedup + quality report
agentds audit # quantify data-quality defects (quality gate)
agentds push --repo you/your-agent-sft --public # to a new HF dataset repo
```

Pick tiers (`function_calling`, `agent_traces`, `swe_terminal`, `web`, `general`) or individual
sources (`--key apigen swe_gym`). `--limit N` caps rows/subset for a quick dry run.

## 🧩 Canonical schema

A standard, **model-agnostic** OpenAI-style chat-with-tools schema (works with any model's chat template). Wire-compatible with [`voidful/gemma4-agent-sft`](https://huggingface.co/datasets/voidful/gemma4-agent-sft), so shards concatenate cleanly:

| field | type | description |
|---|---|---|
| `id` | str | `{source}_{config}_{hash16}` |
| `source` | str | normalized source key |
| `source_subset` | str | `config/split` within the source |
| `messages` | str (JSON) | `list[{role, content, tool_calls?, tool_responses?}]` |
| `tools` | str (JSON) | `list[{type:"function", function:{name, description, parameters}}]` |
| `tool_names` | list[str] | declared tool names |
| `quality` | str (JSON) | `{tier, score, curated, signals}` |
| `metadata` | str (JSON) | `{hf_id, license, dedup_group, instance_id, …}` |

- `tool_calls[].function.arguments` are **objects** (string-encoded args parsed).
- Chain-of-thought (`…`, `reasoning_content`) and foreign chat-template markers are stripped.
- `parameters` coerced to a JSON-schema `object` (xLAM/Hermes flat styles wrapped; `str`/`int` β†’ `string`/`integer`).

Example normalized row

```json
{
"id": "apigen_mt_dataset_9009cd98a0542977",
"source": "apigen_mt",
"source_subset": "dataset/train",
"messages": "[{\"role\":\"system\",\"content\":\"# Airline Agent Policy…\"},{\"role\":\"user\",\"content\":\"I'd like to cancel a reservation.\"},{\"role\":\"assistant\",\"content\":null,\"tool_calls\":[{\"function\":{\"name\":\"get_reservation_details\",\"arguments\":{\"reservation_id\":\"0U4NPP\"}}}]},{\"role\":\"tool\",\"tool_responses\":[{\"name\":\"get_reservation_details\",\"response\":{\"reservation_id\":\"0U4NPP\",\"status\":\"active\"}}]},{\"role\":\"assistant\",\"content\":\"Your reservation 0U4NPP is active β€” shall I cancel it?\"}]",
"tools": "[{\"type\":\"function\",\"function\":{\"name\":\"get_reservation_details\",\"description\":\"…\",\"parameters\":{\"type\":\"object\",\"properties\":{\"reservation_id\":{\"type\":\"string\"}},\"required\":[\"reservation_id\"]}}}]",
"tool_names": ["get_reservation_details", "cancel_reservation", "..."],
"quality": "{\"tier\":\"high\",\"score\":0.9,\"curated\":false,\"signals\":{\"n_turns\":5,\"n_tool_calls\":1,\"multi_turn\":true,\"valid_arg_ratio\":1.0}}",
"metadata": "{\"hf_id\":\"Salesforce/APIGen-MT-5k\",\"license\":\"cc-by-4.0\",\"dedup_group\":\"xlam_apigen\"}"
}
```

## πŸ“š Catalog

**[β†’ Full catalog of 60+ datasets, by tier, with normalization status (CATALOG.md)](CATALOG.md)**

| Tier | What it teaches | Wired sources (sample) |
|---|---|---|
| πŸ› οΈ **function_calling** | when/how to call tools, schema grounding, when *not* to call | apigen(xLAM), glaive, toolace, when2call, hermes, hermes_reasoning, toolmind |
| 🧡 **agent_traces** | real `claude_code`/`pi` coding-agent sessions (HF `format: agent-traces`, decoded via `teich`) | DeepSeek-v4-Pro, synthtraces, qwen3.7-max-pi, minimax-m3, ml-intern |
| πŸ’» **swe_terminal** | SWE repair, shell/terminal, long-horizon coding (streamed + sampled) | swe_gym, swe_rebench, swe_zero, swe_smith, coderforge, nemotron-terminal |
| 🌐 **web** | observationβ†’action loops (DOM/AXTree as text) | weblinx, mind2web, nnetnav |
| πŸ’¬ **general** | retention β€” keep natural-answer ability, avoid over-tool-calling | openhermes, smoltalk2 |

Adding a dataset = an entry in [`configs/registry.yaml`](configs/registry.yaml) (+ a normalizer if it's a new format). See [CONTRIBUTING.md](CONTRIBUTING.md).

## βš™οΈ How it works

```
configs/registry.yaml (single source of truth)
β”‚
HF source ──stream──▢ normalize ──▢ validate ──▢ group-dedup ──▢ quality ──▢ parquet shards ──▢ πŸ€— push
(streaming=True, (per-format) (schema (exact + SWE- (tiers)
never fully β”‚ well-formed) provenance +
downloaded) β”‚ MinHash near-dup,
β”‚ incl. vs reference)
agentds/normalizers.py agentds/dedup.py agentds/quality.py
```

- **Normalizers** ([`agentds/normalizers.py`](agentds/normalizers.py)) β€” one per format family: xLAM, ShareGPT (incl. Glaive `function_call`/`observation`), Hermes `` XML, ToolACE BFCL `[Func(k=v)]` (paren/space/path-style names), When2Call `` + appropriate-refusal rows, native OpenHands SWE trajectories, Nemotron terminal transcripts, WebLINX/Mind2Web/WebArena action grammars, and the HF `agent-traces` format. Tools are synthesized from observed calls when a source ships no schema.
- **Group-level dedup** ([`agentds/dedup.py`](agentds/dedup.py)) β€” (1) exact xxhash of normalized content; (2) **SWE-provenance** key so the same GitHub issue across SWE-Zero/nebius/SWE-Gym/SWE-smith/CoderForge collapses to one (real `repo-NNNN` ids by issue number; synthetic ids at full granularity); (3) **MinHash + LSH** near-dup over assistant action/tool-schema shingles. Stateful across the whole run + can preload any reference dataset's hashes (`--dedup-against`).
- **Quality** ([`agentds/quality.py`](agentds/quality.py)) β€” `{tier: high|medium|low, score, curated, signals}`; rewards multi-turn, schema-valid, observation-grounded tool use; folds in source success signals (SWE `resolved`, CoderForge `reward`); penalizes degenerate trajectories.

## πŸ§ͺ Recommended training recipe

- **Stage A β€” agentic continued post-training** (10–30B tok): SWE/terminal 55% Β· tool-use 20% Β· web 15% Β· general 10%.
- **Stage B β€” high-quality agent SFT** (1–3B tok): filter to `quality.tier == "high"` + verified successes.
- **Stage C β€” RL / rejection sampling**: use executable/verified subsets (`reward==1`, `resolved`).

Recommended loss mask:

```
system / user / tool-schema / tool-observation : 0
assistant natural language / final answer : 1.0
assistant tool-call JSON : 1.5
assistant recovery-after-error action : 2.0
```

## 🀝 Contributing

PRs that **add datasets** or **wire up catalog-only entries** are the most valuable β€” see [CONTRIBUTING.md](CONTRIBUTING.md). The bar: it must normalize cleanly (`agentds validate` green) and declare a `dedup_group`.

## πŸ§ͺ Tests

```bash
.venv/bin/python -m tests.test_normalizers # offline, fixture-based
```

## πŸ“„ License & citation

Code: [MIT](LICENSE). **Each dataset keeps its upstream license** β€” recorded in every row's `metadata.license`; review before downstream use (sources span apache-2.0 / mit / cc-by-4.0 and restricted terms like cc-by-nc-sa-4.0).

```bibtex
@misc{awesome-agent-dataset,
title = {awesome-agent-dataset: a catalog and normalization toolkit for agent-training data},
author = {voidful},
year = {2026},
url = {https://github.com/voidful/awesome-agent-dataset}
}
```

## πŸ™ Acknowledgements

Built on the open datasets catalogued here and the HuggingFace `datasets` / `teich` / `datasketch` ecosystems. Schema is compatible with [`voidful/gemma4-agent-sft`](https://huggingface.co/datasets/voidful/gemma4-agent-sft).