{"id":44097944,"url":"https://github.com/liuxiaotong/data-recipe","last_synced_at":"2026-02-08T13:13:56.749Z","repository":{"id":335429645,"uuid":"1145689082","full_name":"liuxiaotong/data-recipe","owner":"liuxiaotong","description":"Reverse-engineering framework for AI datasets — extract annotation specs, cost models \u0026 reproducibility from samples or requirement docs.","archived":false,"fork":false,"pushed_at":"2026-02-07T06:19:50.000Z","size":1084,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-07T13:56:16.146Z","etag":null,"topics":["ai-agent","ai-data-pipeline","annotation-spec","cost-estimation","dataset-analysis","huggingface","llm","mcp","python","reverse-engineering","training-data","workflow-automation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/liuxiaotong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-30T04:59:59.000Z","updated_at":"2026-02-07T06:19:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/liuxiaotong/data-recipe","commit_stats":null,"previous_names":["liuxiaotong/data-recipe"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/liuxiaotong/data-recipe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liuxiaotong%2Fdata-recipe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liuxiaotong%2Fdata-recipe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liuxiaotong%2Fdata-recipe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liuxiaotong%2Fdata-recipe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/liuxiaotong","download_url":"https://codeload.github.com/liuxiaotong/data-recipe/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liuxiaotong%2Fdata-recipe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29231139,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-08T13:10:22.947Z","status":"ssl_error","status_checked_at":"2026-02-08T13:08:18.779Z","response_time":57,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agent","ai-data-pipeline","annotation-spec","cost-estimation","dataset-analysis","huggingface","llm","mcp","python","reverse-engineering","training-data","workflow-automation"],"created_at":"2026-02-08T13:13:53.709Z","updated_at":"2026-02-08T13:13:56.741Z","avatar_url":"https://github.com/liuxiaotong.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# DataRecipe\n\n**AI 数据集逆向工程框架**\n\n[![PyPI](https://img.shields.io/pypi/v/knowlyr-datarecipe?color=blue\u0026v=3)](https://pypi.org/project/knowlyr-datarecipe/)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10%E2%80%933.13-blue.svg)](https://www.python.org/downloads/)\n[![Tests](https://img.shields.io/badge/tests-3399_passed-brightgreen.svg)](#开发)\n[![Coverage](https://img.shields.io/badge/coverage-97%25-brightgreen.svg)](#开发)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n[![MCP](https://img.shields.io/badge/MCP-10_Tools-purple.svg)](#mcp-server)\n\n[快速开始](#快速开始) · [LLM 增强](#llm-增强层) · [需求文档分析](#需求文档分析) · [MCP Server](#mcp-server) · [Data Pipeline 生态](#data-pipeline-生态)\n\n\u003c/div\u003e\n\n---\n\n从数据集样本或需求文档中自动提取构建范式，生成 **23+ 生产级文档**，覆盖决策、项目管理、标注规范、成本分析全链路。\n\n```\n数据集 / 需求文档 → 逆向分析 → [LLM 增强层] → 23+ 结构化文档 (人类可读 + 机器可解析)\n```\n\n### 谁在用\n\n| 角色 | 关注目录 | 获得什么 |\n|------|---------|---------|\n| 决策层 | `01_决策参考/` | 价值评分、ROI 分析、竞争定位 |\n| 项目经理 | `02_项目管理/` | 里程碑、验收标准、风险管理 |\n| 标注团队 | `03_标注规范/` | 标注指南、培训手册、质检清单 |\n| 技术团队 | `04_复刻指南/` | 生产 SOP、数据结构、复刻策略 |\n| 财务 | `05_成本分析/` | 分阶段成本、人机分配 |\n| AI Agent | `08_AI_Agent/` | 结构化上下文、可执行流水线 |\n\n## 安装\n\n```bash\npip install knowlyr-datarecipe\n\n# 可选依赖\npip install knowlyr-datarecipe[llm]      # LLM 分析 (Anthropic/OpenAI)\npip install knowlyr-datarecipe[pdf]      # PDF 解析\npip install knowlyr-datarecipe[mcp]      # MCP 服务器\npip install knowlyr-datarecipe[all]      # 全部\n```\n\n## 快速开始\n\n### 分析 HuggingFace 数据集\n\n```bash\n# 基础分析 (纯本地，无需 API key)\nknowlyr-datarecipe deep-analyze tencent/CL-bench\n\n# 启用 LLM 增强 (在 Claude Code/App 中运行，自动利用宿主 LLM)\nknowlyr-datarecipe deep-analyze tencent/CL-bench --use-llm\n\n# 独立运行时用 API\nknowlyr-datarecipe deep-analyze tencent/CL-bench --use-llm --enhance-mode api\n```\n\n### 分析需求文档\n\n```bash\n# API 模式 (需要 ANTHROPIC_API_KEY)\nknowlyr-datarecipe analyze-spec requirements.pdf\n\n# 交互模式 (在 Claude Code 中使用，无需 API key)\nknowlyr-datarecipe analyze-spec requirements.pdf --interactive\n\n# 从预计算 JSON 加载\nknowlyr-datarecipe analyze-spec requirements.pdf --from-json analysis.json\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e输出示例 (deep-analyze)\u003c/summary\u003e\n\n```\n============================================================\n  DataRecipe 深度逆向分析\n============================================================\n\n数据集: tencent/CL-bench\n✓ 加载完成: 300 样本\n✓ 评分标准: 4120 条, 2412 种模式\n✓ Prompt模板: 293 个\n✓ 人机分配: 人工 84%, 机器 16%\n✓ LLM 增强完成\n\n输出目录: ./projects/tencent_CL-bench/\n生成文件: 29 个\n  📄 01_决策参考/EXECUTIVE_SUMMARY.md\n  📋 02_项目管理/MILESTONE_PLAN.md\n  📝 03_标注规范/ANNOTATION_SPEC.md\n  ...\n```\n\n\u003c/details\u003e\n\n---\n\n## LLM 增强层\n\n核心创新：在分析和生成之间插入 **LLM 增强层**，一次调用生成富上下文对象 `EnhancedContext`，所有文档生成器消费该对象。\n\n```\n本地分析结果 → [LLM 增强: 1次调用] → EnhancedContext → 各生成器 → 高质量文档\n```\n\n### 三种运行模式\n\n| 模式 | 场景 | 说明 |\n|------|------|------|\n| `auto` (默认) | 自动检测 | 有 API key 用 API，否则用交互模式 |\n| `interactive` | Claude Code / Claude App | 输出 prompt，宿主 LLM 直接处理 |\n| `api` | 独立运行 | 调用 Anthropic / OpenAI API |\n\n### 增强效果对比\n\n| 文档 | 无 LLM | 有 LLM |\n|------|--------|--------|\n| **EXECUTIVE_SUMMARY** | 通用占位符 \"场景A/B/C\" | 具体 ROI 数字、针对性风险、竞争定位 |\n| **ANNOTATION_SPEC** | 模板化规范 | 领域标注指导、常见错误、样本逐条分析 |\n| **REPRODUCTION_GUIDE** | 几乎空白 | 完整复刻策略、团队配置、风险矩阵 |\n| **MILESTONE_PLAN** | 套话风险 | 分阶段具体风险 + 缓解措施 |\n| **ANALYSIS_REPORT** | 几乎空白 | 方法学洞察、竞争分析、领域建议 |\n\n### MCP 两步式增强（推荐）\n\n通过 MCP Server 调用时，Claude Agent 自身作为 LLM 处理增强 prompt，无需 API key：\n\n```\n1. Claude 调用 analyze_huggingface_dataset(\"tencent/CL-bench\")\n   → 返回分析结果 + enhancement_prompt\n\n2. Claude 处理 enhancement_prompt，生成增强 JSON\n\n3. Claude 调用 enhance_analysis_reports(output_dir, enhanced_context)\n   → 报告从模板占位符 → 针对性的具体分析内容\n```\n\n### 编程接口\n\n在 Claude Code 等 LLM 环境中，也可通过 `get_prompt()` + `enhance_from_response()` 模式集成：\n\n```python\nfrom datarecipe.generators.llm_enhancer import LLMEnhancer\n\nenhancer = LLMEnhancer(mode=\"auto\")\n\n# 获取增强 prompt (交给宿主 LLM 处理)\nprompt = enhancer.get_prompt(dataset_id=\"my/dataset\", dataset_type=\"evaluation\", ...)\n\n# 解析 LLM 返回的 JSON\nctx = enhancer.enhance_from_response(llm_json_response)\n\n# 或从缓存加载\nctx = enhancer.enhance_from_json(\"enhanced_context.json\")\n```\n\n`EnhancedContext` 包含 14 个增强字段：用途摘要、方法学洞察、复刻策略、ROI 场景、风险评估、领域标注指导、质量陷阱、样本分析、团队建议等。\n\n---\n\n## 输出结构\n\n所有命令（`deep-analyze`、`analyze-spec`、`deploy`、`integrate-report`）的产出统一到 `projects/` 下，一个数据集 = 一个项目文件夹：\n\n```\nprojects/{数据集名}/\n├── README.md                        # 自动生成的导航枢纽\n├── recipe_summary.json              # 核心摘要 (Radar 兼容)\n├── .project_manifest.json           # 记录已执行的命令和时间戳\n│\n├── 01_决策参考/                      # deep-analyze / analyze-spec\n│   └── EXECUTIVE_SUMMARY.md         # 评分 + ROI + 风险 + 竞争定位\n├── 02_项目管理/                      # deep-analyze / analyze-spec\n│   ├── MILESTONE_PLAN.md            # 里程碑 + 验收标准 + 风险管理\n│   └── INDUSTRY_BENCHMARK.md        # 行业基准对比\n├── 03_标注规范/                      # deep-analyze / analyze-spec\n│   ├── ANNOTATION_SPEC.md           # 标注规范 + 领域指导\n│   ├── TRAINING_GUIDE.md            # 标注员培训手册\n│   └── QA_CHECKLIST.md              # 质检清单\n├── 04_复刻指南/                      # deep-analyze / analyze-spec\n│   ├── REPRODUCTION_GUIDE.md        # 复刻策略 + 团队配置\n│   ├── PRODUCTION_SOP.md            # 生产 SOP\n│   ├── ANALYSIS_REPORT.md           # 分析报告\n│   └── DATA_SCHEMA.json             # 数据格式定义\n├── 05_成本分析/                      # deep-analyze / analyze-spec\n│   └── COST_BREAKDOWN.md            # 分阶段成本明细\n├── 06_原始数据/                      # deep-analyze / analyze-spec\n│   ├── enhanced_context.json        # LLM 增强上下文 (可复用)\n│   └── *.json                       # 分析原始数据\n├── 07_模板/                          # analyze-spec\n│   └── data_template.json           # 数据录入模板\n├── 08_AI_Agent/                      # deep-analyze / analyze-spec\n│   ├── agent_context.json           # 聚合上下文入口\n│   ├── workflow_state.json          # 工作流状态\n│   ├── reasoning_traces.json        # 推理链\n│   └── pipeline.yaml                # 可执行流水线\n├── 09_样例数据/                      # analyze-spec\n│   ├── samples.json                 # 样例数据 (最多50条)\n│   └── SAMPLE_GUIDE.md              # 样例指南 + 自动化评估\n├── 10_生产部署/                      # deploy\n│   ├── recipe.yaml                  # 数据配方\n│   ├── annotation_guide.md          # 标注指南\n│   ├── quality_rules.yaml/.md       # 质检规则\n│   ├── acceptance_criteria.yaml/.md # 验收标准\n│   ├── timeline.md                  # 项目时间线\n│   └── scripts/                     # 自动化脚本\n└── 11_综合报告/                      # integrate-report\n    └── weekly_report_*.md           # Radar + Recipe 综合报告\n```\n\n### 双重格式输出\n\n所有文档同时生成人类可读 (Markdown) 和机器可解析 (JSON/YAML) 格式：\n\n| 人类文档 | 机器文件 | 用途 |\n|---------|---------|------|\n| `EXECUTIVE_SUMMARY.md` | `reasoning_traces.json` | 决策依据 + 推理链 |\n| `MILESTONE_PLAN.md` | `workflow_state.json` | 进度状态 + 阻塞项 |\n| `PRODUCTION_SOP.md` | `pipeline.yaml` | 可执行工作流 |\n\n---\n\n## 需求文档分析\n\n从 PDF / Word / 图片需求文档直接生成全套项目文档，无需现有数据集。\n\n**支持格式**: PDF (`.pdf`)、Word (`.docx`)、图片 (`.png`, `.jpg`)、文本 (`.txt`, `.md`)\n\n**智能难度验证**: 当文档含难度要求（如「doubao1.8 跑 3 次，最多 1 次正确」）时，自动提取验证配置并生成 `DIFFICULTY_VALIDATION.md`。\n\n---\n\n## MCP Server\n\n在 Claude Desktop / Claude Code 中直接使用，10 个工具覆盖完整工作流。\n\n```json\n{\n  \"mcpServers\": {\n    \"knowlyr-datarecipe\": {\n      \"command\": \"uv\",\n      \"args\": [\"--directory\", \"/path/to/data-recipe\", \"run\", \"knowlyr-datarecipe-mcp\"]\n    }\n  }\n}\n```\n\n| 工具 | 功能 |\n|------|------|\n| `parse_spec_document` | 解析需求文档，返回提取 prompt |\n| `generate_spec_output` | 生成 23+ 项目文档 |\n| `analyze_huggingface_dataset` | 深度分析 HF 数据集，返回 enhancement_prompt |\n| `enhance_analysis_reports` | 应用 LLM 增强内容，重新生成高质量报告 |\n| `get_extraction_prompt` | 获取 LLM 提取模板 |\n| `extract_rubrics` | 提取评分标准 |\n| `extract_prompts` | 提取 Prompt 模板 |\n| `compare_datasets` | 对比多个数据集 |\n| `profile_dataset` | 数据集画像 + 成本估算 |\n| `get_agent_context` | 获取 AI Agent 上下文 |\n\n---\n\n## Data Pipeline 生态\n\nDataRecipe 是 Data Pipeline 生态的分析引擎，与标注、合成、质检工具协同：\n\n```mermaid\ngraph LR\n    Radar[\"🔍 Radar\u003cbr/\u003e情报发现\"] --\u003e Recipe[\"📋 Recipe\u003cbr/\u003e逆向分析\"]\n    Recipe --\u003e Synth[\"🔄 Synth\u003cbr/\u003e数据合成\"]\n    Recipe --\u003e Label[\"🏷️ Label\u003cbr/\u003e数据标注\"]\n    Synth --\u003e Check[\"✅ Check\u003cbr/\u003e数据质检\"]\n    Label --\u003e Check\n    Check --\u003e Hub[\"🎯 Hub\u003cbr/\u003e编排层\"]\n    Hub --\u003e Sandbox[\"📦 Sandbox\u003cbr/\u003e执行沙箱\"]\n    Sandbox --\u003e Recorder[\"📹 Recorder\u003cbr/\u003e轨迹录制\"]\n    Recorder --\u003e Reward[\"⭐ Reward\u003cbr/\u003e过程打分\"]\n    style Recipe fill:#0969da,color:#fff,stroke:#0969da\n```\n\n| 层 | 项目 | 说明 | 仓库 |\n|---|---|---|---|\n| 情报 | **AI Dataset Radar** | 数据集竞争情报、趋势分析 | [GitHub](https://github.com/liuxiaotong/ai-dataset-radar) |\n| 分析 | **DataRecipe** | 逆向分析、Schema 提取、成本估算 | You are here |\n| 生产 | **DataSynth** | LLM 批量合成、种子数据扩充 | [GitHub](https://github.com/liuxiaotong/data-synth) |\n| 生产 | **DataLabel** | 轻量标注工具、多标注员合并 | [GitHub](https://github.com/liuxiaotong/data-label) |\n| 质检 | **DataCheck** | 规则验证、重复检测、分布分析 | [GitHub](https://github.com/liuxiaotong/data-check) |\n| Agent | **AgentSandbox** | Docker 执行沙箱、轨迹重放 | [GitHub](https://github.com/liuxiaotong/agent-sandbox) |\n| Agent | **AgentRecorder** | 标准化轨迹录制、多框架适配 | [GitHub](https://github.com/liuxiaotong/agent-recorder) |\n| Agent | **AgentReward** | 过程级 Reward、Rubric 多维评估 | [GitHub](https://github.com/liuxiaotong/agent-reward) |\n| 编排 | **TrajectoryHub** | Pipeline 编排、数据集导出 | [GitHub](https://github.com/liuxiaotong/agent-trajectory-hub) |\n\n```bash\n# 端到端工作流\nknowlyr-datarecipe deep-analyze tencent/CL-bench --use-llm      # 分析\nknowlyr-datalabel generate ./projects/tencent_CL-bench/          # 标注\nknowlyr-datasynth generate ./projects/tencent_CL-bench/ -n 1000  # 合成\nknowlyr-datacheck validate ./projects/tencent_CL-bench/          # 质检\n```\n\n---\n\n## 命令参考\n\n| 命令 | 功能 |\n|------|------|\n| `deep-analyze \u003cdataset\u003e` | 深度分析 HF 数据集 |\n| `deep-analyze \u003cdataset\u003e --use-llm` | 启用 LLM 增强 |\n| `deep-analyze \u003cdataset\u003e --enhance-mode api` | 指定增强模式 |\n| `analyze-spec \u003cfile\u003e` | 分析需求文档 (API 模式) |\n| `analyze-spec \u003cfile\u003e --interactive` | 交互模式 (Claude Code) |\n| `analyze-spec \u003cfile\u003e --from-json` | 从 JSON 加载分析 |\n| `analyze \u003cdataset\u003e` | 快速分析 |\n| `profile \u003cdataset\u003e` | 标注员画像 + 成本估算 |\n| `extract-rubrics \u003cdataset\u003e` | 提取评分标准 |\n| `deploy \u003cdataset\u003e` | 生成生产部署配置 |\n| `integrate-report` | 生成 Radar + Recipe 综合报告 |\n| `batch-from-radar \u003creport\u003e` | 从 Radar 报告批量分析 |\n\n---\n\n## 项目架构\n\n```\nsrc/datarecipe/\n├── core/\n│   ├── deep_analyzer.py            # 深度分析引擎 (6 阶段流水线)\n│   └── project_layout.py           # 统一输出目录布局\n├── analyzers/\n│   ├── spec_analyzer.py            # 需求文档分析 (LLM 提取)\n│   ├── context_strategy.py         # 上下文策略检测\n│   └── llm_dataset_analyzer.py     # 数据集智能分析\n├── generators/\n│   ├── llm_enhancer.py             # LLM 增强层 (EnhancedContext)\n│   ├── spec_output.py              # 需求文档产出 (23+ 文件)\n│   ├── executive_summary.py        # 执行摘要生成器\n│   ├── annotation_spec.py          # 标注规范生成器\n│   ├── milestone_plan.py           # 里程碑计划生成器\n│   ├── enhanced_guide.py           # 增强生产指南\n│   ├── human_machine_split.py      # 人机分配分析\n│   ├── industry_benchmark.py       # 行业基准对比\n│   └── pattern_generator.py        # 模式生成器\n├── parsers/\n│   └── document_parser.py          # PDF / Word / 图片解析\n├── extractors/\n│   ├── rubrics_analyzer.py         # 评分标准提取\n│   └── prompt_extractor.py         # Prompt 模板提取\n├── cost/\n│   ├── token_analyzer.py           # Token 精确分析\n│   ├── phased_model.py             # 分阶段成本模型\n│   ├── calibrator.py               # 成本校准器\n│   └── complexity_analyzer.py      # 复杂度分析\n├── knowledge/\n│   ├── knowledge_base.py           # 知识库 (模式积累)\n│   └── dataset_catalog.py          # 数据集目录\n├── integrations/\n│   └── radar.py                    # AI Dataset Radar 集成\n├── cache/\n│   └── analysis_cache.py           # 分析缓存\n├── constants.py                    # 全局常量\n├── schema.py                       # 数据模型 (Recipe / DataRecipe)\n├── task_profiles.py                # 任务类型注册表 (5 种内置类型)\n├── cost_calculator.py              # 成本计算器\n├── comparator.py                   # 数据集对比\n├── profiler.py                     # 标注员画像\n├── workflow.py                     # 生产工作流生成\n├── quality_metrics.py              # 质量评估指标\n├── pipeline.py                     # 多阶段流水线模板\n├── mcp_server.py                   # MCP Server (10 工具)\n└── cli/                            # CLI 命令包\n    ├── __init__.py                 # 命令注册\n    ├── _helpers.py                 # 共享工具函数\n    ├── analyze.py                  # analyze, show, export, guide\n    ├── deep.py                     # deep-analyze\n    ├── spec.py                     # analyze-spec\n    ├── batch.py                    # batch, batch-from-radar, integrate-report\n    ├── tools.py                    # cost, quality, deploy, workflow 等\n    └── infra.py                    # watch, cache, knowledge\n```\n\n---\n\n## 开发\n\n```bash\n# 安装开发依赖\nmake install\n\n# 运行测试 (3399 个用例)\nmake test\n\n# 查看测试覆盖率 (97%+)\nmake cov\n\n# 代码格式化 + lint\nmake lint\nmake format\n\n# 安装 pre-commit hooks\nmake hooks\n```\n\n**测试覆盖**: 35+ 个测试文件，3399 个测试用例，97% 语句覆盖率。\n\n**CI**: GitHub Actions，支持 Python 3.10 / 3.11 / 3.12 / 3.13，覆盖率阈值 80%。Tag push 自动发布 PyPI + GitHub Release。\n\n**Pre-commit**: ruff lint + format、trailing-whitespace、check-yaml、check-added-large-files。\n\n详见 [CONTRIBUTING.md](CONTRIBUTING.md)。\n\n---\n\n## License\n\n[MIT](LICENSE)\n\n---\n\n## AI Data Pipeline 生态\n\n\u003e 9 个工具覆盖 AI 数据工程全流程，均支持 CLI + MCP，可独立使用也可组合成流水线。\n\n| Tool | Description | Link |\n|------|-------------|------|\n| **AI Dataset Radar** | Competitive intelligence for AI training datasets | [GitHub](https://github.com/liuxiaotong/ai-dataset-radar) |\n| **DataRecipe** | Reverse-engineer datasets into annotation specs \u0026 cost models | You are here |\n| **DataSynth** | Seed-to-scale synthetic data generation | [GitHub](https://github.com/liuxiaotong/data-synth) |\n| **DataLabel** | Lightweight, serverless HTML labeling tool | [GitHub](https://github.com/liuxiaotong/data-label) |\n| **DataCheck** | Automated quality checks \u0026 anomaly detection | [GitHub](https://github.com/liuxiaotong/data-check) |\n| **AgentSandbox** | Reproducible Docker sandbox for Code Agent execution | [GitHub](https://github.com/liuxiaotong/agent-sandbox) |\n| **AgentRecorder** | Standardized trajectory recording for Code Agents | [GitHub](https://github.com/liuxiaotong/agent-recorder) |\n| **AgentReward** | Process-level rubric-based reward engine | [GitHub](https://github.com/liuxiaotong/agent-reward) |\n| **TrajectoryHub** | Pipeline orchestrator for Agent trajectory data | [GitHub](https://github.com/liuxiaotong/agent-trajectory-hub) |\n\n```mermaid\ngraph LR\n    A[Radar] --\u003e B[Recipe] --\u003e C[Synth] --\u003e E[Check] --\u003e F[Hub]\n    B --\u003e D[Label] --\u003e E\n    F --\u003e G[Sandbox] --\u003e H[Recorder] --\u003e I[Reward]\n```\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\u003csub\u003e为数据工程团队、标注服务商及 AI 数据集研究者提供可复用的逆向工程方法论\u003c/sub\u003e\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliuxiaotong%2Fdata-recipe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fliuxiaotong%2Fdata-recipe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliuxiaotong%2Fdata-recipe/lists"}