{"id":49070120,"url":"https://github.com/thu-ml/embodied-data-toolkit","last_synced_at":"2026-04-20T07:04:09.522Z","repository":{"id":340378758,"uuid":"1135407952","full_name":"thu-ml/embodied-data-toolkit","owner":"thu-ml","description":"A toolkit for processing raw embodied data into standardized formats and converting between embodied dataset schemas.","archived":false,"fork":false,"pushed_at":"2026-02-24T14:42:49.000Z","size":232,"stargazers_count":15,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-24T19:04:20.339Z","etag":null,"topics":["data-processing","embodied-ai","format-converter-tool"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thu-ml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-16T03:49:58.000Z","updated_at":"2026-02-24T14:45:24.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/thu-ml/embodied-data-toolkit","commit_stats":null,"previous_names":["thu-ml/embodied-data-toolkit"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/thu-ml/embodied-data-toolkit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2Fembodied-data-toolkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2Fembodied-data-toolkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2Fembodied-data-toolkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2Fembodied-data-toolkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thu-ml","download_url":"https://codeload.github.com/thu-ml/embodied-data-toolkit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-ml%2Fembodied-data-toolkit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32036803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"online","status_checked_at":"2026-04-20T02:00:06.527Z","response_time":94,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-processing","embodied-ai","format-converter-tool"],"created_at":"2026-04-20T07:04:08.569Z","updated_at":"2026-04-20T07:04:09.507Z","avatar_url":"https://github.com/thu-ml.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Embodied Data Toolkit\n\n![Status](https://img.shields.io/badge/Status-Active-success)\n![Python](https://img.shields.io/badge/Python-3.10%2B-blue)\n\n**Embodied Data Toolkit** is an end-to-end framework designed for Embodied AI and robotics learning. It provides a complete solution from raw data ingestion and format conversion to high-level trajectory processing.\n\nThe toolkit consists of two core components:\n1.  **Unified Data Converter**: A configuration-driven engine to transform heterogeneous raw data (HDF5, Pytorch Tensor, Json, mp4, etc.) into any designated formats.\n2.  **Process Pipeline**: A modular workflow manager for trajectory processing (Trimming, Captioning, Concatenation) with built-in checkpointing.\n\n---\n\n## 🏗️ Architecture\n\nThe framework adopts a layered processing architecture to ensure high throughput and reliability.\n\n![Architecture Diagram](./assets/process_pipeline.png)\n\n### 1. Unified Data Converter (Format Engine)\n- **No-Code Mapping**: Define source-to-target mapping via JSON configs without writing code.\n- **Protocol Support**: Native support for `src://` (source root) and `dest://` (target root) protocols.\n- **Multimedia Expert**: Extract compressed videos from HDF5, merge tensors, and handle multi-modal data.\n- **Advanced Aggregation**: Capable of querying and aggregating data across logical levels (e.g., gathering all episodes for a task summary).\n\n### 2. Process Pipeline (Workflow Manager)\n- **Multi-level Concurrency**: Parallel processing at Episode, Task, and Dataset levels using `multiprocessing`.\n- **Resumable Execution**: Crash-safe processing using Redis and local `.status.json` files to track progress.\n- **Pluggable Steps**: Built-in processors for **Validation**, **Structure**, **Concat**, **Caption**, and **Trim**.\n\n---\n\n## 🚀 Quick Start\n\n### 1. Installation\n```bash\ngit clone https://github.com/thu-ml/embodied-data-toolkit.git\ncd embodied-data-toolkit\n\nconda create -n embodied-data-toolkit python==3.10\nconda activate embodied-data-toolkit\n\npip install -r requirements.txt\n# Ensure system-level ffmpeg is installed\n# sudo apt install ffmpeg\n```\n\n### 2. Component Usage\n\n#### A. Data Conversion\nConvert raw datasets to a standard structure using a JSON config (define your corresponding config json first):\n```bash\npython unified_data_converter/run_conversion.py \\\n    --config unified_data_converter/configs/my_config.json \\\n    --src_root /path/to/raw_data \\\n    --dest_root /path/to/standard_data \\\n    --workers 16\n```\n\nor\n\n```bash\nbash scripts/run_conversion.sh\n```\n\n\n#### B. Trajectory Processing\nRun the high-level processing pipeline (Trimming, Captioning, etc.), change `config.yaml` and add more Processors to adapt to your own process pipeline:\n```bash\npython process_pipeline/process_pipeline.py \\\n  --config process_pipeline/configs/config.yaml\n```\n\nor\n\n```bash\nbash scripts/run_process_pipeline.sh\n```\n\n---\n\n## 📂 Data Formats\n\n### 1. Process Pipeline Input Format\n\n```text\ndataset_root/\n├── folder_1/\n│   └── ...\n└── folder_n/\n    └── {task_name}/\n        ├── episode_0/\n        │   ├── episode_0_cam_front.mp4      # Front view (Deprecated)\n        │   ├── episode_0_cam_high.mp4       # High-angle view\n        │   ├── episode_0_cam_left_wrist.mp4  # Left wrist camera\n        │   ├── episode_0_cam_right_wrist.mp4 # Right wrist camera\n        │   ├── episode_0_qpos.pt            # Joint positions (T, 14)\n        │   └── episode_0_tts.mp4            # Audio/TTS (Optional)\n        ├── episode_1/\n        │   └── ...\n        └── ...\n```\n\n### 2. Standard Data Format (Pipeline Output)\nThe **Process Pipeline** processes the above input (trimming, concatenating, captioning) and generates the final standardized structure for training.\n\n```text\n{task_name}/\n├── task_meta.json           # Global metadata and task-level instructions\n└── episode_{id}/            # Individual episode directory\n    ├── video.mp4            # Main/Merged video (result of Concat processor, with cam_high.mp4 at top, cam_left_wrist.mp4 at bottom left(resized to half height and width of cam_high.mp4) and cam_right_wrist.mp4 at bottom right(resized to half height and width of cam_high.mp4))\n    ├── qpos.pt              # Joint positions and gripper states (torch.Tensor)\n    ├── endpose.pt           # End-effector Cartesian poses (Optional, torch.Tensor)\n    ├── instructions.json    # Language metadata (total_frames, instructions, segments)\n    ├── umt5_wan/            # (Optional, just as an example to exemplify how to add extra information into our data format) Language embeddings (UMT5/Wan2.2)\n    └── raw_video/           # Original camera views\n        ├── cam_high.mp4     # Fixed high-angle view (e.g., top/rear)\n        ├── cam_left_wrist.mp4\n        ├── cam_right_wrist.mp4\n        └── cam_front.mp4    # (Optional) Front/Side view\n```\n\n### Key Data Specifications\n- **Tensors**: `.pt` files are expected to be saved via `torch.save()`.\n- **Videos**: `.mp4` files should ideally be H.264 encoded for maximum compatibility.\n- **Instructions**: `instructions.json` should contain at least a top-level `instructions` list of strings and frame-level sub-instructions.\n\n```\n{\n  instructions: [\"aaa\",\"bbb\",\"ccc\"],\n  sub_instructions: [\n    {\"start_frame\": 0, \"end_frame\": 150, \"instruction\": [\"aaa\"]},\n    {\"start_frame\": 150, \"end_frame\": 340, \"instruction\": [\"bbb\", \"ccc\"]}\n  ]\n}\n```\n\n---\n\n## 🛠️ Redis Management\n\nThe **Process Pipeline** uses Redis to maintain a global state for breakpoint resumption (checkpointing).\n\n### Installation (Linux/Ubuntu)\n```bash\nsudo apt update\nsudo apt install redis-server\n```\n\n### Starting Redis\n- **As a System Service (Recommended)**:\n  ```bash\n  sudo systemctl start redis-server\n  # Enable auto-start on boot\n  sudo systemctl enable redis-server\n  ```\n- **Manually in Background**:\n  ```bash\n  redis-server --daemonize yes\n  ```\n\n### Stopping Redis\n- **As a System Service**:\n  ```bash\n  sudo systemctl stop redis-server\n  ```\n- **Manually**:\n  ```bash\n  redis-cli shutdown\n  ```\n\n### Checking Status\n```bash\nredis-cli ping\n# Should return \"PONG\"\n```\n\n---\n\n## 🧩 Processors Detail (partial)\n\n| Component | Processor | Description | Key Parameters |\n| :--- | :--- | :--- | :--- |\n| **Pipeline** | **Validation** | Verifies data integrity and compliance | `perform: true` |\n| **Pipeline** | **Structure** | Restructures directory hierarchy | `fast_video_copy` |\n| **Pipeline** | **Concat** | Merges multi-view videos (Top/Left/Right) | `fps` |\n| **Pipeline** | **Caption** | Generates text descriptions (GPT/VLM) | `api_key`, `system_prompt` |\n| **Pipeline** | **Trim** | Trims static frames based on movement | `threshold`, `video_trim_mode` |\n| **Converter** | **copy** | Simple file copy | `source` |\n| **Converter** | **hdf5_extractor** | Extract data from HDF5 files | `source_h5`, `fields` |\n| **Converter** | **json_transformer** | Transform JSON structure | `template` |\n\n---\n\n## 📂 Project Structure\n\n```text\n.\n├── unified_data_converter/   # Format conversion engine\n│   ├── configs/              # JSON conversion rules\n│   ├── core/                 # Resolver, Planner, Context\n│   ├── processors/           # HDF5, Video, JSON converters\n│   └── run_conversion.py     # Entry point\n├── process_pipeline/         # Workflow \u0026 Trajectory manager\n│   ├── configs/              # Pipeline YAML configs\n│   ├── core/                 # Pipeline \u0026 Runners (Episode/Task)\n│   ├── processors/           # Trim, Caption, Concat steps\n│   └── process_pipeline.py   # Entry point\n├── utils/                    # Shared IO, Video, and Tensor utilities\n└── README.md\n```\n\n---\n\n## ⚠️ Notes\n\n1.  **Concurrency**: Both components support the `--workers` or `workers` config to adjust CPU usage.\n2.  **Trim Mode**: \n    - `ffmpeg`: High quality, slow.\n    - `fast` (OpenCV): High speed (8-10x), larger files.\n3.  **HDF5 Dependencies**: If using `hdf5_extractor`, ensure `h5py` is installed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2Fembodied-data-toolkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthu-ml%2Fembodied-data-toolkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-ml%2Fembodied-data-toolkit/lists"}