{"id":50963379,"url":"https://github.com/duoan/mega-data-factory","last_synced_at":"2026-06-18T17:03:10.637Z","repository":{"id":333259247,"uuid":"1136540501","full_name":"duoan/mega-data-factory","owner":"duoan","description":"🏭 Mega Scale Multimodal DataPipeline for SOTA Foundation Models","archived":false,"fork":false,"pushed_at":"2026-03-10T05:30:13.000Z","size":9282,"stargazers_count":353,"open_issues_count":10,"forks_count":44,"subscribers_count":29,"default_branch":"master","last_synced_at":"2026-03-10T13:46:41.127Z","etag":null,"topics":["data-centric-ai","data-curation","data-quality","datapipeline","datapipelines","deeplearning","foundation-models","image-editing","image-generation","llm","machine-learning","mllm","multimodal","ray","rust","video-generation","vlm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/duoan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-17T21:51:33.000Z","updated_at":"2026-03-10T05:30:16.000Z","dependencies_parsed_at":"2026-02-17T06:30:48.265Z","dependency_job_id":null,"html_url":"https://github.com/duoan/mega-data-factory","commit_stats":null,"previous_names":["duoan/datapipeline_z_image","duoan/webscale-multimodal-datapipeline","duoan/mega-data-factory"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/duoan/mega-data-factory","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duoan%2Fmega-data-factory","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duoan%2Fmega-data-factory/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duoan%2Fmega-data-factory/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duoan%2Fmega-data-factory/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/duoan","download_url":"https://codeload.github.com/duoan/mega-data-factory/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/duoan%2Fmega-data-factory/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34499413,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-18T02:00:06.871Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-centric-ai","data-curation","data-quality","datapipeline","datapipelines","deeplearning","foundation-models","image-editing","image-generation","llm","machine-learning","mllm","multimodal","ray","rust","video-generation","vlm"],"created_at":"2026-06-18T17:03:04.751Z","updated_at":"2026-06-18T17:03:10.610Z","avatar_url":"https://github.com/duoan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Mega Data Factory\n\nA reproducible, high-throughput, distributed open-source pipeline for processing web-scale (hundreds of billions) multimodal datasets. Built on Ray with Rust-accelerated and GPU-optimized operators for ablation, scoring, and deduplication at scale.\n\n![Mega Data Factory](mdf.png)\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=duoan/mega-data-factory\u0026type=date\u0026legend=top-left)](https://www.star-history.com/#duoan/mega-data-factory\u0026type=date\u0026legend=top-left)\n\n## Vision\n\n**Reproduce SOTA foundation model data pipelines** — from rule-based to model-based, spanning text, image, and multimodal data.\n\n### Text Data Pipelines\n\n| Pipeline | Paper | Status |\n|----------|-------|--------|\n| [FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) | 15T tokens, quality filtering | 🚧 In Progress |\n| [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | Educational content classifier | 🚧 In Progress |\n| [RefinedWeb](https://arxiv.org/pdf/2306.01116) | URL filtering, trafilatura, dedup | ✅ URL Filter |\n| [DCLM](https://arxiv.org/pdf/2406.11794) | Data curation for LLMs | 📋 Planned |\n| [Dolma](https://arxiv.org/pdf/2402.00159) | Open corpus toolkit | 📋 Planned |\n| [RedPajama-V2](https://together.ai/blog/redpajama-data-v2) | 30T tokens, quality signals | 📋 Planned |\n\n### Image \u0026 Vision-Language Pipelines\n\n| Pipeline | Paper | Status |\n|----------|-------|--------|\n| [Z-Image](https://arxiv.org/pdf/2511.22699) | Image generation foundation model | ✅ Implemented |\n| [Imagen 3](https://arxiv.org/abs/2408.07009) | Image quality \u0026 AIGC detection | ✅ Implemented |\n| [LAION-5B](https://arxiv.org/pdf/2210.08402) | CLIP filtering, dedup | ✅ Implemented |\n| [DataComp](https://arxiv.org/pdf/2304.14108) | CLIP/SigLIP filtering | ✅ Implemented |\n| [Qwen-VL](https://arxiv.org/pdf/2511.21631) | Vision-language data | 🚧 In Progress |\n| [Seed1.5-VL](https://arxiv.org/pdf/2505.07062) | Vision-language reasoning | 📋 Planned |\n| [HoneyBee](https://arxiv.org/pdf/2510.12225) | Data recipes for VL reasoners | 📋 Planned |\n| [Cosmos](https://arxiv.org/pdf/2501.03575) | World model platform | 📋 Planned |\n\n### Video \u0026 Multimodal Pipelines\n\n| Pipeline | Paper | Status |\n|----------|-------|--------|\n| [Panda-70M](https://arxiv.org/pdf/2402.19479) | Video captioning | 📋 Planned |\n| [InternVid](https://arxiv.org/pdf/2307.06942) | Video-language | 📋 Planned |\n| [OpenVid-1M](https://arxiv.org/pdf/2407.02371) | Video generation | 📋 Planned |\n\n## Pipeline Run Reports\n\n\u003chttps://huggingface.co/spaces/classtag/mega-data-factory-reports\u003e\nThis space contains interactive HTML reports for pipeline runs, showcasing metrics, visualizations, and performance statistics.\n\n### Data Quality Funnel\n\n![data quality funnel](images/data_quality_funnel.png)\n\n### Data Flow Sankey\n\n![data flow sankey](images/data_flow_sankey.png)\n\n### Data Detail Metrics\n\n![data detail metrics](images/data_detail_metrics.png)\n\n## Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/duoan/mega-data-factory.git\ncd mega-data-factory\n\n# Install with Rust acceleration (recommended)\nuv pip install -e .\n\n# Or install without Rust (pure Python fallback)\nuv sync\n```\n\n\u003e Requires Rust toolchain for building accelerated operators. Install via [rustup](https://rustup.rs/).\n\n## Quick Start\n\n```bash\n# Run pipeline with config\nmdf run --config configs/z_image.yaml\n\n# Or with options\nmdf run -c configs/z_image.yaml --max-samples 1000 --batch-size 500\n```\n\n## Operators\n\n\u003e 🦀 = Rust Accelerated | 🖥️ = GPU Optimized\n\n### Data Loaders\n\n| Loader | Description | Features |\n|--------|-------------|----------|\n| `HuggingFaceLoader` | Load from HuggingFace datasets | Streaming, sharding |\n| `CommonCrawlLoader` | Load from CommonCrawl WARC files | 🦀 Rust text extraction, distributed |\n\n### Text Operators\n\n**Refiners** (normalize/enrich text fields):\n\n| Operator | Description |\n|----------|-------------|\n| [`TextNewLineRemovalRefiner`](mega_data_factory/operators/refiners/text_new_line_removal_refiner.md) | Limit maximum consecutive newlines in text |\n\n**Filters** (rule-based, from [RefinedWeb](https://arxiv.org/pdf/2306.01116)):\n\n| Operator | Description | Reference |\n|----------|-------------|-----------|\n| [`URLFilter`](mega_data_factory/operators/filters/url_filter.md) | Domain blocklist, URL word scoring, quality source exclusion | RefinedWeb §G.1 |\n| [`TextLengthFilter`](mega_data_factory/operators/filters/text_length_filter.md) | Filter by character/word count | FineWeb, RefinedWeb |\n| [`TextAlphabeticWordRationFilter`](mega_data_factory/operators/filters/text_alphabetic_word_ration_filter.md) (`text_alphabetic_word_ration_filter`) | Filter by ratio of words without alphabetic chars | Gopher-style heuristic |\n| [`TextAvgWordLengthFilter`](mega_data_factory/operators/filters/text_avg_word_length_filter.md) (`text_avg_word_length_filter`) | Filter by average word length range | RefinedWeb-style heuristic |\n| [`TextBulletFilter`](mega_data_factory/operators/filters/text_bullet_filter.md) (`text_bullet_filter`) | Filter by bullet-line ratio | RefinedWeb-style heuristic |\n| [`TextEllipsisLineRatioFilter`](mega_data_factory/operators/filters/text_ellipsis_line_ratio_filter.md) (`text_ellipsis_line_ratio_filter`) | Filter by ellipsis-ending line ratio | RefinedWeb-style heuristic |\n| [`TextSymbolRatioFilter`](mega_data_factory/operators/filters/text_symbol_ratio_filter.md) (`text_symbol_ratio_filter`) | Filter by symbol-to-word ratio (`#`, `...`, `. . .`, `…`) | RefinedWeb-style heuristic |\n| [`TextRepetitionFilter`](mega_data_factory/operators/filters/text_repetition_filter.md) (`text_repetition_filter`) | Multi-granularity n-gram repetition checks (line/paragraph/word) | Gopher / MassiveText heuristic |\n| [`TextTargetLanguageFilter`](mega_data_factory/operators/filters/text_target_language_filter.md) (`text_target_language_filter`) | FastText language detection with score threshold | CCNet |\n\n**Deduplicators:**\n\n| Operator | Description |\n|----------|-------------|\n| [`TextExactDeduplicator`](mega_data_factory/operators/dedup/text_exact_dedup.md) | Exact content hash deduplication (xxhash/MD5) |\n\n**Coming Soon:**\n\n- `PerplexityFilter` - KenLM perplexity scoring\n- `QualityClassifierFilter` - Model-based quality (FineWeb-Edu style)\n- `MinHashDeduplicator` - Near-duplicate detection\n\n### Image Operators\n\n**Refiners** (enrich records with new fields):\n\n| Operator | Description | Acceleration |\n|----------|-------------|--------------|\n| [`ImageMetadataRefiner`](mega_data_factory/operators/refiners/image_metadata.md) | Width, height, format, file size | CPU |\n| [`ImageTechnicalQualityRefiner`](mega_data_factory/operators/refiners/image_technical_quality.md) | Compression artifacts, entropy | 🦀 Rust |\n| [`ImageVisualDegradationsRefiner`](mega_data_factory/operators/refiners/image_visual_degradations.md) | Color cast, blur, watermark, noise | CPU |\n| [`ImageClipEmbeddingRefiner`](mega_data_factory/operators/refiners/image_clip_embedding.md) | CLIP embeddings (OpenCLIP) | 🖥️ GPU |\n| [`ImageSigLIPEmbeddingRefiner`](mega_data_factory/operators/refiners/image_siglip_embedding.md) | SigLIP2 embeddings | 🖥️ GPU |\n| [`ImageAestheticQualityRefiner`](mega_data_factory/operators/refiners/image_aesthetic_quality.md) | Aesthetic score (CLIP-based) | CPU |\n| [`ImageAIGCDetectorRefiner`](mega_data_factory/operators/refiners/image_aigc_detector.md) | AI-generated image detection | CPU |\n\n**Filters:**\n\n| Operator | Description |\n|----------|-------------|\n| [`ImageQualityFilter`](mega_data_factory/operators/filters/image_quality_filter.md) | Filter by size, quality metrics, aesthetic score |\n\n**Deduplicators:**\n\n| Operator | Description | Acceleration |\n|----------|-------------|--------------|\n| [`ImagePhashDeduplicator`](mega_data_factory/operators/dedup/image_phash_dedup.md) | Perceptual hash deduplication | 🦀 Rust |\n\n### General Operators\n\n**Filters:**\n\n| Operator | Description |\n|----------|-------------|\n| [`RangeFilter`](mega_data_factory/operators/filters/range_filter.md) | Generic range filter for any numeric field (min/max bounds) |\n\n### Video Operators\n\n**Refiners:**\n\n| Operator | Description | Requirements |\n|----------|-------------|--------------|\n| [`VideoMetadataRefiner`](mega_data_factory/operators/refiners/video_metadata.md) | Extract video metadata (duration, resolution, fps, codec, bitrate, audio info) | FFprobe |\n| [`VideoAestheticsScoreRefiner`](mega_data_factory/operators/refiners/video_aesthetics_score_refiner.md) | Video aesthetic quality scoring via frame sampling | 🖥️ GPU |\n| [`VideoClipEmbeddingRefiner`](mega_data_factory/operators/refiners/video_clip_embedding.md) | CLIP embeddings for video frames (mean/max pooling) | 🖥️ GPU |\n\n**Deduplicators:**\n\n| Operator | Description | Requirements |\n|----------|-------------|--------------|\n| [`VideoExactByteLevelDeduplicator`](mega_data_factory/operators/dedup/video_exact_byte_level_dedup.md) | Exact file hash deduplication (SHA-256/MD5/SHA-512) | - |\n| [`VideoExactStreamLevelDeduplicator`](mega_data_factory/operators/dedup/video_exact_stream_level_dedup.md) | Raw stream hash deduplication (container-agnostic) | FFmpeg |\n\n### LLM Synthesis Operators\n\n**Refiners** (synthesize data via LLM APIs or local models):\n\n| Operator | Description | Mode |\n|----------|-------------|------|\n| [`LLMOnlineSynthesisRefiner`](mega_data_factory/operators/refiners/llm_synthesis/llm_online_synthesis.md) | Call remote LLM APIs (OpenAI, Claude, Gemini, MiniMax, DeepSeek, etc.) with account pool + proxy pool | Online |\n| [`LLMOfflineSynthesisRefiner`](mega_data_factory/operators/refiners/llm_synthesis/llm_offline_synthesis.md) | Run models locally on GPUs via vLLM engine for high-throughput batch inference | Offline |\n| [`LLMResponseParserRefiner`](mega_data_factory/operators/refiners/llm_synthesis/llm_response_parser.md) | Post-process LLM responses: JSON/regex/JMESPath extraction, schema validation, field mapping | Post-processing |\n\n![LLM Synthesis Architecture](mega_data_factory/operators/refiners/llm_synthesis/architecture.png)\n\n**Online mode** supports any OpenAI-compatible endpoint (vLLM server, Ollama, Together, Groq) plus native Anthropic, Gemini, and [MiniMax](https://www.minimax.io) APIs. Account pool rotates API keys with rate-limit awareness; proxy pool rotates HTTP/SOCKS proxies with failure tracking.\n\n**Offline mode** uses vLLM's Python API for zero-HTTP-overhead GPU inference with continuous batching, tensor parallelism, and quantization (AWQ/GPTQ) support.\n\nInstall dependencies:\n\n```bash\npip install -e \".[llm-online]\"    # httpx for online mode\npip install -e \".[llm-offline]\"   # vllm for offline mode\n```\n\n### Data Writers\n\n| Writer | Description |\n|--------|-------------|\n| `ParquetDataWriter` | Write to Parquet files |\n| `IcebergDataWriter` | Write to Apache Iceberg tables |\n\n## Architecture\n\n\u003e **Deep Dive**: See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for a comprehensive explanation of the distributed pipeline-parallel design, including ObjectRef chaining, backpressure control, bucketed deduplication, and theoretical scalability analysis.\n\n### Pipeline Overview\n\n```mermaid\n%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4f46e5', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6366f1', 'lineColor': '#a5b4fc', 'secondaryColor': '#1e1b4b', 'tertiaryColor': '#312e81', 'background': '#0f0f23', 'mainBkg': '#1e1b4b', 'nodeBorder': '#6366f1', 'clusterBkg': '#1e1b4b', 'clusterBorder': '#6366f1', 'titleColor': '#e0e7ff', 'edgeLabelBackground': '#312e81'}}}%%\nflowchart TB\n    subgraph Driver[\"Ray Driver\"]\n        Config[Config]\n        Executor[Executor]\n        Progress[Stats]\n    end\n\n    subgraph ObjectStore[\"Object Store\"]\n        Batches[\"Shared Memory\"]\n    end\n\n    subgraph Stage0[\"CPU Pool ×8\"]\n        direction LR\n        W0[\"W0\"]\n        W1[\"W1\"]\n        W2[\"W2\"]\n        Wn[\"...\"]\n        W7[\"W7\"]\n    end\n\n    subgraph Stage1[\"GPU Pool ×2\"]\n        direction LR\n        GPU0[\"GPU0\"]\n        GPU1[\"GPU1\"]\n    end\n\n    subgraph Output[\"Output\"]\n        Writer[Parquet]\n    end\n\n    HF[\"HuggingFace\"] --\u003e Driver\n    Driver --\u003e ObjectStore\n    ObjectStore --\u003e Stage0\n    Stage0 --\u003e ObjectStore\n    ObjectStore --\u003e Stage1\n    Stage1 --\u003e Writer\n```\n\n### Worker Pool \u0026 Load Balancing\n\n```mermaid\n%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#059669', 'primaryTextColor': '#fff', 'primaryBorderColor': '#10b981', 'lineColor': '#6ee7b7', 'secondaryColor': '#064e3b', 'tertiaryColor': '#065f46', 'background': '#0f0f23', 'mainBkg': '#064e3b', 'nodeBorder': '#10b981', 'clusterBkg': '#064e3b', 'clusterBorder': '#10b981'}}}%%\nflowchart LR\n    subgraph Input[\"Batches\"]\n        B0[\"B0\"] \u0026 B1[\"B1\"] \u0026 B2[\"B2\"] \u0026 B3[\"B3\"]\n        B4[\"B4\"] \u0026 B5[\"B5\"] \u0026 B6[\"B6\"] \u0026 B7[\"B7\"]\n    end\n\n    subgraph CPU[\"CPU Pool ×8 workers\"]\n        C0[\"C0 🦀\"] \u0026 C1[\"C1 🦀\"] \u0026 C2[\"C2 🦀\"] \u0026 C3[\"C3 🦀\"]\n        C4[\"C4 🦀\"] \u0026 C5[\"C5 🦀\"] \u0026 C6[\"C6 🦀\"] \u0026 C7[\"C7 🦀\"]\n    end\n\n    subgraph GPU[\"GPU Pool ×2 workers\"]\n        G0[\"G0 CLIP\"]\n        G1[\"G1 CLIP\"]\n    end\n\n    B0 --\u003e C0\n    B1 --\u003e C1\n    B2 --\u003e C2\n    B3 --\u003e C3\n    B4 --\u003e C4\n    B5 --\u003e C5\n    B6 --\u003e C6\n    B7 --\u003e C7\n\n    C0 \u0026 C1 \u0026 C2 \u0026 C3 --\u003e G0\n    C4 \u0026 C5 \u0026 C6 \u0026 C7 --\u003e G1\n```\n\n### Execution Sequence\n\n```mermaid\n%%{init: {'theme': 'dark'}}%%\nsequenceDiagram\n    participant D as Driver\n    participant OS as ObjectStore\n    participant CPU as CPU ×8\n    participant GPU as GPU ×2\n    participant W as Writer\n\n    D-\u003e\u003eOS: Submit batches\n\n    par CPU Processing\n        OS-\u003e\u003eCPU: Batch 0-7\n    end\n\n    CPU-\u003e\u003eOS: Processed\n\n    par GPU Processing\n        OS-\u003e\u003eGPU: Batch 0-7\n    end\n\n    GPU-\u003e\u003eW: Write Parquet\n    W-\u003e\u003eD: Done\n```\n\n### Timeline (Parallel Execution)\n\n```mermaid\n%%{init: {'theme': 'dark'}}%%\ngantt\n    title Batch Processing Timeline\n    dateFormat X\n    axisFormat %s\n\n    section CPU-0\n        B0    :c0, 0, 2\n        B8    :c0b, 8, 2\n\n    section CPU-1\n        B1    :c1, 0, 2\n        B9    :c1b, 8, 2\n\n    section CPU-7\n        B7    :c7, 0, 2\n        B15   :c7b, 8, 2\n\n    section GPU-0\n        B0    :g0a, 2, 3\n        B2    :g0b, 5, 3\n\n    section GPU-1\n        B1    :g1a, 2, 3\n        B3    :g1b, 5, 3\n```\n\n\u003e **Key Points**:\n\u003e\n\u003e - **CPU Pool**: 8 workers for metadata, quality (🦀 Rust), filtering, dedup\n\u003e - **GPU Pool**: 2 workers for CLIP embeddings (limited by VRAM)\n\u003e - **Load Balancing**: Ray auto-distributes batches to idle workers\n\n## Configuration\n\n### Text Pipeline: CommonCrawl Processing\n\n```yaml\n# configs/example_commoncrawl.yaml\n# RefinedWeb-style text extraction pipeline\n\ndata_loader:\n  type: CommonCrawlLoader\n  params:\n    crawl_id: \"CC-MAIN-2024-51\"\n  num_workers: 1\n\nstages:\n  - name: content_filtering\n    operators:\n      # RefinedWeb §G.1: URL filtering\n      - name: url_filter\n        params:\n          url_field: \"url\"\n      # Length filtering\n      - name: text_length_filter\n        params:\n          min_length: 50\n          max_length: 100000\n          text_field: \"text\"\n          length_type: \"word\"\n      # Additional text quality filters\n      - name: text_alphabetic_word_ration_filter\n        params:\n          text_field: \"text\"\n          max_ratio: 0.8\n      - name: text_avg_word_length_filter\n        params:\n          text_field: \"text\"\n          lower_bound: 2.0\n          upper_bound: 20.0\n      - name: text_bullet_filter\n        params:\n          text_field: \"text\"\n          max_bullet_ratio: 0.9\n      - name: text_ellipsis_line_ratio_filter\n        params:\n          text_field: \"text\"\n          max_ratio: 0.3\n      - name: text_symbol_ratio_filter\n        params:\n          text_field: \"text\"\n          max_symbol_to_word_ratio: 0.5\n      - name: text_repetition_filter\n        params:\n          text_field: \"text\"\n      # Normalize newlines before dedup\n      - name: text_new_line_removal_refiner\n        params:\n          text_field: \"text\"\n          max_consecutive: 2\n      # Exact deduplication\n      - name: text_exact_deduplicator\n        params:\n          text_field: \"text\"\n    worker:\n      min_replicas: 2\n      max_replicas: 2\n\ndata_writer:\n  type: ParquetDataWriter\n  params:\n    output_path: \"./output/commoncrawl\"\n\nexecutor:\n  max_samples: 10000\n  batch_size: 200\n  dedup_num_buckets: 1\n  rejected_samples:\n    enabled: true\n  metrics:\n    enabled: true\n    generate_report: true\n    debug_samples_per_operator: 20\n```\n\n### Image Pipeline: Z-Image Style\n\n```yaml\n# configs/z_image.yaml\n# Image quality + aesthetic + AIGC detection pipeline\n\ndata_loader:\n  type: HuggingFaceLoader\n  params:\n    dataset_name: \"jp1924/Laion400m-1\"\n    split: \"train\"\n    streaming: true\n\nstages:\n  # Stage 1: Basic metadata and quality (CPU, Rust-accelerated)\n  - name: basic_stage\n    operators:\n      - name: image_metadata_refiner\n      - name: image_technical_quality_refiner  # 🦀 Rust\n      - name: image_quality_filter\n        params:\n          min_width: 128\n          min_height: 128\n          max_compression_artifacts: 0.8\n      - name: image_phash_deduplicator  # 🦀 Rust\n    worker:\n      min_replicas: 2\n      max_replicas: 8\n      resources:\n        cpu: 1\n\n  # Stage 2: Embedding extraction (GPU)\n  - name: embedding_stage\n    operators:\n      - name: image_clip_embedding_refiner\n        params:\n          model_name: \"ViT-L-14\"\n          pretrained: \"openai\"\n          use_fp16: true\n      - name: image_siglip_embedding_refiner\n        params:\n          model_name: \"google/siglip2-so400m-patch14-384\"\n          use_fp16: true\n    worker:\n      min_replicas: 1\n      max_replicas: 2\n      resources:\n        gpu: 1\n\n  # Stage 3: Quality scoring\n  - name: scoring_stage\n    operators:\n      - name: image_aesthetic_quality_refiner\n      - name: image_aigc_detector_refiner\n        params:\n          threshold: 0.5\n    worker:\n      min_replicas: 2\n      max_replicas: 4\n      resources:\n        cpu: 1\n\ndata_writer:\n  type: ParquetDataWriter\n  params:\n    output_path: \"./output/z_image\"\n\nexecutor:\n  max_samples: 100000\n  batch_size: 256\n  dedup_num_buckets: 16\n  metrics:\n    enabled: true\n    generate_report: true\n```\n\n### LLM Synthesis Pipeline\n\n```yaml\n# configs/example_llm_synthesis.yaml\n# Knowledge synthesis with post-processing\n\ndata_loader:\n  type: HuggingFaceLoader\n  params:\n    dataset_name: \"your-org/seed-prompts\"\n    split: \"train\"\n    streaming: true\n\nstages:\n  - name: synthesis_stage\n    operators:\n      # Step 1: Call LLM API\n      - name: llm_online_synthesis_refiner\n        params:\n          provider: anthropic\n          model: claude-sonnet-4-20250514\n          system_prompt: |\n            Analyze the text and return JSON:\n            {\"category\": \"...\", \"confidence\": 0.0-1.0, \"reasoning\": \"...\"}\n          prompt_template: \"Classify: {text}\"\n          enable_thinking: true\n          thinking_budget: 10000\n          accounts:\n            - api_key: \"${ANTHROPIC_API_KEY_1}\"\n            - api_key: \"${ANTHROPIC_API_KEY_2}\"\n          proxies:\n            - \"http://user:pass@proxy1:8080\"\n          max_concurrent: 8\n\n      # Step 2: Extract structured output\n      - name: llm_response_parser_refiner\n        params:\n          input_field: llm_response\n          parse_mode: json\n          field_mapping:\n            category: \"category\"\n            confidence: \"confidence\"\n            reasoning: \"reasoning\"\n          required_fields: [\"category\", \"confidence\"]\n          field_types:\n            category: str\n            confidence: float\n    worker:\n      num_replicas: 1\n      resources:\n        cpu: 2\n\ndata_writer:\n  type: ParquetDataWriter\n  params:\n    output_path: \"./output/llm_synthesis\"\n\nexecutor:\n  max_samples: 10000\n  batch_size: 64\n```\n\n## Performance\n\n### Text Pipeline (CommonCrawl)\n\n```text\n============================================================\nPipeline: CommonCrawl text extraction (1M records)\nHardware: 8 CPU cores\n============================================================\n\nstage_0:\n  [Stage Summary]\n    Input: 1,000,000 → Output: 945,866 (94.6% pass)\n    Total time: 49.11s\n    Throughput: 20,362 records/sec\n\n  URLFilter:           20,362 rec/sec   (98.1% pass)  # RefinedWeb §G.1\n  TextLengthFilter:  1,976,454 rec/sec   (96.4% pass)  # Near instant\n============================================================\n\nProjections:\n  10M records   →  ~8 minutes\n  100M records  →  ~1.4 hours\n  1B records    →  ~14 hours\n```\n\n### Image Pipeline (LAION)\n\nBenchmark on Mac M1 Pro (MPS):\n\n```text\n============================================================\nPipeline: Image quality + embedding (1K records)\n============================================================\n\nstage_0 (CPU, Rust-accelerated):\n  [Stage Summary]\n    Input: 1,000 → Output: 898 (89.8% pass)\n    Total time: 0.61s\n    Throughput: 1,630 records/sec\n\n  ImageMetadataRefiner:        27,000 rec/sec\n  ImageTechnicalQualityRefiner: 2,500 rec/sec  🦀 Rust\n  ImageQualityFilter:       4,200,000 rec/sec\n  ImagePhashDeduplicator:      1,500 rec/sec  🦀 Rust\n\nstage_1 (GPU):\n  [Stage Summary]\n    Input: 898 → Output: 898\n    Total time: 6.80s\n    Throughput: 132 records/sec\n\n  ImageClipEmbeddingRefiner:     132 rec/sec  🖥️ GPU\n============================================================\n```\n\n## Project Structure\n\n```text\nmega-data-factory/\n├── mega_data_factory/\n│   ├── cli.py                          # CLI entry point (mdf command)\n│   ├── framework/\n│   │   ├── executor.py                 # Pipeline orchestration\n│   │   ├── stage_actor.py               # StageActor\n│   │   ├── loader_actor.py             # LoaderActor\n│   │   ├── dedup_backend.py            # DedupBackend (ABC), ExactDedupBackend, SemanticDedupBackend\n│   │   ├── operator.py                 # Operator, Refiner, Filter, Deduplicator\n│   │   ├── config.py                   # YAML config parsing\n│   │   ├── registry.py                 # Component registries\n│   │   └── metrics/                    # Metrics collection \u0026 reporting\n│   ├── loaders/\n│   │   ├── huggingface_loader.py       # HuggingFace datasets\n│   │   └── commoncrawl_loader.py       # CommonCrawl WARC files\n│   ├── operators/\n│   │   ├── refiners/                   # Refiners (text, image, video)\n│   │   │   └── llm_synthesis/          # LLM synthesis (online, offline, parser)\n│   │   ├── filters/                    # Text + Image filters\n│   │   └── dedup/                      # Deduplicators (phash, minhash)\n│   ├── writers/\n│   │   ├── parquet_writer.py           # Parquet output\n│   │   └── iceberg_writer.py           # Apache Iceberg output\n│   └── models/                         # Model trainers (aesthetic, AIGC, k-means)\n├── src/lib.rs                          # 🦀 Rust operators (quality, phash, HTML extraction)\n├── configs/                            # Pipeline configurations\n│   ├── z_image.yaml                    # Image pipeline\n│   ├── example_commoncrawl.yaml        # Text pipeline\n│   └── example_llm_synthesis.yaml      # LLM synthesis pipeline\n├── tests/                              # Unit tests\n├── Cargo.toml                          # Rust dependencies\n└── pyproject.toml                      # Python config (maturin build)\n```\n\n## Extending the Pipeline\n\n### Custom Text Filter\n\n```python\nfrom mega_data_factory.framework import Filter, OperatorRegistry\n\nclass MyTextFilter(Filter):\n    def __init__(self, min_words: int = 50):\n        super().__init__()\n        self.min_words = min_words\n\n    def should_keep_batch(self, records: list[dict]) -\u003e list[bool]:\n        return [len(r.get(\"text\", \"\").split()) \u003e= self.min_words for r in records]\n\nOperatorRegistry.register(\"MyTextFilter\", MyTextFilter)\n```\n\n### Custom Image Refiner\n\n```python\nfrom mega_data_factory.framework import Refiner, OperatorRegistry\nimport pyarrow as pa\n\nclass MyImageRefiner(Refiner):\n    def refine_batch(self, records: list[dict]) -\u003e None:\n        for record in records:\n            record[\"my_score\"] = compute_score(record[\"image\"])\n\n    def get_output_schema(self) -\u003e dict[str, pa.DataType]:\n        return {\"my_score\": pa.float32()}\n\nOperatorRegistry.register(\"MyImageRefiner\", MyImageRefiner)\n```\n\n## Key Features\n\n- **Pipeline Parallelism**: Ray ObjectRef chaining enables concurrent stage execution without blocking ([details](docs/ARCHITECTURE.md#pipeline-parallelism-via-objectref-chaining))\n- **Distributed Data Loading**: Sharded file loading with checkpoint support for fault recovery\n- **Backpressure Control**: Bounded in-flight batches prevent OOM on large datasets\n- **Bucketed Deduplication**: Distributed state sharding scales to 100B+ keys ([details](docs/ARCHITECTURE.md#distributed-deduplication))\n- **Rust Acceleration**: 10-25x speedup for image quality, hashing, and HTML extraction\n- **GPU Optimization**: CLIP/SigLIP embedding extraction with FP16 and batch inference\n- **Elastic Scaling**: Dynamic worker allocation with min/max replicas per stage\n- **LLM Synthesis**: Online (API) and offline (vLLM) modes with account/proxy pools and response parsing\n- **Config-Driven**: YAML configs define entire pipelines with no code changes\n\n## References\n\n### Text Data Pipelines\n\n- [RefinedWeb (arXiv:2306.01116)](https://arxiv.org/pdf/2306.01116) - URL filtering, trafilatura, MassiveText dedup\n- [FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) - 15T token dataset, quality filtering\n- [DCLM (arXiv:2406.11794)](https://arxiv.org/pdf/2406.11794) - Data curation for language models\n- [Dolma (arXiv:2402.00159)](https://arxiv.org/pdf/2402.00159) - Open corpus for LLM pretraining\n\n### Image \u0026 Vision-Language\n\n- [Z-Image (arXiv:2511.22699)](https://arxiv.org/pdf/2511.22699) - Image generation foundation model data\n- [DataComp (arXiv:2304.14108)](https://arxiv.org/pdf/2304.14108) - CLIP filtering benchmark\n- [LAION-5B (arXiv:2210.08402)](https://arxiv.org/pdf/2210.08402) - Large-scale image-text dataset\n\n### Tools \u0026 Models\n\n- [OpenCLIP](https://github.com/mlfoundations/open_clip) - CLIP implementation\n- [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384) - Vision encoder\n- [dom_smoothie](https://github.com/nicr9/dom_smoothie) - Rust readability.js port\n\n## License\n\nMIT License\n\n## Citation\n\n```bibtex\n@software{mega_data_factory,\n  author       = {Duo An},\n  title        = {Mega Data Factory},\n  year         = {2025},\n  publisher    = {GitHub},\n  url          = {https://github.com/duoan/mega-data-factory}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduoan%2Fmega-data-factory","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fduoan%2Fmega-data-factory","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fduoan%2Fmega-data-factory/lists"}