https://github.com/duoan/mega-data-factory
๐ญ Mega Scale Multimodal DataPipeline for SOTA Foundation Models
https://github.com/duoan/mega-data-factory
data-centric-ai data-curation data-quality datapipeline datapipelines deeplearning foundation-models image-editing image-generation llm machine-learning mllm multimodal ray rust video-generation vlm
Last synced: about 16 hours ago
JSON representation
๐ญ Mega Scale Multimodal DataPipeline for SOTA Foundation Models
- Host: GitHub
- URL: https://github.com/duoan/mega-data-factory
- Owner: duoan
- License: mit
- Created: 2026-01-17T21:51:33.000Z (5 months ago)
- Default Branch: master
- Last Pushed: 2026-03-10T05:30:13.000Z (3 months ago)
- Last Synced: 2026-03-10T13:46:41.127Z (3 months ago)
- Topics: data-centric-ai, data-curation, data-quality, datapipeline, datapipelines, deeplearning, foundation-models, image-editing, image-generation, llm, machine-learning, mllm, multimodal, ray, rust, video-generation, vlm
- Language: Python
- Homepage:
- Size: 8.85 MB
- Stars: 353
- Watchers: 29
- Forks: 44
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Mega Data Factory
A reproducible, high-throughput, distributed open-source pipeline for processing web-scale (hundreds of billions) multimodal datasets. Built on Ray with Rust-accelerated and GPU-optimized operators for ablation, scoring, and deduplication at scale.

## Star History
[](https://www.star-history.com/#duoan/mega-data-factory&type=date&legend=top-left)
## Vision
**Reproduce SOTA foundation model data pipelines** โ from rule-based to model-based, spanning text, image, and multimodal data.
### Text Data Pipelines
| Pipeline | Paper | Status |
|----------|-------|--------|
| [FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) | 15T tokens, quality filtering | ๐ง In Progress |
| [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | Educational content classifier | ๐ง In Progress |
| [RefinedWeb](https://arxiv.org/pdf/2306.01116) | URL filtering, trafilatura, dedup | โ
URL Filter |
| [DCLM](https://arxiv.org/pdf/2406.11794) | Data curation for LLMs | ๐ Planned |
| [Dolma](https://arxiv.org/pdf/2402.00159) | Open corpus toolkit | ๐ Planned |
| [RedPajama-V2](https://together.ai/blog/redpajama-data-v2) | 30T tokens, quality signals | ๐ Planned |
### Image & Vision-Language Pipelines
| Pipeline | Paper | Status |
|----------|-------|--------|
| [Z-Image](https://arxiv.org/pdf/2511.22699) | Image generation foundation model | โ
Implemented |
| [Imagen 3](https://arxiv.org/abs/2408.07009) | Image quality & AIGC detection | โ
Implemented |
| [LAION-5B](https://arxiv.org/pdf/2210.08402) | CLIP filtering, dedup | โ
Implemented |
| [DataComp](https://arxiv.org/pdf/2304.14108) | CLIP/SigLIP filtering | โ
Implemented |
| [Qwen-VL](https://arxiv.org/pdf/2511.21631) | Vision-language data | ๐ง In Progress |
| [Seed1.5-VL](https://arxiv.org/pdf/2505.07062) | Vision-language reasoning | ๐ Planned |
| [HoneyBee](https://arxiv.org/pdf/2510.12225) | Data recipes for VL reasoners | ๐ Planned |
| [Cosmos](https://arxiv.org/pdf/2501.03575) | World model platform | ๐ Planned |
### Video & Multimodal Pipelines
| Pipeline | Paper | Status |
|----------|-------|--------|
| [Panda-70M](https://arxiv.org/pdf/2402.19479) | Video captioning | ๐ Planned |
| [InternVid](https://arxiv.org/pdf/2307.06942) | Video-language | ๐ Planned |
| [OpenVid-1M](https://arxiv.org/pdf/2407.02371) | Video generation | ๐ Planned |
## Pipeline Run Reports
This space contains interactive HTML reports for pipeline runs, showcasing metrics, visualizations, and performance statistics.
### Data Quality Funnel

### Data Flow Sankey

### Data Detail Metrics

## Installation
```bash
# Clone the repository
git clone https://github.com/duoan/mega-data-factory.git
cd mega-data-factory
# Install with Rust acceleration (recommended)
uv pip install -e .
# Or install without Rust (pure Python fallback)
uv sync
```
> Requires Rust toolchain for building accelerated operators. Install via [rustup](https://rustup.rs/).
## Quick Start
```bash
# Run pipeline with config
mdf run --config configs/z_image.yaml
# Or with options
mdf run -c configs/z_image.yaml --max-samples 1000 --batch-size 500
```
## Operators
> ๐ฆ = Rust Accelerated | ๐ฅ๏ธ = GPU Optimized
### Data Loaders
| Loader | Description | Features |
|--------|-------------|----------|
| `HuggingFaceLoader` | Load from HuggingFace datasets | Streaming, sharding |
| `CommonCrawlLoader` | Load from CommonCrawl WARC files | ๐ฆ Rust text extraction, distributed |
### Text Operators
**Refiners** (normalize/enrich text fields):
| Operator | Description |
|----------|-------------|
| [`TextNewLineRemovalRefiner`](mega_data_factory/operators/refiners/text_new_line_removal_refiner.md) | Limit maximum consecutive newlines in text |
**Filters** (rule-based, from [RefinedWeb](https://arxiv.org/pdf/2306.01116)):
| Operator | Description | Reference |
|----------|-------------|-----------|
| [`URLFilter`](mega_data_factory/operators/filters/url_filter.md) | Domain blocklist, URL word scoring, quality source exclusion | RefinedWeb ยงG.1 |
| [`TextLengthFilter`](mega_data_factory/operators/filters/text_length_filter.md) | Filter by character/word count | FineWeb, RefinedWeb |
| [`TextAlphabeticWordRationFilter`](mega_data_factory/operators/filters/text_alphabetic_word_ration_filter.md) (`text_alphabetic_word_ration_filter`) | Filter by ratio of words without alphabetic chars | Gopher-style heuristic |
| [`TextAvgWordLengthFilter`](mega_data_factory/operators/filters/text_avg_word_length_filter.md) (`text_avg_word_length_filter`) | Filter by average word length range | RefinedWeb-style heuristic |
| [`TextBulletFilter`](mega_data_factory/operators/filters/text_bullet_filter.md) (`text_bullet_filter`) | Filter by bullet-line ratio | RefinedWeb-style heuristic |
| [`TextEllipsisLineRatioFilter`](mega_data_factory/operators/filters/text_ellipsis_line_ratio_filter.md) (`text_ellipsis_line_ratio_filter`) | Filter by ellipsis-ending line ratio | RefinedWeb-style heuristic |
| [`TextSymbolRatioFilter`](mega_data_factory/operators/filters/text_symbol_ratio_filter.md) (`text_symbol_ratio_filter`) | Filter by symbol-to-word ratio (`#`, `...`, `. . .`, `โฆ`) | RefinedWeb-style heuristic |
| [`TextRepetitionFilter`](mega_data_factory/operators/filters/text_repetition_filter.md) (`text_repetition_filter`) | Multi-granularity n-gram repetition checks (line/paragraph/word) | Gopher / MassiveText heuristic |
| [`TextTargetLanguageFilter`](mega_data_factory/operators/filters/text_target_language_filter.md) (`text_target_language_filter`) | FastText language detection with score threshold | CCNet |
**Deduplicators:**
| Operator | Description |
|----------|-------------|
| [`TextExactDeduplicator`](mega_data_factory/operators/dedup/text_exact_dedup.md) | Exact content hash deduplication (xxhash/MD5) |
**Coming Soon:**
- `PerplexityFilter` - KenLM perplexity scoring
- `QualityClassifierFilter` - Model-based quality (FineWeb-Edu style)
- `MinHashDeduplicator` - Near-duplicate detection
### Image Operators
**Refiners** (enrich records with new fields):
| Operator | Description | Acceleration |
|----------|-------------|--------------|
| [`ImageMetadataRefiner`](mega_data_factory/operators/refiners/image_metadata.md) | Width, height, format, file size | CPU |
| [`ImageTechnicalQualityRefiner`](mega_data_factory/operators/refiners/image_technical_quality.md) | Compression artifacts, entropy | ๐ฆ Rust |
| [`ImageVisualDegradationsRefiner`](mega_data_factory/operators/refiners/image_visual_degradations.md) | Color cast, blur, watermark, noise | CPU |
| [`ImageClipEmbeddingRefiner`](mega_data_factory/operators/refiners/image_clip_embedding.md) | CLIP embeddings (OpenCLIP) | ๐ฅ๏ธ GPU |
| [`ImageSigLIPEmbeddingRefiner`](mega_data_factory/operators/refiners/image_siglip_embedding.md) | SigLIP2 embeddings | ๐ฅ๏ธ GPU |
| [`ImageAestheticQualityRefiner`](mega_data_factory/operators/refiners/image_aesthetic_quality.md) | Aesthetic score (CLIP-based) | CPU |
| [`ImageAIGCDetectorRefiner`](mega_data_factory/operators/refiners/image_aigc_detector.md) | AI-generated image detection | CPU |
**Filters:**
| Operator | Description |
|----------|-------------|
| [`ImageQualityFilter`](mega_data_factory/operators/filters/image_quality_filter.md) | Filter by size, quality metrics, aesthetic score |
**Deduplicators:**
| Operator | Description | Acceleration |
|----------|-------------|--------------|
| [`ImagePhashDeduplicator`](mega_data_factory/operators/dedup/image_phash_dedup.md) | Perceptual hash deduplication | ๐ฆ Rust |
### General Operators
**Filters:**
| Operator | Description |
|----------|-------------|
| [`RangeFilter`](mega_data_factory/operators/filters/range_filter.md) | Generic range filter for any numeric field (min/max bounds) |
### Video Operators
**Refiners:**
| Operator | Description | Requirements |
|----------|-------------|--------------|
| [`VideoMetadataRefiner`](mega_data_factory/operators/refiners/video_metadata.md) | Extract video metadata (duration, resolution, fps, codec, bitrate, audio info) | FFprobe |
| [`VideoAestheticsScoreRefiner`](mega_data_factory/operators/refiners/video_aesthetics_score_refiner.md) | Video aesthetic quality scoring via frame sampling | ๐ฅ๏ธ GPU |
| [`VideoClipEmbeddingRefiner`](mega_data_factory/operators/refiners/video_clip_embedding.md) | CLIP embeddings for video frames (mean/max pooling) | ๐ฅ๏ธ GPU |
**Deduplicators:**
| Operator | Description | Requirements |
|----------|-------------|--------------|
| [`VideoExactByteLevelDeduplicator`](mega_data_factory/operators/dedup/video_exact_byte_level_dedup.md) | Exact file hash deduplication (SHA-256/MD5/SHA-512) | - |
| [`VideoExactStreamLevelDeduplicator`](mega_data_factory/operators/dedup/video_exact_stream_level_dedup.md) | Raw stream hash deduplication (container-agnostic) | FFmpeg |
### LLM Synthesis Operators
**Refiners** (synthesize data via LLM APIs or local models):
| Operator | Description | Mode |
|----------|-------------|------|
| [`LLMOnlineSynthesisRefiner`](mega_data_factory/operators/refiners/llm_synthesis/llm_online_synthesis.md) | Call remote LLM APIs (OpenAI, Claude, Gemini, MiniMax, DeepSeek, etc.) with account pool + proxy pool | Online |
| [`LLMOfflineSynthesisRefiner`](mega_data_factory/operators/refiners/llm_synthesis/llm_offline_synthesis.md) | Run models locally on GPUs via vLLM engine for high-throughput batch inference | Offline |
| [`LLMResponseParserRefiner`](mega_data_factory/operators/refiners/llm_synthesis/llm_response_parser.md) | Post-process LLM responses: JSON/regex/JMESPath extraction, schema validation, field mapping | Post-processing |

**Online mode** supports any OpenAI-compatible endpoint (vLLM server, Ollama, Together, Groq) plus native Anthropic, Gemini, and [MiniMax](https://www.minimax.io) APIs. Account pool rotates API keys with rate-limit awareness; proxy pool rotates HTTP/SOCKS proxies with failure tracking.
**Offline mode** uses vLLM's Python API for zero-HTTP-overhead GPU inference with continuous batching, tensor parallelism, and quantization (AWQ/GPTQ) support.
Install dependencies:
```bash
pip install -e ".[llm-online]" # httpx for online mode
pip install -e ".[llm-offline]" # vllm for offline mode
```
### Data Writers
| Writer | Description |
|--------|-------------|
| `ParquetDataWriter` | Write to Parquet files |
| `IcebergDataWriter` | Write to Apache Iceberg tables |
## Architecture
> **Deep Dive**: See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for a comprehensive explanation of the distributed pipeline-parallel design, including ObjectRef chaining, backpressure control, bucketed deduplication, and theoretical scalability analysis.
### Pipeline Overview
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4f46e5', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6366f1', 'lineColor': '#a5b4fc', 'secondaryColor': '#1e1b4b', 'tertiaryColor': '#312e81', 'background': '#0f0f23', 'mainBkg': '#1e1b4b', 'nodeBorder': '#6366f1', 'clusterBkg': '#1e1b4b', 'clusterBorder': '#6366f1', 'titleColor': '#e0e7ff', 'edgeLabelBackground': '#312e81'}}}%%
flowchart TB
subgraph Driver["Ray Driver"]
Config[Config]
Executor[Executor]
Progress[Stats]
end
subgraph ObjectStore["Object Store"]
Batches["Shared Memory"]
end
subgraph Stage0["CPU Pool ร8"]
direction LR
W0["W0"]
W1["W1"]
W2["W2"]
Wn["..."]
W7["W7"]
end
subgraph Stage1["GPU Pool ร2"]
direction LR
GPU0["GPU0"]
GPU1["GPU1"]
end
subgraph Output["Output"]
Writer[Parquet]
end
HF["HuggingFace"] --> Driver
Driver --> ObjectStore
ObjectStore --> Stage0
Stage0 --> ObjectStore
ObjectStore --> Stage1
Stage1 --> Writer
```
### Worker Pool & Load Balancing
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#059669', 'primaryTextColor': '#fff', 'primaryBorderColor': '#10b981', 'lineColor': '#6ee7b7', 'secondaryColor': '#064e3b', 'tertiaryColor': '#065f46', 'background': '#0f0f23', 'mainBkg': '#064e3b', 'nodeBorder': '#10b981', 'clusterBkg': '#064e3b', 'clusterBorder': '#10b981'}}}%%
flowchart LR
subgraph Input["Batches"]
B0["B0"] & B1["B1"] & B2["B2"] & B3["B3"]
B4["B4"] & B5["B5"] & B6["B6"] & B7["B7"]
end
subgraph CPU["CPU Pool ร8 workers"]
C0["C0 ๐ฆ"] & C1["C1 ๐ฆ"] & C2["C2 ๐ฆ"] & C3["C3 ๐ฆ"]
C4["C4 ๐ฆ"] & C5["C5 ๐ฆ"] & C6["C6 ๐ฆ"] & C7["C7 ๐ฆ"]
end
subgraph GPU["GPU Pool ร2 workers"]
G0["G0 CLIP"]
G1["G1 CLIP"]
end
B0 --> C0
B1 --> C1
B2 --> C2
B3 --> C3
B4 --> C4
B5 --> C5
B6 --> C6
B7 --> C7
C0 & C1 & C2 & C3 --> G0
C4 & C5 & C6 & C7 --> G1
```
### Execution Sequence
```mermaid
%%{init: {'theme': 'dark'}}%%
sequenceDiagram
participant D as Driver
participant OS as ObjectStore
participant CPU as CPU ร8
participant GPU as GPU ร2
participant W as Writer
D->>OS: Submit batches
par CPU Processing
OS->>CPU: Batch 0-7
end
CPU->>OS: Processed
par GPU Processing
OS->>GPU: Batch 0-7
end
GPU->>W: Write Parquet
W->>D: Done
```
### Timeline (Parallel Execution)
```mermaid
%%{init: {'theme': 'dark'}}%%
gantt
title Batch Processing Timeline
dateFormat X
axisFormat %s
section CPU-0
B0 :c0, 0, 2
B8 :c0b, 8, 2
section CPU-1
B1 :c1, 0, 2
B9 :c1b, 8, 2
section CPU-7
B7 :c7, 0, 2
B15 :c7b, 8, 2
section GPU-0
B0 :g0a, 2, 3
B2 :g0b, 5, 3
section GPU-1
B1 :g1a, 2, 3
B3 :g1b, 5, 3
```
> **Key Points**:
>
> - **CPU Pool**: 8 workers for metadata, quality (๐ฆ Rust), filtering, dedup
> - **GPU Pool**: 2 workers for CLIP embeddings (limited by VRAM)
> - **Load Balancing**: Ray auto-distributes batches to idle workers
## Configuration
### Text Pipeline: CommonCrawl Processing
```yaml
# configs/example_commoncrawl.yaml
# RefinedWeb-style text extraction pipeline
data_loader:
type: CommonCrawlLoader
params:
crawl_id: "CC-MAIN-2024-51"
num_workers: 1
stages:
- name: content_filtering
operators:
# RefinedWeb ยงG.1: URL filtering
- name: url_filter
params:
url_field: "url"
# Length filtering
- name: text_length_filter
params:
min_length: 50
max_length: 100000
text_field: "text"
length_type: "word"
# Additional text quality filters
- name: text_alphabetic_word_ration_filter
params:
text_field: "text"
max_ratio: 0.8
- name: text_avg_word_length_filter
params:
text_field: "text"
lower_bound: 2.0
upper_bound: 20.0
- name: text_bullet_filter
params:
text_field: "text"
max_bullet_ratio: 0.9
- name: text_ellipsis_line_ratio_filter
params:
text_field: "text"
max_ratio: 0.3
- name: text_symbol_ratio_filter
params:
text_field: "text"
max_symbol_to_word_ratio: 0.5
- name: text_repetition_filter
params:
text_field: "text"
# Normalize newlines before dedup
- name: text_new_line_removal_refiner
params:
text_field: "text"
max_consecutive: 2
# Exact deduplication
- name: text_exact_deduplicator
params:
text_field: "text"
worker:
min_replicas: 2
max_replicas: 2
data_writer:
type: ParquetDataWriter
params:
output_path: "./output/commoncrawl"
executor:
max_samples: 10000
batch_size: 200
dedup_num_buckets: 1
rejected_samples:
enabled: true
metrics:
enabled: true
generate_report: true
debug_samples_per_operator: 20
```
### Image Pipeline: Z-Image Style
```yaml
# configs/z_image.yaml
# Image quality + aesthetic + AIGC detection pipeline
data_loader:
type: HuggingFaceLoader
params:
dataset_name: "jp1924/Laion400m-1"
split: "train"
streaming: true
stages:
# Stage 1: Basic metadata and quality (CPU, Rust-accelerated)
- name: basic_stage
operators:
- name: image_metadata_refiner
- name: image_technical_quality_refiner # ๐ฆ Rust
- name: image_quality_filter
params:
min_width: 128
min_height: 128
max_compression_artifacts: 0.8
- name: image_phash_deduplicator # ๐ฆ Rust
worker:
min_replicas: 2
max_replicas: 8
resources:
cpu: 1
# Stage 2: Embedding extraction (GPU)
- name: embedding_stage
operators:
- name: image_clip_embedding_refiner
params:
model_name: "ViT-L-14"
pretrained: "openai"
use_fp16: true
- name: image_siglip_embedding_refiner
params:
model_name: "google/siglip2-so400m-patch14-384"
use_fp16: true
worker:
min_replicas: 1
max_replicas: 2
resources:
gpu: 1
# Stage 3: Quality scoring
- name: scoring_stage
operators:
- name: image_aesthetic_quality_refiner
- name: image_aigc_detector_refiner
params:
threshold: 0.5
worker:
min_replicas: 2
max_replicas: 4
resources:
cpu: 1
data_writer:
type: ParquetDataWriter
params:
output_path: "./output/z_image"
executor:
max_samples: 100000
batch_size: 256
dedup_num_buckets: 16
metrics:
enabled: true
generate_report: true
```
### LLM Synthesis Pipeline
```yaml
# configs/example_llm_synthesis.yaml
# Knowledge synthesis with post-processing
data_loader:
type: HuggingFaceLoader
params:
dataset_name: "your-org/seed-prompts"
split: "train"
streaming: true
stages:
- name: synthesis_stage
operators:
# Step 1: Call LLM API
- name: llm_online_synthesis_refiner
params:
provider: anthropic
model: claude-sonnet-4-20250514
system_prompt: |
Analyze the text and return JSON:
{"category": "...", "confidence": 0.0-1.0, "reasoning": "..."}
prompt_template: "Classify: {text}"
enable_thinking: true
thinking_budget: 10000
accounts:
- api_key: "${ANTHROPIC_API_KEY_1}"
- api_key: "${ANTHROPIC_API_KEY_2}"
proxies:
- "http://user:pass@proxy1:8080"
max_concurrent: 8
# Step 2: Extract structured output
- name: llm_response_parser_refiner
params:
input_field: llm_response
parse_mode: json
field_mapping:
category: "category"
confidence: "confidence"
reasoning: "reasoning"
required_fields: ["category", "confidence"]
field_types:
category: str
confidence: float
worker:
num_replicas: 1
resources:
cpu: 2
data_writer:
type: ParquetDataWriter
params:
output_path: "./output/llm_synthesis"
executor:
max_samples: 10000
batch_size: 64
```
## Performance
### Text Pipeline (CommonCrawl)
```text
============================================================
Pipeline: CommonCrawl text extraction (1M records)
Hardware: 8 CPU cores
============================================================
stage_0:
[Stage Summary]
Input: 1,000,000 โ Output: 945,866 (94.6% pass)
Total time: 49.11s
Throughput: 20,362 records/sec
URLFilter: 20,362 rec/sec (98.1% pass) # RefinedWeb ยงG.1
TextLengthFilter: 1,976,454 rec/sec (96.4% pass) # Near instant
============================================================
Projections:
10M records โ ~8 minutes
100M records โ ~1.4 hours
1B records โ ~14 hours
```
### Image Pipeline (LAION)
Benchmark on Mac M1 Pro (MPS):
```text
============================================================
Pipeline: Image quality + embedding (1K records)
============================================================
stage_0 (CPU, Rust-accelerated):
[Stage Summary]
Input: 1,000 โ Output: 898 (89.8% pass)
Total time: 0.61s
Throughput: 1,630 records/sec
ImageMetadataRefiner: 27,000 rec/sec
ImageTechnicalQualityRefiner: 2,500 rec/sec ๐ฆ Rust
ImageQualityFilter: 4,200,000 rec/sec
ImagePhashDeduplicator: 1,500 rec/sec ๐ฆ Rust
stage_1 (GPU):
[Stage Summary]
Input: 898 โ Output: 898
Total time: 6.80s
Throughput: 132 records/sec
ImageClipEmbeddingRefiner: 132 rec/sec ๐ฅ๏ธ GPU
============================================================
```
## Project Structure
```text
mega-data-factory/
โโโ mega_data_factory/
โ โโโ cli.py # CLI entry point (mdf command)
โ โโโ framework/
โ โ โโโ executor.py # Pipeline orchestration
โ โ โโโ stage_actor.py # StageActor
โ โ โโโ loader_actor.py # LoaderActor
โ โ โโโ dedup_backend.py # DedupBackend (ABC), ExactDedupBackend, SemanticDedupBackend
โ โ โโโ operator.py # Operator, Refiner, Filter, Deduplicator
โ โ โโโ config.py # YAML config parsing
โ โ โโโ registry.py # Component registries
โ โ โโโ metrics/ # Metrics collection & reporting
โ โโโ loaders/
โ โ โโโ huggingface_loader.py # HuggingFace datasets
โ โ โโโ commoncrawl_loader.py # CommonCrawl WARC files
โ โโโ operators/
โ โ โโโ refiners/ # Refiners (text, image, video)
โ โ โ โโโ llm_synthesis/ # LLM synthesis (online, offline, parser)
โ โ โโโ filters/ # Text + Image filters
โ โ โโโ dedup/ # Deduplicators (phash, minhash)
โ โโโ writers/
โ โ โโโ parquet_writer.py # Parquet output
โ โ โโโ iceberg_writer.py # Apache Iceberg output
โ โโโ models/ # Model trainers (aesthetic, AIGC, k-means)
โโโ src/lib.rs # ๐ฆ Rust operators (quality, phash, HTML extraction)
โโโ configs/ # Pipeline configurations
โ โโโ z_image.yaml # Image pipeline
โ โโโ example_commoncrawl.yaml # Text pipeline
โ โโโ example_llm_synthesis.yaml # LLM synthesis pipeline
โโโ tests/ # Unit tests
โโโ Cargo.toml # Rust dependencies
โโโ pyproject.toml # Python config (maturin build)
```
## Extending the Pipeline
### Custom Text Filter
```python
from mega_data_factory.framework import Filter, OperatorRegistry
class MyTextFilter(Filter):
def __init__(self, min_words: int = 50):
super().__init__()
self.min_words = min_words
def should_keep_batch(self, records: list[dict]) -> list[bool]:
return [len(r.get("text", "").split()) >= self.min_words for r in records]
OperatorRegistry.register("MyTextFilter", MyTextFilter)
```
### Custom Image Refiner
```python
from mega_data_factory.framework import Refiner, OperatorRegistry
import pyarrow as pa
class MyImageRefiner(Refiner):
def refine_batch(self, records: list[dict]) -> None:
for record in records:
record["my_score"] = compute_score(record["image"])
def get_output_schema(self) -> dict[str, pa.DataType]:
return {"my_score": pa.float32()}
OperatorRegistry.register("MyImageRefiner", MyImageRefiner)
```
## Key Features
- **Pipeline Parallelism**: Ray ObjectRef chaining enables concurrent stage execution without blocking ([details](docs/ARCHITECTURE.md#pipeline-parallelism-via-objectref-chaining))
- **Distributed Data Loading**: Sharded file loading with checkpoint support for fault recovery
- **Backpressure Control**: Bounded in-flight batches prevent OOM on large datasets
- **Bucketed Deduplication**: Distributed state sharding scales to 100B+ keys ([details](docs/ARCHITECTURE.md#distributed-deduplication))
- **Rust Acceleration**: 10-25x speedup for image quality, hashing, and HTML extraction
- **GPU Optimization**: CLIP/SigLIP embedding extraction with FP16 and batch inference
- **Elastic Scaling**: Dynamic worker allocation with min/max replicas per stage
- **LLM Synthesis**: Online (API) and offline (vLLM) modes with account/proxy pools and response parsing
- **Config-Driven**: YAML configs define entire pipelines with no code changes
## References
### Text Data Pipelines
- [RefinedWeb (arXiv:2306.01116)](https://arxiv.org/pdf/2306.01116) - URL filtering, trafilatura, MassiveText dedup
- [FineWeb](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) - 15T token dataset, quality filtering
- [DCLM (arXiv:2406.11794)](https://arxiv.org/pdf/2406.11794) - Data curation for language models
- [Dolma (arXiv:2402.00159)](https://arxiv.org/pdf/2402.00159) - Open corpus for LLM pretraining
### Image & Vision-Language
- [Z-Image (arXiv:2511.22699)](https://arxiv.org/pdf/2511.22699) - Image generation foundation model data
- [DataComp (arXiv:2304.14108)](https://arxiv.org/pdf/2304.14108) - CLIP filtering benchmark
- [LAION-5B (arXiv:2210.08402)](https://arxiv.org/pdf/2210.08402) - Large-scale image-text dataset
### Tools & Models
- [OpenCLIP](https://github.com/mlfoundations/open_clip) - CLIP implementation
- [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384) - Vision encoder
- [dom_smoothie](https://github.com/nicr9/dom_smoothie) - Rust readability.js port
## License
MIT License
## Citation
```bibtex
@software{mega_data_factory,
author = {Duo An},
title = {Mega Data Factory},
year = {2025},
publisher = {GitHub},
url = {https://github.com/duoan/mega-data-factory}
}
```