{"id":49284802,"url":"https://github.com/avitai/datarax","last_synced_at":"2026-04-25T21:01:13.164Z","repository":{"id":332234512,"uuid":"1133129450","full_name":"avitai/datarax","owner":"avitai","description":"A Differentiable Data Pipeline Framework for JAX","archived":false,"fork":false,"pushed_at":"2026-04-24T23:56:30.000Z","size":9034,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-25T01:34:54.205Z","etag":null,"topics":["autograd","data","data-analysis","data-science","differentiable","flax-nnx","jax","jit","machine-learning","xla"],"latest_commit_sha":null,"homepage":"https://datarax.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/avitai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing/contributing_guide.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-12T23:28:16.000Z","updated_at":"2026-04-24T23:56:37.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/avitai/datarax","commit_stats":null,"previous_names":["avitai/datarax"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/avitai/datarax","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avitai%2Fdatarax","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avitai%2Fdatarax/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avitai%2Fdatarax/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avitai%2Fdatarax/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/avitai","download_url":"https://codeload.github.com/avitai/datarax/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avitai%2Fdatarax/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32276628,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-25T18:29:39.964Z","status":"ssl_error","status_checked_at":"2026-04-25T18:29:32.149Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["autograd","data","data-analysis","data-science","differentiable","flax-nnx","jax","jit","machine-learning","xla"],"created_at":"2026-04-25T21:01:12.576Z","updated_at":"2026-04-25T21:01:13.151Z","avatar_url":"https://github.com/avitai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Datarax: A Data Pipeline Framework for JAX\n\n[![CI](https://github.com/avitai/datarax/actions/workflows/ci.yml/badge.svg)](https://github.com/avitai/datarax/actions/workflows/ci.yml)\n[![Test Coverage](https://github.com/avitai/datarax/actions/workflows/test-coverage.yml/badge.svg)](https://github.com/avitai/datarax/actions/workflows/test-coverage.yml)\n[![codecov](https://codecov.io/gh/avitai/datarax/branch/main/graph/badge.svg)](https://codecov.io/gh/avitai/datarax)\n[![Build](https://github.com/avitai/datarax/actions/workflows/build-verification.yml/badge.svg)](https://github.com/avitai/datarax/actions/workflows/build-verification.yml)\n[![Summary](https://github.com/avitai/datarax/actions/workflows/summary.yml/badge.svg)](https://github.com/avitai/datarax/actions/workflows/summary.yml)\n\n[![Project Status: Active](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n\n---\n\n\u003e **Early Development - API Unstable**\n\u003e\n\u003e Datarax is in early development and undergoing rapid iteration.\n\u003e Breaking changes are expected. Pin to specific commits if stability is required.\n\u003e We recommend waiting for a stable release (v1.0) before using Datarax in production.\n\n---\n\n**Datarax** (*Data + Array/JAX*) is an extensible data pipeline framework built for JAX-based machine learning workflows. It leverages JAX's JIT compilation, automatic differentiation, and hardware acceleration to build data loading, preprocessing, and augmentation pipelines that run on CPUs, GPUs, and TPUs.\n\n## Key Features\n\n- **JAX-Native Design:** All core components built on JAX's functional paradigm with Flax NNX module system for state management\n- **High Performance:** JIT-compiled pipelines via XLA, with built-in profiling and roofline analysis\n- **DAG Execution Engine:** Graph-based pipeline construction with branching, parallel execution, caching, and rebatching nodes\n- **Scalability:** Multi-device and multi-host data distribution with device mesh sharding\n- **Determinism:** Reproducible pipelines by default using Grain's Feistel cipher shuffling (O(1) memory)\n- **Extensibility:** Custom data sources, operators, and augmentation strategies via composable NNX modules\n- **Benchmarking Suite:** Comparative benchmarks against 12+ frameworks with Calibrax-powered analysis and regression checks\n- **Ecosystem Integration:** Works with Flax, Optax, Orbax, HuggingFace Datasets, and TensorFlow Datasets\n\n## Why Datarax?\n\nJAX has mature libraries for models (Flax), optimizers (Optax), and checkpointing (Orbax), but lacks a dedicated data pipeline framework that operates at the same level of abstraction. Existing options are either framework-agnostic loaders that return NumPy arrays (losing JIT/autodiff benefits) or wrappers around tf.data/PyTorch that introduce cross-framework overhead. Datarax aims to fill this gap. The framework is under active development with ongoing performance optimization — the architecture is functional, but throughput and API surface are still being refined.\n\n### JAX-Native from the Ground Up\n\nEvery component — sources, operators, batchers, samplers, sharders — is a Flax NNX module. Pipeline state is managed through NNX's variable system, which means operators can hold learnable parameters, be serialized with Orbax, and participate in JAX transformations (`jit`, `vmap`, `grad`) without special handling.\n\n### Differentiable Data Pipelines\n\nBecause operators are NNX modules, gradients flow through the entire pipeline. This enables approaches that are not possible with standard data loaders:\n\n- [Gradient-based augmentation search](examples/advanced/differentiable/01_dada_learned_augmentation_guide.py) — replacing RL-based methods like AutoAugment with direct optimization\n- [Task-optimized preprocessing](examples/advanced/differentiable/02_learned_isp_guide.py) — backpropagating task loss through every processing stage\n- [Differentiable audio synthesis](examples/advanced/differentiable/03_ddsp_audio_synthesis_guide.py) — extending the same pattern to non-vision domains\n\nSee the [differentiable pipeline examples](docs/examples/advanced/differentiable/) for details.\n\n### DAG Execution Model\n\nPipelines are directed acyclic graphs, not linear chains. The `\u003e\u003e` operator composes sequential steps, `|` creates parallel branches, and control-flow nodes (`Branch`, `Merge`, `SplitField`) handle conditional and multi-path logic. The DAG executor manages scheduling, caching, and rebatching across the graph.\n\n### Deterministic Reproducibility\n\nShuffling uses Grain's Feistel cipher permutation, which generates a full-epoch permutation in O(1) memory without materializing the index array. Combined with explicit RNG key threading through every stochastic operator, pipelines produce identical output given the same seed — across restarts, devices, and host counts.\n\n### Built-in Competitive Benchmarking\n\nThe benchmarking suite profiles datarax against 12+ frameworks (Grain, tf.data, PyTorch DataLoader, DALI, Ray Data, and others) across standardized scenarios. Results are converted to CalibraX runs for direction-aware metrics, regression gating, and W\u0026B export. This benchmark-driven loop is how datarax tracks progress toward competitive throughput — current results and optimization status are tracked in the [benchmarking documentation](docs/benchmarks/index.md).\n\n## Installation\n\n```bash\n# Basic installation\nuv pip install datarax\n\n# With data loading support (HuggingFace, TFDS, audio/image libs)\nuv pip install \"datarax[data]\"\n\n# With GPU support (CUDA 12)\nuv pip install \"datarax[gpu]\"\n\n# Full development installation\nuv pip install \"datarax[all]\"\n```\n\n### macOS / Apple Silicon\n\n```bash\n# macOS CPU mode (recommended)\nuv pip install \"datarax[all-cpu]\"\nJAX_PLATFORMS=cpu python your_script.py\n\n# Metal GPU acceleration (experimental, M1/M2/M3+)\nuv pip install jax-metal\nJAX_PLATFORMS=metal python your_script.py\n```\n\n\u003e **Note:** Metal GPU acceleration is community-tested. CI runs on macOS with CPU only.\n\n## Quick Start\n\n```python\nimport jax\nimport jax.numpy as jnp\nimport numpy as np\nfrom flax import nnx\n\nfrom datarax import build_source_pipeline\nfrom datarax.dag.nodes import OperatorNode\nfrom datarax.operators import ElementOperator, ElementOperatorConfig\nfrom datarax.sources import MemorySource, MemorySourceConfig\nfrom datarax.typing import Element\n\n\ndef normalize(element: Element, key: jax.Array | None = None) -\u003e Element:\n    return element.update_data({\"image\": element.data[\"image\"] / 255.0})\n\n\ndef augment(element: Element, key: jax.Array) -\u003e Element:\n    key1, _ = jax.random.split(key)\n    flip = jax.random.bernoulli(key1, 0.5)\n    new_image = jax.lax.cond(\n        flip, lambda img: jnp.flip(img, axis=1), lambda img: img,\n        element.data[\"image\"],\n    )\n    return element.update_data({\"image\": new_image})\n\n\n# Create in-memory data source\ndata = {\n    \"image\": np.random.randint(0, 255, (1000, 28, 28, 1)).astype(np.float32),\n    \"label\": np.random.randint(0, 10, (1000,)).astype(np.int32),\n}\nsource = MemorySource(MemorySourceConfig(), data=data, rngs=nnx.Rngs(0))\n\n# Build pipeline with DAG-based API\nnormalizer = ElementOperator(\n    ElementOperatorConfig(stochastic=False), fn=normalize, rngs=nnx.Rngs(0),\n)\naugmenter = ElementOperator(\n    ElementOperatorConfig(stochastic=True, stream_name=\"augmentations\"),\n    fn=augment, rngs=nnx.Rngs(42),\n)\n\npipeline = (\n    build_source_pipeline(source, batch_size=32)\n    \u003e\u003e OperatorNode(normalizer)\n    \u003e\u003e OperatorNode(augmenter)\n)\n\n# Process batches\nfor i, batch in enumerate(pipeline):\n    if i \u003e= 3:\n        break\n    print(f\"Batch {i}: images {batch['image'].shape}, labels {batch['label'].shape}\")\n```\n\n### Advanced: Branching and Parallel DAGs\n\n```python\nfrom datarax.dag.nodes import OperatorNode, Merge, Branch\n\n# Define additional operators\ndef invert(element: Element, key=None) -\u003e Element:\n    return element.update_data({\"image\": 1.0 - element.data[\"image\"]})\n\ninverter = ElementOperator(\n    ElementOperatorConfig(stochastic=False), fn=invert, rngs=nnx.Rngs(0),\n)\n\ndef is_high_contrast(element):\n    return jnp.var(element.data[\"image\"]) \u003e 0.1\n\n# Build a complex DAG:\n# 1. Source -\u003e Batching\n# 2. Parallel: normalizer AND inverter (| creates a Parallel node)\n# 3. Merge: average the two branches\n# 4. Branch: conditional path based on image variance\ncomplex_pipeline = (\n    build_source_pipeline(source, batch_size=32)\n    \u003e\u003e (OperatorNode(normalizer) | OperatorNode(inverter))\n    \u003e\u003e Merge(\"mean\")\n    \u003e\u003e Branch(\n           condition=is_high_contrast,\n           true_path=OperatorNode(augmenter),\n           false_path=OperatorNode(normalizer),\n       )\n)\n```\n\n## Architecture\n\n```text\nsrc/datarax/\n  core/         # Base modules: DataSourceModule, OperatorModule, Element, Batcher, Sampler, Sharder\n  dag/          # DAG executor and node system (source, operator, batch, cache, control flow)\n  sources/      # MemorySource, TFDS (eager/streaming), HuggingFace (eager/streaming), ArrayRecord, MixedSource\n  operators/    # ElementOperator, MapOperator, CompositeOperator, modality-specific (image, text)\n    strategies/ # Sequential, Parallel, Branching, Ensemble, Merging execution strategies\n  samplers/     # Sequential, Shuffle (Feistel cipher), Range, EpochAware samplers\n  sharding/     # ArraySharder, JaxProcessSharder for multi-device distribution\n  distributed/  # DeviceMesh, DataParallel for multi-host training\n  batching/     # DefaultBatcher with buffer state management\n  checkpoint/   # NNXCheckpointHandler with Orbax integration\n  monitoring/   # Pipeline monitor, DAG monitor, reporters\n  performance/  # Roofline analysis, XLA optimization utilities\n  control/      # Prefetcher for asynchronous data loading\n  memory/       # Shared memory manager for multi-process data sharing\n  config/       # TOML-based configuration system with schema validation\n  cli/          # datarax CLI entry point\n  utils/        # PyTree utilities, external integration helpers\n```\n\n## Benchmarking\n\nDatarax includes a benchmarking suite for comparison against 12+ data loading frameworks across a range of workload scenarios (vision, NLP, tabular, multimodal, distributed).\n\n```bash\n# Install benchmark dependencies (adds PyTorch, DALI, Ray, etc.)\nuv sync --extra benchmark\n\n# Optional: install CalibraX with W\u0026B support explicitly\nuv pip install \"calibrax[wandb] @ git+https://github.com/avitai/calibrax.git\"\n\n# Run benchmarks locally\nuv run python -m benchmarks.runners.full_runner --platform cpu --repetitions 5\n\n# Run on cloud (SkyPilot)\nsky launch benchmarks/sky/gpu-benchmark.yaml --env WANDB_API_KEY=$WANDB_API_KEY\n```\n\nBenchmark results are exported to W\u0026B with charts, gap analysis, stability reports, and raw result artifacts. See [Benchmarking Guide](docs/benchmarks/index.md) for methodology and cloud deployment.\n\n## Development Setup\n\nDatarax uses `uv` as its package manager:\n\n```bash\n# Clone and setup\ngit clone https://github.com/avitai/datarax.git\ncd datarax\n\n# Automatic setup\n./setup.sh \u0026\u0026 source activate.sh\n\n# Or manual install\nuv sync --extra dev\n```\n\n### Running Tests\n\n```bash\n# CPU-only (most stable)\nJAX_PLATFORMS=cpu uv run pytest\n\n# Include benchmark test suite in the same run\nJAX_PLATFORMS=cpu uv run pytest --all-suites\n\n# Specific module\nJAX_PLATFORMS=cpu uv run pytest tests/sources/test_memory_source_module.py\n```\n\n### Docker\n\n```bash\n# Build and run\ndocker build -t datarax:latest .\ndocker run --rm --gpus all datarax:latest python -c \"import datarax, jax; print(jax.devices())\"\n\n# Benchmark images\ndocker build -f benchmarks/docker/Dockerfile.gpu -t datarax-bench:gpu .\n```\n\nSee [Docker Guide](docs/contributing/docker.md) for full details.\n\n## Documentation\n\n- [Installation Guide](docs/getting_started/installation.md)\n- [Quick Start](docs/getting_started/quick_start.md)\n- [Core Concepts](docs/getting_started/core_concepts.md)\n- [User Guide](docs/user_guide/)\n- [API Reference](docs/api_reference/index.md)\n- [Examples](docs/examples/overview.md)\n- [Benchmarking](docs/benchmarks/index.md)\n- [Contributing](docs/contributing/contributing_guide.md)\n- [Docker](docs/contributing/docker.md)\n\n## License\n\nDatarax is licensed under the [MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favitai%2Fdatarax","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Favitai%2Fdatarax","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favitai%2Fdatarax/lists"}