{"id":46029419,"url":"https://github.com/russfellows/dl-driver","last_synced_at":"2026-03-01T03:33:02.406Z","repository":{"id":315840819,"uuid":"1011601484","full_name":"russfellows/dl-driver","owner":"russfellows","description":"Realistic Driver for AI/ML Storage Workloads.  Part of the sai3 project that delivers multi-protocol storage access for AI/ML workflows. ","archived":false,"fork":false,"pushed_at":"2025-11-14T18:58:36.000Z","size":1546,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-14T20:36:04.779Z","etag":null,"topics":["ai-ml","azure-blob","benchmark","gcs","hacktoberfest","jax","nfs","pytorch","rust","s3","sai3","storage","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/russfellows.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-01T04:33:39.000Z","updated_at":"2025-11-14T18:44:12.000Z","dependencies_parsed_at":"2025-09-21T05:39:49.490Z","dependency_job_id":"ebc696da-da35-49f6-ab6f-0f6fa8e22a6f","html_url":"https://github.com/russfellows/dl-driver","commit_stats":null,"previous_names":["russfellows/dl-driver"],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/russfellows/dl-driver","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fdl-driver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fdl-driver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fdl-driver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fdl-driver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/russfellows","download_url":"https://codeload.github.com/russfellows/dl-driver/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fdl-driver/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29959375,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-01T01:47:18.291Z","status":"online","status_checked_at":"2026-03-01T02:00:07.437Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-ml","azure-blob","benchmark","gcs","hacktoberfest","jax","nfs","pytorch","rust","s3","sai3","storage","tensorflow"],"created_at":"2026-03-01T03:33:01.756Z","updated_at":"2026-03-01T03:33:02.383Z","avatar_url":"https://github.com/russfellows.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dl-driver\n\n**A tool for performing realistic testing of storage performance when running AI/ML workloads**\n\n[![Rust](https://img.shields.io/badge/rust-1.91.0+-blue.svg)](https://www.rust-lang.org)\n[![Version](https://img.shields.io/badge/version-0.8.11-green.svg)](./docs/Changelog.md)\n[![Build](https://img.shields.io/badge/build-passing-success.svg)](#compilation-status)\n[![Formats](https://img.shields.io/badge/formats-3%20validated-brightgreen.svg)](#format-compatibility)\n[![Validation](https://img.shields.io/badge/tests-133%20passing-success.svg)](#testing--validation)\n[![Storage](https://img.shields.io/badge/storage-4%20backends-orange.svg)](#storage-backends)\n[![Distributed](https://img.shields.io/badge/distributed-multi--agent-purple.svg)](#distributed-execution)\n[![Architecture](https://img.shields.io/badge/architecture-unified-blue.svg)](#architecture-overview)\n[![REUSE status](https://api.reuse.software/badge/github.com/russfellows/dl-driver)](https://api.reuse.software/info/github.com/russfellows/dl-driver)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n[![ScanCode Compatible](https://img.shields.io/badge/ScanCode-Compatible-green.svg)](https://scancode.io/)\n\n## 🚀 Overview\n\n**dl-driver** is a tool for testing storage performance during AI/ML workloads.  For training workloads it supports running data generation, data loading and checkpoint tests that provide **format compatibility** with standard Python libraries. Built in Rust for performance and reliability, it serves as a drop-in replacement for [DLIO benchmarks](https://github.com/argonne-lcf/dlio_benchmark) while delivering enterprise-grade capabilities through the powerful [s3dlio](https://github.com/russfellows/s3dlio) library.\n\n**Key Achievement**: Validation of object/file formats with numpy, h5py, and TensorFlow provides integration with existing ML pipelines.\n\n## 🎯 Current Status\n- **🎉 v0.8.11 RELEASED**: Updated to s3dlio v0.9.18 with dependency synchronization\n- **🎉 v0.8.10 RELEASED**: Realistic checkpoint sizes (100MB+) with s3dlio integration and architecture fixes\n- **🎉 v0.8.9 RELEASED**: Multi-array NPZ + TFRecord index generation\n- **🎯 NPZ ENHANCEMENT**: Multi-array support via s3dlio's build_multi_npz() (data + labels + metadata)\n- **📊 TFRECORD INDICES**: Automatic index file generation for TensorFlow Data Service compatibility\n- **🎉 v0.8.8**: Distributed multi-rank with file sharding and bug fixes \n- **🎯 DISTRIBUTED MULTI-RANK**: Complete Phase 1 \u0026 2 implementation with interleaved/contiguous sharding\n- **📊 ACCURATE PERCENTILES**: Bucket-level histogram aggregation for distributed workloads (\u003c1% error)\n- **⚡ ACCELERATOR UTILIZATION**: Fixed AU calculation (now compute_time / batch_time, not inverted)\n- **📝 UNIFIED OUTPUT**: Consistent dual-perspective format (Storage + AI/ML) across all modes\n- **🔧 FIRST-BATCH EXCLUSION**: Steady-state metrics exclude cold-start batch for accuracy\n- **📊 LIVE STATS STREAMING**: Real-time progress updates via gRPC streaming (1s intervals)\n- **📈 PROGRESS BARS**: Multi-line display with percentage, epoch counter, and detailed statistics\n- **🤝 STARTUP HANDSHAKE**: READY/ERROR validation before workload execution\n- **🎯 ZERO WARNINGS**: Production-quality code with zero compiler warnings\n- **⚡ MULTI-ENDPOINT**: Load balance across multiple S3/storage endpoints with round-robin or least-connections\n- **♻️ CHECKPOINT RELOAD**: Resume training from saved checkpoints with --resume-from-checkpoint flag\n- **💾 CHECKPOINT SUPPORT**: Step-based and epoch-based checkpointing across all storage backends\n- **🔧 CLI SIMPLIFIED**: Removed legacy commands, unified interface with validate/--dry-run\n- **🎉 DISTRIBUTED CONTROLLER**: Multi-agent orchestration for true distributed workloads\n- **🌐 MULTI-NODE EXECUTION**: Coordinate workloads across multiple hosts with shared/local storage\n- **📊 HISTOGRAM AGGREGATION**: Accurate percentile calculation with \u003c1% error for distributed workloads\n- **📁 RESULTS DIRECTORY**: Complete, reproducible results with per-agent and consolidated metrics\n- **✅ 133/133 TESTS PASSING**: Full validation across all features and backends\n\n### Core Capabilities\n- **🎯 Multi-Array NPZ**: Create NPZ archives with multiple named arrays (data, labels, metadata) using s3dlio's zero-copy API\n- **📊 TFRecord Indices**: Automatic index generation for TensorFlow Data Service (16 bytes/record, optional separate folder)\n- **🎯 Distributed Multi-Rank**: Complete Phase 1 \u0026 2 implementation with file sharding (interleaved/contiguous strategies)\n- **📊 Accurate Percentiles**: Bucket-level HDR histogram aggregation for distributed workloads (\u003c1% error)\n- **⚡ Accelerator Utilization**: Fixed AU metric calculation (compute_time / batch_time ratio)\n- **📝 Unified Output Format**: Consistent dual-perspective reporting (Storage I/O + AI/ML Training)\n- **🔧 Steady-State Metrics**: First-batch exclusion prevents cold-start skew in statistics\n- **⚠️ Storage Latency**: Currently reports 0µs (full instrumentation planned - see `docs/STORAGE_LATENCY_LIMITATION.md`)\n- **📊 Live Stats Streaming**: Real-time progress updates via gRPC streaming with 1-second intervals\n- **📈 Progress Bars**: Multi-line display showing percentage, epoch counter, and detailed I/O statistics\n- **🤝 Startup Handshake**: READY/ERROR validation ensures all agents are healthy before workload starts\n- **⏱️ Microsecond Precision**: All distributed mode latencies now displayed in microseconds (µs) for accuracy\n- **📊 Distributed Histogram Aggregation**: Bucket-level HDR histogram merging for accurate percentiles across agents\n- **📁 Enhanced Results Capture**: console.log includes all completion messages, latencies, and throughput statistics\n- **⚡ Multi-Endpoint Load Balancing**: Distribute requests across multiple storage endpoints (round-robin or least-connections)\n- **♻️ Checkpoint Reload**: Resume training from saved checkpoints with automatic state restoration\n- **💾 Checkpoint Plugin**: Step-based and epoch-based checkpointing with multi-backend support (file://, s3://, az://, gs://)\n- **🔧 Clean CLI**: Unified interface with validate and --dry-run as aliases, legacy commands removed\n- **🌐 Multi-Agent Orchestration**: Controller coordinates workloads across multiple agent instances\n- **💓 Coordinated Start**: Synchronized workload execution with health checking\n- **📊 Aggregate Metrics**: Automatic collection and aggregation from all agents with histogram-based percentiles\n- **📁 Structured Results**: Complete results directory with per-agent TSV files and consolidated bucket-level histograms\n- **🗂️ Path Isolation**: Agent-specific path prefixes for local storage isolation\n- **☁️ Shared Storage**: Automatic detection and handling of GCS/S3/Azure shared backends (--shared-storage flag)\n- **✅ E2E Validated**: 2-node and 4-node configurations tested (local + cloud storage)\n- **📈 Performance**: Multi-GiB/s aggregate throughput with accurate percentile tracking\n\n**For storage I/O replay**, use [sai3-bench](https://github.com/russfellows/sai3-bench) instead.\n\n## 📚 Documentation\n\n**👉 For complete documentation, see [docs/USER_GUIDE.md](docs/USER_GUIDE.md)**\n\n### Quick Links\n\n- **[User Guide](docs/USER_GUIDE.md)** - Comprehensive guide covering all features\n- **[Quick Start](docs/QUICK_START.md)** - Get started in minutes\n- **[Distributed Setup](tests/dlio_configs/DISTRIBUTED_README.md)** - Multi-agent orchestration guide\n- **[Changelog](docs/Changelog.md)** - Version history and release notes\n- **[Dual Metrics](docs/DUAL_METRICS_REPORTING.md)** - Metrics specification\n- **[Results Directory Format](docs/RESULTS_DIRECTORY_FORMAT.md)** - Structured results output specification\n\n## 🌐 Distributed Execution\n\n### Multi-Agent Orchestration\nExecute DLIO workloads across multiple agent instances with centralized controller:\n\n```bash\n# Start agent processes on each host\n# Host 1:\n./target/release/dl_driver_agent --agent-id agent-0 --port 50051 --bind-addr 0.0.0.0 \u0026\n\n# Host 2:\n./target/release/dl_driver_agent --agent-id agent-1 --port 50051 --bind-addr 0.0.0.0 \u0026\n\n# Run distributed workload from controller\n./target/release/dl-driver distributed run \\\n  --config tests/dlio_configs/distributed_2node_local.yaml \\\n  --agents http://host1:50051,http://host2:50051 \\\n  --path-template \"{id}/\"\n\n# Output shows aggregated results:\n╔════════════════════════════════════════════════╗\n║   Distributed Workload Complete! 🎉           ║\n╚════════════════════════════════════════════════╝\n\n📊 Storage Performance (I/O Perspective):\n   Total Throughput: 687.5 MiB/s\n   Total Operations: 40\n   Errors: 0\n\n🤖 AI/ML Training Performance (Training Perspective):\n   Training Velocity: 297.9 samples/s, 45.8 batches/s\n   Pipeline Efficiency: 37.8%\n```\n\n### Storage Backend Modes\n\n**Local Storage** (requires path template for agent isolation):\n```bash\n# Each agent writes to separate subdirectory\n./target/release/dl-driver distributed run \\\n  --config distributed_local.yaml \\\n  --agents http://host1:50051,http://host2:50052 \\\n  --path-template \"{id}/\"\n# Creates: /tmp/data/agent-0/, /tmp/data/agent-1/, etc.\n```\n\n**Shared Storage** (no path template needed):\n```bash\n# All agents write to same GCS/S3 bucket\n./target/release/dl-driver distributed run \\\n  --config distributed_gcs.yaml \\\n  --agents http://host1:50051,http://host2:50052\n# All write to: gs://bucket/distributed-test/\n```\n\n### Key Distributed Features\n- **🌐 Multi-Host Orchestration**: Controller coordinates agents across network\n- **💓 Health Checking**: Automatic agent health verification before execution\n- **🔗 Coordinated Start**: Synchronized workload start across all agents\n- **📊 Aggregate Metrics**: Automatic collection and aggregation from all agents\n- **🗂️ Path Isolation**: Agent-specific subdirectories for local storage\n- **☁️ Shared Storage**: Automatic detection of GCS/S3/Azure shared backends\n- **📈 Dual Metrics**: Separate storage and AI/ML training perspectives\n\nSee `tests/dlio_configs/DISTRIBUTED_README.md` for complete usage guide.\n\n## 🌟 Multi-Process Scaling Usage\n\n### Multi-Rank Distributed Execution\nExecute DLIO workloads across multiple processes with shared memory coordination:\n\n```bash\n# 2-Process execution (simulating 2 GPUs)\n./target/release/dl-driver run --config config.yaml --world-size 2 --rank 0 \u0026\n./target/release/dl-driver run --config config.yaml --world-size 2 --rank 1 \u0026\n\n# 4-Process execution (simulating 4 GPUs) \n./target/release/dl-driver run --config config.yaml --world-size 4 --rank 0 \u0026\n./target/release/dl-driver run --config config.yaml --world-size 4 --rank 1 \u0026\n./target/release/dl-driver run --config config.yaml --world-size 4 --rank 2 \u0026\n./target/release/dl-driver run --config config.yaml --world-size 4 --rank 3 \u0026\n\n# Rank 0 will display aggregated results:\n🎉 Plan A1 Multi-GPU Results (Shared Memory Coordination):\n================================================================\nTotal files processed: 28\nTotal data read: 0.40 GiB\nCombined throughput: 11.16 GiB/s\nGlobal runtime: 0.071s\nNumber of ranks: 4\n✅ Multi-rank coordination successful - NO TEMP FILES USED\n```\n\n### Key Multi-Process Features\n- **🔗 Shared Memory Coordination**: Zero temp files, atomic operations, cross-process barriers\n- **📊 Automatic Aggregation**: Rank 0 displays combined performance across all processes  \n- **⚡ Synchronized Execution**: All ranks coordinate start/stop for accurate timing\n- **🎯 Interleaved Sharding**: Optimal data distribution across ranks\n- **🧹 Automatic Cleanup**: Proper shared memory cleanup on completion or failure\n\n## 🚀 Single-Process DLIO Execution\n\n```bash\n# Build and run standard DLIO workload\ncargo build --release\n./target/release/dl-driver run --config tests/dlio_configs/minimal_config.yaml\n\n# Generate data separately (optional)\n./target/release/dl-driver generate --config config.yaml\n\n# Validate configuration\n./target/release/dl-driver validate --config config.yaml\n\n# MLPerf compliance mode (enhanced reporting)\n./target/release/dl-driver run --mlperf --config config.yaml --format json\n```\n\n### ✨ Key Features\n\n- **🌐 Distributed Controller**: Multi-agent orchestration with coordinated start and histogram-based aggregate metrics\n- **📁 Results Directory**: Complete, reproducible results with per-agent and consolidated TSV files\n- **📊 Histogram Aggregation**: Accurate percentile calculation (\u003c1% error) for distributed workloads\n- **🗂️ Directory Tree Modes**: 3-mode system (Flat, DLIO Sharding, Hierarchical) for realistic dataset organization\n- **🔍 Dry-Run Validation**: `--dry-run` flag validates configs and shows workload summary before execution\n- **🌟 Multi-Process Scaling**: `--world-size N --rank R` distributed execution with shared memory coordination\n- **🔥 Enterprise Coordination**: Atomic operations, cross-process barriers, zero temp files  \n- **🚀 TRUE DLIO Parallel I/O**: Background workers with I/O+compute overlap for realistic performance\n- **🎯 Complete Format Compatibility**: NPZ, HDF5, TFRecord validated with numpy, h5py, TensorFlow\n- **🏪 Universal Storage**: File, S3/MinIO, Azure Blob, DirectIO backends with unified interface  \n- **📋 DLIO Compatible**: Drop-in replacement for existing DLIO benchmark configurations\n- **📊 Dual Metrics**: Separate storage (ops/s, MiB/s) and AI/ML (samples/s, batches/s) perspectives\n- **☁️ Production Cloud Ready**: Real S3 and Azure credential support\n- **🧪 Comprehensively Validated**: 119 comprehensive tests with golden reference validation and MLCommons DLIO compatibility\n\n## 🧠 Workstream A: Realistic AI/ML Framework Simulation\n```\n\n### ✨ Key Features\n\n- **� Distributed Controller**: Multi-agent orchestration with coordinated start and aggregate metrics\n- **�🌟 Multi-Process Scaling**: `--world-size N --rank R` distributed execution with shared memory coordination\n- **🔥 Enterprise Coordination**: Atomic operations, cross-process barriers, zero temp files  \n- **🚀 TRUE DLIO Parallel I/O**: Background workers with I/O+compute overlap for realistic performance\n- **🎯 Complete Format Compatibility**: NPZ, HDF5, TFRecord validated with numpy, h5py, TensorFlow\n- **🏪 Universal Storage**: File, S3/MinIO, Azure Blob, DirectIO backends with unified interface  \n- **📋 DLIO Compatible**: Drop-in replacement for existing DLIO benchmark configurations\n- **📊 Dual Metrics**: Separate storage (ops/s, MiB/s) and AI/ML (samples/s, batches/s) perspectives\n- **☁️ Production Cloud Ready**: Real S3 and Azure credential support\n- **🧪 Comprehensively Validated**: 119 comprehensive tests with golden reference validation and MLCommons DLIO compatibility\n\n## 🧠 Workstream A: Realistic AI/ML Framework Simulation\n\n### Framework-Specific Workload Profiles\nExecute workloads optimized for specific AI/ML frameworks:\n\n```bash\n# PyTorch-optimized workload simulation\n./target/release/dl-driver run --config config.yaml --profile torch\n\n# TensorFlow-optimized configuration  \n./target/release/dl-driver run --config config.yaml --profile tf\n\n# JAX-optimized workload patterns\n./target/release/dl-driver run --config config.yaml --profile jax\n```\n\n### Advanced Metrics Export \u0026 CI Integration\nExport comprehensive performance metrics for automated analysis:\n\n```bash\n# Export metrics to JSON for programmatic analysis\n./target/release/dl-driver run --config config.yaml --metrics-json results.json\n\n# Export metrics to CSV for spreadsheet analysis\n./target/release/dl-driver run --config config.yaml --metrics-csv results.csv\n\n# Both formats simultaneously for comprehensive reporting\n./target/release/dl-driver run --config config.yaml --metrics-json metrics.json --metrics-csv metrics.csv\n```\n\n### Operation Log Validation \u0026 Benchmarking\nValidate workload performance against reference operation logs:\n\n```bash\n# Validate against compressed operation log (supports .csv.zst, .jsonl.zst)\n./target/release/dl-driver run --config config.yaml --op-log reference-benchmark.csv.zst\n\n# Example with comprehensive validation and metrics export\n./target/release/dl-driver run \\\n    --config config.yaml \\\n    --profile torch \\\n    --metrics-json validation-results.json \\\n    --op-log production-reference.csv.zst\n\n# Validation output with CI-friendly exit codes:\n✅ PASS: Workload performance within tolerance (±5.0%)\n📊 Files processed: 1000 (reference: 1000)  \n📊 Throughput: 12.4 GiB/s (reference: 12.1 GiB/s, +2.5%)\n📊 Total runtime: 45.2s (reference: 46.1s, -2.0%)\n```\n\n### Key Workstream A Features\n- **🧠 Intelligent Profiles**: Framework-specific optimizations for PyTorch, TensorFlow, and JAX\n- **📊 Production Metrics**: JSON/CSV export for CI/CD pipelines and performance tracking\n- **🔍 Validation Engine**: Compare against reference operation logs with configurable tolerance\n- **⚡ Real-World Testing**: Validated with 2.78M record operation logs from production systems\n- **🎯 CI Integration**: PASS/FAIL validation with proper exit codes for automated testing\n\n## 🎯 Technical Specifications\n\n### Binaries\n- **`dl-driver`**: Main CLI for single-process, multi-rank, and distributed controller execution\n- **`dl_driver_agent`**: Standalone agent process for distributed workloads (gRPC service)\n\n### Storage Backends\n- **File System**: POSIX-compliant file I/O with DirectIO optimization\n- **Cloud Storage**: S3/MinIO and Azure Blob with credential support\n- **Performance**: Multi-GiB/s throughput with enterprise-grade reliability\n\n### Data Formats  \n- **NPZ, HDF5, TFRecord**: 100% compatible with numpy, h5py, and TensorFlow\n- **Framework Support**: PyTorch, TensorFlow, and JAX configuration profiles\n- **Validation**: Comprehensive test suite ensuring standard library compatibility\n\n## 🏆 Key Achievements\n\n### 🎯 Production-Ready AI/ML Data Pipeline\ndl-driver has evolved into a complete, enterprise-grade testing framework for AI/ML workloads:\n\n- **100% Format Compatibility**: All generated files work seamlessly with standard Python libraries (numpy, h5py, TensorFlow)\n- **Distributed Orchestration**: Multi-agent coordination with histogram-based percentile aggregation (\u003c1% error)\n- **Results Directory**: Complete, reproducible results with per-agent and consolidated metrics in TSV format\n- **DLIO Drop-in Replacement**: Full MLCommons configuration compatibility with enhanced features\n- **Multi-Backend Excellence**: Unified performance across File, S3, Azure, and DirectIO storage\n- **Enterprise Validation**: Comprehensive test suite ensuring reliability and correctness\n\n### 📊 Validation Confidence\n```\n✅ Core Tests:       60/60 tests passing (metrics, config, workload, distributed, histogram aggregation)\n✅ CLI Tests:        29/29 tests passing (configuration, backend integration)\n✅ Integration Tests: 10/10 tests passing (histogram E2E, results directory workflow)\n✅ Framework Tests:   7/7 tests passing (PyTorch integration, validation, serialization)\n✅ Format Tests:      5/5 tests passing (NPZ, HDF5, TFRecord)\n✅ Other Tests:       8/8 tests passing (replay, coordination, etc.)\n✅ Total Coverage:  119/119 comprehensive tests validating all functionality\n```\n\n## 🏗️ Architecture\n\ndl-driver follows a clean workspace architecture with 6 focused crates:\n\n```\nreal_dlio/\n├── crates/\n│   ├── cli/          # Command-line interface\n│   ├── core/         # Workload orchestration and config parsing  \n│   ├── frameworks/   # Framework integrations (PyTorch, TensorFlow, JAX)\n│   ├── storage/      # Storage backend abstractions\n│   ├── formats/      # Data format handlers (HDF5, NPZ, etc.)\n│   └── py_api/       # Python bindings (PyO3)\n├── tests/            # Integration and regression tests\n└── docs/             # Documentation and changelog\n```\n\n## 🚀 Quick Start\n\n### Installation\n\n```bash\ngit clone https://github.com/russfellows/dl-driver.git\ncd dl-driver\ncargo build --release\n```\n\n### Basic Usage\n\n```bash\n# Generate test datasets with different formats\n./target/release/dl-driver generate --config tests/dlio_configs/minimal_config.yaml\n\n# Run DLIO-compatible workloads (unified execution engine)\n./target/release/dl-driver run --config tests/dlio_configs/unet3d_config.yaml\n\n# Validate configuration without running\n./target/release/dl-driver validate --config tests/dlio_configs/bert_config.yaml\n\n# Multi-rank execution (shared memory coordination)\n./target/release/dl-driver run --config config.yaml --world-size 4 --rank 0 \u0026\n./target/release/dl-driver run --config config.yaml --world-size 4 --rank 1 \u0026\n./target/release/dl-driver run --config config.yaml --world-size 4 --rank 2 \u0026\n./target/release/dl-driver run --config config.yaml --world-size 4 --rank 3 \u0026\n\n# Distributed multi-agent execution\n./target/release/dl_driver_agent --agent-id agent-0 --port 50051 \u0026\n./target/release/dl_driver_agent --agent-id agent-1 --port 50052 \u0026\n./target/release/dl-driver distributed run \\\n  --config tests/dlio_configs/distributed_2node_local.yaml \\\n  --agents http://host1:50051,http://host2:50052 \\\n  --path-template \"{id}/\"\n\n# Framework-specific workload profiles (Workstream A)\n./target/release/dl-driver run --config config.yaml --profile torch\n./target/release/dl-driver run --config config.yaml --profile tf\n./target/release/dl-driver run --config config.yaml --profile jax\n\n# Metrics export for CI/CD integration (Workstream A)\n./target/release/dl-driver run --config config.yaml --metrics-json results.json\n./target/release/dl-driver run --config config.yaml --metrics-csv results.csv\n\n# Operation log validation (Workstream A)\n./target/release/dl-driver run --config config.yaml --op-log reference.csv.zst\n\n# Run format validation (requires Python environment)\npython tools/validation/validate_formats.py\n```\n\n### Command Overview\n```bash\ndl-driver --help                    # Show all available commands\ndl-driver generate --help           # Generate synthetic datasets  \ndl-driver run --help               # Run DLIO workloads (with optional MLPerf mode)\ndl-driver validate --help          # Validate configurations\ndl-driver distributed --help       # Distributed multi-agent orchestration\n\n# Multi-rank execution\ndl-driver run --world-size N --rank R     # Multi-process shared memory coordination\n\n# Distributed execution\ndl_driver_agent --agent-id ID --port PORT  # Start agent process\ndl-driver distributed run --agents LIST    # Controller for multi-agent workloads\n\n# Workstream A: Advanced execution options\ndl-driver run --profile [torch|tf|jax]     # Framework-specific optimization profiles\ndl-driver run --metrics-json FILE          # Export metrics in JSON format\ndl-driver run --metrics-csv FILE           # Export metrics in CSV format  \ndl-driver run --op-log FILE                # Validate against reference operation log\n```\n\n## 📝 Configuration\n\nDLIO-compatible YAML configuration with multi-backend storage support:\n\n```yaml\ndataset:\n  data_folder: file:///mnt/vast1/data/  # file://, s3://, az://, direct://\n  format: npz                           # npz, hdf5, tfrecord  \n  num_files_train: 1000\n\nreader:\n  batch_size: 32\n  read_threads: 4\n  \ntrain:\n  epochs: 5\n  computation_time: 0.05\n```\n\nConfiguration examples available in `tests/dlio_configs/`\n\n## 🧪 Testing \u0026 Validation\n\n```bash\n# Build and test\ncargo build --release\ncargo test\n\n# Test multi-rank coordination\n./target/release/dl-driver run --config config.yaml --world-size 2 --rank 0 \u0026\n./target/release/dl-driver run --config config.yaml --world-size 2 --rank 1\n\n# NEW: Test Workstream A features (v0.6.4)\n./target/release/dl-driver run --config config.yaml --profile torch --metrics-json test.json\n./target/release/dl-driver run --config config.yaml --op-log tests/dlio_configs/reference.csv.zst\n```\n\n### ⚠️ Known Testing Limitations\n\n**Storage Latency Measurement (v0.8.8)**: Current Phase 2 multi-rank tests use `/tmp` (tmpfs, memory-backed) with small files (64KB) that fit entirely in page cache. While metrics are logically correct, **verification requires real disk I/O testing**. See [docs/testing/PHASE2_VERIFICATION_PLAN.md](docs/testing/PHASE2_VERIFICATION_PLAN.md) for planned verification using:\n- `direct://` I/O to bypass page cache\n- `/mnt/test` (real disk, NOT tmpfs)\n- Large datasets (5-10GB) exceeding available RAM\n- Expected latency ranges: 5-50ms for disk I/O, \u003c1ms with prefetch\n\nThis verification is planned but not yet executed. Current 0µs latencies are consistent with prefetched+cached data but don't prove measurement correctness.\n\n### Validation Results\n- ✅ **119 comprehensive tests** passing across all features\n- ✅ **Format validation** with numpy, h5py, and TensorFlow standard libraries\n- ✅ **Distributed workloads** validated with histogram aggregation and results directory output\n- ✅ **Framework profiles** validated with PyTorch, TensorFlow, and JAX configurations\n- ✅ **Operation log validation** tested with multi-million record production datasets\n- ✅ **Metrics export** validated in JSON, CSV, and TSV formats for CI integration\n- ✅ **100% compatibility** with numpy, h5py, tensorflow\n- ✅ **MLCommons DLIO configs** fully validated\n\n### Test Categories\n- **Backend Integration**: File, S3, Azure, DirectIO validation\n- **Format Compatibility**: NPZ, HDF5, TFRecord with standard libraries\n- **DLIO Compliance**: Configuration parsing and workload execution\n- **Performance**: s3dlio AsyncPoolDataLoader benchmarks\n\n## 🛠️ Development\n\n### Prerequisites\n- Rust 1.89.0 or later\n- s3dlio library (automatically handled by Cargo)\n\n### Building from Source\n```bash\ngit clone https://github.com/russfellows/dl-driver.git\ncd dl-driver\ncargo build --release\n```\n\n### Contributing\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality  \n4. Ensure all tests pass\n5. Submit a pull request\n\n##  Documentation\n\n- [Changelog](./docs/Changelog.md) - Detailed version history\n- [Configuration Guide](./tests/configs/) - Example configurations\n- [API Documentation](https://docs.rs/real_dlio) - Rust API docs\n\n## 🤝 Acknowledgments\n\n- [DLIO Benchmark](https://github.com/argonne-lcf/dlio_benchmark) - Original inspiration and configuration format\n- [s3dlio](https://github.com/russfellows/s3dlio) - Powerful multi-backend storage library\n- Rust ecosystem - tokio, serde, anyhow, and many other excellent crates\n\n## 📄 License \u0026 Compliance\n\nThis project maintains **enterprise-grade license compliance** with comprehensive scanning and validation.\n\n### License Information\n- **License**: [GPL-3.0-or-later](LICENSES/GPL-3.0-or-later.txt) \n- **REUSE Compliant**: Full compliance with [REUSE Specification 3.3](https://reuse.software/spec/)\n- **SPDX Standards**: All source files include proper SPDX license identifiers\n- **ScanCode Compatible**: Validated with ScanCode Toolkit for enterprise scanning\n\n### Compliance Summary\n- ✅ **201 files scanned** by ScanCode Toolkit\n- ✅ **72 files** with SPDX GPL-3.0 identifiers  \n- ✅ **80 files** with proper copyright attribution\n- ✅ **Automated CI/CD** license validation via GitHub Actions\n\n📋 **[View Detailed Compliance Report](docs/LICENSE-COMPLIANCE.md)**\n\n### Local Validation\n```bash\n# REUSE compliance check\nreuse lint\n\n# ScanCode analysis (via Docker)\ndocker run --rm -v $(pwd):/workdir sixarm/scancode \\\n  --copyright --license --package --info --license-text \\\n  --strip-root --format html-app /workdir /workdir/compliance-report.html\n```\n\n---\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frussfellows%2Fdl-driver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frussfellows%2Fdl-driver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frussfellows%2Fdl-driver/lists"}