https://github.com/russfellows/dl-driver
Realistic Driver for AI/ML Storage Workloads. Part of the sai3 project that delivers multi-protocol storage access for AI/ML workflows.
https://github.com/russfellows/dl-driver
ai-ml azure-blob benchmark gcs hacktoberfest jax nfs pytorch rust s3 sai3 storage tensorflow
Last synced: 4 months ago
JSON representation
Realistic Driver for AI/ML Storage Workloads. Part of the sai3 project that delivers multi-protocol storage access for AI/ML workflows.
- Host: GitHub
- URL: https://github.com/russfellows/dl-driver
- Owner: russfellows
- License: gpl-3.0
- Created: 2025-07-01T04:33:39.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-11-14T18:58:36.000Z (8 months ago)
- Last Synced: 2025-11-14T20:36:04.779Z (8 months ago)
- Topics: ai-ml, azure-blob, benchmark, gcs, hacktoberfest, jax, nfs, pytorch, rust, s3, sai3, storage, tensorflow
- Language: Rust
- Homepage:
- Size: 1.47 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# dl-driver
**A tool for performing realistic testing of storage performance when running AI/ML workloads**
[](https://www.rust-lang.org)
[](./docs/Changelog.md)
[](#compilation-status)
[](#format-compatibility)
[](#testing--validation)
[](#storage-backends)
[](#distributed-execution)
[](#architecture-overview)
[](https://api.reuse.software/info/github.com/russfellows/dl-driver)
[](https://www.gnu.org/licenses/gpl-3.0)
[](https://scancode.io/)
## ๐ Overview
**dl-driver** is a tool for testing storage performance during AI/ML workloads. For training workloads it supports running data generation, data loading and checkpoint tests that provide **format compatibility** with standard Python libraries. Built in Rust for performance and reliability, it serves as a drop-in replacement for [DLIO benchmarks](https://github.com/argonne-lcf/dlio_benchmark) while delivering enterprise-grade capabilities through the powerful [s3dlio](https://github.com/russfellows/s3dlio) library.
**Key Achievement**: Validation of object/file formats with numpy, h5py, and TensorFlow provides integration with existing ML pipelines.
## ๐ฏ Current Status
- **๐ v0.8.11 RELEASED**: Updated to s3dlio v0.9.18 with dependency synchronization
- **๐ v0.8.10 RELEASED**: Realistic checkpoint sizes (100MB+) with s3dlio integration and architecture fixes
- **๐ v0.8.9 RELEASED**: Multi-array NPZ + TFRecord index generation
- **๐ฏ NPZ ENHANCEMENT**: Multi-array support via s3dlio's build_multi_npz() (data + labels + metadata)
- **๐ TFRECORD INDICES**: Automatic index file generation for TensorFlow Data Service compatibility
- **๐ v0.8.8**: Distributed multi-rank with file sharding and bug fixes
- **๐ฏ DISTRIBUTED MULTI-RANK**: Complete Phase 1 & 2 implementation with interleaved/contiguous sharding
- **๐ ACCURATE PERCENTILES**: Bucket-level histogram aggregation for distributed workloads (<1% error)
- **โก ACCELERATOR UTILIZATION**: Fixed AU calculation (now compute_time / batch_time, not inverted)
- **๐ UNIFIED OUTPUT**: Consistent dual-perspective format (Storage + AI/ML) across all modes
- **๐ง FIRST-BATCH EXCLUSION**: Steady-state metrics exclude cold-start batch for accuracy
- **๐ LIVE STATS STREAMING**: Real-time progress updates via gRPC streaming (1s intervals)
- **๐ PROGRESS BARS**: Multi-line display with percentage, epoch counter, and detailed statistics
- **๐ค STARTUP HANDSHAKE**: READY/ERROR validation before workload execution
- **๐ฏ ZERO WARNINGS**: Production-quality code with zero compiler warnings
- **โก MULTI-ENDPOINT**: Load balance across multiple S3/storage endpoints with round-robin or least-connections
- **โป๏ธ CHECKPOINT RELOAD**: Resume training from saved checkpoints with --resume-from-checkpoint flag
- **๐พ CHECKPOINT SUPPORT**: Step-based and epoch-based checkpointing across all storage backends
- **๐ง CLI SIMPLIFIED**: Removed legacy commands, unified interface with validate/--dry-run
- **๐ DISTRIBUTED CONTROLLER**: Multi-agent orchestration for true distributed workloads
- **๐ MULTI-NODE EXECUTION**: Coordinate workloads across multiple hosts with shared/local storage
- **๐ HISTOGRAM AGGREGATION**: Accurate percentile calculation with <1% error for distributed workloads
- **๐ RESULTS DIRECTORY**: Complete, reproducible results with per-agent and consolidated metrics
- **โ
133/133 TESTS PASSING**: Full validation across all features and backends
### Core Capabilities
- **๐ฏ Multi-Array NPZ**: Create NPZ archives with multiple named arrays (data, labels, metadata) using s3dlio's zero-copy API
- **๐ TFRecord Indices**: Automatic index generation for TensorFlow Data Service (16 bytes/record, optional separate folder)
- **๐ฏ Distributed Multi-Rank**: Complete Phase 1 & 2 implementation with file sharding (interleaved/contiguous strategies)
- **๐ Accurate Percentiles**: Bucket-level HDR histogram aggregation for distributed workloads (<1% error)
- **โก Accelerator Utilization**: Fixed AU metric calculation (compute_time / batch_time ratio)
- **๐ Unified Output Format**: Consistent dual-perspective reporting (Storage I/O + AI/ML Training)
- **๐ง Steady-State Metrics**: First-batch exclusion prevents cold-start skew in statistics
- **โ ๏ธ Storage Latency**: Currently reports 0ยตs (full instrumentation planned - see `docs/STORAGE_LATENCY_LIMITATION.md`)
- **๐ Live Stats Streaming**: Real-time progress updates via gRPC streaming with 1-second intervals
- **๐ Progress Bars**: Multi-line display showing percentage, epoch counter, and detailed I/O statistics
- **๐ค Startup Handshake**: READY/ERROR validation ensures all agents are healthy before workload starts
- **โฑ๏ธ Microsecond Precision**: All distributed mode latencies now displayed in microseconds (ยตs) for accuracy
- **๐ Distributed Histogram Aggregation**: Bucket-level HDR histogram merging for accurate percentiles across agents
- **๐ Enhanced Results Capture**: console.log includes all completion messages, latencies, and throughput statistics
- **โก Multi-Endpoint Load Balancing**: Distribute requests across multiple storage endpoints (round-robin or least-connections)
- **โป๏ธ Checkpoint Reload**: Resume training from saved checkpoints with automatic state restoration
- **๐พ Checkpoint Plugin**: Step-based and epoch-based checkpointing with multi-backend support (file://, s3://, az://, gs://)
- **๐ง Clean CLI**: Unified interface with validate and --dry-run as aliases, legacy commands removed
- **๐ Multi-Agent Orchestration**: Controller coordinates workloads across multiple agent instances
- **๐ Coordinated Start**: Synchronized workload execution with health checking
- **๐ Aggregate Metrics**: Automatic collection and aggregation from all agents with histogram-based percentiles
- **๐ Structured Results**: Complete results directory with per-agent TSV files and consolidated bucket-level histograms
- **๐๏ธ Path Isolation**: Agent-specific path prefixes for local storage isolation
- **โ๏ธ Shared Storage**: Automatic detection and handling of GCS/S3/Azure shared backends (--shared-storage flag)
- **โ
E2E Validated**: 2-node and 4-node configurations tested (local + cloud storage)
- **๐ Performance**: Multi-GiB/s aggregate throughput with accurate percentile tracking
**For storage I/O replay**, use [sai3-bench](https://github.com/russfellows/sai3-bench) instead.
## ๐ Documentation
**๐ For complete documentation, see [docs/USER_GUIDE.md](docs/USER_GUIDE.md)**
### Quick Links
- **[User Guide](docs/USER_GUIDE.md)** - Comprehensive guide covering all features
- **[Quick Start](docs/QUICK_START.md)** - Get started in minutes
- **[Distributed Setup](tests/dlio_configs/DISTRIBUTED_README.md)** - Multi-agent orchestration guide
- **[Changelog](docs/Changelog.md)** - Version history and release notes
- **[Dual Metrics](docs/DUAL_METRICS_REPORTING.md)** - Metrics specification
- **[Results Directory Format](docs/RESULTS_DIRECTORY_FORMAT.md)** - Structured results output specification
## ๐ Distributed Execution
### Multi-Agent Orchestration
Execute DLIO workloads across multiple agent instances with centralized controller:
```bash
# Start agent processes on each host
# Host 1:
./target/release/dl_driver_agent --agent-id agent-0 --port 50051 --bind-addr 0.0.0.0 &
# Host 2:
./target/release/dl_driver_agent --agent-id agent-1 --port 50051 --bind-addr 0.0.0.0 &
# Run distributed workload from controller
./target/release/dl-driver distributed run \
--config tests/dlio_configs/distributed_2node_local.yaml \
--agents http://host1:50051,http://host2:50051 \
--path-template "{id}/"
# Output shows aggregated results:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Distributed Workload Complete! ๐ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Storage Performance (I/O Perspective):
Total Throughput: 687.5 MiB/s
Total Operations: 40
Errors: 0
๐ค AI/ML Training Performance (Training Perspective):
Training Velocity: 297.9 samples/s, 45.8 batches/s
Pipeline Efficiency: 37.8%
```
### Storage Backend Modes
**Local Storage** (requires path template for agent isolation):
```bash
# Each agent writes to separate subdirectory
./target/release/dl-driver distributed run \
--config distributed_local.yaml \
--agents http://host1:50051,http://host2:50052 \
--path-template "{id}/"
# Creates: /tmp/data/agent-0/, /tmp/data/agent-1/, etc.
```
**Shared Storage** (no path template needed):
```bash
# All agents write to same GCS/S3 bucket
./target/release/dl-driver distributed run \
--config distributed_gcs.yaml \
--agents http://host1:50051,http://host2:50052
# All write to: gs://bucket/distributed-test/
```
### Key Distributed Features
- **๐ Multi-Host Orchestration**: Controller coordinates agents across network
- **๐ Health Checking**: Automatic agent health verification before execution
- **๐ Coordinated Start**: Synchronized workload start across all agents
- **๐ Aggregate Metrics**: Automatic collection and aggregation from all agents
- **๐๏ธ Path Isolation**: Agent-specific subdirectories for local storage
- **โ๏ธ Shared Storage**: Automatic detection of GCS/S3/Azure shared backends
- **๐ Dual Metrics**: Separate storage and AI/ML training perspectives
See `tests/dlio_configs/DISTRIBUTED_README.md` for complete usage guide.
## ๐ Multi-Process Scaling Usage
### Multi-Rank Distributed Execution
Execute DLIO workloads across multiple processes with shared memory coordination:
```bash
# 2-Process execution (simulating 2 GPUs)
./target/release/dl-driver run --config config.yaml --world-size 2 --rank 0 &
./target/release/dl-driver run --config config.yaml --world-size 2 --rank 1 &
# 4-Process execution (simulating 4 GPUs)
./target/release/dl-driver run --config config.yaml --world-size 4 --rank 0 &
./target/release/dl-driver run --config config.yaml --world-size 4 --rank 1 &
./target/release/dl-driver run --config config.yaml --world-size 4 --rank 2 &
./target/release/dl-driver run --config config.yaml --world-size 4 --rank 3 &
# Rank 0 will display aggregated results:
๐ Plan A1 Multi-GPU Results (Shared Memory Coordination):
================================================================
Total files processed: 28
Total data read: 0.40 GiB
Combined throughput: 11.16 GiB/s
Global runtime: 0.071s
Number of ranks: 4
โ
Multi-rank coordination successful - NO TEMP FILES USED
```
### Key Multi-Process Features
- **๐ Shared Memory Coordination**: Zero temp files, atomic operations, cross-process barriers
- **๐ Automatic Aggregation**: Rank 0 displays combined performance across all processes
- **โก Synchronized Execution**: All ranks coordinate start/stop for accurate timing
- **๐ฏ Interleaved Sharding**: Optimal data distribution across ranks
- **๐งน Automatic Cleanup**: Proper shared memory cleanup on completion or failure
## ๐ Single-Process DLIO Execution
```bash
# Build and run standard DLIO workload
cargo build --release
./target/release/dl-driver run --config tests/dlio_configs/minimal_config.yaml
# Generate data separately (optional)
./target/release/dl-driver generate --config config.yaml
# Validate configuration
./target/release/dl-driver validate --config config.yaml
# MLPerf compliance mode (enhanced reporting)
./target/release/dl-driver run --mlperf --config config.yaml --format json
```
### โจ Key Features
- **๐ Distributed Controller**: Multi-agent orchestration with coordinated start and histogram-based aggregate metrics
- **๐ Results Directory**: Complete, reproducible results with per-agent and consolidated TSV files
- **๐ Histogram Aggregation**: Accurate percentile calculation (<1% error) for distributed workloads
- **๐๏ธ Directory Tree Modes**: 3-mode system (Flat, DLIO Sharding, Hierarchical) for realistic dataset organization
- **๐ Dry-Run Validation**: `--dry-run` flag validates configs and shows workload summary before execution
- **๐ Multi-Process Scaling**: `--world-size N --rank R` distributed execution with shared memory coordination
- **๐ฅ Enterprise Coordination**: Atomic operations, cross-process barriers, zero temp files
- **๐ TRUE DLIO Parallel I/O**: Background workers with I/O+compute overlap for realistic performance
- **๐ฏ Complete Format Compatibility**: NPZ, HDF5, TFRecord validated with numpy, h5py, TensorFlow
- **๐ช Universal Storage**: File, S3/MinIO, Azure Blob, DirectIO backends with unified interface
- **๐ DLIO Compatible**: Drop-in replacement for existing DLIO benchmark configurations
- **๐ Dual Metrics**: Separate storage (ops/s, MiB/s) and AI/ML (samples/s, batches/s) perspectives
- **โ๏ธ Production Cloud Ready**: Real S3 and Azure credential support
- **๐งช Comprehensively Validated**: 119 comprehensive tests with golden reference validation and MLCommons DLIO compatibility
## ๐ง Workstream A: Realistic AI/ML Framework Simulation
```
### โจ Key Features
- **๏ฟฝ Distributed Controller**: Multi-agent orchestration with coordinated start and aggregate metrics
- **๏ฟฝ๐ Multi-Process Scaling**: `--world-size N --rank R` distributed execution with shared memory coordination
- **๐ฅ Enterprise Coordination**: Atomic operations, cross-process barriers, zero temp files
- **๐ TRUE DLIO Parallel I/O**: Background workers with I/O+compute overlap for realistic performance
- **๐ฏ Complete Format Compatibility**: NPZ, HDF5, TFRecord validated with numpy, h5py, TensorFlow
- **๐ช Universal Storage**: File, S3/MinIO, Azure Blob, DirectIO backends with unified interface
- **๐ DLIO Compatible**: Drop-in replacement for existing DLIO benchmark configurations
- **๐ Dual Metrics**: Separate storage (ops/s, MiB/s) and AI/ML (samples/s, batches/s) perspectives
- **โ๏ธ Production Cloud Ready**: Real S3 and Azure credential support
- **๐งช Comprehensively Validated**: 119 comprehensive tests with golden reference validation and MLCommons DLIO compatibility
## ๐ง Workstream A: Realistic AI/ML Framework Simulation
### Framework-Specific Workload Profiles
Execute workloads optimized for specific AI/ML frameworks:
```bash
# PyTorch-optimized workload simulation
./target/release/dl-driver run --config config.yaml --profile torch
# TensorFlow-optimized configuration
./target/release/dl-driver run --config config.yaml --profile tf
# JAX-optimized workload patterns
./target/release/dl-driver run --config config.yaml --profile jax
```
### Advanced Metrics Export & CI Integration
Export comprehensive performance metrics for automated analysis:
```bash
# Export metrics to JSON for programmatic analysis
./target/release/dl-driver run --config config.yaml --metrics-json results.json
# Export metrics to CSV for spreadsheet analysis
./target/release/dl-driver run --config config.yaml --metrics-csv results.csv
# Both formats simultaneously for comprehensive reporting
./target/release/dl-driver run --config config.yaml --metrics-json metrics.json --metrics-csv metrics.csv
```
### Operation Log Validation & Benchmarking
Validate workload performance against reference operation logs:
```bash
# Validate against compressed operation log (supports .csv.zst, .jsonl.zst)
./target/release/dl-driver run --config config.yaml --op-log reference-benchmark.csv.zst
# Example with comprehensive validation and metrics export
./target/release/dl-driver run \
--config config.yaml \
--profile torch \
--metrics-json validation-results.json \
--op-log production-reference.csv.zst
# Validation output with CI-friendly exit codes:
โ
PASS: Workload performance within tolerance (ยฑ5.0%)
๐ Files processed: 1000 (reference: 1000)
๐ Throughput: 12.4 GiB/s (reference: 12.1 GiB/s, +2.5%)
๐ Total runtime: 45.2s (reference: 46.1s, -2.0%)
```
### Key Workstream A Features
- **๐ง Intelligent Profiles**: Framework-specific optimizations for PyTorch, TensorFlow, and JAX
- **๐ Production Metrics**: JSON/CSV export for CI/CD pipelines and performance tracking
- **๐ Validation Engine**: Compare against reference operation logs with configurable tolerance
- **โก Real-World Testing**: Validated with 2.78M record operation logs from production systems
- **๐ฏ CI Integration**: PASS/FAIL validation with proper exit codes for automated testing
## ๐ฏ Technical Specifications
### Binaries
- **`dl-driver`**: Main CLI for single-process, multi-rank, and distributed controller execution
- **`dl_driver_agent`**: Standalone agent process for distributed workloads (gRPC service)
### Storage Backends
- **File System**: POSIX-compliant file I/O with DirectIO optimization
- **Cloud Storage**: S3/MinIO and Azure Blob with credential support
- **Performance**: Multi-GiB/s throughput with enterprise-grade reliability
### Data Formats
- **NPZ, HDF5, TFRecord**: 100% compatible with numpy, h5py, and TensorFlow
- **Framework Support**: PyTorch, TensorFlow, and JAX configuration profiles
- **Validation**: Comprehensive test suite ensuring standard library compatibility
## ๐ Key Achievements
### ๐ฏ Production-Ready AI/ML Data Pipeline
dl-driver has evolved into a complete, enterprise-grade testing framework for AI/ML workloads:
- **100% Format Compatibility**: All generated files work seamlessly with standard Python libraries (numpy, h5py, TensorFlow)
- **Distributed Orchestration**: Multi-agent coordination with histogram-based percentile aggregation (<1% error)
- **Results Directory**: Complete, reproducible results with per-agent and consolidated metrics in TSV format
- **DLIO Drop-in Replacement**: Full MLCommons configuration compatibility with enhanced features
- **Multi-Backend Excellence**: Unified performance across File, S3, Azure, and DirectIO storage
- **Enterprise Validation**: Comprehensive test suite ensuring reliability and correctness
### ๐ Validation Confidence
```
โ
Core Tests: 60/60 tests passing (metrics, config, workload, distributed, histogram aggregation)
โ
CLI Tests: 29/29 tests passing (configuration, backend integration)
โ
Integration Tests: 10/10 tests passing (histogram E2E, results directory workflow)
โ
Framework Tests: 7/7 tests passing (PyTorch integration, validation, serialization)
โ
Format Tests: 5/5 tests passing (NPZ, HDF5, TFRecord)
โ
Other Tests: 8/8 tests passing (replay, coordination, etc.)
โ
Total Coverage: 119/119 comprehensive tests validating all functionality
```
## ๐๏ธ Architecture
dl-driver follows a clean workspace architecture with 6 focused crates:
```
real_dlio/
โโโ crates/
โ โโโ cli/ # Command-line interface
โ โโโ core/ # Workload orchestration and config parsing
โ โโโ frameworks/ # Framework integrations (PyTorch, TensorFlow, JAX)
โ โโโ storage/ # Storage backend abstractions
โ โโโ formats/ # Data format handlers (HDF5, NPZ, etc.)
โ โโโ py_api/ # Python bindings (PyO3)
โโโ tests/ # Integration and regression tests
โโโ docs/ # Documentation and changelog
```
## ๐ Quick Start
### Installation
```bash
git clone https://github.com/russfellows/dl-driver.git
cd dl-driver
cargo build --release
```
### Basic Usage
```bash
# Generate test datasets with different formats
./target/release/dl-driver generate --config tests/dlio_configs/minimal_config.yaml
# Run DLIO-compatible workloads (unified execution engine)
./target/release/dl-driver run --config tests/dlio_configs/unet3d_config.yaml
# Validate configuration without running
./target/release/dl-driver validate --config tests/dlio_configs/bert_config.yaml
# Multi-rank execution (shared memory coordination)
./target/release/dl-driver run --config config.yaml --world-size 4 --rank 0 &
./target/release/dl-driver run --config config.yaml --world-size 4 --rank 1 &
./target/release/dl-driver run --config config.yaml --world-size 4 --rank 2 &
./target/release/dl-driver run --config config.yaml --world-size 4 --rank 3 &
# Distributed multi-agent execution
./target/release/dl_driver_agent --agent-id agent-0 --port 50051 &
./target/release/dl_driver_agent --agent-id agent-1 --port 50052 &
./target/release/dl-driver distributed run \
--config tests/dlio_configs/distributed_2node_local.yaml \
--agents http://host1:50051,http://host2:50052 \
--path-template "{id}/"
# Framework-specific workload profiles (Workstream A)
./target/release/dl-driver run --config config.yaml --profile torch
./target/release/dl-driver run --config config.yaml --profile tf
./target/release/dl-driver run --config config.yaml --profile jax
# Metrics export for CI/CD integration (Workstream A)
./target/release/dl-driver run --config config.yaml --metrics-json results.json
./target/release/dl-driver run --config config.yaml --metrics-csv results.csv
# Operation log validation (Workstream A)
./target/release/dl-driver run --config config.yaml --op-log reference.csv.zst
# Run format validation (requires Python environment)
python tools/validation/validate_formats.py
```
### Command Overview
```bash
dl-driver --help # Show all available commands
dl-driver generate --help # Generate synthetic datasets
dl-driver run --help # Run DLIO workloads (with optional MLPerf mode)
dl-driver validate --help # Validate configurations
dl-driver distributed --help # Distributed multi-agent orchestration
# Multi-rank execution
dl-driver run --world-size N --rank R # Multi-process shared memory coordination
# Distributed execution
dl_driver_agent --agent-id ID --port PORT # Start agent process
dl-driver distributed run --agents LIST # Controller for multi-agent workloads
# Workstream A: Advanced execution options
dl-driver run --profile [torch|tf|jax] # Framework-specific optimization profiles
dl-driver run --metrics-json FILE # Export metrics in JSON format
dl-driver run --metrics-csv FILE # Export metrics in CSV format
dl-driver run --op-log FILE # Validate against reference operation log
```
## ๐ Configuration
DLIO-compatible YAML configuration with multi-backend storage support:
```yaml
dataset:
data_folder: file:///mnt/vast1/data/ # file://, s3://, az://, direct://
format: npz # npz, hdf5, tfrecord
num_files_train: 1000
reader:
batch_size: 32
read_threads: 4
train:
epochs: 5
computation_time: 0.05
```
Configuration examples available in `tests/dlio_configs/`
## ๐งช Testing & Validation
```bash
# Build and test
cargo build --release
cargo test
# Test multi-rank coordination
./target/release/dl-driver run --config config.yaml --world-size 2 --rank 0 &
./target/release/dl-driver run --config config.yaml --world-size 2 --rank 1
# NEW: Test Workstream A features (v0.6.4)
./target/release/dl-driver run --config config.yaml --profile torch --metrics-json test.json
./target/release/dl-driver run --config config.yaml --op-log tests/dlio_configs/reference.csv.zst
```
### โ ๏ธ Known Testing Limitations
**Storage Latency Measurement (v0.8.8)**: Current Phase 2 multi-rank tests use `/tmp` (tmpfs, memory-backed) with small files (64KB) that fit entirely in page cache. While metrics are logically correct, **verification requires real disk I/O testing**. See [docs/testing/PHASE2_VERIFICATION_PLAN.md](docs/testing/PHASE2_VERIFICATION_PLAN.md) for planned verification using:
- `direct://` I/O to bypass page cache
- `/mnt/test` (real disk, NOT tmpfs)
- Large datasets (5-10GB) exceeding available RAM
- Expected latency ranges: 5-50ms for disk I/O, <1ms with prefetch
This verification is planned but not yet executed. Current 0ยตs latencies are consistent with prefetched+cached data but don't prove measurement correctness.
### Validation Results
- โ
**119 comprehensive tests** passing across all features
- โ
**Format validation** with numpy, h5py, and TensorFlow standard libraries
- โ
**Distributed workloads** validated with histogram aggregation and results directory output
- โ
**Framework profiles** validated with PyTorch, TensorFlow, and JAX configurations
- โ
**Operation log validation** tested with multi-million record production datasets
- โ
**Metrics export** validated in JSON, CSV, and TSV formats for CI integration
- โ
**100% compatibility** with numpy, h5py, tensorflow
- โ
**MLCommons DLIO configs** fully validated
### Test Categories
- **Backend Integration**: File, S3, Azure, DirectIO validation
- **Format Compatibility**: NPZ, HDF5, TFRecord with standard libraries
- **DLIO Compliance**: Configuration parsing and workload execution
- **Performance**: s3dlio AsyncPoolDataLoader benchmarks
## ๐ ๏ธ Development
### Prerequisites
- Rust 1.89.0 or later
- s3dlio library (automatically handled by Cargo)
### Building from Source
```bash
git clone https://github.com/russfellows/dl-driver.git
cd dl-driver
cargo build --release
```
### Contributing
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request
## Documentation
- [Changelog](./docs/Changelog.md) - Detailed version history
- [Configuration Guide](./tests/configs/) - Example configurations
- [API Documentation](https://docs.rs/real_dlio) - Rust API docs
## ๐ค Acknowledgments
- [DLIO Benchmark](https://github.com/argonne-lcf/dlio_benchmark) - Original inspiration and configuration format
- [s3dlio](https://github.com/russfellows/s3dlio) - Powerful multi-backend storage library
- Rust ecosystem - tokio, serde, anyhow, and many other excellent crates
## ๐ License & Compliance
This project maintains **enterprise-grade license compliance** with comprehensive scanning and validation.
### License Information
- **License**: [GPL-3.0-or-later](LICENSES/GPL-3.0-or-later.txt)
- **REUSE Compliant**: Full compliance with [REUSE Specification 3.3](https://reuse.software/spec/)
- **SPDX Standards**: All source files include proper SPDX license identifiers
- **ScanCode Compatible**: Validated with ScanCode Toolkit for enterprise scanning
### Compliance Summary
- โ
**201 files scanned** by ScanCode Toolkit
- โ
**72 files** with SPDX GPL-3.0 identifiers
- โ
**80 files** with proper copyright attribution
- โ
**Automated CI/CD** license validation via GitHub Actions
๐ **[View Detailed Compliance Report](docs/LICENSE-COMPLIANCE.md)**
### Local Validation
```bash
# REUSE compliance check
reuse lint
# ScanCode analysis (via Docker)
docker run --rm -v $(pwd):/workdir sixarm/scancode \
--copyright --license --package --info --license-text \
--strip-root --format html-app /workdir /workdir/compliance-report.html
```
---