https://github.com/russfellows/s3dlio
Part of the sai3 project that delivers multi-protocol storage access for AI/ML workflows, supporting Pytorch, Tensorflow and Jax. This project provides a CLI, along with Rust and Python libraries for AI/ML storage workflows. Supporting S3, File, Azure Blob and GCS using the latest Rust SDKs.
https://github.com/russfellows/s3dlio
ai-ml ai-storage aws-s3 azure-blob file-io gcs jax python-library pytorch rust rust-crate s3 sai3 tensorflow
Last synced: 26 days ago
JSON representation
Part of the sai3 project that delivers multi-protocol storage access for AI/ML workflows, supporting Pytorch, Tensorflow and Jax. This project provides a CLI, along with Rust and Python libraries for AI/ML storage workflows. Supporting S3, File, Azure Blob and GCS using the latest Rust SDKs.
- Host: GitHub
- URL: https://github.com/russfellows/s3dlio
- Owner: russfellows
- License: apache-2.0
- Created: 2025-05-12T19:40:19.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2026-03-15T04:20:00.000Z (3 months ago)
- Last Synced: 2026-03-15T18:42:39.075Z (3 months ago)
- Topics: ai-ml, ai-storage, aws-s3, azure-blob, file-io, gcs, jax, python-library, pytorch, rust, rust-crate, s3, sai3, tensorflow
- Language: Rust
- Homepage:
- Size: 3.98 MB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# s3dlio - Universal Storage I/O Library
[](https://github.com/russfellows/s3dlio)
[](docs/Changelog.md)
[](https://github.com/russfellows/s3dlio/releases)
[](https://pypi.org/project/s3dlio/)
[](LICENSE)
[](https://www.rust-lang.org)
[](https://www.python.org)
High-performance, multi-protocol storage library for AI/ML workloads with universal copy operations across S3, Azure, GCS, local file systems, and DirectIO.
> **v0.9.100 โ General-purpose object data loader: `PyDataset.from_uris()`, `items()`, `collect_batch()`, `skip_head` HEAD optimisation**
>
> **`PyDataset.from_uris(uris)`**: map-style dataset from a pre-built URI list โ no listing overhead. **`PyBytesAsyncDataLoader.items()`**: URI-carrying sliding-window iterator; Tokio keeps `prefetch` GETs permanently in flight via `buffer_unordered`. **`collect_batch(n)`**: drains `n` items with one GIL crossing. **`skip_head=True` (new default)**: skips the per-object HEAD request; actual size is cached after each GET so range splitting fires correctly from epoch 2. See [docs/Python_Data-Loader.md](docs/Python_Data-Loader.md).
>
> **v0.9.98 (prior):** `ParquetRowGroupDataset` โ epoch-aware Parquet DataLoader with epoch-2 fast path and Arrow IPC decode. See [docs/Changelog.md](docs/Changelog.md) for full history.
## ๐ฆ Installation
### Quick Install (Python)
```bash
# If using uv package manager + uv virtual environment:
uv pip install s3dlio
# If using pip without uv:
pip install s3dlio
```
### Python Backend Profiles (PyPI vs Full Build)
- If using `uv` package manager + `uv` virtual environment: `uv pip install s3dlio`.
- If using standard `pip` without `uv`: `pip install s3dlio`.
- The default published wheel is now S3-focused (Azure Blob and GCS are excluded).
- If you want full backends (S3 + Azure Blob + GCS), build from source with:
```bash
# uv workflow:
uv pip install s3dlio --no-binary s3dlio --config-settings "cargo-extra-args=--features extension-module,full-backends"
# pip-only workflow:
pip install s3dlio --no-binary s3dlio --config-settings "cargo-extra-args=--features extension-module,full-backends"
```
You can still add a separate package name (for example `s3dlio-full`) later if you want a dedicated prebuilt full wheel distribution.
> Maintainer note: for PyPI uploads, publish the default (`./build_pyo3.sh`) wheel unless intentionally releasing a separate distribution. `full-backends` is currently source-build only via the command above.
### Building from Source (Rust)
#### System Dependencies
s3dlio requires some system libraries to build. **Only OpenSSL and pkg-config are required by default.** HDF5 and hwloc are optional and improve functionality but are not needed for the core library:
**Ubuntu/Debian:**
```bash
# Quick install - run our helper script
./scripts/install-system-deps.sh
# Or manually (required only):
sudo apt-get install -y build-essential pkg-config libssl-dev
# Optional - for NUMA topology support (--features numa):
sudo apt-get install -y libhwloc-dev
# Optional - for HDF5 data format support (--features hdf5):
sudo apt-get install -y libhdf5-dev
# All optional libraries at once:
sudo apt-get install -y libhdf5-dev libhwloc-dev cmake
```
**RHEL/CentOS/Fedora/Rocky/AlmaLinux:**
```bash
# Quick install
./scripts/install-system-deps.sh
# Or manually (required only):
sudo dnf install -y gcc gcc-c++ make pkg-config openssl-devel
# Optional - for NUMA topology support:
sudo dnf install -y hwloc-devel
# Optional - for HDF5 data format support:
sudo dnf install -y hdf5-devel
# All optional libraries at once:
sudo dnf install -y hdf5-devel hwloc-devel cmake
```
**macOS:**
```bash
# Quick install
./scripts/install-system-deps.sh
# Or manually (required only):
brew install pkg-config openssl@3
# Optional - for NUMA/HDF5 support:
brew install hdf5 hwloc cmake
# Set environment variables (add to ~/.zshrc or ~/.bash_profile):
export PKG_CONFIG_PATH="$(brew --prefix openssl@3)/lib/pkgconfig:$PKG_CONFIG_PATH"
export OPENSSL_DIR="$(brew --prefix openssl@3)"
```
**Arch Linux:**
```bash
# Quick install
./scripts/install-system-deps.sh
# Or manually (required only):
sudo pacman -S base-devel pkg-config openssl
# Optional - for NUMA/HDF5 support:
sudo pacman -S hdf5 hwloc cmake
```
**WSL (Windows Subsystem for Linux) / Minimal Environments:**
If you are building on WSL or any environment where `libhdf5` or `libhwloc` may not be available, s3dlio builds without them by default. No extra libraries are required:
```bash
# Just the basics - works on WSL, Docker, CI, and minimal installs:
sudo apt-get install -y build-essential pkg-config libssl-dev
cargo build --release
# install Python package (no system HDF5/hwloc needed):
# uv workflow:
uv pip install s3dlio
# pip-only workflow:
pip install s3dlio
```
#### Install Rust (if not already installed)
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
#### Build s3dlio
```bash
# Clone the repository
git clone https://github.com/russfellows/s3dlio.git
cd s3dlio
# Build with default features (no HDF5 or NUMA required)
cargo build --release
# Build s3-cli with all cloud backends enabled (AWS + Azure + GCS)
cargo build --release --bin s3-cli --features full-backends
# Build s3-cli with GCS enabled only (plus default backends)
cargo build --release --bin s3-cli --features backend-gcs
# Build with NUMA topology support (requires libhwloc-dev)
cargo build --release --features numa
# Build with HDF5 data format support (requires libhdf5-dev)
cargo build --release --features hdf5
# Build with all optional features
cargo build --release --features numa,hdf5
# Run tests
cargo test
# Build Python bindings (optional)
./build_pyo3.sh
# Build Python bindings with full backends (S3 + Azure + GCS)
./build_pyo3.sh full
# Named profile form is also supported:
./build_pyo3.sh --profile full
./build_pyo3.sh --profile default
# Show profile/help usage
./build_pyo3.sh --help
```
### Build Profile Quick Reference
Rust backend feature profiles:
- Default build (`cargo build --release`): S3-focused default backend set.
- GCS-enabled build (`--features backend-gcs`): enables GCS in addition to default set.
- Full cloud build (`--features full-backends`): enables AWS + Azure + GCS.
Python wheel build profiles via `build_pyo3.sh`:
- `default` or `slim`: AWS + file/direct; excludes Azure and GCS.
- `full`: AWS + Azure + GCS + file/direct.
- Positional and named forms are equivalent:
- `./build_pyo3.sh full`
- `./build_pyo3.sh -p full`
- `./build_pyo3.sh --profile full`
Optional extra Rust features for wheel builds can still be passed with `EXTRA_FEATURES`.
Example: `EXTRA_FEATURES="numa,hdf5" ./build_pyo3.sh full`.
**Note:** NUMA support (`--features numa`) improves multi-socket performance but requires the `hwloc2` C library. HDF5 support (`--features hdf5`) enables HDF5 data format generation but requires `libhdf5`. Both are optional and s3dlio is fully functional without them.
**Platform support:** s3dlio builds natively on Linux (x86\_64, aarch64), macOS (x86\_64 and Apple Silicon arm64), and WSL. Making `numa` and `hdf5` optional was the key change for broad platform support โ all remaining dependencies are pure Rust or use platform-independent system libraries (OpenSSL). To cross-compile Python wheels for Linux ARM64 from an x86\_64 host, see `build_pyo3.sh` for instructions using the `--zig` linker. For macOS universal2 (fat binary covering both architectures), see the commented section in `build_pyo3.sh`.
## โจ Key Features
- **High Performance**: High-throughput multi GB/s reads and writes on platforms with sufficient network and storage capabilities
- **Zero-Copy Architecture**: `bytes::Bytes` throughout for minimal memory overhead
- **Multi-Protocol**: S3, Azure Blob, GCS, file://, direct:// (O_DIRECT)
- **HTTP/2 Support (opt-in)**: HTTP/2 is available but **HTTP/1.1 is the default** โ HTTP/2 is almost always slower for bulk storage workloads. Opt in by setting `S3DLIO_H2C=1`. Both TLS ALPN (`https://`) and cleartext h2c (`http://`) are supported when enabled. See [docs/HTTP2_ALPN_INVESTIGATION.md](docs/HTTP2_ALPN_INVESTIGATION.md).
- **Python & Rust**: Native Rust library with zero-copy Python bindings (PyO3), bytearray support for efficient memory management
- **Multi-Endpoint Load Balancing**: RoundRobin/LeastConnections across storage endpoints
- **AI/ML Ready**: PyTorch DataLoader integration, TFRecord/NPZ format support
- **Parquet DataLoader**: Per-row-group epoch-aware DataLoader for any s3dlio storage backend (S3, Azure Blob, GCS, file://, direct://). Epoch-2 zero re-fetch speedup (2.5ร+), Raw + Arrow IPC decode modes, 8-worker concurrency with shared metadata caches โ [see guide](docs/Parquet_Data-Loader.md)
- **High-Speed Data Generation**: 50+ GB/s test data with configurable compression/dedup
## ๐ Latest Release
**v0.9.98** (May 2026) โ Parquet DataLoader with epoch-2 fast path and Arrow IPC decode. See [docs/Changelog.md](docs/Changelog.md).
**Recent highlights:**
- **v0.9.98** - **Parquet DataLoader** (`ParquetRowGroupDataset`): per-row-group Dataset, epoch-2 zero-re-fetch (2.5ร speedup proven), Raw + ArrowIpc decode modes, 8-worker shared caches; 648 tests passing
- **v0.9.97** - `XorStream` (dedup-safe, ~15 GB/s/core); `S3DLIO_UNSIGNED_PAYLOAD` opt-in for private S3-compatible endpoints; 613 tests passing
- **v0.9.92** - MPU coordinator task, auto-scale, async write/flush/finish safety fixes, MAX_MULTIPART_PARTS guard; 580 tests passing
- **v0.9.90** - Full NVIDIA AIStore support (`S3DLIO_FOLLOW_REDIRECTS=1`) with all TLS security policies; HTTP/2 available (opt-in via `S3DLIO_H2C=1`, **not the default**); 5 issues closed (#126, #133, #134, #135, #136); 559 tests passing
- **v0.9.86** - Redirect follower for NVIDIA AIStore (S3 path); HTTPSโHTTP downgrade prevention; 21 new redirect tests; redirect security analysis documented
- **v0.9.84** - HEAD elimination (ObjectSizeCache); OnceLock env-var caching; lock-free range assembly; `AWS_CA_BUNDLE_PATH` โ `AWS_CA_BUNDLE`; structured tracing
- **v0.9.80** - Python list hang fix (IMDSv2 legacy call removed); tracing deadlock fix (`tokio::spawn` โ inline stream); async S3 delete/bucket helpers; deprecated Python APIs cleaned up
๐ **[Complete Changelog](docs/Changelog.md)** - Full version history, migration guides, API details
---
## ๐ Version History
For detailed release notes and migration guides, see the [Complete Changelog](docs/Changelog.md).
---
## Storage Backend Support
### Universal Backend Architecture
s3dlio provides unified storage operations across all backends with consistent URI patterns:
- **๐๏ธ Amazon S3**: `s3://bucket/prefix/` - High-performance S3 operations (5+ GB/s reads, 2.5+ GB/s writes) with built-in concurrent range GETs (on by default)
- **โ๏ธ Azure Blob Storage**: `az://container/prefix/` - Complete Azure integration with **RangeEngine** (30-50% faster for large blobs)
- **๐ Google Cloud Storage**: `gs://bucket/prefix/` or `gcs://bucket/prefix/` - Production ready with **RangeEngine** and full ObjectStore integration
- **๐ Local File System**: `file:///path/to/directory/` - High-speed local file operations with **RangeEngine** support
- **โก DirectIO**: `direct:///path/to/directory/` - Bypass OS cache for maximum I/O performance with **RangeEngine**
### Concurrent Range GET Performance Features (v0.9.3+, Updated v0.9.60)
Concurrent range downloads hide network latency by parallelizing HTTP range requests.
**All backends support concurrent range GETs โ but via two different mechanisms:**
**Mechanism 1 โ S3 built-in (on by default, v0.9.60+)**
- โ
**Amazon S3**: Concurrent range splitting enabled by default via `S3DLIO_ENABLE_RANGE_OPTIMIZATION` (default: on). Uses `get_object_concurrent_range_async()` โ fires parallel `GetObject(Range: bytes=N-M)` requests via the AWS SDK with lock-free chunk assembly. Controlled by `S3DLIO_RANGE_THRESHOLD_MB` (default: 32 MiB) and `S3DLIO_RANGE_CONCURRENCY` (default: auto-scaled). Disable with `S3DLIO_ENABLE_RANGE_OPTIMIZATION=0`.
**Mechanism 2 โ RangeEngine (per-store config flag, must enable explicitly)**
- โ
**Azure Blob Storage**: 30-50% faster for large files (`enable_range_engine: true` in `AzureConfig`)
- โ
**Google Cloud Storage**: 30-50% faster for large files (`enable_range_engine: true` in `GcsConfig`)
- โ ๏ธ **Local File System**: Rarely beneficial due to seek overhead (disabled by default)
- โ ๏ธ **DirectIO**: Rarely beneficial due to O_DIRECT overhead (disabled by default)
**RangeEngine config flag defaults (v0.9.6+):**
- **Status**: `enable_range_engine: false` by default in all per-store config structs
- **Reason**: Extra HEAD request on every GET causes ~50% slowdown for small-object workloads
- **Threshold**: 32 MiB default (tunable per-store via `RangeEngineConfig::min_split_size`)
**How to Enable for Large-File Workloads:**
```rust
use s3dlio::object_store::{AzureObjectStore, AzureConfig};
let config = AzureConfig {
enable_range_engine: true, // Explicitly enable for large files
..Default::default()
};
let store = AzureObjectStore::with_config(config);
```
**When to Enable:**
- โ
Large-file workloads (average size >= 32 MiB)
- โ
High-bandwidth, high-latency networks
- โ Mixed or small-object workloads
- โ Local file systems
### S3 Backend Options
s3dlio supports two S3 backend implementations. **Native AWS SDK is the default and recommended** for production use:
```bash
# Default: Native AWS SDK backend (RECOMMENDED for production)
cargo build --release
# or explicitly:
cargo build --no-default-features --features native-backends
# Experimental: Apache Arrow object_store backend (optional, for testing)
cargo build --no-default-features --features arrow-backend
```
**Why native-backends is default:**
- Proven performance in production workloads
- Optimized for high-throughput S3 operations (5+ GB/s reads, 2.5+ GB/s writes)
- Well-tested with MinIO, Vast, and AWS S3
**About arrow-backend:**
- Experimental alternative implementation
- No proven performance advantage over native backend
- Useful for comparison testing and development
- Not recommended for production use
### GCS Backend Options (Current)
GCS is now **optional** at build time.
- Default build (`cargo build --release`) does **not** include GCS.
- To include GCS, enable `backend-gcs` (or `full-backends`).
- When enabled, s3dlio uses the **official Google crates** (`google-cloud-storage` + gax) from a patched fork maintained for s3dlio.
```bash
# Default build (S3-focused; no GCS)
cargo build --release
# Enable GCS explicitly
cargo build --release --features backend-gcs
# Enable all cloud backends (AWS + Azure + GCS)
cargo build --release --features full-backends
```
**Patched official GCS fork used by s3dlio:**
- Repository: https://github.com/russfellows/google-cloud-rust
- Integration in this repo is pinned in [Cargo.toml](Cargo.toml) (currently via release tag from that fork).
**Legacy note:** `gcs-community` remains as a legacy opt-in path, but the primary supported path is the official Google crates from the patched `russfellows/google-cloud-rust` fork.
## Quick Start
### Installation
**Rust CLI:**
```bash
git clone https://github.com/russfellows/s3dlio.git
cd s3dlio
cargo build --release
# Full cloud backend CLI build:
cargo build --release --bin s3-cli --features full-backends
```
**Python Library:**
```bash
# uv workflow:
uv pip install s3dlio
# pip-only workflow:
pip install s3dlio
# or build from source:
./build_pyo3.sh && ./install_pyo3_wheel.sh
# build from source with full cloud backends:
./build_pyo3.sh --profile full && ./install_pyo3_wheel.sh
```
### Documentation
- **[CLI Guide](docs/CLI_GUIDE.md)** - Complete command-line interface reference with examples
- **[Python API Guide](docs/PYTHON_API_GUIDE.md)** - Complete Python library reference with examples
- **[Parquet DataLoader Guide](docs/Parquet_Data-Loader.md)** - Epoch-aware per-row-group Parquet DataLoader (v0.9.98+)
- **[Multi-Endpoint Guide](docs/MULTI_ENDPOINT_GUIDE.md)** - Load balancing across multiple storage endpoints (v0.9.14+)
- **[Rust API Guide v0.9.0](docs/api/rust-api-v0.9.0.md)** - Complete Rust library reference with migration guide
- **[Changelog](docs/Changelog.md)** - Version history and release notes
- **[Adaptive Tuning Guide](docs/ADAPTIVE-TUNING.md)** - Optional performance auto-tuning
- **[Testing Guide](docs/TESTING-GUIDE.md)** - Test suite documentation
## Core Capabilities
### ๐ Universal Copy Operations
s3dlio treats upload and download as enhanced versions of the Unix `cp` command, working across all storage backends:
**CLI Usage:**
```bash
# Upload to any backend with real-time progress
s3-cli upload /local/data/*.log s3://mybucket/logs/
s3-cli upload /local/files/* az://container/data/
s3-cli upload /local/models/* gs://ml-bucket/models/
s3-cli upload /local/backup/* file:///remote-mount/backup/
s3-cli upload /local/cache/* direct:///nvme-storage/cache/
# Download from any backend
s3-cli download s3://bucket/data/ ./local-data/
s3-cli download az://container/logs/ ./logs/
s3-cli download gs://ml-bucket/datasets/ ./datasets/
s3-cli download file:///network-storage/data/ ./data/
# Cross-backend copying workflow
s3-cli download s3://source-bucket/data/ ./temp/
s3-cli upload ./temp/* gs://dest-bucket/data/
```
**Advanced Pattern Matching:**
```bash
# Glob patterns for file selection (upload)
s3-cli upload "/data/*.log" s3://bucket/logs/
s3-cli upload "/files/data_*.csv" az://container/data/
# Regex patterns for listing (use single quotes to prevent shell expansion)
s3-cli ls -r s3://bucket/ -p '.*\.txt$' # Only .txt files
s3-cli ls -r gs://bucket/ -p '.*\.(csv|json)$' # CSV or JSON files
s3-cli ls -r az://acct/cont/ -p '.*/data_.*' # Files with "data_" in path
# Count objects matching pattern (with progress indicator)
s3-cli ls -rc gs://bucket/data/ -p '.*\.npz$'
# Output: โ [00:00:05] 71,305 objects (14,261 obj/s)
# Total objects: 142,610 (10.0s, rate: 14,261 objects/s)
# Delete only matching files
s3-cli delete -r s3://bucket/logs/ -p '.*\.log$'
```
See **[CLI Guide](docs/CLI_GUIDE.md)** for complete command reference and pattern syntax.
### ๐ Python Integration
**High-Performance Data Operations:**
```python
import s3dlio
# Universal upload/download across all backends
s3dlio.upload(['/local/data.csv'], 's3://bucket/data/')
s3dlio.upload(['/local/logs/*.log'], 'az://container/logs/')
s3dlio.upload(['/local/models/*.pt'], 'gs://ml-bucket/models/')
s3dlio.download('s3://bucket/data/', './local-data/')
s3dlio.download('gs://ml-bucket/datasets/', './datasets/')
# High-level AI/ML operations
dataset = s3dlio.create_dataset("s3://bucket/training-data/")
loader = s3dlio.create_async_loader("gs://ml-bucket/data/", {"batch_size": 32})
# PyTorch integration
from s3dlio.torch import S3IterableDataset
from torch.utils.data import DataLoader
dataset = S3IterableDataset("gs://bucket/data/", loader_opts={})
dataloader = DataLoader(dataset, batch_size=16)
```
**Streaming & Compression:**
```python
# High-performance streaming with compression
options = s3dlio.PyWriterOptions()
options.compression = "zstd"
options.compression_level = 3
writer = s3dlio.create_s3_writer('s3://bucket/data.zst', options)
writer.write_chunk(large_data_bytes)
stats = writer.finalize() # Returns (bytes_written, compressed_bytes)
# Data generation with configurable modes
s3dlio.put("s3://bucket/test-data-{}.bin", num=1000, size=4194304,
data_gen_mode="streaming") # 2.6-3.5x faster for most cases
```
**XorStream โ dedup-safe data generation (v0.9.97+):**
`XorStream` generates unique, incompressible data at ~15 GB/s per core without Rayon
thread management overhead. Every `fill()` / `generate()` call is guaranteed to produce a
different 512-byte-block-level fingerprint. Ideal for PUT-heavy benchmarks where many
worker threads share a single generator.
```python
import s3dlio
import numpy as np
stream = s3dlio.XorStream()
# --- Fastest path: in-place fill into pre-allocated bytearray ---
buf = bytearray(8 * 1024 * 1024) # 8 MiB working buffer
stream.fill(buf) # fill once โ unique payload
stream.fill(buf) # fill again โ different payload, guaranteed
print(stream.objects_generated) # == 2
# --- Convenience path: allocate + fill in one call ---
data = stream.generate(8 * 1024 * 1024) # returns BytesView
view = memoryview(data) # zero-copy Python view
arr = np.frombuffer(view, dtype=np.uint8) # zero-copy numpy array
# --- PUT with XorStream data (benchmark pattern) ---
import threading
def worker(stream, uri_template, n):
buf = bytearray(8 * 1024 * 1024)
for i in range(n):
stream.fill(buf) # reuse buffer, unique data each time
s3dlio.put_bytes(buf, uri_template.format(i))
threads = [threading.Thread(target=worker, args=(stream, "s3://bucket/obj-{}.bin", 100))
for _ in range(32)]
for t in threads: t.start()
for t in threads: t.join()
```
| Scenario | Best choice |
|---|---|
| High concurrency PUT (โฅ 32 workers) | **`XorStream`** โ no Rayon scheduling |
| Medium objects 1โ32 MiB | **`XorStream`** โ no per-call allocation |
| Controllable compress/dedup ratios | `generate_data()` / `Generator` |
| Very large objects (โฅ 256 MiB) | `Generator.fill_chunk()` |
**Multi-Endpoint Load Balancing (v0.9.14+):**
```python
# Distribute I/O across multiple storage endpoints
store = s3dlio.create_multi_endpoint_store(
uris=[
"s3://bucket-1/data",
"s3://bucket-2/data",
"s3://bucket-3/data",
],
strategy="least_connections" # or "round_t robin"
)
# Zero-copy data access (memoryview compatible)
data = store.get("s3://bucket-1/file.bin")
array = np.frombuffer(memoryview(data), dtype=np.float32)
# Monitor load distribution
stats = store.get_endpoint_stats()
for i, s in enumerate(stats):
print(f"Endpoint {i}: {s['requests']} requests, {s['bytes_transferred']} bytes")
```
๐ **[Complete Multi-Endpoint Guide](docs/MULTI_ENDPOINT_GUIDE.md)** - Load balancing, configuration, use cases
### ๐ฆ Parquet DataLoader โ Epoch-Aware Training (v0.9.98+)
The Parquet DataLoader provides per-row-group streaming for Parquet files on **any
s3dlio-accessible storage** โ S3, Azure Blob, GCS, `file://`, and `direct://` (O_DIRECT;
tested and working). Only the URI prefix changes; no code changes needed to switch backends.
Two decode modes: `Raw` (Python decodes) and `ArrowIpc` (Rust decodes to Arrow IPC bytes).
**Zero footer re-fetches on epoch 2+** (row-group byte ranges cached in a process-global
DashMap after epoch 1; backend-agnostic).
```python
import s3dlio
# Works with any s3dlio URI โ just change the prefix:
# "s3://bucket/train/" Amazon S3 / MinIO / Ceph
# "az://container/train/" Azure Blob Storage
# "gs://bucket/train/" Google Cloud Storage
# "file:///mnt/data/train/" Local filesystem
# "direct:///mnt/nvme/train/" Local O_DIRECT (bypass page cache)
# Raw mode โ Python decodes with PyArrow (default)
loader = s3dlio.create_async_loader(
"s3://bucket/train/",
{"format": "parquet", "prefetch": 32}
)
for item in loader:
# item["data"]: bytes, item["uri"]: str, item["rg_idx"]: int
table = pyarrow.parquet.read_table(io.BytesIO(item["data"]))
# Arrow IPC mode โ Rust decodes, Python gets ready-to-use RecordBatch bytes
loader = s3dlio.create_async_loader(
"direct:///mnt/nvme/train/", # same API, different backend
{"format": "parquet", "decode": "arrow", "prefetch": 32}
)
for item in loader:
batch = pa.ipc.open_stream(pa.py_buffer(item["data"])).read_next_batch()
```
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `"format"` | `str` | โ | Must be `"parquet"` to activate Parquet mode |
| `"decode"` | `str` | `"raw"` | `"raw"` or `"arrow"` (Rust-side decode) |
| `"columns"` | `list[int]` | `None` | Column subset; `None` = all columns |
| `"footer_cap"` | `int` | 4 MiB | Bytes from file tail for footer parsing |
| `"prefetch"` | `int` | `32` | Concurrent in-flight row-group GETs |
**Epoch-2+ speedup** โ measured against live MinIO:
| Epoch | Construction time | Notes |
|-------|-----------------|-------|
| 1 | 20.4 ms | `list_objects` + footer GETs |
| 2+ | 8.3 ms | `list_objects` only โ **2.5ร faster** |
**Memory**: only metadata in RAM; 8 concurrent workers share process-global caches (no 8ร duplication).
๐ **[Complete Parquet DataLoader Guide](docs/Parquet_Data-Loader.md)**
## Performance
### Benchmark Results
s3dlio delivers world-class performance across all operations:
| Operation | Performance | Notes |
|-----------|-------------|-------|
| **S3 PUT** | Up to 3.089 GB/s | Exceeds steady-state baseline by 17.8% |
| **S3 GET** | Up to 4.826 GB/s | Near line-speed performance |
| **Multi-Process** | 2-3x faster | Improvement over single process |
| **Streaming Mode** | 2.6-3.5x faster | For 1-8MB objects vs single-pass |
### Optimization Features
- **HTTP/2 Support (opt-in)**: HTTP/2 is supported but **not the default** โ HTTP/1.1 is used unless you set `S3DLIO_H2C=1`. HTTP/2 is available for scenarios that benefit from multiplexing, but is typically slower for bulk storage workloads.
- **Intelligent Defaults**: Streaming mode automatically selected based on benchmarks
- **Multi-Process Architecture**: Massive parallelism for maximum performance
- **Zero-Copy Streaming**: Memory-efficient operations for large datasets
- **Configurable Chunk Sizes**: Fine-tune performance for your workload
# Checkpoint system for model states
store = s3dlio.PyCheckpointStore('file:///tmp/checkpoints/')
store.save('model_state', your_model_data)
loaded_data = store.load('model_state')
```
**Ready for Production**: All core functionality validated, comprehensive test suite, and honest documentation matching actual capabilities.
## Configuration & Tuning
### Environment Variables
s3dlio supports comprehensive configuration through environment variables:
- **NVIDIA AIStore**: `S3DLIO_FOLLOW_REDIRECTS=1` - Enable HTTP 307 redirect following for AIStore (opt-in, disabled by default); `S3DLIO_REDIRECT_MAX=5` - Maximum redirect hops per request
- **HTTP/2 mode**: `S3DLIO_H2C=1` - Force h2c (HTTP/2 cleartext) on http:// endpoints; `S3DLIO_H2C=0` - Force HTTP/1.1; unset = auto-probe (default)
- **Runtime Scaling**: `S3DLIO_RT_THREADS=32` - Tokio worker threads
- **Connection Pool**: `S3DLIO_POOL_MAX_IDLE_PER_HOST=32` - Max idle connections per host (default: 32)
- **S3 Range GET**: `S3DLIO_ENABLE_RANGE_OPTIMIZATION=0` - Disable concurrent range splitting (enabled by default); `S3DLIO_RANGE_THRESHOLD_MB=64` - Size threshold in MiB (default: 32); `S3DLIO_RANGE_CONCURRENCY=64` - Max concurrent range requests
- **Operation Logging**: `S3DLIO_OPLOG_LEVEL=2` - S3 operation tracking
๐ [Environment Variables Reference](docs/Environment_Variables.md)
### Operation Logging (Op-Log)
Universal operation trace logging across all backends with zstd-compressed TSV format, warp-replay compatible.
```python
import s3dlio
s3dlio.init_op_log("operations.tsv.zst")
# All operations automatically logged
s3dlio.finalize_op_log()
```
See [S3DLIO OpLog Implementation](docs/S3DLIO_OPLOG_IMPLEMENTATION_SUMMARY.md) for detailed usage.
## Building from Source
### Prerequisites
- **Rust**: [Install Rust toolchain](https://www.rust-lang.org/tools/install)
- **Python 3.12+**: For Python library development
- **UV** (recommended): [Install UV](https://docs.astral.sh/uv/getting-started/installation/)
- **OpenSSL**: Required (`libssl-dev` on Ubuntu)
- **HDF5** *(optional)*: Only needed with `--features hdf5` (`libhdf5-dev` on Ubuntu, `brew install hdf5` on macOS)
- **hwloc** *(optional)*: Only needed with `--features numa` (`libhwloc-dev` on Ubuntu)
### Build Steps
```bash
# Python environment
uv venv && source .venv/bin/activate
# Rust CLI
cargo build --release
# Python library
./build_pyo3.sh && ./install_pyo3_wheel.sh
```
## Configuration
### Environment Setup
```bash
# Required for S3 operations
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_ENDPOINT_URL=https://your-s3-endpoint
AWS_REGION=us-east-1
```
Enable comprehensive S3 operation logging compatible with MinIO warp format:
## Advanced Features
### CPU Profiling & Analysis
```bash
cargo build --release --features profiling
cargo run --example simple_flamegraph_test --features profiling
```
### Compression & Streaming
```python
import s3dlio
options = s3dlio.PyWriterOptions()
options.compression = "zstd"
writer = s3dlio.create_s3_writer('s3://bucket/data.zst', options)
writer.write_chunk(large_data)
stats = writer.finalize()
```
## Container Deployment
```bash
# Use pre-built container
podman pull quay.io/russfellows-sig65/s3dlio
podman run --net=host --rm -it quay.io/russfellows-sig65/s3dlio
# Or build locally
podman build -t s3dlio .
```
**Note**: Always use `--net=host` for storage backend connectivity.
## Documentation & Support
- **๐ฅ๏ธ CLI Guide**: [docs/CLI_GUIDE.md](docs/CLI_GUIDE.md) - Complete command-line reference
- **๐ Python API**: [docs/PYTHON_API_GUIDE.md](docs/PYTHON_API_GUIDE.md) - Python library reference
- **๏ฟฝ Parquet DataLoader**: [docs/Parquet_Data-Loader.md](docs/Parquet_Data-Loader.md) - Epoch-aware Parquet DataLoader guide (v0.9.98+)
- **๏ฟฝ๐ API Documentation**: [docs/api/](docs/api/)
- **๐ Changelog**: [docs/Changelog.md](docs/Changelog.md)
- **๐งช Testing Guide**: [docs/TESTING-GUIDE.md](docs/TESTING-GUIDE.md)
- **๐ Performance**: [docs/performance/](docs/performance/)
## ๐ Related Projects
- **[sai3-bench](https://github.com/russfellows/sai3-bench)** - Multi-protocol I/O benchmarking suite built on s3dlio
- **[polarWarp](https://github.com/russfellows/polarWarp)** - Op-log analysis tool for parsing and visualizing s3dlio operation logs
- **[google-cloud-rust (s3dlio patched fork)](https://github.com/russfellows/google-cloud-rust)** - Official Google Cloud Rust client fork used by s3dlio for patched GCS support
## License
Licensed under the Apache License 2.0 - see [LICENSE](LICENSE) file.
---
**๐ Ready to get started?** Check out the [Quick Start](#quick-start) section above or explore our [example scripts](examples/) for common use cases!