{"id":42808777,"url":"https://github.com/russfellows/s3dlio","last_synced_at":"2026-05-13T06:12:31.086Z","repository":{"id":302170085,"uuid":"982376659","full_name":"russfellows/s3dlio","owner":"russfellows","description":"Part of the sai3 project that delivers multi-protocol storage access for AI/ML workflows, supporting Pytorch, Tensorflow and Jax.  This project provides a CLI, along with Rust and Python libraries for AI/ML storage workflows.  Supporting S3, File, Azure Blob and GCS using the latest Rust SDKs.","archived":false,"fork":false,"pushed_at":"2026-03-15T04:20:00.000Z","size":4178,"stargazers_count":8,"open_issues_count":7,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-03-15T18:42:39.075Z","etag":null,"topics":["ai-ml","ai-storage","aws-s3","azure-blob","file-io","gcs","jax","python-library","pytorch","rust","rust-crate","s3","sai3","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/russfellows.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-12T19:40:19.000Z","updated_at":"2026-03-15T04:14:40.000Z","dependencies_parsed_at":"2025-07-01T00:30:08.409Z","dependency_job_id":"fac60b01-c5b4-45cf-a42c-7d8e3f8f928c","html_url":"https://github.com/russfellows/s3dlio","commit_stats":null,"previous_names":["russfellows/s3dlio"],"tags_count":56,"template":false,"template_full_name":null,"purl":"pkg:github/russfellows/s3dlio","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fs3dlio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fs3dlio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fs3dlio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fs3dlio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/russfellows","download_url":"https://codeload.github.com/russfellows/s3dlio/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/russfellows%2Fs3dlio/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30639120,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-18T00:09:27.587Z","status":"ssl_error","status_checked_at":"2026-03-18T00:09:26.123Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-ml","ai-storage","aws-s3","azure-blob","file-io","gcs","jax","python-library","pytorch","rust","rust-crate","s3","sai3","tensorflow"],"created_at":"2026-01-30T04:34:10.017Z","updated_at":"2026-05-13T06:12:31.078Z","avatar_url":"https://github.com/russfellows.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# s3dlio - Universal Storage I/O Library\n\n[![Build Status](https://img.shields.io/badge/build-passing-brightgreen)](https://github.com/russfellows/s3dlio)\n[![Rust Tests](https://img.shields.io/badge/rust%20tests-314-brightgreen)](docs/Changelog.md)\n[![Version](https://img.shields.io/badge/version-0.9.100-blue)](https://github.com/russfellows/s3dlio/releases)\n[![PyPI](https://img.shields.io/pypi/v/s3dlio)](https://pypi.org/project/s3dlio/)\n[![License](https://img.shields.io/badge/license-Apache--2.0-blue)](LICENSE)\n[![Rust](https://img.shields.io/badge/rust-1.91%2B-orange)](https://www.rust-lang.org)\n[![Python](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org)\n\nHigh-performance, multi-protocol storage library for AI/ML workloads with universal copy operations across S3, Azure, GCS, local file systems, and DirectIO.\n\n\u003e **v0.9.100 — General-purpose object data loader: `PyDataset.from_uris()`, `items()`, `collect_batch()`, `skip_head` HEAD optimisation**\n\u003e\n\u003e **`PyDataset.from_uris(uris)`**: map-style dataset from a pre-built URI list — no listing overhead. **`PyBytesAsyncDataLoader.items()`**: URI-carrying sliding-window iterator; Tokio keeps `prefetch` GETs permanently in flight via `buffer_unordered`. **`collect_batch(n)`**: drains `n` items with one GIL crossing. **`skip_head=True` (new default)**: skips the per-object HEAD request; actual size is cached after each GET so range splitting fires correctly from epoch 2. See [docs/Python_Data-Loader.md](docs/Python_Data-Loader.md).\n\u003e\n\u003e **v0.9.98 (prior):** `ParquetRowGroupDataset` — epoch-aware Parquet DataLoader with epoch-2 fast path and Arrow IPC decode. See [docs/Changelog.md](docs/Changelog.md) for full history.\n\n## 📦 Installation\n\n### Quick Install (Python)\n\n```bash\n# If using uv package manager + uv virtual environment:\nuv pip install s3dlio\n\n# If using pip without uv:\npip install s3dlio\n```\n\n### Python Backend Profiles (PyPI vs Full Build)\n\n- If using `uv` package manager + `uv` virtual environment: `uv pip install s3dlio`.\n- If using standard `pip` without `uv`: `pip install s3dlio`.\n- The default published wheel is now S3-focused (Azure Blob and GCS are excluded).\n- If you want full backends (S3 + Azure Blob + GCS), build from source with:\n\n```bash\n# uv workflow:\nuv pip install s3dlio --no-binary s3dlio --config-settings \"cargo-extra-args=--features extension-module,full-backends\"\n\n# pip-only workflow:\npip install s3dlio --no-binary s3dlio --config-settings \"cargo-extra-args=--features extension-module,full-backends\"\n```\n\nYou can still add a separate package name (for example `s3dlio-full`) later if you want a dedicated prebuilt full wheel distribution.\n\n\u003e Maintainer note: for PyPI uploads, publish the default (`./build_pyo3.sh`) wheel unless intentionally releasing a separate distribution. `full-backends` is currently source-build only via the command above.\n\n### Building from Source (Rust)\n\n#### System Dependencies\n\ns3dlio requires some system libraries to build. **Only OpenSSL and pkg-config are required by default.** HDF5 and hwloc are optional and improve functionality but are not needed for the core library:\n\n**Ubuntu/Debian:**\n```bash\n# Quick install - run our helper script\n./scripts/install-system-deps.sh\n\n# Or manually (required only):\nsudo apt-get install -y build-essential pkg-config libssl-dev\n\n# Optional - for NUMA topology support (--features numa):\nsudo apt-get install -y libhwloc-dev\n\n# Optional - for HDF5 data format support (--features hdf5):\nsudo apt-get install -y libhdf5-dev\n\n# All optional libraries at once:\nsudo apt-get install -y libhdf5-dev libhwloc-dev cmake\n```\n\n**RHEL/CentOS/Fedora/Rocky/AlmaLinux:**\n```bash\n# Quick install\n./scripts/install-system-deps.sh\n\n# Or manually (required only):\nsudo dnf install -y gcc gcc-c++ make pkg-config openssl-devel\n\n# Optional - for NUMA topology support:\nsudo dnf install -y hwloc-devel\n\n# Optional - for HDF5 data format support:\nsudo dnf install -y hdf5-devel\n\n# All optional libraries at once:\nsudo dnf install -y hdf5-devel hwloc-devel cmake\n```\n\n**macOS:**\n```bash\n# Quick install\n./scripts/install-system-deps.sh\n\n# Or manually (required only):\nbrew install pkg-config openssl@3\n\n# Optional - for NUMA/HDF5 support:\nbrew install hdf5 hwloc cmake\n\n# Set environment variables (add to ~/.zshrc or ~/.bash_profile):\nexport PKG_CONFIG_PATH=\"$(brew --prefix openssl@3)/lib/pkgconfig:$PKG_CONFIG_PATH\"\nexport OPENSSL_DIR=\"$(brew --prefix openssl@3)\"\n```\n\n**Arch Linux:**\n```bash\n# Quick install\n./scripts/install-system-deps.sh\n\n# Or manually (required only):\nsudo pacman -S base-devel pkg-config openssl\n\n# Optional - for NUMA/HDF5 support:\nsudo pacman -S hdf5 hwloc cmake\n```\n\n**WSL (Windows Subsystem for Linux) / Minimal Environments:**\n\nIf you are building on WSL or any environment where `libhdf5` or `libhwloc` may not be available, s3dlio builds without them by default. No extra libraries are required:\n```bash\n# Just the basics - works on WSL, Docker, CI, and minimal installs:\nsudo apt-get install -y build-essential pkg-config libssl-dev\ncargo build --release\n# install Python package (no system HDF5/hwloc needed):\n# uv workflow:\nuv pip install s3dlio\n# pip-only workflow:\npip install s3dlio\n```\n\n#### Install Rust (if not already installed)\n\n```bash\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\nsource $HOME/.cargo/env\n```\n\n#### Build s3dlio\n\n```bash\n# Clone the repository\ngit clone https://github.com/russfellows/s3dlio.git\ncd s3dlio\n\n# Build with default features (no HDF5 or NUMA required)\ncargo build --release\n\n# Build s3-cli with all cloud backends enabled (AWS + Azure + GCS)\ncargo build --release --bin s3-cli --features full-backends\n\n# Build s3-cli with GCS enabled only (plus default backends)\ncargo build --release --bin s3-cli --features backend-gcs\n\n# Build with NUMA topology support (requires libhwloc-dev)\ncargo build --release --features numa\n\n# Build with HDF5 data format support (requires libhdf5-dev)\ncargo build --release --features hdf5\n\n# Build with all optional features\ncargo build --release --features numa,hdf5\n\n# Run tests\ncargo test\n\n# Build Python bindings (optional)\n./build_pyo3.sh\n\n# Build Python bindings with full backends (S3 + Azure + GCS)\n./build_pyo3.sh full\n\n# Named profile form is also supported:\n./build_pyo3.sh --profile full\n./build_pyo3.sh --profile default\n\n# Show profile/help usage\n./build_pyo3.sh --help\n```\n\n### Build Profile Quick Reference\n\nRust backend feature profiles:\n\n- Default build (`cargo build --release`): S3-focused default backend set.\n- GCS-enabled build (`--features backend-gcs`): enables GCS in addition to default set.\n- Full cloud build (`--features full-backends`): enables AWS + Azure + GCS.\n\nPython wheel build profiles via `build_pyo3.sh`:\n\n- `default` or `slim`: AWS + file/direct; excludes Azure and GCS.\n- `full`: AWS + Azure + GCS + file/direct.\n- Positional and named forms are equivalent:\n    - `./build_pyo3.sh full`\n    - `./build_pyo3.sh -p full`\n    - `./build_pyo3.sh --profile full`\n\nOptional extra Rust features for wheel builds can still be passed with `EXTRA_FEATURES`.\nExample: `EXTRA_FEATURES=\"numa,hdf5\" ./build_pyo3.sh full`.\n\n**Note:** NUMA support (`--features numa`) improves multi-socket performance but requires the `hwloc2` C library. HDF5 support (`--features hdf5`) enables HDF5 data format generation but requires `libhdf5`. Both are optional and s3dlio is fully functional without them.\n\n**Platform support:** s3dlio builds natively on Linux (x86\\_64, aarch64), macOS (x86\\_64 and Apple Silicon arm64), and WSL. Making `numa` and `hdf5` optional was the key change for broad platform support — all remaining dependencies are pure Rust or use platform-independent system libraries (OpenSSL). To cross-compile Python wheels for Linux ARM64 from an x86\\_64 host, see `build_pyo3.sh` for instructions using the `--zig` linker. For macOS universal2 (fat binary covering both architectures), see the commented section in `build_pyo3.sh`.\n\n## ✨ Key Features\n\n- **High Performance**: High-throughput multi GB/s reads and writes on platforms with sufficient network and storage capabilities\n- **Zero-Copy Architecture**: `bytes::Bytes` throughout for minimal memory overhead\n- **Multi-Protocol**: S3, Azure Blob, GCS, file://, direct:// (O_DIRECT)\n- **HTTP/2 Support (opt-in)**: HTTP/2 is available but **HTTP/1.1 is the default** — HTTP/2 is almost always slower for bulk storage workloads. Opt in by setting `S3DLIO_H2C=1`. Both TLS ALPN (`https://`) and cleartext h2c (`http://`) are supported when enabled. See [docs/HTTP2_ALPN_INVESTIGATION.md](docs/HTTP2_ALPN_INVESTIGATION.md).\n- **Python \u0026 Rust**: Native Rust library with zero-copy Python bindings (PyO3), bytearray support for efficient memory management\n- **Multi-Endpoint Load Balancing**: RoundRobin/LeastConnections across storage endpoints\n- **AI/ML Ready**: PyTorch DataLoader integration, TFRecord/NPZ format support\n- **Parquet DataLoader**: Per-row-group epoch-aware DataLoader for any s3dlio storage backend (S3, Azure Blob, GCS, file://, direct://). Epoch-2 zero re-fetch speedup (2.5×+), Raw + Arrow IPC decode modes, 8-worker concurrency with shared metadata caches — [see guide](docs/Parquet_Data-Loader.md)\n- **High-Speed Data Generation**: 50+ GB/s test data with configurable compression/dedup\n\n## 🌟 Latest Release\n\n**v0.9.98** (May 2026) — Parquet DataLoader with epoch-2 fast path and Arrow IPC decode. See [docs/Changelog.md](docs/Changelog.md).\n\n**Recent highlights:**\n- **v0.9.98** - **Parquet DataLoader** (`ParquetRowGroupDataset`): per-row-group Dataset, epoch-2 zero-re-fetch (2.5× speedup proven), Raw + ArrowIpc decode modes, 8-worker shared caches; 648 tests passing\n- **v0.9.97** - `XorStream` (dedup-safe, ~15 GB/s/core); `S3DLIO_UNSIGNED_PAYLOAD` opt-in for private S3-compatible endpoints; 613 tests passing\n- **v0.9.92** - MPU coordinator task, auto-scale, async write/flush/finish safety fixes, MAX_MULTIPART_PARTS guard; 580 tests passing\n- **v0.9.90** - Full NVIDIA AIStore support (`S3DLIO_FOLLOW_REDIRECTS=1`) with all TLS security policies; HTTP/2 available (opt-in via `S3DLIO_H2C=1`, **not the default**); 5 issues closed (#126, #133, #134, #135, #136); 559 tests passing\n- **v0.9.86** - Redirect follower for NVIDIA AIStore (S3 path); HTTPS→HTTP downgrade prevention; 21 new redirect tests; redirect security analysis documented\n- **v0.9.84** - HEAD elimination (ObjectSizeCache); OnceLock env-var caching; lock-free range assembly; `AWS_CA_BUNDLE_PATH` → `AWS_CA_BUNDLE`; structured tracing\n- **v0.9.80** - Python list hang fix (IMDSv2 legacy call removed); tracing deadlock fix (`tokio::spawn` → inline stream); async S3 delete/bucket helpers; deprecated Python APIs cleaned up\n\n📖 **[Complete Changelog](docs/Changelog.md)** - Full version history, migration guides, API details\n\n---\n\n## 📚 Version History\n\nFor detailed release notes and migration guides, see the [Complete Changelog](docs/Changelog.md).\n\n---\n\n## Storage Backend Support\n\n### Universal Backend Architecture\ns3dlio provides unified storage operations across all backends with consistent URI patterns:\n\n- **🗄️ Amazon S3**: `s3://bucket/prefix/` - High-performance S3 operations (5+ GB/s reads, 2.5+ GB/s writes) with built-in concurrent range GETs (on by default)\n- **☁️ Azure Blob Storage**: `az://container/prefix/` - Complete Azure integration with **RangeEngine** (30-50% faster for large blobs)\n- **🌐 Google Cloud Storage**: `gs://bucket/prefix/` or `gcs://bucket/prefix/` - Production ready with **RangeEngine** and full ObjectStore integration\n- **📁 Local File System**: `file:///path/to/directory/` - High-speed local file operations with **RangeEngine** support\n- **⚡ DirectIO**: `direct:///path/to/directory/` - Bypass OS cache for maximum I/O performance with **RangeEngine**\n\n### Concurrent Range GET Performance Features (v0.9.3+, Updated v0.9.60)\nConcurrent range downloads hide network latency by parallelizing HTTP range requests.\n\n**All backends support concurrent range GETs — but via two different mechanisms:**\n\n**Mechanism 1 — S3 built-in (on by default, v0.9.60+)**\n- ✅ **Amazon S3**: Concurrent range splitting enabled by default via `S3DLIO_ENABLE_RANGE_OPTIMIZATION` (default: on). Uses `get_object_concurrent_range_async()` — fires parallel `GetObject(Range: bytes=N-M)` requests via the AWS SDK with lock-free chunk assembly. Controlled by `S3DLIO_RANGE_THRESHOLD_MB` (default: 32 MiB) and `S3DLIO_RANGE_CONCURRENCY` (default: auto-scaled). Disable with `S3DLIO_ENABLE_RANGE_OPTIMIZATION=0`.\n\n**Mechanism 2 — RangeEngine (per-store config flag, must enable explicitly)**\n- ✅ **Azure Blob Storage**: 30-50% faster for large files (`enable_range_engine: true` in `AzureConfig`)\n- ✅ **Google Cloud Storage**: 30-50% faster for large files (`enable_range_engine: true` in `GcsConfig`)\n- ⚠️ **Local File System**: Rarely beneficial due to seek overhead (disabled by default)\n- ⚠️ **DirectIO**: Rarely beneficial due to O_DIRECT overhead (disabled by default)\n\n**RangeEngine config flag defaults (v0.9.6+):**\n- **Status**: `enable_range_engine: false` by default in all per-store config structs\n- **Reason**: Extra HEAD request on every GET causes ~50% slowdown for small-object workloads\n- **Threshold**: 32 MiB default (tunable per-store via `RangeEngineConfig::min_split_size`)\n\n**How to Enable for Large-File Workloads:**\n```rust\nuse s3dlio::object_store::{AzureObjectStore, AzureConfig};\n\nlet config = AzureConfig {\n    enable_range_engine: true,  // Explicitly enable for large files\n    ..Default::default()\n};\nlet store = AzureObjectStore::with_config(config);\n```\n\n**When to Enable:**\n- ✅ Large-file workloads (average size \u003e= 32 MiB)\n- ✅ High-bandwidth, high-latency networks\n- ❌ Mixed or small-object workloads\n- ❌ Local file systems\n\n### S3 Backend Options\ns3dlio supports two S3 backend implementations. **Native AWS SDK is the default and recommended** for production use:\n\n```bash\n# Default: Native AWS SDK backend (RECOMMENDED for production)\ncargo build --release\n# or explicitly:\ncargo build --no-default-features --features native-backends\n\n# Experimental: Apache Arrow object_store backend (optional, for testing)\ncargo build --no-default-features --features arrow-backend\n```\n\n**Why native-backends is default:**\n- Proven performance in production workloads\n- Optimized for high-throughput S3 operations (5+ GB/s reads, 2.5+ GB/s writes)\n- Well-tested with MinIO, Vast, and AWS S3\n\n**About arrow-backend:**\n- Experimental alternative implementation\n- No proven performance advantage over native backend\n- Useful for comparison testing and development\n- Not recommended for production use\n\n### GCS Backend Options (Current)\n\nGCS is now **optional** at build time.\n\n- Default build (`cargo build --release`) does **not** include GCS.\n- To include GCS, enable `backend-gcs` (or `full-backends`).\n- When enabled, s3dlio uses the **official Google crates** (`google-cloud-storage` + gax) from a patched fork maintained for s3dlio.\n\n```bash\n# Default build (S3-focused; no GCS)\ncargo build --release\n\n# Enable GCS explicitly\ncargo build --release --features backend-gcs\n\n# Enable all cloud backends (AWS + Azure + GCS)\ncargo build --release --features full-backends\n```\n\n**Patched official GCS fork used by s3dlio:**\n- Repository: https://github.com/russfellows/google-cloud-rust\n- Integration in this repo is pinned in [Cargo.toml](Cargo.toml) (currently via release tag from that fork).\n\n**Legacy note:** `gcs-community` remains as a legacy opt-in path, but the primary supported path is the official Google crates from the patched `russfellows/google-cloud-rust` fork.\n\n## Quick Start\n\n### Installation\n\n**Rust CLI:**\n```bash\ngit clone https://github.com/russfellows/s3dlio.git\ncd s3dlio\ncargo build --release\n\n# Full cloud backend CLI build:\ncargo build --release --bin s3-cli --features full-backends\n```\n\n**Python Library:**\n```bash\n# uv workflow:\nuv pip install s3dlio\n\n# pip-only workflow:\npip install s3dlio\n\n# or build from source:\n./build_pyo3.sh \u0026\u0026 ./install_pyo3_wheel.sh\n\n# build from source with full cloud backends:\n./build_pyo3.sh --profile full \u0026\u0026 ./install_pyo3_wheel.sh\n```\n\n### Documentation\n\n- **[CLI Guide](docs/CLI_GUIDE.md)** - Complete command-line interface reference with examples\n- **[Python API Guide](docs/PYTHON_API_GUIDE.md)** - Complete Python library reference with examples\n- **[Parquet DataLoader Guide](docs/Parquet_Data-Loader.md)** - Epoch-aware per-row-group Parquet DataLoader (v0.9.98+)\n- **[Multi-Endpoint Guide](docs/MULTI_ENDPOINT_GUIDE.md)** - Load balancing across multiple storage endpoints (v0.9.14+)\n- **[Rust API Guide v0.9.0](docs/api/rust-api-v0.9.0.md)** - Complete Rust library reference with migration guide\n- **[Changelog](docs/Changelog.md)** - Version history and release notes\n- **[Adaptive Tuning Guide](docs/ADAPTIVE-TUNING.md)** - Optional performance auto-tuning\n- **[Testing Guide](docs/TESTING-GUIDE.md)** - Test suite documentation\n\n## Core Capabilities\n\n### 🚀 Universal Copy Operations\n\ns3dlio treats upload and download as enhanced versions of the Unix `cp` command, working across all storage backends:\n\n**CLI Usage:**\n```bash\n# Upload to any backend with real-time progress\ns3-cli upload /local/data/*.log s3://mybucket/logs/\ns3-cli upload /local/files/* az://container/data/  \ns3-cli upload /local/models/* gs://ml-bucket/models/\ns3-cli upload /local/backup/* file:///remote-mount/backup/\ns3-cli upload /local/cache/* direct:///nvme-storage/cache/\n\n# Download from any backend  \ns3-cli download s3://bucket/data/ ./local-data/\ns3-cli download az://container/logs/ ./logs/\ns3-cli download gs://ml-bucket/datasets/ ./datasets/\ns3-cli download file:///network-storage/data/ ./data/\n\n# Cross-backend copying workflow\ns3-cli download s3://source-bucket/data/ ./temp/\ns3-cli upload ./temp/* gs://dest-bucket/data/\n```\n\n**Advanced Pattern Matching:**\n```bash\n# Glob patterns for file selection (upload)\ns3-cli upload \"/data/*.log\" s3://bucket/logs/\ns3-cli upload \"/files/data_*.csv\" az://container/data/\n\n# Regex patterns for listing (use single quotes to prevent shell expansion)\ns3-cli ls -r s3://bucket/ -p '.*\\.txt$'           # Only .txt files\ns3-cli ls -r gs://bucket/ -p '.*\\.(csv|json)$'    # CSV or JSON files\ns3-cli ls -r az://acct/cont/ -p '.*/data_.*'      # Files with \"data_\" in path\n\n# Count objects matching pattern (with progress indicator)\ns3-cli ls -rc gs://bucket/data/ -p '.*\\.npz$'\n# Output: ⠙ [00:00:05] 71,305 objects (14,261 obj/s)\n#         Total objects: 142,610 (10.0s, rate: 14,261 objects/s)\n\n# Delete only matching files\ns3-cli delete -r s3://bucket/logs/ -p '.*\\.log$'\n```\n\nSee **[CLI Guide](docs/CLI_GUIDE.md)** for complete command reference and pattern syntax.\n\n### 🐍 Python Integration\n\n**High-Performance Data Operations:**\n```python\nimport s3dlio\n\n# Universal upload/download across all backends\ns3dlio.upload(['/local/data.csv'], 's3://bucket/data/')\ns3dlio.upload(['/local/logs/*.log'], 'az://container/logs/')  \ns3dlio.upload(['/local/models/*.pt'], 'gs://ml-bucket/models/')\ns3dlio.download('s3://bucket/data/', './local-data/')\ns3dlio.download('gs://ml-bucket/datasets/', './datasets/')\n\n# High-level AI/ML operations\ndataset = s3dlio.create_dataset(\"s3://bucket/training-data/\")\nloader = s3dlio.create_async_loader(\"gs://ml-bucket/data/\", {\"batch_size\": 32})\n\n# PyTorch integration\nfrom s3dlio.torch import S3IterableDataset\nfrom torch.utils.data import DataLoader\n\ndataset = S3IterableDataset(\"gs://bucket/data/\", loader_opts={})\ndataloader = DataLoader(dataset, batch_size=16)\n```\n\n**Streaming \u0026 Compression:**\n```python\n# High-performance streaming with compression\noptions = s3dlio.PyWriterOptions()\noptions.compression = \"zstd\"\noptions.compression_level = 3\n\nwriter = s3dlio.create_s3_writer('s3://bucket/data.zst', options)\nwriter.write_chunk(large_data_bytes)\nstats = writer.finalize()  # Returns (bytes_written, compressed_bytes)\n\n# Data generation with configurable modes\ns3dlio.put(\"s3://bucket/test-data-{}.bin\", num=1000, size=4194304, \n          data_gen_mode=\"streaming\")  # 2.6-3.5x faster for most cases\n```\n\n**XorStream — dedup-safe data generation (v0.9.97+):**\n\n`XorStream` generates unique, incompressible data at ~15 GB/s per core without Rayon\nthread management overhead. Every `fill()` / `generate()` call is guaranteed to produce a\ndifferent 512-byte-block-level fingerprint. Ideal for PUT-heavy benchmarks where many\nworker threads share a single generator.\n\n```python\nimport s3dlio\nimport numpy as np\n\nstream = s3dlio.XorStream()\n\n# --- Fastest path: in-place fill into pre-allocated bytearray ---\nbuf = bytearray(8 * 1024 * 1024)   # 8 MiB working buffer\nstream.fill(buf)                    # fill once — unique payload\nstream.fill(buf)                    # fill again — different payload, guaranteed\n\nprint(stream.objects_generated)     # == 2\n\n# --- Convenience path: allocate + fill in one call ---\ndata = stream.generate(8 * 1024 * 1024)  # returns BytesView\nview = memoryview(data)                   # zero-copy Python view\narr  = np.frombuffer(view, dtype=np.uint8)  # zero-copy numpy array\n\n# --- PUT with XorStream data (benchmark pattern) ---\nimport threading\n\ndef worker(stream, uri_template, n):\n    buf = bytearray(8 * 1024 * 1024)\n    for i in range(n):\n        stream.fill(buf)       # reuse buffer, unique data each time\n        s3dlio.put_bytes(buf, uri_template.format(i))\n\nthreads = [threading.Thread(target=worker, args=(stream, \"s3://bucket/obj-{}.bin\", 100))\n           for _ in range(32)]\nfor t in threads: t.start()\nfor t in threads: t.join()\n```\n\n| Scenario | Best choice |\n|---|---|\n| High concurrency PUT (≥ 32 workers) | **`XorStream`** — no Rayon scheduling |\n| Medium objects 1–32 MiB | **`XorStream`** — no per-call allocation |\n| Controllable compress/dedup ratios | `generate_data()` / `Generator` |\n| Very large objects (≥ 256 MiB) | `Generator.fill_chunk()` |\n\n\n**Multi-Endpoint Load Balancing (v0.9.14+):**\n```python\n# Distribute I/O across multiple storage endpoints\nstore = s3dlio.create_multi_endpoint_store(\n    uris=[\n        \"s3://bucket-1/data\",\n        \"s3://bucket-2/data\", \n        \"s3://bucket-3/data\",\n    ],\n    strategy=\"least_connections\"  # or \"round_t robin\"\n)\n\n# Zero-copy data access (memoryview compatible)\ndata = store.get(\"s3://bucket-1/file.bin\")\narray = np.frombuffer(memoryview(data), dtype=np.float32)\n\n# Monitor load distribution\nstats = store.get_endpoint_stats()\nfor i, s in enumerate(stats):\n    print(f\"Endpoint {i}: {s['requests']} requests, {s['bytes_transferred']} bytes\")\n```\n📖 **[Complete Multi-Endpoint Guide](docs/MULTI_ENDPOINT_GUIDE.md)** - Load balancing, configuration, use cases\n\n### 📦 Parquet DataLoader — Epoch-Aware Training (v0.9.98+)\n\nThe Parquet DataLoader provides per-row-group streaming for Parquet files on **any\ns3dlio-accessible storage** — S3, Azure Blob, GCS, `file://`, and `direct://` (O_DIRECT;\ntested and working). Only the URI prefix changes; no code changes needed to switch backends.\nTwo decode modes: `Raw` (Python decodes) and `ArrowIpc` (Rust decodes to Arrow IPC bytes).\n**Zero footer re-fetches on epoch 2+** (row-group byte ranges cached in a process-global\nDashMap after epoch 1; backend-agnostic).\n\n```python\nimport s3dlio\n\n# Works with any s3dlio URI — just change the prefix:\n#   \"s3://bucket/train/\"           Amazon S3 / MinIO / Ceph\n#   \"az://container/train/\"        Azure Blob Storage\n#   \"gs://bucket/train/\"           Google Cloud Storage\n#   \"file:///mnt/data/train/\"      Local filesystem\n#   \"direct:///mnt/nvme/train/\"    Local O_DIRECT (bypass page cache)\n\n# Raw mode — Python decodes with PyArrow (default)\nloader = s3dlio.create_async_loader(\n    \"s3://bucket/train/\",\n    {\"format\": \"parquet\", \"prefetch\": 32}\n)\nfor item in loader:\n    # item[\"data\"]: bytes, item[\"uri\"]: str, item[\"rg_idx\"]: int\n    table = pyarrow.parquet.read_table(io.BytesIO(item[\"data\"]))\n\n# Arrow IPC mode — Rust decodes, Python gets ready-to-use RecordBatch bytes\nloader = s3dlio.create_async_loader(\n    \"direct:///mnt/nvme/train/\",   # same API, different backend\n    {\"format\": \"parquet\", \"decode\": \"arrow\", \"prefetch\": 32}\n)\nfor item in loader:\n    batch = pa.ipc.open_stream(pa.py_buffer(item[\"data\"])).read_next_batch()\n```\n\n| Option | Type | Default | Description |\n|--------|------|---------|-------------|\n| `\"format\"` | `str` | — | Must be `\"parquet\"` to activate Parquet mode |\n| `\"decode\"` | `str` | `\"raw\"` | `\"raw\"` or `\"arrow\"` (Rust-side decode) |\n| `\"columns\"` | `list[int]` | `None` | Column subset; `None` = all columns |\n| `\"footer_cap\"` | `int` | 4 MiB | Bytes from file tail for footer parsing |\n| `\"prefetch\"` | `int` | `32` | Concurrent in-flight row-group GETs |\n\n**Epoch-2+ speedup** — measured against live MinIO:\n\n| Epoch | Construction time | Notes |\n|-------|-----------------|-------|\n| 1 | 20.4 ms | `list_objects` + footer GETs |\n| 2+ | 8.3 ms | `list_objects` only — **2.5× faster** |\n\n**Memory**: only metadata in RAM; 8 concurrent workers share process-global caches (no 8× duplication).\n\n📖 **[Complete Parquet DataLoader Guide](docs/Parquet_Data-Loader.md)**\n\n## Performance\n\n\n### Benchmark Results\ns3dlio delivers world-class performance across all operations:\n\n| Operation | Performance | Notes |\n|-----------|-------------|-------|\n| **S3 PUT** | Up to 3.089 GB/s | Exceeds steady-state baseline by 17.8% |\n| **S3 GET** | Up to 4.826 GB/s | Near line-speed performance |\n| **Multi-Process** | 2-3x faster | Improvement over single process |\n| **Streaming Mode** | 2.6-3.5x faster | For 1-8MB objects vs single-pass |\n\n### Optimization Features\n- **HTTP/2 Support (opt-in)**: HTTP/2 is supported but **not the default** — HTTP/1.1 is used unless you set `S3DLIO_H2C=1`. HTTP/2 is available for scenarios that benefit from multiplexing, but is typically slower for bulk storage workloads.\n- **Intelligent Defaults**: Streaming mode automatically selected based on benchmarks\n- **Multi-Process Architecture**: Massive parallelism for maximum performance\n- **Zero-Copy Streaming**: Memory-efficient operations for large datasets\n- **Configurable Chunk Sizes**: Fine-tune performance for your workload\n\n# Checkpoint system for model states\nstore = s3dlio.PyCheckpointStore('file:///tmp/checkpoints/')\nstore.save('model_state', your_model_data)\nloaded_data = store.load('model_state')\n```\n\n**Ready for Production**: All core functionality validated, comprehensive test suite, and honest documentation matching actual capabilities.\n\n## Configuration \u0026 Tuning\n\n### Environment Variables\ns3dlio supports comprehensive configuration through environment variables:\n\n- **NVIDIA AIStore**: `S3DLIO_FOLLOW_REDIRECTS=1` - Enable HTTP 307 redirect following for AIStore (opt-in, disabled by default); `S3DLIO_REDIRECT_MAX=5` - Maximum redirect hops per request\n- **HTTP/2 mode**: `S3DLIO_H2C=1` - Force h2c (HTTP/2 cleartext) on http:// endpoints; `S3DLIO_H2C=0` - Force HTTP/1.1; unset = auto-probe (default)\n- **Runtime Scaling**: `S3DLIO_RT_THREADS=32` - Tokio worker threads\n- **Connection Pool**: `S3DLIO_POOL_MAX_IDLE_PER_HOST=32` - Max idle connections per host (default: 32)\n- **S3 Range GET**: `S3DLIO_ENABLE_RANGE_OPTIMIZATION=0` - Disable concurrent range splitting (enabled by default); `S3DLIO_RANGE_THRESHOLD_MB=64` - Size threshold in MiB (default: 32); `S3DLIO_RANGE_CONCURRENCY=64` - Max concurrent range requests\n- **Operation Logging**: `S3DLIO_OPLOG_LEVEL=2` - S3 operation tracking\n\n📖 [Environment Variables Reference](docs/Environment_Variables.md)\n\n### Operation Logging (Op-Log)\nUniversal operation trace logging across all backends with zstd-compressed TSV format, warp-replay compatible.\n\n```python\nimport s3dlio\ns3dlio.init_op_log(\"operations.tsv.zst\")\n# All operations automatically logged\ns3dlio.finalize_op_log()\n```\n\nSee [S3DLIO OpLog Implementation](docs/S3DLIO_OPLOG_IMPLEMENTATION_SUMMARY.md) for detailed usage.\n\n## Building from Source\n\n### Prerequisites\n- **Rust**: [Install Rust toolchain](https://www.rust-lang.org/tools/install)\n- **Python 3.12+**: For Python library development\n- **UV** (recommended): [Install UV](https://docs.astral.sh/uv/getting-started/installation/)\n- **OpenSSL**: Required (`libssl-dev` on Ubuntu)\n- **HDF5** *(optional)*: Only needed with `--features hdf5` (`libhdf5-dev` on Ubuntu, `brew install hdf5` on macOS)\n- **hwloc** *(optional)*: Only needed with `--features numa` (`libhwloc-dev` on Ubuntu)\n\n### Build Steps\n```bash\n# Python environment\nuv venv \u0026\u0026 source .venv/bin/activate\n\n# Rust CLI\ncargo build --release\n\n# Python library\n./build_pyo3.sh \u0026\u0026 ./install_pyo3_wheel.sh\n```\n\n## Configuration\n\n### Environment Setup\n```bash\n# Required for S3 operations\nAWS_ACCESS_KEY_ID=your-access-key\nAWS_SECRET_ACCESS_KEY=your-secret-key\nAWS_ENDPOINT_URL=https://your-s3-endpoint\nAWS_REGION=us-east-1\n```\nEnable comprehensive S3 operation logging compatible with MinIO warp format:\n\n\n## Advanced Features\n\n### CPU Profiling \u0026 Analysis\n```bash\ncargo build --release --features profiling\ncargo run --example simple_flamegraph_test --features profiling\n```\n\n### Compression \u0026 Streaming\n```python\nimport s3dlio\noptions = s3dlio.PyWriterOptions()\noptions.compression = \"zstd\"\nwriter = s3dlio.create_s3_writer('s3://bucket/data.zst', options)\nwriter.write_chunk(large_data)\nstats = writer.finalize()\n```\n\n## Container Deployment\n\n```bash\n# Use pre-built container\npodman pull quay.io/russfellows-sig65/s3dlio\npodman run --net=host --rm -it quay.io/russfellows-sig65/s3dlio\n\n# Or build locally\npodman build -t s3dlio .\n```\n\n**Note**: Always use `--net=host` for storage backend connectivity.\n\n## Documentation \u0026 Support\n\n- **🖥️ CLI Guide**: [docs/CLI_GUIDE.md](docs/CLI_GUIDE.md) - Complete command-line reference\n- **🐍 Python API**: [docs/PYTHON_API_GUIDE.md](docs/PYTHON_API_GUIDE.md) - Python library reference\n- **� Parquet DataLoader**: [docs/Parquet_Data-Loader.md](docs/Parquet_Data-Loader.md) - Epoch-aware Parquet DataLoader guide (v0.9.98+)\n- **�📚 API Documentation**: [docs/api/](docs/api/)\n- **📝 Changelog**: [docs/Changelog.md](docs/Changelog.md)\n- **🧪 Testing Guide**: [docs/TESTING-GUIDE.md](docs/TESTING-GUIDE.md)\n- **🚀 Performance**: [docs/performance/](docs/performance/)\n\n## 🔗 Related Projects\n\n- **[sai3-bench](https://github.com/russfellows/sai3-bench)** - Multi-protocol I/O benchmarking suite built on s3dlio\n- **[polarWarp](https://github.com/russfellows/polarWarp)** - Op-log analysis tool for parsing and visualizing s3dlio operation logs\n- **[google-cloud-rust (s3dlio patched fork)](https://github.com/russfellows/google-cloud-rust)** - Official Google Cloud Rust client fork used by s3dlio for patched GCS support\n\n## License\n\nLicensed under the Apache License 2.0 - see [LICENSE](LICENSE) file.\n\n---\n\n**🚀 Ready to get started?** Check out the [Quick Start](#quick-start) section above or explore our [example scripts](examples/) for common use cases!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frussfellows%2Fs3dlio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frussfellows%2Fs3dlio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frussfellows%2Fs3dlio/lists"}