https://github.com/logannye/emsqrt
Process any data size with a fixed, small memory footprint. EM-√ is an external-memory ETL/log processing engine with hard peak-RAM guarantees. Unlike traditional systems that "try" to stay within memory limits, EM-√ enforces a strict memory cap, enabling you to process arbitrarily large datasets using small memory footprints.
https://github.com/logannye/emsqrt
big-data big-data-analytics cloud cloud-computing edge-ai edge-computing efficiency efficient-algorithm memory-allocation rust streaming streaming-algorithms streaming-data
Last synced: about 1 month ago
JSON representation
Process any data size with a fixed, small memory footprint. EM-√ is an external-memory ETL/log processing engine with hard peak-RAM guarantees. Unlike traditional systems that "try" to stay within memory limits, EM-√ enforces a strict memory cap, enabling you to process arbitrarily large datasets using small memory footprints.
- Host: GitHub
- URL: https://github.com/logannye/emsqrt
- Owner: logannye
- Created: 2025-11-04T05:50:15.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-18T04:58:55.000Z (6 months ago)
- Last Synced: 2025-11-18T06:23:46.509Z (6 months ago)
- Topics: big-data, big-data-analytics, cloud, cloud-computing, edge-ai, edge-computing, efficiency, efficient-algorithm, memory-allocation, rust, streaming, streaming-algorithms, streaming-data
- Language: Rust
- Homepage:
- Size: 283 KB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# EM-√ (EM-Sqrt): External-Memory ETL Engine
## **Process any dataset size with a fixed, small memory footprint.**
[](LICENSE)
[](https://www.rust-lang.org/)
EM-√ is an external-memory ETL/log processing engine with **hard peak-RAM guarantees**. Unlike traditional systems that "try" to stay within memory limits, EM-√ **enforces** a strict memory cap, enabling you to process arbitrarily large datasets using small memory footprints.
## Key Features
- **Hard Memory Guarantees**: Never exceeds the configured memory cap (default 512MB). All allocations are tracked via RAII guards.
- **External-Memory Operators**: Sort, join, and aggregate operations automatically spill to disk when memory limits are hit.
- **Tree Evaluation (TE) Scheduling**: Principled execution schedule that decomposes plans into blocks with bounded fan-in to control peak memory.
- **Cloud-Ready**: Spill segments support local filesystem with checksums and compression. S3 and GCS adapters are planned.
- **Pluggable Spill Storage**: Point spills at local paths or cloud object stores (S3, GCS, Azure) with retry/backoff controls.
- **Parquet Support**: Native columnar Parquet I/O with Arrow integration (optional `--features parquet`).
- **Grace Hash Join**: Automatic partition-based hash join for datasets exceeding memory limits.
- **Deterministic Execution**: Stable plan hashing for reproducibility and auditability.
- **Memory-Constrained Environments**: Designed for edge computing, serverless, embedded systems, and containerized deployments.
## Quick Start
### Installation
```bash
# Clone the repository
git clone https://github.com/logannye/emsqrt.git
cd emsqrt
# Build the project
cargo build --release
# Run tests
cargo test
```
### Basic Usage
#### Programmatic API
```rust
use emsqrt_core::schema::{Field, DataType, Schema};
use emsqrt_core::dag::LogicalPlan as L;
use emsqrt_core::config::EngineConfig;
use emsqrt_planner::{rules, lower_to_physical, estimate_work};
use emsqrt_te::plan_te;
use emsqrt_exec::Engine;
// Define your schema
let schema = Schema {
fields: vec![
Field::new("id", DataType::Int64, false),
Field::new("name", DataType::Utf8, false),
Field::new("age", DataType::Int64, false),
],
};
// Build a logical plan: scan → filter → project → sink
let scan = L::Scan {
source: "file:///path/to/input.csv".to_string(),
schema: schema.clone(),
};
let filter = L::Filter {
input: Box::new(scan),
expr: "age > 25".to_string(),
};
let project = L::Project {
input: Box::new(filter),
columns: vec!["name".to_string(), "age".to_string()],
};
let sink = L::Sink {
input: Box::new(project),
destination: "file:///path/to/output.csv".to_string(),
format: "csv".to_string(),
};
// Optimize and execute
let optimized = rules::optimize(sink);
let phys_prog = lower_to_physical(&optimized);
let work = estimate_work(&optimized, None);
let te = plan_te(&phys_prog.plan, &work, 512 * 1024 * 1024)?; // 512MB memory cap
// Configure and run
let mut config = EngineConfig::default();
config.mem_cap_bytes = 512 * 1024 * 1024; // 512MB
config.spill_dir = "/tmp/emsqrt-spill".to_string();
let mut engine = Engine::new(config).expect("engine initialization");
let manifest = engine.run(&phys_prog, &te)?;
println!("Execution completed in {}ms", manifest.finished_ms - manifest.started_ms);
```
#### YAML DSL
The YAML DSL supports linear pipelines with the following operators:
```yaml
steps:
- op: scan
source: "data/logs.csv"
schema:
- { name: "ts", type: "Utf8", nullable: false }
- { name: "uid", type: "Utf8", nullable: false }
- { name: "amount", type: "Float64", nullable: true }
- op: filter
expr: "amount > 1000"
- op: project
columns: ["ts", "uid", "amount"]
- op: sink
destination: "results/filtered.csv"
format: "csv" # or "parquet" (requires --features parquet)
```
**Note**: Currently supports `scan`, `filter`, `project`, `map`, and `sink`. Aggregate and join operators are not yet supported in YAML (use the programmatic API for these operations).
Add an optional `config` block to describe spill targets without touching CLI flags:
```yaml
config:
spill_uri: "s3://my-bucket/spill"
spill_aws_region: "us-east-1"
spill_gcs_service_account: "/path/to/service-account.json"
steps:
- op: scan
source: "data/logs.csv"
schema: []
- op: sink
destination: "stdout"
format: "csv"
```
Values from `config` merge with environment variables and command-line overrides.
**Parquet Support**: Scan and Sink operators support Parquet format when built with `--features parquet`. Files are automatically detected by extension (`.parquet`, `.parq`) or can be explicitly specified with `format: "parquet"`.
#### CLI Usage
The EM-√ CLI provides a convenient way to run pipelines from YAML files:
```bash
# Validate a pipeline YAML file
emsqrt validate --pipeline examples/simple_pipeline.yaml
# Show execution plan (EXPLAIN)
emsqrt explain --pipeline examples/simple_pipeline.yaml --memory-cap 536870912
# Execute a pipeline
emsqrt run --pipeline examples/simple_pipeline.yaml
# Override configuration via command-line flags
emsqrt run \
--pipeline examples/simple_pipeline.yaml \
--memory-cap 1073741824 \
--spill-uri s3://my-bucket/spill \
--spill-aws-region us-east-1 \
--spill-aws-access-key-id AKIA... \
--spill-aws-secret-access-key SECRET... \
--spill-gcs-service-account /path/to/sa.json \
--spill-azure-access-key azureKey \
--spill-retry-max 5 \
--spill-dir /tmp/emsqrt-spill \
--max-parallel 4
```
See `examples/README.md` for more details on YAML pipeline syntax.
### Cloud Spill Authentication
When using S3/GCS/Azure spill URIs, provide credentials via CLI flags, environment variables, or the platform's SDK defaults:
- **S3**: export `AWS_ACCESS_KEY_ID`/`AWS_SECRET_ACCESS_KEY` or use `aws configure`; optionally include `--spill-aws-region`.
- **GCS**: set `GOOGLE_SERVICE_ACCOUNT`/`GOOGLE_SERVICE_ACCOUNT_PATH` or run `gcloud auth application-default login`.
- **Azure**: use `az login` and `AZURE_STORAGE_CONNECTION_STRING` or pass `--spill-azure-access-key`.
The `config` block in `examples/cloud_spill/pipeline.yaml` illustrates a spill URI plus retry tuning so you can avoid repeating CLI flags per run.
## Examples of Practical Use Cases
### 1. Serverless Data Pipelines
Process large datasets in AWS Lambda, Google Cloud Functions, or Azure Functions with strict memory limits:
```rust
// Process 100GB dataset in a 512MB Lambda
// Note: S3 spill support is planned; currently use local filesystem
let config = EngineConfig {
mem_cap_bytes: 512 * 1024 * 1024, // 512MB
spill_dir: "/tmp/lambda-spill".to_string(),
..Default::default()
};
```
**Value**: 10-100x cost reduction vs. large EC2 instances or EMR clusters.
### 2. Edge Data Processing
Aggregate sensor data on IoT gateways or embedded devices with limited RAM:
```rust
// Process 1M sensor readings on a Raspberry Pi with 256MB RAM
let config = EngineConfig {
mem_cap_bytes: 128 * 1024 * 1024, // Use only 128MB
spill_dir: "/tmp/sensor-spill".to_string(),
..Default::default()
};
```
**Value**: Enable edge analytics without hardware upgrades.
### 3. Multi-Tenant Data Platforms
Run customer queries with isolated memory budgets:
```rust
// Each customer gets a memory budget
// Note: S3 spill support is planned; currently use local filesystem
let config = EngineConfig {
mem_cap_bytes: customer_memory_budget,
spill_dir: format!("/tmp/platform-spill/customer-{}", customer_id),
..Default::default()
};
```
**Value**: Predictable performance, resource isolation, accurate cost attribution.
### 4. Cost-Optimized Analytics
Use smaller, cheaper instances by trading I/O for memory:
```rust
// Process 500GB dataset on a 4GB RAM instance instead of 64GB
let config = EngineConfig {
mem_cap_bytes: 4 * 1024 * 1024 * 1024, // 4GB
spill_dir: "/fast-nvme/spill".to_string(),
..Default::default()
};
```
**Value**: 10x cost reduction for memory-bound workloads.
## Architecture
EM-√ is built as a modular Rust workspace with the following crates:
```
emsqrt-core/ - Core types, schemas, DAGs, memory budget traits
emsqrt-te/ - Tree Evaluation planner (bounded fan-in decomposition)
emsqrt-mem/ - Memory budget implementation, spill manager, buffer pool
emsqrt-io/ - I/O adapters (CSV, JSONL, Parquet, storage backends)
emsqrt-operators/ - Query operators (filter, project, sort, join, aggregate)
emsqrt-planner/ - Logical/physical planning, optimization, YAML DSL
emsqrt-exec/ - Execution runtime, scheduler, engine
emsqrt-cli/ - Command-line interface for running pipelines
```
### Execution Flow
1. **Planning**: YAML/Logical plan → Optimized logical plan → Physical plan with operator bindings
2. **TE Scheduling**: Physical plan → Tree Evaluation blocks with bounded fan-in
3. **Execution**: Blocks executed in dependency order, respecting memory budget
4. **Spilling**: Operators automatically spill to disk when memory limits are hit
5. **Manifest**: Deterministic execution manifest with plan hashes for reproducibility
### Memory Management
- **MemoryBudget**: RAII guards track all allocations
- **SpillManager**: Checksummed, compressed segments for external-memory operations
- **TE Frontier**: Bounded live blocks guarantee peak memory ≤ cap
## Configuration
### EngineConfig
```rust
pub struct EngineConfig {
/// Hard memory cap (bytes). The engine must NEVER exceed this.
pub mem_cap_bytes: usize,
/// Optional block-size hint (TE planner may override)
pub block_size_hint: Option,
/// Max on-disk spill concurrency
pub max_spill_concurrency: usize,
/// Optional seed for deterministic shuffles
pub seed: Option,
/// Execution parallelism
pub max_parallel_tasks: usize,
/// Directory for spill files
pub spill_dir: String,
}
```
### Environment Variables
```bash
export EMSQRT_MEM_CAP_BYTES=536870912 # 512MB
export EMSQRT_SPILL_DIR=/tmp/emsqrt-spill
export EMSQRT_MAX_PARALLEL_TASKS=4
export EMSQRT_SPILL_URI=s3://my-bucket/emsqrt
export EMSQRT_SPILL_AWS_REGION=us-east-1
export EMSQRT_SPILL_AWS_ACCESS_KEY_ID=AKIA...
export EMSQRT_SPILL_AWS_SECRET_ACCESS_KEY=SECRET...
export EMSQRT_SPILL_AWS_SESSION_TOKEN=optionalSession
export EMSQRT_SPILL_GCS_SA_PATH=/path/to/service-account.json
export EMSQRT_SPILL_AZURE_ACCESS_KEY=azureKey
export EMSQRT_SPILL_RETRY_MAX_RETRIES=5
export EMSQRT_SPILL_RETRY_INITIAL_MS=250
export EMSQRT_SPILL_RETRY_MAX_MS=5000
```
### Default Configuration
```rust
EngineConfig::default()
// mem_cap_bytes: 512 MiB
// max_spill_concurrency: 4
// max_parallel_tasks: 4
// spill_dir: "/tmp/emsqrt-spill"
```
#### StorageConfig
```rust
pub struct StorageConfig {
pub uri: Option, // e.g. s3://bucket/prefix
pub root: String, // normalized spill root
pub aws_region: Option,
pub aws_access_key_id: Option,
pub aws_secret_access_key: Option,
pub aws_session_token: Option,
pub gcs_service_account_path: Option,
pub azure_access_key: Option,
pub retry_max_retries: usize,
pub retry_initial_backoff_ms: u64,
pub retry_max_backoff_ms: u64,
}
```
`EngineConfig::storage_config()` produces this snapshot and `emsqrt-io` uses it to choose between filesystem and cloud adapters.
## Building & Testing
### Build
```bash
# Debug build
cargo build
# Release build (optimized)
cargo build --release
# Build specific crate
cargo build -p emsqrt-exec
```
### Run Tests
```bash
# All tests (unit tests in crates)
cargo test --all --lib
# Specific test suite (in workspace root tests/ directory)
cargo test --test integration_tests
cargo test --test expression_tests
cargo test --test cost_estimation_tests
# Run comprehensive test suite (10 phases)
./scripts/run_all_tests.sh
```
### Test Coverage
The comprehensive test suite (`scripts/run_all_tests.sh`) includes 10 phases:
1. **Unit Tests**: SpillManager, RowBatch helpers, Memory budget
2. **Integration Tests**: Full pipeline tests (scan, filter, project, sort, aggregate, sink, join)
3. **E2E Tests**: End-to-end smoke tests
4. **Crate-Level Tests**: All library unit tests across crates
5. **Expression Engine Tests**: Expression parsing and evaluation
6. **Column Statistics Tests**: Statistics collection and cost estimation
7. **Error Handling Tests**: Error context and recovery
8. **Operator Tests**: Merge join, filter with expressions
9. **Feature-Specific Tests**: Parquet, Arrow (when features enabled)
10. **CLI Tests**: YAML parsing and validation
## Supported Operations
### Currently Implemented
- ✅ **Scan**: Read CSV and Parquet files with schema inference
- ✅ **Filter**: Predicate filtering (e.g., `age > 25`, `name == "Alice"`)
- ✅ **Project**: Column selection and renaming
- ✅ **Map**: Column renaming (e.g., `old_name AS new_name`)
- ✅ **Sort**: External sort with k-way merge
- ✅ **Aggregate**: Group-by with COUNT, SUM, AVG, MIN, MAX
- ✅ **Join**: Hash join (with Grace hash join for large datasets), merge join (sorted merge join for pre-sorted inputs)
- ✅ **Sink**: Write CSV and Parquet files
- ✅ **Expression Engine**: Full SQL-like expressions with operator precedence, cross-type arithmetic, and logical operations
- ✅ **Statistics**: Column statistics (min/max/distinct_count/null_count) for cost estimation and selectivity modeling
- ✅ **Parquet I/O**: Native columnar read/write with Arrow integration (requires `--features parquet`)
- ✅ **Arrow Integration**: Columnar processing with RecordBatch ↔ RowBatch conversion utilities
- ✅ **Grace Hash Join**: Partition-based hash join for very large datasets with automatic spilling
### Planned Features
- 🔄 **Cloud Storage**: S3, GCS adapters for spill segments (currently filesystem only)
## How It Works
### Tree Evaluation (TE)
Tree Evaluation is a principled execution scheduling approach that:
1. **Decomposes plans into blocks** with bounded fan-in (e.g., each join block depends on at most K input blocks)
2. **Controls the live frontier** (the set of materialized blocks at any time)
3. **Guarantees peak memory** ≤ `K * block_size + overhead`
### External-Memory Operators
When memory limits are hit, operators automatically:
1. **Spill to disk**: Write intermediate results to checksummed, compressed segments
2. **Partition**: Divide work into smaller chunks that fit in memory
3. **Merge**: Combine results from multiple partitions/runs
Example: External sort generates sorted runs, then performs k-way merge.
### Memory Budget Enforcement
Every allocation requires a `BudgetGuard`:
```rust
let guard = budget.try_acquire(bytes, "my-buffer")?;
// Allocate memory...
// Guard automatically releases bytes on drop (RAII)
```
If `try_acquire` returns `None`, the operator must spill or partition.
## Performance
### Benchmarks (Planned)
- Sort 10GB with 512MB memory
- Join 1GB × 1GB with 50MB memory
- Aggregate 1M groups with 20MB memory
- TPC-H queries (Q1, Q3, Q6)
See `docs/benchmarks.md` for the current Criterion harness and the `scripts/benchmarks/run_benchmarks.sh` helper.
### Expected Characteristics
- **Throughput**: 10-100x slower than in-memory systems (by design)
- **Memory**: **Guaranteed** to never exceed cap (unlike other systems)
- **Scalability**: Can process datasets 100-1000x larger than available RAM
## Development
### Project Structure
```
emsqrt/
├── Cargo.toml # Workspace configuration
├── crates/
│ ├── emsqrt-core/ # Core types and traits
│ ├── emsqrt-te/ # Tree Evaluation planner
│ ├── emsqrt-mem/ # Memory budget and spill manager
│ ├── emsqrt-io/ # I/O adapters
│ ├── emsqrt-operators/ # Query operators
│ ├── emsqrt-planner/ # Planning and optimization
│ ├── emsqrt-exec/ # Execution runtime
│ └── emsqrt-cli/ # Command-line interface
├── tests/ # Integration and unit tests (workspace root)
├── scripts/ # Utility scripts (run_all_tests.sh)
├── examples/ # YAML pipeline examples
└── README.md # This file
```
### Adding a New Operator
1. Implement the `Operator` trait in `emsqrt-operators/src/`
2. Register in `emsqrt-operators/src/registry.rs`
3. Add to planner lowering in `emsqrt-planner/src/lower.rs`
4. Add tests in `tests/`
### Code Style
- Follow Rust standard formatting: `cargo fmt`
- All code must compile with `#![forbid(unsafe_code)]`
- Use `thiserror` for error types
- Use `serde` for serialization
## Contributing
Contributions are welcome! Areas of particular interest:
- Cloud storage adapters (S3, GCS, Azure) - placeholders exist, need implementation
- Additional operators (window functions, lateral joins)
- YAML DSL support for aggregate and join operators
- Performance optimizations (SIMD in Arrow operations, parallel processing)
- Documentation improvements
## Acknowledgments
This project implements Tree Evaluation (TE) scheduling for external-memory query processing, enabling predictable memory usage in constrained environments.
---
**Take-home**: EM-√ trades throughput for guaranteed memory bounds. Use it when memory constraints are more important than raw speed.
Repo is a dynamic work, please be aware that it will evolve and further develop over time.