https://github.com/logannye/emsqrt

Process any data size with a fixed, small memory footprint. EM-√ is an external-memory ETL/log processing engine with hard peak-RAM guarantees. Unlike traditional systems that "try" to stay within memory limits, EM-√ enforces a strict memory cap, enabling you to process arbitrarily large datasets using small memory footprints.
https://github.com/logannye/emsqrt
big-data big-data-analytics cloud cloud-computing edge-ai edge-computing efficiency efficient-algorithm memory-allocation rust streaming streaming-algorithms streaming-data
Last synced: 3 months ago
JSON representation
Host: GitHub
URL: https://github.com/logannye/emsqrt
Owner: logannye
Created: 2025-11-04T05:50:15.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-11-18T04:58:55.000Z (8 months ago)
Last Synced: 2025-11-18T06:23:46.509Z (8 months ago)
Topics: big-data, big-data-analytics, cloud, cloud-computing, edge-ai, edge-computing, efficiency, efficient-algorithm, memory-allocation, rust, streaming, streaming-algorithms, streaming-data
Language: Rust
Homepage:
Size: 283 KB
Stars: 3
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # EM-√ (EM-Sqrt): External-Memory ETL Engine

## **Process any dataset size with a fixed, small memory footprint.**

[![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-blue.svg)](LICENSE)

[![Rust](https://img.shields.io/badge/Rust-1.70+-orange.svg)](https://www.rust-lang.org/)

EM-√ is an external-memory ETL/log processing engine with **hard peak-RAM guarantees**. Unlike traditional systems that "try" to stay within memory limits, EM-√ **enforces** a strict memory cap, enabling you to process arbitrarily large datasets using small memory footprints.

## Key Features

- **Hard Memory Guarantees**: Never exceeds the configured memory cap (default 512MB). All allocations are tracked via RAII guards.

- **External-Memory Operators**: Sort, join, and aggregate operations automatically spill to disk when memory limits are hit.

- **Tree Evaluation (TE) Scheduling**: Principled execution schedule that decomposes plans into blocks with bounded fan-in to control peak memory.

- **Cloud-Ready**: Spill segments support local filesystem with checksums and compression. S3 and GCS adapters are planned.

- **Pluggable Spill Storage**: Point spills at local paths or cloud object stores (S3, GCS, Azure) with retry/backoff controls.

- **Parquet Support**: Native columnar Parquet I/O with Arrow integration (optional `--features parquet`).

- **Grace Hash Join**: Automatic partition-based hash join for datasets exceeding memory limits.

- **Deterministic Execution**: Stable plan hashing for reproducibility and auditability.

- **Memory-Constrained Environments**: Designed for edge computing, serverless, embedded systems, and containerized deployments.

## Quick Start

### Installation

```bash

# Clone the repository

git clone https://github.com/logannye/emsqrt.git

cd emsqrt

# Build the project

cargo build --release

# Run tests

cargo test

```

### Basic Usage

#### Programmatic API

```rust

use emsqrt_core::schema::{Field, DataType, Schema};

use emsqrt_core::dag::LogicalPlan as L;

use emsqrt_core::config::EngineConfig;

use emsqrt_planner::{rules, lower_to_physical, estimate_work};

use emsqrt_te::plan_te;

use emsqrt_exec::Engine;

// Define your schema

let schema = Schema {

    fields: vec![

        Field::new("id", DataType::Int64, false),

        Field::new("name", DataType::Utf8, false),

        Field::new("age", DataType::Int64, false),

    ],

};

// Build a logical plan: scan → filter → project → sink

let scan = L::Scan {

    source: "file:///path/to/input.csv".to_string(),

    schema: schema.clone(),

};

let filter = L::Filter {

    input: Box::new(scan),

    expr: "age > 25".to_string(),

};

let project = L::Project {

    input: Box::new(filter),

    columns: vec!["name".to_string(), "age".to_string()],

};

let sink = L::Sink {

    input: Box::new(project),

    destination: "file:///path/to/output.csv".to_string(),

    format: "csv".to_string(),

};

// Optimize and execute

let optimized = rules::optimize(sink);

let phys_prog = lower_to_physical(&optimized);

let work = estimate_work(&optimized, None);

let te = plan_te(&phys_prog.plan, &work, 512 * 1024 * 1024)?; // 512MB memory cap

// Configure and run

let mut config = EngineConfig::default();

config.mem_cap_bytes = 512 * 1024 * 1024; // 512MB

config.spill_dir = "/tmp/emsqrt-spill".to_string();

let mut engine = Engine::new(config).expect("engine initialization");

let manifest = engine.run(&phys_prog, &te)?;

println!("Execution completed in {}ms", manifest.finished_ms - manifest.started_ms);

```

#### YAML DSL

The YAML DSL supports linear pipelines with the following operators:

```yaml

steps:

  - op: scan

    source: "data/logs.csv"

    schema:

      - { name: "ts", type: "Utf8", nullable: false }

      - { name: "uid", type: "Utf8", nullable: false }

      - { name: "amount", type: "Float64", nullable: true }

  

  - op: filter

    expr: "amount > 1000"

  

  - op: project

    columns: ["ts", "uid", "amount"]

  

  - op: sink

    destination: "results/filtered.csv"

    format: "csv"  # or "parquet" (requires --features parquet)

```

**Note**: Currently supports `scan`, `filter`, `project`, `map`, and `sink`. Aggregate and join operators are not yet supported in YAML (use the programmatic API for these operations).

Add an optional `config` block to describe spill targets without touching CLI flags:

```yaml

config:

  spill_uri: "s3://my-bucket/spill"

  spill_aws_region: "us-east-1"

  spill_gcs_service_account: "/path/to/service-account.json"

steps:

  - op: scan

    source: "data/logs.csv"

    schema: []

  - op: sink

    destination: "stdout"

    format: "csv"

```

Values from `config` merge with environment variables and command-line overrides.

**Parquet Support**: Scan and Sink operators support Parquet format when built with `--features parquet`. Files are automatically detected by extension (`.parquet`, `.parq`) or can be explicitly specified with `format: "parquet"`.

#### CLI Usage

The EM-√ CLI provides a convenient way to run pipelines from YAML files:

```bash

# Validate a pipeline YAML file

emsqrt validate --pipeline examples/simple_pipeline.yaml

# Show execution plan (EXPLAIN)

emsqrt explain --pipeline examples/simple_pipeline.yaml --memory-cap 536870912

# Execute a pipeline

emsqrt run --pipeline examples/simple_pipeline.yaml

# Override configuration via command-line flags

emsqrt run \

  --pipeline examples/simple_pipeline.yaml \

  --memory-cap 1073741824 \

  --spill-uri s3://my-bucket/spill \

  --spill-aws-region us-east-1 \

  --spill-aws-access-key-id AKIA... \

  --spill-aws-secret-access-key SECRET... \

  --spill-gcs-service-account /path/to/sa.json \

  --spill-azure-access-key azureKey \

  --spill-retry-max 5 \

  --spill-dir /tmp/emsqrt-spill \

  --max-parallel 4

```

See `examples/README.md` for more details on YAML pipeline syntax.

### Cloud Spill Authentication

When using S3/GCS/Azure spill URIs, provide credentials via CLI flags, environment variables, or the platform's SDK defaults:

- **S3**: export `AWS_ACCESS_KEY_ID`/`AWS_SECRET_ACCESS_KEY` or use `aws configure`; optionally include `--spill-aws-region`.

- **GCS**: set `GOOGLE_SERVICE_ACCOUNT`/`GOOGLE_SERVICE_ACCOUNT_PATH` or run `gcloud auth application-default login`.

- **Azure**: use `az login` and `AZURE_STORAGE_CONNECTION_STRING` or pass `--spill-azure-access-key`.

The `config` block in `examples/cloud_spill/pipeline.yaml` illustrates a spill URI plus retry tuning so you can avoid repeating CLI flags per run.

## Examples of Practical Use Cases

### 1. Serverless Data Pipelines

Process large datasets in AWS Lambda, Google Cloud Functions, or Azure Functions with strict memory limits:

```rust

// Process 100GB dataset in a 512MB Lambda

// Note: S3 spill support is planned; currently use local filesystem

let config = EngineConfig {

    mem_cap_bytes: 512 * 1024 * 1024, // 512MB

    spill_dir: "/tmp/lambda-spill".to_string(),

    ..Default::default()

};

```

**Value**: 10-100x cost reduction vs. large EC2 instances or EMR clusters.

### 2. Edge Data Processing

Aggregate sensor data on IoT gateways or embedded devices with limited RAM:

```rust

// Process 1M sensor readings on a Raspberry Pi with 256MB RAM

let config = EngineConfig {

    mem_cap_bytes: 128 * 1024 * 1024, // Use only 128MB

    spill_dir: "/tmp/sensor-spill".to_string(),

    ..Default::default()

};

```

**Value**: Enable edge analytics without hardware upgrades.

### 3. Multi-Tenant Data Platforms

Run customer queries with isolated memory budgets:

```rust

// Each customer gets a memory budget

// Note: S3 spill support is planned; currently use local filesystem

let config = EngineConfig {

    mem_cap_bytes: customer_memory_budget,

    spill_dir: format!("/tmp/platform-spill/customer-{}", customer_id),

    ..Default::default()

};

```

**Value**: Predictable performance, resource isolation, accurate cost attribution.

### 4. Cost-Optimized Analytics

Use smaller, cheaper instances by trading I/O for memory:

```rust

// Process 500GB dataset on a 4GB RAM instance instead of 64GB

let config = EngineConfig {

    mem_cap_bytes: 4 * 1024 * 1024 * 1024, // 4GB

    spill_dir: "/fast-nvme/spill".to_string(),

    ..Default::default()

};

```

**Value**: 10x cost reduction for memory-bound workloads.

## Architecture

EM-√ is built as a modular Rust workspace with the following crates:

```

emsqrt-core/      - Core types, schemas, DAGs, memory budget traits

emsqrt-te/        - Tree Evaluation planner (bounded fan-in decomposition)

emsqrt-mem/       - Memory budget implementation, spill manager, buffer pool

emsqrt-io/        - I/O adapters (CSV, JSONL, Parquet, storage backends)

emsqrt-operators/ - Query operators (filter, project, sort, join, aggregate)

emsqrt-planner/   - Logical/physical planning, optimization, YAML DSL

emsqrt-exec/      - Execution runtime, scheduler, engine

emsqrt-cli/       - Command-line interface for running pipelines

```

### Execution Flow

1. **Planning**: YAML/Logical plan → Optimized logical plan → Physical plan with operator bindings

2. **TE Scheduling**: Physical plan → Tree Evaluation blocks with bounded fan-in

3. **Execution**: Blocks executed in dependency order, respecting memory budget

4. **Spilling**: Operators automatically spill to disk when memory limits are hit

5. **Manifest**: Deterministic execution manifest with plan hashes for reproducibility

### Memory Management

- **MemoryBudget**: RAII guards track all allocations

- **SpillManager**: Checksummed, compressed segments for external-memory operations

- **TE Frontier**: Bounded live blocks guarantee peak memory ≤ cap

## Configuration

### EngineConfig

```rust

pub struct EngineConfig {

    /// Hard memory cap (bytes). The engine must NEVER exceed this.

    pub mem_cap_bytes: usize,

    

    /// Optional block-size hint (TE planner may override)

    pub block_size_hint: Option,

    

    /// Max on-disk spill concurrency

    pub max_spill_concurrency: usize,

    

    /// Optional seed for deterministic shuffles

    pub seed: Option,

    

    /// Execution parallelism

    pub max_parallel_tasks: usize,

    

    /// Directory for spill files

    pub spill_dir: String,

}

```

### Environment Variables

```bash

export EMSQRT_MEM_CAP_BYTES=536870912  # 512MB

export EMSQRT_SPILL_DIR=/tmp/emsqrt-spill

export EMSQRT_MAX_PARALLEL_TASKS=4

export EMSQRT_SPILL_URI=s3://my-bucket/emsqrt

export EMSQRT_SPILL_AWS_REGION=us-east-1

export EMSQRT_SPILL_AWS_ACCESS_KEY_ID=AKIA...

export EMSQRT_SPILL_AWS_SECRET_ACCESS_KEY=SECRET...

export EMSQRT_SPILL_AWS_SESSION_TOKEN=optionalSession

export EMSQRT_SPILL_GCS_SA_PATH=/path/to/service-account.json

export EMSQRT_SPILL_AZURE_ACCESS_KEY=azureKey

export EMSQRT_SPILL_RETRY_MAX_RETRIES=5

export EMSQRT_SPILL_RETRY_INITIAL_MS=250

export EMSQRT_SPILL_RETRY_MAX_MS=5000

```

### Default Configuration

```rust

EngineConfig::default()

// mem_cap_bytes: 512 MiB

// max_spill_concurrency: 4

// max_parallel_tasks: 4

// spill_dir: "/tmp/emsqrt-spill"

```

#### StorageConfig

```rust

pub struct StorageConfig {

    pub uri: Option,        // e.g. s3://bucket/prefix

    pub root: String,               // normalized spill root

    pub aws_region: Option,

    pub aws_access_key_id: Option,

    pub aws_secret_access_key: Option,

    pub aws_session_token: Option,

    pub gcs_service_account_path: Option,

    pub azure_access_key: Option,

    pub retry_max_retries: usize,

    pub retry_initial_backoff_ms: u64,

    pub retry_max_backoff_ms: u64,

}

```

`EngineConfig::storage_config()` produces this snapshot and `emsqrt-io` uses it to choose between filesystem and cloud adapters.

## Building & Testing

### Build

```bash

# Debug build

cargo build

# Release build (optimized)

cargo build --release

# Build specific crate

cargo build -p emsqrt-exec

```

### Run Tests

```bash

# All tests (unit tests in crates)

cargo test --all --lib

# Specific test suite (in workspace root tests/ directory)

cargo test --test integration_tests

cargo test --test expression_tests

cargo test --test cost_estimation_tests

# Run comprehensive test suite (10 phases)

./scripts/run_all_tests.sh

```

### Test Coverage

The comprehensive test suite (`scripts/run_all_tests.sh`) includes 10 phases:

1. **Unit Tests**: SpillManager, RowBatch helpers, Memory budget

2. **Integration Tests**: Full pipeline tests (scan, filter, project, sort, aggregate, sink, join)

3. **E2E Tests**: End-to-end smoke tests

4. **Crate-Level Tests**: All library unit tests across crates

5. **Expression Engine Tests**: Expression parsing and evaluation

6. **Column Statistics Tests**: Statistics collection and cost estimation

7. **Error Handling Tests**: Error context and recovery

8. **Operator Tests**: Merge join, filter with expressions

9. **Feature-Specific Tests**: Parquet, Arrow (when features enabled)

10. **CLI Tests**: YAML parsing and validation

## Supported Operations

### Currently Implemented

- ✅ **Scan**: Read CSV and Parquet files with schema inference

- ✅ **Filter**: Predicate filtering (e.g., `age > 25`, `name == "Alice"`)

- ✅ **Project**: Column selection and renaming

- ✅ **Map**: Column renaming (e.g., `old_name AS new_name`)

- ✅ **Sort**: External sort with k-way merge

- ✅ **Aggregate**: Group-by with COUNT, SUM, AVG, MIN, MAX

- ✅ **Join**: Hash join (with Grace hash join for large datasets), merge join (sorted merge join for pre-sorted inputs)

- ✅ **Sink**: Write CSV and Parquet files

- ✅ **Expression Engine**: Full SQL-like expressions with operator precedence, cross-type arithmetic, and logical operations

- ✅ **Statistics**: Column statistics (min/max/distinct_count/null_count) for cost estimation and selectivity modeling

- ✅ **Parquet I/O**: Native columnar read/write with Arrow integration (requires `--features parquet`)

- ✅ **Arrow Integration**: Columnar processing with RecordBatch ↔ RowBatch conversion utilities

- ✅ **Grace Hash Join**: Partition-based hash join for very large datasets with automatic spilling

### Planned Features

- 🔄 **Cloud Storage**: S3, GCS adapters for spill segments (currently filesystem only)

## How It Works

### Tree Evaluation (TE)

Tree Evaluation is a principled execution scheduling approach that:

1. **Decomposes plans into blocks** with bounded fan-in (e.g., each join block depends on at most K input blocks)

2. **Controls the live frontier** (the set of materialized blocks at any time)

3. **Guarantees peak memory** ≤ `K * block_size + overhead`

### External-Memory Operators

When memory limits are hit, operators automatically:

1. **Spill to disk**: Write intermediate results to checksummed, compressed segments

2. **Partition**: Divide work into smaller chunks that fit in memory

3. **Merge**: Combine results from multiple partitions/runs

Example: External sort generates sorted runs, then performs k-way merge.

### Memory Budget Enforcement

Every allocation requires a `BudgetGuard`:

```rust

let guard = budget.try_acquire(bytes, "my-buffer")?;

// Allocate memory...

// Guard automatically releases bytes on drop (RAII)

```

If `try_acquire` returns `None`, the operator must spill or partition.

## Performance

### Benchmarks (Planned)

- Sort 10GB with 512MB memory

- Join 1GB × 1GB with 50MB memory

- Aggregate 1M groups with 20MB memory

- TPC-H queries (Q1, Q3, Q6)

See `docs/benchmarks.md` for the current Criterion harness and the `scripts/benchmarks/run_benchmarks.sh` helper.

### Expected Characteristics

- **Throughput**: 10-100x slower than in-memory systems (by design)

- **Memory**: **Guaranteed** to never exceed cap (unlike other systems)

- **Scalability**: Can process datasets 100-1000x larger than available RAM

## Development

### Project Structure

```

emsqrt/

├── Cargo.toml              # Workspace configuration

├── crates/

│   ├── emsqrt-core/       # Core types and traits

│   ├── emsqrt-te/          # Tree Evaluation planner

│   ├── emsqrt-mem/         # Memory budget and spill manager

│   ├── emsqrt-io/          # I/O adapters

│   ├── emsqrt-operators/   # Query operators

│   ├── emsqrt-planner/     # Planning and optimization

│   ├── emsqrt-exec/        # Execution runtime

│   └── emsqrt-cli/         # Command-line interface

├── tests/                  # Integration and unit tests (workspace root)

├── scripts/                # Utility scripts (run_all_tests.sh)

├── examples/               # YAML pipeline examples

└── README.md               # This file

```

### Adding a New Operator

1. Implement the `Operator` trait in `emsqrt-operators/src/`

2. Register in `emsqrt-operators/src/registry.rs`

3. Add to planner lowering in `emsqrt-planner/src/lower.rs`

4. Add tests in `tests/`

### Code Style

- Follow Rust standard formatting: `cargo fmt`

- All code must compile with `#![forbid(unsafe_code)]`

- Use `thiserror` for error types

- Use `serde` for serialization

## Contributing

Contributions are welcome! Areas of particular interest:

- Cloud storage adapters (S3, GCS, Azure) - placeholders exist, need implementation

- Additional operators (window functions, lateral joins)

- YAML DSL support for aggregate and join operators

- Performance optimizations (SIMD in Arrow operations, parallel processing)

- Documentation improvements

## Acknowledgments

This project implements Tree Evaluation (TE) scheduling for external-memory query processing, enabling predictable memory usage in constrained environments.

---

**Take-home**: EM-√ trades throughput for guaranteed memory bounds. Use it when memory constraints are more important than raw speed.

Repo is a dynamic work, please be aware that it will evolve and further develop over time.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/logannye/emsqrt

Awesome Lists containing this project

README