https://github.com/TFMV/porter
Arrow Flight SQL Server
https://github.com/TFMV/porter
Last synced: about 20 hours ago
JSON representation
Arrow Flight SQL Server
- Host: GitHub
- URL: https://github.com/TFMV/porter
- Owner: TFMV
- License: mit
- Created: 2024-05-15T15:36:46.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-19T07:14:29.000Z (5 months ago)
- Last Synced: 2025-06-19T07:28:36.711Z (5 months ago)
- Language: Go
- Homepage:
- Size: 20.2 MB
- Stars: 57
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-duckdb - Hatch - A Go-based implementation of a DuckDB Arrow Flight SQL Server. (Client-Server Setups / Web Clients (WebAssembly))
README
# Porter
*Zero‑copy analytics, delivered at Mach Arrow.*
[](https://goreportcard.com/report/github.com/TFMV/porter)
[](LICENSE)
[](https://github.com/TFMV/porter/actions/workflows/ci.yml)
> **Status: Experimental**
> Porter is under **active development** and is stable for development and experimental use cases.
---
> **Seeking Testers & Contributors**
> Porter is actively evolving, and I'm looking for developers willing to spend a few hours testing any use cases you might have in mind.
> If you hit a bug, open an issue or send a PR. Otherwise, feel free to assign it to me.
> Your feedback will shape the future of fast, simple, open SQL over Arrow Flight.
>
> **Ready to contribute?** Check out our [Contributing Guidelines](CONTRIBUTING.md) for community standards, development setup, and technical details.
---
## What is Porter?
Porter is a **high-performance Flight SQL server** that makes Arrow-native analytics effortless. It provides a flexible abstraction layer for different database backends, supporting both **DuckDB** and **ClickHouse** with a unified Flight SQL interface. It bridges the gap between analytical databases and modern data infrastructure by speaking the Flight SQL protocol natively.
Think of it as **analytics with wings** – all the analytical power you love, now network-accessible with zero-copy Arrow streaming.
## Key Features
### Blazing Fast
- **Zero-copy Arrow streaming** for maximum throughput
- **Connection pooling** and resource optimization
- **In-memory caching** with LRU eviction
- **Concurrent query execution** with transaction support
### Developer Friendly
- **Single binary** deployment
- **Configuration via YAML** or environment variables
- **Docker support** with optimized images
- **Comprehensive CLI** with intuitive commands
- **Extensive middleware** support for customization
### Standards Compliant
- **Apache Arrow Flight SQL** protocol
- **JDBC/ODBC compatibility** through Flight SQL drivers
- **SQL standard compliance** via database backends
- **Cross-platform** support (Linux, macOS, Windows)
### Built-in Benchmarking
- **TPC-H benchmark suite** with DuckDB's native TPC-H module
- **Multiple output formats** (table, JSON, Arrow)
- **Configurable scale factors** and iterations
- **Query plan analysis** and performance metrics
- **Zero setup** – no external data files required
## Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Flight SQL │ │ Porter │ │ Database │
│ Clients │◄──►│ Flight SQL │◄──►│ Backends │
│ │ │ Server │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
│ │ │
┌───▼────┐ ┌────▼────┐ ┌────▼────┐
│ Arrow │ │ Metrics │ │ Schema │
│Streamng│ │Promtheus│ │ Cache │
└────────┘ └─────────┘ └─────────┘
```
**Core Components:**
- **Flight SQL Handler** – Protocol implementation and query routing
- **Database Backends** – Flexible abstraction supporting DuckDB and ClickHouse
- **Connection Pool** – Efficient database connection management with backend-specific optimizations
- **Cache Layer** – Smart caching for schemas and query results
- **Metrics Collection** – Comprehensive observability and monitoring
- **Middleware Stack** – Authentication, logging, and custom extensions
## Quick Start
### Prerequisites
- Go 1.24+ (for building from source)
- DuckDB (bundled with Go bindings)
- ClickHouse server (optional, for ClickHouse backend)
### Installation
#### Option 1: Build from Source
```bash
git clone https://github.com/TFMV/porter.git
cd porter
make build
```
#### Option 2: Using Go Install
```bash
go install github.com/TFMV/porter/cmd/server@latest
```
### Running Porter
#### Basic Usage
```bash
# Start with in-memory DuckDB database
./porter serve
# Start with persistent DuckDB database
./porter serve --database ./my-data.db --address 0.0.0.0:32010
# Start with ClickHouse backend
./porter serve --database "clickhouse://localhost:9000/default"
# Start with configuration file
./porter serve --config ./config.yaml
```
#### Using Docker
```bash
# Build the image
make docker-build
# Run the container
make docker-run
# Or run directly
docker run -p 32010:32010 -p 9090:9090 porter:latest
```
### Configuration
Create a `config.yaml` file:
```yaml
server:
address: "0.0.0.0:32010"
max_connections: 100
connection_timeout: "30s"
query_timeout: "5m"
max_message_size: 16777216 # 16MB
database:
# DuckDB backend (default)
dsn: "duckdb://./data.db"
# ClickHouse backend
# dsn: "clickhouse://localhost:9000/default?username=default&password="
max_open_conns: 32
max_idle_conns: 8
conn_max_lifetime: "1h"
health_check_period: "30s"
logging:
level: "info"
format: "json"
metrics:
enabled: true
address: ":9090"
tls:
enabled: false
cert_file: ""
key_file: ""
auth:
enabled: false
type: "oauth2"
```
## Supported Database Backends
Porter provides a unified Flight SQL interface across multiple database backends:
### DuckDB Backend
- **Connection String**: `duckdb://path/to/database.db` or `duckdb://:memory:`
- **Features**: Full analytical SQL, columnar storage, embedded deployment
- **Use Cases**: Development, analytics workloads, embedded applications
- **Performance**: Excellent for OLAP queries, TPC-H benchmarks
### MotherDuck Backend
- **Connection String**: `duckdb://motherduck/`
- **Token Source**: `--token` flag or `MOTHERDUCK_TOKEN` environment variable
- **Features**: Hosted DuckDB with cloud execution
### ClickHouse Backend
- **Connection String**: `clickhouse://host:port/database?username=user&password=pass`
- **Features**: Distributed analytics, real-time ingestion, horizontal scaling
- **Use Cases**: Large-scale analytics, time-series data, real-time dashboards
- **Performance**: Optimized for high-throughput analytical workloads
### Backend Selection
The backend is automatically detected from the DSN connection string:
```yaml
database:
# DuckDB (embedded)
dsn: "duckdb://./analytics.db"
# MotherDuck (cloud)
dsn: "duckdb://motherduck/my_db"
# ClickHouse (distributed)
dsn: "clickhouse://localhost:9000/analytics?username=default"
```
Both backends support:
- ✅ **Full Flight SQL protocol** compliance
- ✅ **Arrow-native streaming** for zero-copy data transfer
- ✅ **Connection pooling** and resource management
- ✅ **Transaction support** and prepared statements
- ✅ **Schema discovery** and metadata operations
## Connecting to Porter
### Python (using ADBC)
```python
import adbc_driver_flightsql.dbapi
# Connect to Porter
conn = adbc_driver_flightsql.dbapi.connect(
"grpc://localhost:32010"
)
# Execute queries
cursor = conn.cursor()
cursor.execute("SELECT * FROM my_table LIMIT 10")
results = cursor.fetchall()
```
### Python (using PyArrow)
```python
from pyarrow import flight
# Connect to Porter Flight SQL server
client = flight.FlightClient("grpc://localhost:32010")
# Execute query and get results as Arrow table
info = client.get_flight_info(
flight.FlightDescriptor.for_command(b"SELECT * FROM my_table")
)
table = client.do_get(info.endpoints[0].ticket).read_all()
```
### Python (using Polars)
```python
import adbc_driver_flightsql.dbapi
import pyarrow as pa
import polars as pl
# Connect and run a query
conn = adbc_driver_flightsql.dbapi.connect("grpc://localhost:32010")
cursor = conn.cursor()
cursor.execute("SELECT 1 AS id, 'example' AS label")
# Collect record batches and convert to Polars
batches = []
batch = cursor.fetch_record_batch()
while batch is not None:
batches.append(batch)
batch = cursor.fetch_record_batch()
arrow_table = pa.Table.from_batches(batches)
df = pl.from_arrow(arrow_table)
print(df)
cursor.close()
conn.close()
```
### Java (using Arrow Flight SQL JDBC)
```java
String url = "jdbc:arrow-flight-sql://localhost:32010";
Connection conn = DriverManager.getConnection(url);
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM my_table");
while (rs.next()) {
System.out.println(rs.getString(1));
}
```
## Benchmarking
Porter includes a comprehensive TPC-H benchmarking suite that uses DuckDB's built-in TPC-H module for reproducible performance testing.
### Quick Benchmark
```bash
# Run a single query with small dataset
./porter bench --query q1 --scale 0.01
# Run multiple queries with medium dataset
./porter bench --query q1,q3,q5 --scale 0.1 --iterations 3
# Run all queries with full dataset
./porter bench --all --scale 1 --format json --output results.json
```
### Example Output
```
TPC-H Benchmark Results
=======================
Configuration:
Scale Factor: 0.01
Iterations: 1
Total Time: 120.5ms
Environment:
DuckDB Version: v1.3.1
Platform: darwin/arm64
Results:
Query Iter Time Rows Status
----- ---- ---- ---- ------
q1 1 2.952ms 4 OK
q6 1 0.562ms 1 OK
q10 1 3.665ms 20 OK
```
### Benchmark Analysis
Use the included Python script to analyze results:
```bash
# Analyze single benchmark
python3 examples/benchmark_analysis.py results.json
# Compare two benchmark runs
python3 examples/benchmark_analysis.py --compare baseline.json current.json
```
For detailed benchmarking documentation, see [docs/benchmarking.md](docs/benchmarking.md).
### Microbenchmarks
Porter also ships a small suite of microbenchmarks under `bench/` for
quick performance checks. First generate a dataset:
```bash
go run ./bench/gen_data.go --rows=100000 --cols=5 --format=parquet --out=test.parquet
```
Run the benchmarks with:
```bash
make bench
```
Results are printed in standard Go benchmark format and can be formatted as
CSV or Markdown using the helpers in `bench/report.go`.
## Development
### Building
```bash
# Standard build
make build
# Optimized build (smaller binary)
make optimize
# Show binary size and dependencies
make size
```
### Testing
```bash
# Run all tests
make test
# Run with verbose output
go test -v ./...
# Run specific package tests
go test -v ./pkg/handlers
```
### Development Server
```bash
# Run with hot reload (requires air)
air
# Or run directly
make run
```
## Monitoring & Observability
### Metrics
Porter exposes Prometheus metrics on `:9090` by default:
- **Query Performance**: Execution time, throughput, error rates
- **Connection Pool**: Active connections, wait times, health status
- **Cache Performance**: Hit/miss ratios, eviction rates
- **Resource Usage**: Memory allocation, GC statistics
- **Flight SQL Operations**: RPC call counts, streaming metrics
### Health Checks
```bash
# Check server health
curl http://localhost:9090/health
# Prometheus metrics
curl http://localhost:9090/metrics
```
### Logging
Structured logging with configurable levels:
```bash
# Set log level
export PORTER_LOG_LEVEL=debug
# JSON format (default)
export PORTER_LOG_FORMAT=json
# Human-readable format
export PORTER_LOG_FORMAT=console
```
## Advanced Configuration
### Environment Variables
All configuration options can be set via environment variables with the `PORTER_` prefix:
```bash
export PORTER_ADDRESS="0.0.0.0:32010"
export PORTER_DATABASE=":memory:"
export PORTER_LOG_LEVEL="debug"
export PORTER_METRICS_ENABLED="true"
export PORTER_MAX_CONNECTIONS="100"
export MOTHERDUCK_TOKEN=your_token
```
### Command Line Options
```bash
./porter serve --help
# Key options:
--address string Server listen address (default "0.0.0.0:32010")
--database string DuckDB database path (default ":memory:")
--config string Configuration file path
--log-level string Log level: debug, info, warn, error (default "info")
--max-connections int Maximum concurrent connections (default 100)
--metrics Enable Prometheus metrics (default true)
--metrics-address string Metrics server address (default ":9090")
--tls Enable TLS encryption
--auth Enable authentication
```
## Contributing
We welcome contributions! Porter is designed to be extensible and community-driven.
### Areas for Contribution
- **Performance optimizations** and benchmarking
- **Additional authentication providers** (JWT, LDAP, etc.)
- **Enhanced monitoring** and alerting capabilities
- **Client libraries** and integrations
- **Documentation** and examples
- **Testing** and quality assurance
### Development Setup
```bash
# Clone the repository
git clone https://github.com/TFMV/porter.git
cd porter
# Install dependencies
go mod download
# Run tests
make test
# Build and run
make build
./porter serve
```
## Documentation
- **[Flight SQL Protocol](docs/flightsql-manifest.md)** - Protocol specification and implementation details
- **[Architecture Guide](docs/README.md)** - Deep dive into Porter's architecture
- **[API Reference](pkg/)** - Go package documentation
- **[Configuration Reference](config/)** - Complete configuration options
## Porter and the Future of Data Migration
With Porter's current capabilities:
- **Multi-backend support** (DuckDB and ClickHouse, with more planned)
- **Unified Flight SQL interface** across all backends
- **Arrow-native streaming** between different database systems
And as Porter gains:
- **Distributed query execution** (via DoExchange)
- **Additional backend support** (PostgreSQL, BigQuery, etc.)
- **Enhanced Flight SQL interoperability**
…it becomes not just a query engine, but a conduit — a streamlined pathway for moving structured data between systems without serialization overhead.
### Why This Matters
Most data migration today is:
- **Batch-oriented**
- **File-based** (CSV, Parquet)
- **Bloated** with ETL pipelines and format juggling
With Porter, you get:
- **Arrow in, Arrow out**: A columnar memory format travels intact
- **Query-based filtering**: Only move what matters
- **Streaming transport**: No intermediate staging or local disk
- **Backend-agnostic execution**: Read from DuckDB, write to ClickHouse, or vice versa
### Example Use Case
```bash
# Query DuckDB and stream results to ClickHouse (conceptual example)
# Currently: Use Porter as a unified Flight SQL gateway
porter serve --database "duckdb://source.db" &
porter serve --database "clickhouse://localhost:9000/target" --address :32011 &
# Future: Direct data migration pipeline
porter migrate --from "duckdb://source.db" --to "clickhouse://localhost:9000/target" \
--query "SELECT * FROM users WHERE active = true" --target-table users_active
```
With distributed query + DoExchange, this grows into:
- **Live sync jobs** between heterogeneous databases
- **Selective replication** across cloud and edge
- **In-flight transformations** using SQL only
---
## Known Issues & Limitations
- **Experimental status**: APIs may change without notice
- **Limited authentication**: Only basic auth and OAuth2 currently supported
- **Single-node only**: No distributed query execution yet
- **Memory constraints**: Large result sets may cause memory pressure
- **Platform support**: Some features may be limited on Windows
## License
Released under the MIT License. See [LICENSE](LICENSE) for details.
---