https://github.com/turbolytics/sql-flow
DuckDB for streaming data
https://github.com/turbolytics/sql-flow
Last synced: 8 months ago
JSON representation
DuckDB for streaming data
- Host: GitHub
- URL: https://github.com/turbolytics/sql-flow
- Owner: turbolytics
- License: mit
- Created: 2023-11-21T22:18:36.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-08T20:26:16.000Z (8 months ago)
- Last Synced: 2025-03-08T21:34:38.879Z (8 months ago)
- Language: Python
- Homepage:
- Size: 864 KB
- Stars: 338
- Watchers: 7
- Forks: 9
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-duckdb - SQLFlow - Enables SQL-based stream processing, powered by DuckDB. (Integrations / Web Clients (WebAssembly))
README
# SQLFlow: DuckDB for Streaming Data.
[Quickstart](https://sql-flow.com/docs/tutorials/intro) | [Tutorials](https://sql-flow.com/docs/category/tutorials) |  | [Documentation](https://sql-flow.com)
SQLFlow is a high-performance stream processing engine that simplifies building data pipelines by enabling you to define them using just SQL. Think of SQLFLow as a lightweight, modern Flink.
Key Features:
- Process data from [Kafka](https://kafka.apache.org/), WebSockets, and more.
- Write ouputs to [PostgreSQL](https://www.postgresql.org/), Kafka topics, or cloud storage (such as S3), in a variety of formats, including parquet and iceberg.
- Built on [DuckDB](https://duckdb.org/) and [Apache Arrow](https://arrow.apache.org/) for high-speed processing.
# Quick Start (Getting Started in 5 Minutes)
1. Pull the SQLFlow docker image
```
docker pull turbolytics/sql-flow:latest
```
2. Validate config against test data:
```
docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.yml /tmp/conf/fixtures/simple.json
['{"city":"New York","city_count":28672}', '{"city":"Baltimore","city_count":28672}']
```
3. Start kafka locally using docker:
```
docker-compose -f dev/kafka-single.yml up -d
```
4. Publish test messages to kafka:
```
python3 cmd/publish-test-data.py --num-messages=10000 --topic="input-simple-agg-mem"
```
5. Start kafka consumer from inside docker-compose container, to verify SQLFlow output
```
docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-simple-agg-mem
```
- Start SQLFlow in docker:
```
docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow -e SQLFLOW_KAFKA_BROKERS=host.docker.internal:29092 turbolytics/sql-flow:latest run /tmp/conf/config/examples/basic.agg.mem.yml --max-msgs-to-process=10000
```
- Verify output in the kafka consumer:
```
...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}
```
You just ran SQLFlow against a stream of kafka data!
# How SQLFlow Works
SQLFlow is a stream processing engine written in python. SQLFLow embeds DuckDB and Apache Arrow for high performance. SQLFLow consists of a couple of components:

## Core Components
**Input Source**
SQLFlow ingests data from a variety of input sources, including Kafka, and Webhooks. SQLFlow models the input as a stream of data.
**Handler**
SQLFlow uses DuckDB and Apache Arrow to execute SQL against the input source. Handlers contain the stream processing logica, filter, aggregate, enrich or drop data.
**Output Sink**
SQLFlow writes the results of the SQL to output sources including: Kafka, Postgres, Filesystem, Blob Storage.
The following image shows an example SQLFlow configuration file:

The file explicitly contains a `pipeline` configuration with a `source`, `handler` and `sink` section. This configuration file also contains commands to be executed prior to the pipeline running. These commands support things like attaching databases to the pipeline execution context.
## SQLFlow Use-Cases
- **Streaming Data Transformations**: Clean data and types and publish the new data ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/basic.agg.mem.yml)).
- **Stream Enrichment**: Add data an input stream and publish the new data ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/enrich.yml)).
- **Data aggregation**: Aggregate input data batches to decrease data volume ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/basic.agg.mem.yml)).
- **Tumbling Window Aggregation**: Bucket data into arbitrary time windows (such as "hour" or "10 minutes") ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/tumbling.window.yml)).
- **Run SQL against the Bluesky Firehose**: Execute SQL against any webhook source, such as the [Bluesky firehose](https://docs.bsky.app/docs/advanced-guides/firehose) ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/bluesky/bluesky.kafka.raw.yml))
- **Stream Data to Iceberg**: Stream writes to an Iceberg Catalog.
- **Enrich Streams with Postgres Data**: Query postgres during stream processing to enrich stream data.
- **Sink Kafka to Postgres**: Insert stream processing outputs into postgres.
## SQLFlow Features & Roadmap
- Sources
- [x] Kafka Consumer using consumer groups
- [x] Websocket input (for consuming bluesky firehose)
- [ ] HTTP (for webhooks)
- Sinks
- [x] Kafka Producer
- [x] Stdout
- [x] Local Disk
- [x] Postgres
- [x] S3
- [x] Any output DuckDB Supports!
- Serialization
- [x] JSON Input
- [x] JSON Output
- [x] Parquet Output
- [x] Iceberg Output (using pyiceberg)
- Handlers
- [x] Memory Persistence
- [x] Pipeline-scoped SQL such as defining views, or attaching to databases.
- [x] User Defined Functions (UDF)
- [x] Dynamic Schema Inferrence
- [ ] Disk Persistence
- [ ] Static Schema Definition
- Table Managers
- [x] Tumbling Window Aggregations
- [ ] Buffered Table
- Operations
- [x] Observability Metrics (Prometheus)
## Examples
Additional examples are available in the wiki: [Tutorials](https://github.com/turbolytics/sql-flow/wiki/Tutorials)
### Consume Bluesky Firehose
SQLFlow supports DuckDB over websocket. Running SQL against the [Bluesky firehose](https://docs.bsky.app/docs/advanced-guides/firehose) is a simple configuration file:

The following command starts a bluesky consumer and prints every post to stdout:
```
docker run -v $(pwd)/dev/config/examples:/examples turbolytics/sql-flow:latest run /examples/bluesky/bluesky.raw.stdout.yml
```

[Checkout the configuration files here](https://github.com/turbolytics/sql-flow/tree/main/dev/config/examples/bluesky)
### Stream Kafka to Iceberg
SQLFlow supports writing to Iceberg tables using [pyiceberg](https://py.iceberg.apache.org/).
The following configuration writes to an Iceberg table using a local SQLite catalog:
- Initialize the SQLite iceberg catalog and test table
```
python3 cmd/setup-iceberg-local.py setup
created default.city_events
created default.bluesky_post_events
Catalog setup complete.
```
- Start Kafka Locally
```
docker-compose -f dev/kafka-single.yml up -d
```
- Publish Test Messages to Kafka
```
python3 cmd/publish-test-data.py --num-messages=5000 --topic="input-kafka-mem-iceberg"
```
- Run SQLFlow, which will read from kafka and write to the iceberg table locally
```
docker run \
-e SQLFLOW_KAFKA_BROKERS=host.docker.internal:29092 \
-e PYICEBERG_HOME=/tmp/iceberg/ \
-v $(pwd)/dev/config/iceberg/.pyiceberg.yaml:/tmp/iceberg/.pyiceberg.yaml \
-v /tmp/sqlflow/warehouse:/tmp/sqlflow/warehouse \
-v $(pwd)/dev/config/examples:/examples \
turbolytics/sql-flow:latest run /examples/kafka.mem.iceberg.yml --max-msgs-to-process=5000
```
- Verify iceberg data was written by querying it with duckdb
```
% duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select count(*) from '/tmp/sqlflow/warehouse/default.db/city_events/data/*.parquet';
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 5000 │
└──────────────┘
```
## Running SQLFlow
Coming Soon! Until then checkout:
- [Tutorials](https://github.com/turbolytics/sql-flow/wiki/Tutorials)
If you need any support please open an issue or contact us directly! (`danny` [AT] `turbolytics.io`)!
## Development
- Install python deps
```
pip install -r requirements.txt
pip install -r requirements.dev.txt
C_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/include LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/lib pip install confluent-kafka
```
- Run tests
```
make test-unit
```
## [Benchmarks](https://github.com/turbolytics/sql-flow/wiki/Benchmarks)
The following table shows the performance of different test scenarios:
| Name | Throughput | Max RSS Memory | Peak Memory Usage |
|---------------------------|-------------------|----------------|-------------------|
| Simple Aggregation Memory | 45,000 msgs / sec | 230 MiB | 130 MiB |
| Simple Aggregation Disk | 36,000 msgs / sec | 256 MiB | 102 MiB |
| Enrichment | 13,000 msgs /sec | 368 MiB | 124 MiB |
| CSV Disk Join | 11,500 msgs /sec | 312 MiB | 152 MiB |
| CSV Memory Join | 33,200 msgs / sec | 300 MiB | 107 MiB |
| In Memory Tumbling Window | 44,000 msgs / sec | 198 MiB | 96 MiB |
[More information about benchmarks are available in the wiki](https://github.com/turbolytics/sql-flow/wiki/Benchmarks).
# Contact Us
Like SQLFlow? Use SQLFlow? Feature Requests? Please let us know! danny@turbolytics.io