https://github.com/turbolytics/sql-flow

DuckDB for streaming data
https://github.com/turbolytics/sql-flow
Last synced: 9 months ago
JSON representation
DuckDB for streaming data
Host: GitHub
URL: https://github.com/turbolytics/sql-flow
Owner: turbolytics
License: mit
Created: 2023-11-21T22:18:36.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-03-08T20:26:16.000Z (9 months ago)
Last Synced: 2025-03-08T21:34:38.879Z (9 months ago)
Language: Python
Homepage:
Size: 864 KB
Stars: 338
Watchers: 7
Forks: 9
Open Issues: 19
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-duckdb - SQLFlow - Enables SQL-based stream processing, powered by DuckDB. (Integrations / Web Clients (WebAssembly))
README

          # SQLFlow: DuckDB for Streaming Data. 

[Quickstart](https://sql-flow.com/docs/tutorials/intro) | [Tutorials](https://sql-flow.com/docs/category/tutorials) | ![Docker Pulls](https://img.shields.io/docker/pulls/turbolytics/sql-flow) | [Documentation](https://sql-flow.com)

SQLFlow is a high-performance stream processing engine that simplifies building data pipelines by enabling you to define them using just SQL. Think of SQLFLow as a lightweight, modern Flink.

Key Features:

- Process data from [Kafka](https://kafka.apache.org/), WebSockets, and more.

- Write ouputs to [PostgreSQL](https://www.postgresql.org/), Kafka topics, or cloud storage (such as S3), in a variety of formats, including parquet and iceberg.

- Built on [DuckDB](https://duckdb.org/) and [Apache Arrow](https://arrow.apache.org/) for high-speed processing.

# Quick Start (Getting Started in 5 Minutes)

1. Pull the SQLFlow docker image

```

docker pull turbolytics/sql-flow:latest

```

2. Validate config against test data:

```

docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.yml /tmp/conf/fixtures/simple.json

['{"city":"New York","city_count":28672}', '{"city":"Baltimore","city_count":28672}']

```

3. Start kafka locally using docker:

```

docker-compose -f dev/kafka-single.yml up -d

```

4. Publish test messages to kafka:

```

python3 cmd/publish-test-data.py --num-messages=10000 --topic="input-simple-agg-mem"

```

5. Start kafka consumer from inside docker-compose container, to verify SQLFlow output

```

docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-simple-agg-mem

```

- Start SQLFlow in docker:

```

docker run -v $(pwd)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow -e SQLFLOW_KAFKA_BROKERS=host.docker.internal:29092 turbolytics/sql-flow:latest run /tmp/conf/config/examples/basic.agg.mem.yml --max-msgs-to-process=10000

```

- Verify output in the kafka consumer:

 

```

...

...

{"city":"San Francisco504","city_count":1}

{"city":"San Francisco735","city_count":1}

{"city":"San Francisco533","city_count":1}

{"city":"San Francisco556","city_count":1}

```

You just ran SQLFlow against a stream of kafka data! 

# How SQLFlow Works

SQLFlow is a stream processing engine written in python. SQLFLow embeds DuckDB and Apache Arrow for high performance. SQLFLow consists of a couple of components:



## Core Components

**Input Source**

SQLFlow ingests data from a variety of input sources, including Kafka, and Webhooks. SQLFlow models the input as a stream of data.

**Handler**

SQLFlow uses DuckDB and Apache Arrow to execute SQL against the input source. Handlers contain the stream processing logica, filter, aggregate, enrich or drop data.

**Output Sink**

SQLFlow writes the results of the SQL to output sources including: Kafka, Postgres, Filesystem, Blob Storage.

The following image shows an example SQLFlow configuration file:



The file explicitly contains a `pipeline` configuration with a `source`, `handler` and `sink` section. This configuration file also contains commands to be executed prior to the pipeline running. These commands support things like attaching databases to the pipeline execution context.

## SQLFlow Use-Cases

- **Streaming Data Transformations**: Clean data and types and publish the new data ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/basic.agg.mem.yml)).

- **Stream Enrichment**: Add data an input stream and publish the new data ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/enrich.yml)).

- **Data aggregation**: Aggregate input data batches to decrease data volume ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/basic.agg.mem.yml)).

- **Tumbling Window Aggregation**: Bucket data into arbitrary time windows (such as "hour" or "10 minutes") ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/tumbling.window.yml)).

- **Run SQL against the Bluesky Firehose**: Execute SQL against any webhook source, such as the [Bluesky firehose](https://docs.bsky.app/docs/advanced-guides/firehose) ([example config](https://github.com/turbolytics/sql-flow/blob/main/dev/config/examples/bluesky/bluesky.kafka.raw.yml))

- **Stream Data to Iceberg**: Stream writes to an Iceberg Catalog.

- **Enrich Streams with Postgres Data**: Query postgres during stream processing to enrich stream data.

- **Sink Kafka to Postgres**: Insert stream processing outputs into postgres. 

## SQLFlow Features & Roadmap

- Sources

  - [x] Kafka Consumer using consumer groups

  - [x] Websocket input (for consuming bluesky firehose)

  - [ ] HTTP (for webhooks)

- Sinks 

  - [x] Kafka Producer

  - [x] Stdout

  - [x] Local Disk

  - [x] Postgres

  - [x] S3

  - [x] Any output DuckDB Supports!

- Serialization

  - [x] JSON Input

  - [x] JSON Output

  - [x] Parquet Output

  - [x] Iceberg Output (using pyiceberg)

- Handlers

  - [x] Memory Persistence

  - [x] Pipeline-scoped SQL such as defining views, or attaching to databases.

  - [x] User Defined Functions (UDF)

  - [x] Dynamic Schema Inferrence 

  - [ ] Disk Persistence

  - [ ] Static Schema Definition

- Table Managers

  - [x] Tumbling Window Aggregations

  - [ ] Buffered Table 

- Operations

  - [x] Observability Metrics (Prometheus)

## Examples

Additional examples are available in the wiki: [Tutorials](https://github.com/turbolytics/sql-flow/wiki/Tutorials)

### Consume Bluesky Firehose

SQLFlow supports DuckDB over websocket. Running SQL against the [Bluesky firehose](https://docs.bsky.app/docs/advanced-guides/firehose) is a simple configuration file:



The following command starts a bluesky consumer and prints every post to stdout:

```

docker run -v $(pwd)/dev/config/examples:/examples turbolytics/sql-flow:latest run /examples/bluesky/bluesky.raw.stdout.yml

```

![output](https://github.com/user-attachments/assets/185c6453-debc-439a-a2b9-ed20fdc82851)

[Checkout the configuration files here](https://github.com/turbolytics/sql-flow/tree/main/dev/config/examples/bluesky)

### Stream Kafka to Iceberg

SQLFlow supports writing to Iceberg tables using [pyiceberg](https://py.iceberg.apache.org/). 

The following configuration writes to an Iceberg table using a local SQLite catalog:

- Initialize the SQLite iceberg catalog and test table

```

python3 cmd/setup-iceberg-local.py setup

created default.city_events

created default.bluesky_post_events

Catalog setup complete.

```

- Start Kafka Locally

```

docker-compose -f dev/kafka-single.yml up -d

```

- Publish Test Messages to Kafka

```

python3 cmd/publish-test-data.py --num-messages=5000 --topic="input-kafka-mem-iceberg"

```

- Run SQLFlow, which will read from kafka and write to the iceberg table locally

```

docker run \

  -e SQLFLOW_KAFKA_BROKERS=host.docker.internal:29092 \

  -e PYICEBERG_HOME=/tmp/iceberg/ \ 

  -v $(pwd)/dev/config/iceberg/.pyiceberg.yaml:/tmp/iceberg/.pyiceberg.yaml \

  -v /tmp/sqlflow/warehouse:/tmp/sqlflow/warehouse \

  -v $(pwd)/dev/config/examples:/examples \

  turbolytics/sql-flow:latest run /examples/kafka.mem.iceberg.yml --max-msgs-to-process=5000

```

- Verify iceberg data was written by querying it with duckdb

```

% duckdb

v1.1.3 19864453f7

Enter ".help" for usage hints.

Connected to a transient in-memory database.

Use ".open FILENAME" to reopen on a persistent database.

D select count(*) from '/tmp/sqlflow/warehouse/default.db/city_events/data/*.parquet';

┌──────────────┐

│ count_star() │

│    int64     │

├──────────────┤

│         5000 │

└──────────────┘

```

## Running SQLFlow

Coming Soon! Until then checkout:

- [Tutorials](https://github.com/turbolytics/sql-flow/wiki/Tutorials)

If you need any support please open an issue or contact us directly! (`danny` [AT] `turbolytics.io`)!

## Development 

- Install python deps

```

pip install -r requirements.txt

pip install -r requirements.dev.txt

C_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/include LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/lib pip install confluent-kafka

```

- Run tests

```

make test-unit

```

 

## [Benchmarks](https://github.com/turbolytics/sql-flow/wiki/Benchmarks)

The following table shows the performance of different test scenarios:

| Name                      | Throughput        | Max RSS Memory | Peak Memory Usage |

|---------------------------|-------------------|----------------|-------------------|

| Simple Aggregation Memory | 45,000 msgs / sec | 230 MiB        | 130 MiB           |

| Simple Aggregation Disk   | 36,000 msgs / sec | 256 MiB        | 102 MiB           |

| Enrichment                | 13,000 msgs /sec  | 368 MiB        | 124 MiB           |

| CSV Disk Join             | 11,500 msgs /sec  | 312 MiB        | 152 MiB           |

| CSV Memory Join           | 33,200 msgs / sec | 300 MiB        | 107 MiB           |

| In Memory Tumbling Window | 44,000 msgs / sec | 198 MiB        |  96 MiB           |

[More information about benchmarks are available in the wiki](https://github.com/turbolytics/sql-flow/wiki/Benchmarks). 

# Contact Us 

Like SQLFlow? Use SQLFlow? Feature Requests? Please let us know! danny@turbolytics.io
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/turbolytics/sql-flow

Awesome Lists containing this project

README