Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/turbolytics/sql-flow
DuckDB for streaming data
https://github.com/turbolytics/sql-flow
Last synced: 14 days ago
JSON representation
DuckDB for streaming data
- Host: GitHub
- URL: https://github.com/turbolytics/sql-flow
- Owner: turbolytics
- License: mit
- Created: 2023-11-21T22:18:36.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-04-11T10:02:00.000Z (7 months ago)
- Last Synced: 2024-08-01T08:09:44.646Z (3 months ago)
- Language: Python
- Homepage:
- Size: 564 KB
- Stars: 52
- Watchers: 3
- Forks: 4
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SQLFlow: DuckDB for streaming data.
SQLFlow enables SQL-based stream-processing, powered by [DuckDB](https://duckdb.org/). SQLFlow enables writing kafka stream processing logic in pure sql.
SQLFlow supports:
- Kafka streaming - Writing a consumer that performs SQL based transformations and publishing the output to another kafka topic.
- JSON on the wire
- Writing stream transformations in pure SQL, powered by [DuckDB](https://duckdb.org/)
- Performant [librdkafka](https://github.com/confluentinc/librdkafka) [python consumer](https://github.com/confluentinc/confluent-kafka-python)SQLFlow is currently not a good fit for:
- Stateful stream processing
- Wire protocols other than JSONSQLFlow is a kafka consumer that embeds SQL for stream transformation:
## Getting Started
### Docker
[Docker is the easiest way to get started.](https://hub.docker.com/r/turbolytics/sql-flow)- Pull the sql-flow docker image
```
docker pull turbolytics/sql-flow:latest
```- Validate config by invoking it on test data
```
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest dev invoke /tmp/conf/config/examples/basic.agg.yml /tmp/conf/fixtures/simple.json['{"city":"New York","city_count":28672}', '{"city":"Baltimore","city_count":28672}']
```- Start kafka locally using docker
```
docker-compose -f dev/kafka-single.yml up -d
```- Publish test messages to kafka
```
python3 cmd/publish-test-data.py --num-messages=10000 --topic="topic-local-docker"
```- Start kafka consumer from inside docker-compose container
```
docker exec -it kafka1 kafka-console-consumer --bootstrap-server=kafka1:9092 --topic=output-local-docker
```- Start SQLFlow in docker
```
docker run -v $(PWD)/dev:/tmp/conf -v /tmp/sqlflow:/tmp/sqlflow turbolytics/sql-flow:latest run /tmp/conf/config/local.docker.yml
```- Verify output in the kafka consumer
```
...
...
{"city":"San Francisco504","city_count":1}
{"city":"San Francisco735","city_count":1}
{"city":"San Francisco533","city_count":1}
{"city":"San Francisco556","city_count":1}
```The `dev invoke` command enables testing a sql-flow pipeline configuration on a batch of test data. This enables fast feedback local development before launching a sql-flow consumer that reads from kafka.
## Configuration
The heart of sql-flow is the pipeline configuration file. Each configuration file specifies:
- Kafka configuration
- Pipeline configuration
- Input configuration
- SQL transformation
- Output configurationEvery instance of sql-flow needs a pipeline configuration file.
## Recipes
Coming Soon, until then checkout:
- [Benchmark configurations](./dev/config/benchmarks)
- [Unit Test configurations](./tests/)#### Running multiple SQLFlow instances on the same filesystem
#### Verifying a configuration locally## Development
- Install python deps
```
pip install -r requirements.txt
pip install -r requirements.dev.txtC_INCLUDE_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/include LIBRARY_PATH=/opt/homebrew/Cellar/librdkafka/2.3.0/lib pip install confluent-kafka
```- Run tests
```
pytests tests
```
## Benchmarks
**Methodology**
Each test loads 1MM records into kafka. Each test executes sql-flow consumer until each message is processed. Each test captures the maximum resident memory during the benchmark, and the average throughput of message ingestion.
**System**
```
Hardware:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro18,3
Model Number: Z15G001X2LL/A
Chip: Apple M1 Pro
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 32 GB
Activation Lock Status: Enabled
```| Name | Throughput | Max RSS Memory | Peak Memory Usage |
|---------------------------|-------------------|----------------|-------------------|
| Simple Aggregation Memory | 45,000 msgs / sec | 230 MiB | 130 MiB |
| Simple Aggregation Disk | 36,000 msgs / sec | 256 MiB | 102 MiB |
| Enrichment | 13,000 msgs /sec | 368 MiB | 124 MiB |
| CSV Disk Join | 11,500 msgs /sec | 312 MiB | 152 MiB |
| CSV Memory Join | 33,200 msgs / sec | 300 MiB | 107 MiB |### Simple Aggregate - Disk / Mem
Performs a simple aggregate. Output is significantly
smaller than input.```
./benchmark/simple-agg-disk.sh
./benchmark/simple-agg-mem.sh
```### Enriches
Performs an enrichment. Output is 1:1 records with input, but
each output record is enhanced with additional information.```
./benchmark/enrichment.sh
```### CSV Disk Join
```
./benchmark/csv-disk-join.sh
```## CSV Memory Join
```
./benchmark/csv-disk-join.sh
```---
Like SQLFlow? Use SQLFlow? Feature Requests? Please let us know! [email protected]