https://github.com/xmlking/streaming-playground
Testing arroyo with bufstream
https://github.com/xmlking/streaming-playground
Last synced: 6 days ago
JSON representation
Testing arroyo with bufstream
- Host: GitHub
- URL: https://github.com/xmlking/streaming-playground
- Owner: xmlking
- Created: 2024-11-13T02:39:48.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-17T16:20:31.000Z (about 1 year ago)
- Last Synced: 2025-01-30T05:43:15.722Z (12 months ago)
- Size: 23.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Streaming Adventures
Experiment with:
- [x] [Arroyo](https://www.arroyo.dev/)
- [x] [Redpanda Connect](https://www.redpanda.com/connect),
- [x] [Bufstream](https://buf.build/product/bufstream), [demo](https://github.com/bufbuild/bufstream-demo), demo with [iceberg](https://github.com/bufbuild/buf-examples/tree/main/bufstream/iceberg-quickstart)
- [x] [SQLFlow](https://sql-flow.com/docs/introduction/basics)
- [ ] [Timeplus](https://docs.timeplus.com/proton-howto)
- [ ] [RisingWave](https://risingwave.com/overview/)
- [ ] [Tableflow](https://www.confluent.io/product/tableflow/)
## Prerequisites
Install rpk CLI to use as kafka CLI
```shell
brew install redpanda-data/tap/redpanda
# add zsh completions
rpk generate shell-completion zsh > "${fpath[1]}/_rpk"
```
Create rpk **profile** to connect to **local** redpanda kafka cluster
```shell
rpk profile create local \
-s brokers=localhost:19092 \
-s registry.hosts=localhost:8081 \
-s admin.hosts=localhost:9644
```
Install psql CLI for Mac
```shell
brew install libpq
# Finally, symlink psql (and other libpq tools) into /usr/local/bin:
brew link --force libpq
# to connect to local database
psql "postgresql://postgres:postgres@localhost/postgres?sslmode=require"
```
## Start
First time setup
```aiignore
# pull docker images to local
docker compose --profile optional pull
```
```shell
docker compose up
# docker compose --profile optional up
docker compose ps
open http://localhost:5115/ # Arroyo Console
open http://localhost:8080/ # Redpanda Console
open http://localhost:8081/subjects # Redpanda Registry
docker compose down
# (DANGER) - shutdown and delete volumes
docker compose down -v
```
**Benthos** example
```shell
# to start with benthos
docker compose up connect
docker compose down
# (DANGER) - shutdown and delete volumes
docker compose down -v
```
This will start:
1. Postgres Database
2. Kafka - [Redpanda](https://www.redpanda.com/) or [Bufstream](https://buf.build/product/bufstream)
3. [Redpanda Console](https://www.redpanda.com/redpanda-console-kafka-ui)
4. [Redpanda Connect](https://www.redpanda.com/connect) (optional)
5. [MinIO](https://min.io/) (optional)
6. [ClickHouse](https://clickhouse.com/) (optional)
7. [Arroyo](https://www.arroyo.dev/)
## Config
### Kafka
Add a new topics
> [!TIP]
> You can also use [Redpanda Console](http://localhost:8080/overview) to create topics.
```shell
rpk topic list
rpk topic create -r 1 -p 1 customer-source
rpk topic create -r 1 -p 1 customer-sink
```
### Arroyo Pipeline
in from [Arroyo Console](http://localhost:5115/), Create a pipeline with:
> [!WARNING]
> By default preview doesn't write to sinks to avoid accidentally writing bad data.
> You can run the pipeline for real by clicking "Launch" or you can enable web sinks in preview:
```sql
CREATE TABLE customer_source (
name TEXT,
age INT,
phone TEXT
) WITH (
connector = 'kafka',
format = 'json',
type = 'source',
bootstrap_servers = 'redpanda:9092',
topic = 'customer-source'
);
CREATE TABLE customer_sink (
count BIGINT,
age INT
) WITH (
connector = 'kafka',
format = 'json',
type = 'sink',
bootstrap_servers = 'redpanda:9092',
topic = 'customer-sink'
);
SELECT count(*), age
FROM customer_source
GROUP BY age, hop(interval '2 seconds', interval '10 seconds');
INSERT INTO customer_sink SELECT count(*), age
FROM customer_source
GROUP BY age, hop(interval '2 seconds', interval '10 seconds');
```
## Test
publish a couple of messages to `customer-source` topic using **Redpanda Console** e.g:
> [!IMPORTANT]
> Use TYPE: **JSON**
```json
{
"name": "sumo",
"age": 70,
"phone": "111-222-4444"
}
```
Check any new messages in `customer-sink` topic.
## Examples
Try more [examples](https://github.com/ArroyoSystems/arroyo/tree/master/crates/arroyo-planner/src/test/queries) in Arroyo Console: http://localhost:5115/
basic_tumble_aggregate
```shell
CREATE TABLE nexmark WITH (
connector = 'nexmark',
event_rate = 10
);
SELECT
bid.auction as auction,
tumble(INTERVAL '1' second) as window,
count(*) as count
FROM
nexmark
where
bid is not null
GROUP BY
1,
2
```
bitcoin_exchange_rate
```shell
CREATE TABLE coinbase (
type TEXT,
price TEXT
) WITH (
connector = 'websocket',
endpoint = 'wss://ws-feed.exchange.coinbase.com',
subscription_message = '{
"type": "subscribe",
"product_ids": [
"BTC-USD"
],
"channels": ["ticker"]
}',
format = 'json'
);
SELECT avg(CAST(price as FLOAT)) from coinbase
WHERE type = 'ticker'
GROUP BY hop(interval '5' second, interval '1 minute');
```
first_pipeline
```shell
CREATE TABLE nexmark with (
connector = 'nexmark',
event_rate = '100'
);
-- SELECT * from nexmark where auction is not null;
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY window
ORDER BY count DESC) AS row_num
FROM (SELECT count(*) AS count, bid.auction AS auction,
hop(interval '2 seconds', interval '60 seconds') AS window
FROM nexmark WHERE bid is not null
GROUP BY 2, window)) WHERE row_num <= 5;
```
create_table_updating
```shell
CREATE TABLE nexmark with (
connector = 'nexmark',
event_rate = '100'
);
CREATE TABLE bids (
auction BIGINT,
bidder BIGINT,
channel VARCHAR,
url VARCHAR,
datetime DATETIME,
avg_price BIGINT
) WITH (
connector = 'filesystem',
type = 'sink',
path = '/home/data',
format = 'parquet',
parquet_compression = 'zstd',
rollover_seconds = 60,
time_partition_pattern = '%Y/%m/%d/%H',
partition_fields = 'bidder'
);
-- SELECT bid from nexmark where bid is not null;
INSERT INTO bids
SELECT
bid.auction, bid.bidder, bid.channel, bid.url, bid.datetime, bid.price as avg_price
FROM
nexmark
where
bid is not null
```
```shell
CREATE TABLE nexmark with (
connector = 'nexmark',
event_rate = '100'
);
SELECT avg(bid.price) as avg_price
FROM nexmark
WHERE bid IS NOT NULL
GROUP BY hop(interval '2 seconds', interval '10 seconds');
```
## TODO
- Try [Redpanda Iceberg Topics for SQL-based analytics with zero ETL](https://github.com/redpanda-data/redpanda-labs/tree/main/docker-compose/iceberg)
- [Build a Streaming CDC Pipeline with MinIO and Redpanda into Snowflake](https://blog.min.io/build-a-streaming-cdc-pipeline-with-minio-and-redpanda-into-snowflake/)
- [SQLFlow](https://sql-flow.com/docs/introduction/basics) - Enables SQL-based stream-processing, powered by DuckDB.