https://github.com/xmlking/streaming-playground

Testing arroyo with bufstream
https://github.com/xmlking/streaming-playground
Last synced: 6 days ago
JSON representation
Testing arroyo with bufstream
Host: GitHub
URL: https://github.com/xmlking/streaming-playground
Owner: xmlking
Created: 2024-11-13T02:39:48.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-12-17T16:20:31.000Z (about 1 year ago)
Last Synced: 2025-01-30T05:43:15.722Z (12 months ago)
Size: 23.4 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Streaming Adventures

Experiment with:

- [x] [Arroyo](https://www.arroyo.dev/)

- [x] [Redpanda Connect](https://www.redpanda.com/connect), 

- [x] [Bufstream](https://buf.build/product/bufstream), [demo](https://github.com/bufbuild/bufstream-demo), demo with [iceberg](https://github.com/bufbuild/buf-examples/tree/main/bufstream/iceberg-quickstart)

- [x] [SQLFlow](https://sql-flow.com/docs/introduction/basics)

- [ ] [Timeplus](https://docs.timeplus.com/proton-howto)

- [ ] [RisingWave](https://risingwave.com/overview/)

- [ ] [Tableflow](https://www.confluent.io/product/tableflow/)

## Prerequisites

Install rpk CLI to use as kafka CLI

```shell

brew install redpanda-data/tap/redpanda

# add zsh completions

rpk generate shell-completion zsh > "${fpath[1]}/_rpk"

```

Create rpk **profile** to connect to **local** redpanda kafka cluster

```shell

rpk profile create local \

-s brokers=localhost:19092 \

-s registry.hosts=localhost:8081 \

-s admin.hosts=localhost:9644

```

Install psql CLI for Mac

```shell

brew install libpq

# Finally, symlink psql (and other libpq tools) into /usr/local/bin:

brew link --force libpq

# to connect to local database

psql "postgresql://postgres:postgres@localhost/postgres?sslmode=require"

```

## Start

First time setup

```aiignore

# pull docker images to local

docker compose --profile optional pull

```

```shell

docker compose up

# docker compose --profile optional up

docker compose ps

open http://localhost:5115/ # Arroyo Console

open http://localhost:8080/ # Redpanda Console

open http://localhost:8081/subjects # Redpanda Registry

docker compose down

# (DANGER) - shutdown and delete volumes

docker compose down -v

```

**Benthos** example

```shell

# to start with benthos

docker compose up connect

docker compose down

# (DANGER) - shutdown and delete volumes

docker compose down -v

```

This will start:

1. Postgres Database

2. Kafka - [Redpanda](https://www.redpanda.com/) or [Bufstream](https://buf.build/product/bufstream)

3. [Redpanda Console](https://www.redpanda.com/redpanda-console-kafka-ui)

4. [Redpanda Connect](https://www.redpanda.com/connect) (optional)

5. [MinIO](https://min.io/) (optional)

6. [ClickHouse](https://clickhouse.com/) (optional)

7. [Arroyo](https://www.arroyo.dev/)

## Config

### Kafka

Add a new topics

> [!TIP]

> You can also use [Redpanda Console](http://localhost:8080/overview) to create topics.

```shell

rpk topic list

rpk topic create -r 1 -p 1 customer-source

rpk topic create -r 1 -p 1 customer-sink

```

### Arroyo Pipeline

in from [Arroyo Console](http://localhost:5115/), Create a pipeline with:

> [!WARNING]

> By default preview doesn't write to sinks to avoid accidentally writing bad data.

> You can run the pipeline for real by clicking "Launch" or you can enable web sinks in preview:

```sql

CREATE TABLE customer_source (

    name TEXT,

    age INT,

    phone TEXT

) WITH (

    connector = 'kafka',

    format = 'json',

    type = 'source',

    bootstrap_servers = 'redpanda:9092',

    topic = 'customer-source'

);

CREATE TABLE customer_sink (

    count BIGINT,

    age INT

) WITH (

    connector = 'kafka',

    format = 'json',

    type = 'sink',

    bootstrap_servers = 'redpanda:9092',

    topic = 'customer-sink'

);

SELECT count(*),  age

FROM customer_source

GROUP BY age, hop(interval '2 seconds', interval '10 seconds');

INSERT INTO customer_sink SELECT count(*),  age

FROM customer_source

GROUP BY age, hop(interval '2 seconds', interval '10 seconds');

```

## Test

publish a couple of messages to `customer-source` topic using **Redpanda Console** e.g:

> [!IMPORTANT]  

> Use TYPE: **JSON**

```json

{

    "name": "sumo",

    "age": 70,

    "phone": "111-222-4444"

}

```

Check any new messages in `customer-sink` topic.

## Examples

Try more [examples](https://github.com/ArroyoSystems/arroyo/tree/master/crates/arroyo-planner/src/test/queries) in Arroyo Console: http://localhost:5115/

basic_tumble_aggregate

```shell

CREATE TABLE nexmark WITH (

    connector = 'nexmark',

    event_rate = 10

);

SELECT

    bid.auction as auction,

    tumble(INTERVAL '1' second) as window,

    count(*) as count

FROM

    nexmark

where

    bid is not null

GROUP BY

    1,

    2

```

bitcoin_exchange_rate

```shell

CREATE TABLE coinbase (

  type TEXT,

  price TEXT

) WITH (

  connector = 'websocket',

  endpoint = 'wss://ws-feed.exchange.coinbase.com',

  subscription_message = '{

      "type": "subscribe",

      "product_ids": [

        "BTC-USD"

      ],

      "channels": ["ticker"]

    }',

      format = 'json'

);

SELECT avg(CAST(price as FLOAT)) from coinbase

WHERE type = 'ticker'

GROUP BY hop(interval '5' second, interval '1 minute');

```

first_pipeline

```shell

CREATE TABLE nexmark with (

    connector = 'nexmark',

    event_rate = '100'

);

 

-- SELECT * from nexmark where auction is not null;

SELECT * FROM (

    SELECT *, ROW_NUMBER() OVER (

        PARTITION BY window

        ORDER BY count DESC) AS row_num

    FROM (SELECT count(*) AS count, bid.auction AS auction,

        hop(interval '2 seconds', interval '60 seconds') AS window

            FROM nexmark WHERE bid is not null

            GROUP BY 2, window)) WHERE row_num <= 5;

```

create_table_updating

```shell

CREATE TABLE nexmark with (

    connector = 'nexmark',

    event_rate = '100'

);

CREATE TABLE bids (

    auction    BIGINT,

    bidder     BIGINT,

    channel    VARCHAR,

    url        VARCHAR,

    datetime   DATETIME,

    avg_price  BIGINT

) WITH (

    connector = 'filesystem',

    type = 'sink',

    path = '/home/data',

    format = 'parquet',

    parquet_compression = 'zstd',

    rollover_seconds = 60,

    time_partition_pattern = '%Y/%m/%d/%H',

    partition_fields = 'bidder'

);

-- SELECT bid from nexmark where bid is not null;

INSERT INTO bids

SELECT

    bid.auction, bid.bidder, bid.channel, bid.url, bid.datetime, bid.price as avg_price

FROM

    nexmark

where

    bid is not null

```

```shell

CREATE TABLE nexmark with (

    connector = 'nexmark',

    event_rate = '100'

);

SELECT avg(bid.price) as avg_price

FROM nexmark

WHERE bid IS NOT NULL

GROUP BY hop(interval '2 seconds', interval '10 seconds');

```

## TODO

- Try [Redpanda Iceberg Topics for SQL-based analytics with zero ETL](https://github.com/redpanda-data/redpanda-labs/tree/main/docker-compose/iceberg) 

- [Build a Streaming CDC Pipeline with MinIO and Redpanda into Snowflake](https://blog.min.io/build-a-streaming-cdc-pipeline-with-minio-and-redpanda-into-snowflake/)

- [SQLFlow](https://sql-flow.com/docs/introduction/basics) - Enables SQL-based stream-processing, powered by DuckDB.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/xmlking/streaming-playground

Awesome Lists containing this project

README