https://github.com/rayokota/kwack
In-Memory Analytics for Kafka using DuckDB
https://github.com/rayokota/kwack
analytics duckdb kafka
Last synced: 7 months ago
JSON representation
In-Memory Analytics for Kafka using DuckDB
- Host: GitHub
- URL: https://github.com/rayokota/kwack
- Owner: rayokota
- License: apache-2.0
- Created: 2024-06-29T04:40:15.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-03-27T08:35:36.000Z (8 months ago)
- Last Synced: 2025-04-10T05:07:16.492Z (7 months ago)
- Topics: analytics, duckdb, kafka
- Language: Java
- Homepage:
- Size: 214 MB
- Stars: 112
- Watchers: 2
- Forks: 6
- Open Issues: 18
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-duckdb - kwack - In-Memory Analytics for Kafka using DuckDB. (Integrations / Web Clients (WebAssembly))
README
# kwack - In-Memory Analytics for Kafka using DuckDB
[![Build Status][github-actions-shield]][github-actions-link]
[github-actions-shield]: https://github.com/rayokota/kwack/actions/workflows/build.yml/badge.svg?branch=master
[github-actions-link]: https://github.com/rayokota/kwack/actions
kwack supports in-memory analytics for Kafka data using DuckDB.
## Getting Started
Note that kwack requires Java 11 or higher.
To run kwack, download a [release](https://github.com/rayokota/kwack/releases), unpack it.
Then change to the `kwack-${version}` directory and run the following to see the command-line options:
```bash
$ bin/kwack -h
Usage: kwack [-hV] [-t=]... [-p=]... [-b=]...
[-m=] [-F=] [-o=] [-k=]...
[-v=]... [-r=] [-q=] [-a=]...
[-d=] [-X=]...
In-Memory Analytics for Kafka using DuckDB.
-t, --topic= Topic(s) to consume from and produce to
-p, --partition= Partition(s)
-b, --bootstrap-server= Bootstrap broker(s) (host:[port])
-m, --metadata-timeout= Metadata (et.al.) request timeout
-F, --file= Read configuration properties from file
-o, --offset= Offset to start consuming from:
beginning | end |
(absolute offset) |
- (relative offset from end)
@ (timestamp in ms to start at)
Default: beginning
-k, --key-serde= (De)serialize keys using
-v, --value-serde= (De)serialize values using
Available serdes:
short | int | long | float |
double | string | json | binary |
avro: |
json: |
proto: |
latest (use latest version in SR) |
(use schema id from SR)
Default for key: binary
Default for value: latest
The proto/latest/ serde formats can
also take a message type name, e.g.
proto:;msg:
in case multiple message types exist
-r, --schema-registry-url= SR (Schema Registry) URL
-q, --query= SQL query to execute. If none is specified,
interactive sqlline mode is used
-a, --row-attribute= Row attribute(s) to show:
none
rowkey (record key)
ksi (key schema id)
vsi (value schema id)
top (topic)
par (partition)
off (offset)
ts (timestamp)
tst (timestamp type)
epo (leadership epoch)
hdr (headers)
Default: rowkey,ksi,vsi,par,off,ts,hdr
-d, --db= DuckDB db, appended to 'jdbc:duckdb:'
Default: :memory:
-x, --skip-bytes= Extra bytes to skip when deserializing with
an external schema
-X, --property= Set configuration property.
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
```
kwack shares many command-line options with [kcat](https://github.com/edenhill/kcat) (formerly kafkacat).
In addition, a file containing configuration properties can be used. The available configuration properties
are listed [here](https://github.com/rayokota/kwack/blob/master/src/main/java/io/kcache/kwack/KwackConfig.java).
Simply modify `config/kwack.properties` to point to an existing Kafka broker and Schema
Registry. Then run the following:
```bash
# Run with properties file
$ bin/kwack -F config/kwack.properties
```
Starting kwack is as easy as specifying a Kafka broker, topic, and Schema Registry URL:
```bash
$ bin/kwack -b mybroker -t mytopic -r http://schema-registry-url:8081
Welcome to kwack!
Enter "!help" for usage hints.
___(.)>
~~~~~~\___)~~~~~~
jdbc:duckdb::memory:>
```
When kwack starts, it will enter interactive mode, where you can enter SQL queries
to analyze Kafka data. For non-interactive mode, specify a query on the command line:
```bash
$ bin/kwack -b mybroker -t mytopic -r http://schema-registry-url:8081 -q "SELECT * FROM mytopic"
```
The output of the above command will be in JSON, and so can be piped to other commands like jq.
One can load multiple topics, and then perform a query that joins the resulting tables on a common
column:
```bash
$ bin/kwack -b mybroker -t mytopic -t mytopic2 -r http://schema-registry-url:8081 -q "SELECT * FROM mytopic JOIN mytopic2 USING (col1)"
```
One can convert Kafka data into Parquet format by using the COPY commmand in DuckDB:
```bash
$ bin/kwack -b mybroker -t mytopic -r http://schema-registry-url:8081 -q "COPY mytopic to 'mytopic.parquet' (FORMAT 'parquet')"
```
If not using Confluent Schema Registry, one can pass an external schema:
```bash
$ bin/kwack -b mybroker -t mytopic -v mytopic=proto:@/path/to/myschema.proto
```
For a given schema, kwack will create DuckDB columns based on
the appropriate Avro, Protobuf, or JSON Schema as follows:
|Avro | Protobuf | JSON Schema | DuckDB |
|-----|----------|-------------|--------|
|boolean | boolean | boolean | BOOLEAN |
|int | int32, sint32, sfixed32 || INTEGER |
|| uint32, fixed32 || UINTEGER |
|long | int64. sint64, sfixed64 | integer | BIGINT |
|| uint64, fixed64 || UBIGINT |
|float | float || FLOAT |
|double | double | number | DOUBLE |
|string | string | string | VARCHAR |
|bytes, fixed | bytes || BLOB |
|enum | enum| enum | ENUM |
|record | message | object | STRUCT |
|array | repeated | array | LIST |
|map | map || MAP |
|union | oneof | oneOf,anyOf | UNION |
|decimal | confluent.type.Decimal || DECIMAL |
|date | google.type.Date || DATE |
|time-millis, time-micros | google.type.TimeOfDay || TIME |
|timestamp-millis ||| TIMESTAMP_MS |
|timestamp-micros ||| TIMESTAMP |
|timestamp-nanos | google.protobuf.Timestamp || TIMESTAMP_NS |
|duration | google.protobuf.Duration || INTERVAL |
|uuid ||| UUID |
For more on how to use kwack, see this [blog](https://yokota.blog/2024/07/11/in-memory-analytics-for-kafka-using-duckdb/).