https://github.com/rayokota/kwack

In-Memory Analytics for Kafka using DuckDB
https://github.com/rayokota/kwack

analytics duckdb kafka

Last synced: 7 months ago
JSON representation

In-Memory Analytics for Kafka using DuckDB

Host: GitHub
URL: https://github.com/rayokota/kwack
Owner: rayokota
License: apache-2.0
Created: 2024-06-29T04:40:15.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2025-03-27T08:35:36.000Z (8 months ago)
Last Synced: 2025-04-10T05:07:16.492Z (7 months ago)
Topics: analytics, duckdb, kafka
Language: Java
Homepage:
Size: 214 MB
Stars: 112
Watchers: 2
Forks: 6
Open Issues: 18
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-duckdb - kwack - In-Memory Analytics for Kafka using DuckDB. (Integrations / Web Clients (WebAssembly))

README

          
# kwack - In-Memory Analytics for Kafka using DuckDB

[![Build Status][github-actions-shield]][github-actions-link]

[github-actions-shield]: https://github.com/rayokota/kwack/actions/workflows/build.yml/badge.svg?branch=master

[github-actions-link]: https://github.com/rayokota/kwack/actions

kwack supports in-memory analytics for Kafka data using DuckDB.

## Getting Started

Note that kwack requires Java 11 or higher. 

To run kwack, download a [release](https://github.com/rayokota/kwack/releases), unpack it.

Then change to the `kwack-${version}` directory and run the following to see the command-line options:

```bash

$ bin/kwack -h

Usage: kwack [-hV] [-t=]... [-p=]... [-b=]...

             [-m=] [-F=] [-o=] [-k=]...

             [-v=]... [-r=] [-q=] [-a=]...

             [-d=] [-X=]...

In-Memory Analytics for Kafka using DuckDB.

  -t, --topic=               Topic(s) to consume from and produce to

  -p, --partition=       Partition(s)

  -b, --bootstrap-server=   Bootstrap broker(s) (host:[port])

  -m, --metadata-timeout=       Metadata (et.al.) request timeout

  -F, --file=          Read configuration properties from file

  -o, --offset=             Offset to start consuming from:

                                      beginning | end |

                                        (absolute offset) |

                                      - (relative offset from end)

                                      @ (timestamp in ms to start at)

                                      Default: beginning

  -k, --key-serde=     (De)serialize keys using 

  -v, --value-serde=   (De)serialize values using 

                                    Available serdes:

                                      short | int | long | float |

                                      double | string | json | binary |

                                      avro: |

                                      json: |

                                      proto: |

                                      latest (use latest version in SR) |

                                         (use schema id from SR)

                                      Default for key:   binary

                                      Default for value: latest

                                    The proto/latest/ serde formats can

                                    also take a message type name, e.g.

                                      proto:;msg:

                                    in case multiple message types exist

  -r, --schema-registry-url=   SR (Schema Registry) URL

  -q, --query=               SQL query to execute. If none is specified,

                                      interactive sqlline mode is used

  -a, --row-attribute=        Row attribute(s) to show:

                                      none

                                      rowkey (record key)

                                      ksi    (key schema id)

                                      vsi    (value schema id)

                                      top    (topic)

                                      par    (partition)

                                      off    (offset)

                                      ts     (timestamp)

                                      tst    (timestamp type)

                                      epo    (leadership epoch)

                                      hdr    (headers)

                                      Default: rowkey,ksi,vsi,par,off,ts,hdr

  -d, --db=                     DuckDB db, appended to 'jdbc:duckdb:'

                                      Default: :memory:

  -x, --skip-bytes=          Extra bytes to skip when deserializing with

                                      an external schema

  -X, --property=         Set configuration property.

  -h, --help                        Show this help message and exit.

  -V, --version                     Print version information and exit.

```

kwack shares many command-line options with [kcat](https://github.com/edenhill/kcat) (formerly kafkacat).

In addition, a file containing configuration properties can be used.  The available configuration properties 

are listed [here](https://github.com/rayokota/kwack/blob/master/src/main/java/io/kcache/kwack/KwackConfig.java).

Simply modify `config/kwack.properties` to point to an existing Kafka broker and Schema

Registry. Then run the following:

```bash

# Run with properties file

$ bin/kwack -F config/kwack.properties

```

Starting kwack is as easy as specifying a Kafka broker, topic, and Schema Registry URL:

```bash

$ bin/kwack -b mybroker -t mytopic -r http://schema-registry-url:8081

Welcome to kwack!

Enter "!help" for usage hints.

      ___(.)>

~~~~~~\___)~~~~~~

jdbc:duckdb::memory:>

```

When kwack starts, it will enter interactive mode, where you can enter SQL queries 

to analyze Kafka data.  For non-interactive mode, specify a query on the command line:

```bash

$ bin/kwack -b mybroker -t mytopic -r http://schema-registry-url:8081 -q "SELECT * FROM mytopic"

```

The output of the above command will be in JSON, and so can be piped to other commands like jq.

One can load multiple topics, and then perform a query that joins the resulting tables on a common 

column:

```bash

$ bin/kwack -b mybroker -t mytopic -t mytopic2 -r http://schema-registry-url:8081 -q "SELECT * FROM mytopic JOIN mytopic2 USING (col1)"

```

One can convert Kafka data into Parquet format by using the COPY commmand in DuckDB:

```bash

$ bin/kwack -b mybroker -t mytopic -r http://schema-registry-url:8081 -q "COPY mytopic to 'mytopic.parquet' (FORMAT 'parquet')"

```

If not using Confluent Schema Registry, one can pass an external schema:

```bash

$ bin/kwack -b mybroker -t mytopic -v mytopic=proto:@/path/to/myschema.proto

```

For a given schema, kwack will create DuckDB columns based on

the appropriate Avro, Protobuf, or JSON Schema as follows:

|Avro | Protobuf | JSON Schema | DuckDB |

|-----|----------|-------------|--------|

|boolean | boolean | boolean | BOOLEAN |

|int | int32, sint32, sfixed32 || INTEGER |

|| uint32, fixed32 || UINTEGER |

|long | int64. sint64, sfixed64 | integer | BIGINT |

|| uint64, fixed64 || UBIGINT |

|float | float || FLOAT |

|double | double | number | DOUBLE |

|string | string | string | VARCHAR |

|bytes, fixed | bytes || BLOB |

|enum | enum| enum | ENUM |

|record | message | object | STRUCT |

|array | repeated | array | LIST |

|map | map || MAP |

|union | oneof | oneOf,anyOf | UNION |

|decimal | confluent.type.Decimal || DECIMAL |

|date | google.type.Date || DATE |

|time-millis, time-micros | google.type.TimeOfDay || TIME |

|timestamp-millis ||| TIMESTAMP_MS |

|timestamp-micros ||| TIMESTAMP |

|timestamp-nanos | google.protobuf.Timestamp || TIMESTAMP_NS |

|duration | google.protobuf.Duration || INTERVAL |

|uuid ||| UUID |

For more on how to use kwack, see this [blog](https://yokota.blog/2024/07/11/in-memory-analytics-for-kafka-using-duckdb/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rayokota/kwack

Awesome Lists containing this project

README