https://github.com/voutilad/redpanda-flight-rs

An Apache Arrow Flight proxy for Redpanda
https://github.com/voutilad/redpanda-flight-rs

apache-arrow kafka redpanda

Last synced: 24 days ago
JSON representation

An Apache Arrow Flight proxy for Redpanda

Host: GitHub
URL: https://github.com/voutilad/redpanda-flight-rs
Owner: voutilad
License: apache-2.0
Created: 2023-10-11T02:56:52.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-11-01T18:10:37.000Z (over 2 years ago)
Last Synced: 2025-03-05T02:44:49.659Z (about 1 year ago)
Topics: apache-arrow, kafka, redpanda
Language: Rust
Homepage:
Size: 225 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Redpanda Flight

> "There is an art, it says, or rather, a knack to flying. The knack lies in learning how to throw yourself at the
> ground and miss."
>
> — Douglas Adams, _The Hitchhiker's Guide to the Galaxy_

This is a prototype of an [Apache Arrow Flight](https://arrow.apache.org/docs/format/Flight.html) proxy for Redpanda.
Put simply, it exposes a topic as a Flight stream, consumable via an Apache Arrow Flight client.

## Why?

A few reasons...

* I needed to learn about how the Rust [rdkafka](https://crates.io/crates/rdkafka) crate works and relearn a bit I've
forgotten about Rust in general.
* I'm a big believer in the power of Flight, having built [abstractions](https://github.com/neo4j-field/neo4j-arrow) for
other stacks in the past.
* I needed to learn more about how the Schema Registry works in the Kafka world.

## Who might ever use this?

Honestly, not sure, but if someone needs to quickly build a service that hydrates state from a topic _en masse_, Flight
gives a quick and efficient way to fetch all that data without dealing with all the Kafka API Consumer nonsense.

## Ok, I just want to try it.

Sure, why not. It does work with Redpanda Cloud if you're lazy.

### Building

Thankfully Rust's tooling is top-notch. Just make sure you've got Rust and Cargo (tested with 1.72 and newer).

```
$ cargo build --release
```

### Running

All config is pretty bare bones and via environment args at the moment:

* `REDPANDA_BROKERS` - broker seed address(es) (default: `localhost:9092`)
* `REDPANDA_SCHEMA_TOPIC` - topic for the schema registry (default: `_schemas`)
* `REDPANDA_SASL_USERNAME` - username for the proxy's connection to Redpanda (optional)
* `REDPANDA_SASL_PASSWORD` - password for the proxy's connection to Redpanda (optional)
* `REDPANDA_SASL_MECHANISM` - sasl mechanism for the proxy's connection to Redpanda (optional)
* `REDPANDA_TLS_ENABLED` - whether to use TLS for the connection to Redpanda (default: no)

You can change the log level by setting `RUST_LOG` to `debug` or `info`, etc.

### Caveats!

* do **NOT** use in production
- no front-end TLS support yet, so client's talk unencrypted to the proxy
- while the client's pass their Kafka login info at request time to the proxy, need to audit areas to minimize
risk/exposure. At any given time the proxy process will have the user's password in memory somewhere.
- basic auth is used only for `do_get` api calls (fetching data) and not all api calls yet
* expects a topic with an Avro schema in the Schema Registry
- not all Avro types are supported: for now use a `record` with basic primitive fields
- nullable types are approximated via Avro `union` fields, but assume a simple `["null", ]` value format.
- datetime/timestamp stuff hasn't been tested
- some logical types may be mapped to simpler Arrow types (e.g. uuid -> utf8)
* no performance tuning
- still need to learn more about how to build an efficient async stream
- batching has to occur, but not sure best way to do it
* generates lots of consumer groups
- while not using "subscribe", the Rust rdkafka library I'm using doesn't expose a direct way to consume data
without causing the creation of a consumer group

### Simple PyArrow Example

Make sure you've created a topic and that topic has an Avro schema registered properly.

An example schema:

```json
{
"type": "record",
"name": "flight_test",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "age",
"type": "int",
"default": -1
},
{
"name": "nullable",
"type": [
"null",
"float"
]
}
]
}
```

You can generate data using Python and the `confluent-kafka` library. _(Sample code coming.)_

For the consuming client side, assuming you `pip install`'d `pyarrow` ideally in a virtualenv:

```python
#!/usr/bin/env python3
import pyarrow.flight
import base64
import os

# we use basic auth for now
username = os.environ.get("REDPANDA_SASL_USERNAME")
password = os.environ.get("REDPANDA_SASL_PASSWORD")
auth_token = b"Basic " + base64.b64encode(f"{username}:{password}".encode())

# connecting to localhost without TLS(!!)
client = pyarrow.flight.FlightClient("grpc+tcp://127.0.0.1:9999")

# setup http headers for auth
opts = pyarrow.flight.FlightCallOptions(
headers=[(b"authorization", auth_token)],
timeout=10,
)

# list flights/topics avialable
print("Fetching flight infos:")
for info in client.list_flights(options=opts):
print(info)
print("----------------------\n\n")

# stream data from a topic partition into a PyArrow Table
# note: this may take a few seconds as it streams and assembles batches
ticket = pyarrow.flight.Ticket(b"flight_test/0")
print(f"Fetching flight with ticket {ticket}")
table = client.do_get(ticket, options=opts).read_all()
print(table)
```

# Copyright & License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/voutilad/redpanda-flight-rs

Awesome Lists containing this project

README