https://github.com/spektom/kafka-dumper

Dumps events stored in Kafka to any Hadoop supported file system using Spark Streaming
https://github.com/spektom/kafka-dumper

apache-spark kafka spark-kafka spark-kafka-integration spark-streaming

Last synced: 3 months ago
JSON representation

Dumps events stored in Kafka to any Hadoop supported file system using Spark Streaming

Host: GitHub
URL: https://github.com/spektom/kafka-dumper
Owner: spektom
License: apache-2.0
Created: 2018-05-14T06:52:50.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2018-05-14T07:16:10.000Z (about 7 years ago)
Last Synced: 2025-01-20T16:53:28.442Z (4 months ago)
Topics: apache-spark, kafka, spark-kafka, spark-kafka-integration, spark-streaming
Language: Scala
Size: 13.7 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

kafka-dumper
=============

Dumps events stored in Kafka to any Hadoop supported file system using Spark Streaming.

The process reads events in slices (slice size is defined by the `interval` argument), then
writes each slice to a text file using predefined output file format.

## Output format

The output format is defined as follows:

topic=/dt=YYYYMMddHH/-.gz

Kafka internal timestamp is used for determining what bucket an event will be
written to.

To tweak the format, see `com.github.spektom.kafka.dumper.FileNamingStrategy` class.

## Delivery semantics

Consumed offsets are stored in Kafka itself right after events were written to files. If the process
crashes in between, the next time it starts it will start consuming events from the previously
commited offsets, and the target files will be overwritten. Idempotency during writes guarantees
exactly once semantics in this case.

## Supported event formats

Any text-based event format is supported: JSON, CSV, etc.

For binary format support, some tweaks must be provided for `com.github.spektom.kafka.dumper.KafkaStream`
and `com.github.spektom.kafka.dumper.RecordSaver` classes

## Building and running

Building shadow Jar file:

mvn package

Usage:

spark-submit kafka-dumper_2.11-0.0.1-uberjar.jar [OPTIONS]

Where options are:

--brokers Comma separated list of Kafka bootstrap servers
--topics Comma separated list of Kafka topics
--group Kafka consumer group
--interval Spark batch interval in seconds (default: 30 secs)
--path Target destination path under which files will be saved

For example, the following command will read events from local Kafka topic called `events`,
and write them to local files under `/tmp/datalake` directory every 10 seconds:

./spark-submit kafka-dumper_2.11-0.0.1-uberjar.jar \
--brokers localhost:9092 --topics events
--path /tmp/datalake --group kafka-dumper
--interval 10

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/spektom/kafka-dumper

Awesome Lists containing this project

README