https://github.com/ignalina/welder

A kafka consuming streaming-spark job ingesting data at maximum pararell speed.
https://github.com/ignalina/welder

Last synced: about 1 year ago
JSON representation

A kafka consuming streaming-spark job ingesting data at maximum pararell speed.

Host: GitHub
URL: https://github.com/ignalina/welder
Owner: Ignalina
License: mit
Created: 2021-10-10T18:47:14.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-04-25T21:11:05.000Z (about 3 years ago)
Last Synced: 2025-02-08T14:29:09.582Z (over 1 year ago)
Language: Java
Size: 1.7 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Welder 2022
Trying out different streaming jobs to read avro from kafka and output to targets like :
* Append only: hive/parquet
* full sync: Iceberg/parquet.
* full sync: Delta
* full sync: hudi

Goals with this is to learn howto read from kafka with multiple partitions/offsets and spread out the work to multiple workers.
The name Welder is the opposite from the Shredder , since the welder it makes the data "whole again" (Hopefully)

# Screenshots
An producer called the Shredder is started , reading fixed column sized datafiles (30 columns in this example , 2 gig per file).

Time spend in total : 2.471591492s parsing 148804290 lines from 2620609413 bytes
Troughput bytes/s total : 1011.17MB /s
Troughput lines/s total : 57.42M Lines/s
Troughput lines/s toAvro: 4.27M Lines/s
Time spent toReadChunks : 0.7911964744166666 s
Time spent toAvro : 33.271536043333334 s
Time spent toKafka : 19.226326717666666 s
Time spent DoneKafka : 8.036e-06 s

![Screenshot](screenshots/spark_232_streaming_1.png)
![Screenshot](screenshots/spark_232_streaming_2.png)
![Screenshot](screenshots/spark_232_streaming_3.png)
![Screenshot](screenshots/spark_232_streaming_4.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ignalina/welder

Awesome Lists containing this project

README