https://github.com/ignalina/welder
A kafka consuming streaming-spark job ingesting data at maximum pararell speed.
https://github.com/ignalina/welder
Last synced: about 1 year ago
JSON representation
A kafka consuming streaming-spark job ingesting data at maximum pararell speed.
- Host: GitHub
- URL: https://github.com/ignalina/welder
- Owner: Ignalina
- License: mit
- Created: 2021-10-10T18:47:14.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2023-04-25T21:11:05.000Z (about 3 years ago)
- Last Synced: 2025-02-08T14:29:09.582Z (over 1 year ago)
- Language: Java
- Size: 1.7 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Welder 2022
Trying out different streaming jobs to read avro from kafka and output to targets like :
* Append only: hive/parquet
* full sync: Iceberg/parquet.
* full sync: Delta
* full sync: hudi
Goals with this is to learn howto read from kafka with multiple partitions/offsets and spread out the work to multiple workers.
The name Welder is the opposite from the Shredder , since the welder it makes the data "whole again" (Hopefully)
# Screenshots
An producer called the Shredder is started , reading fixed column sized datafiles (30 columns in this example , 2 gig per file).
Time spend in total : 2.471591492s parsing 148804290 lines from 2620609413 bytes
Troughput bytes/s total : 1011.17MB /s
Troughput lines/s total : 57.42M Lines/s
Troughput lines/s toAvro: 4.27M Lines/s
Time spent toReadChunks : 0.7911964744166666 s
Time spent toAvro : 33.271536043333334 s
Time spent toKafka : 19.226326717666666 s
Time spent DoneKafka : 8.036e-06 s



