https://github.com/najibadan/firehose_etl
https://github.com/najibadan/firehose_etl
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/najibadan/firehose_etl
- Owner: NajibAdan
- Created: 2025-01-26T09:20:06.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-01-26T09:55:25.000Z (8 months ago)
- Last Synced: 2025-01-26T10:28:36.073Z (8 months ago)
- Language: Python
- Size: 8.79 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# About
This repo contains necessary codes to fetch from Bluesky's firestream websocket and sends the data to Kafka# How to run
Run `make restart` or `docker compose up -d` if you don't have make installed, to set-up Kafka & Zookeeper & Minio. We are using minio as our local S3 solution for testing. After Kafka is up and running run `python datagen/firehose.py` to start pushing data to Kafka.Then run `python connect/batchstream.py` to launch a pyspark job that connects to Kafka & writes the data to a parquet file, partitioning by date & hour.