https://github.com/undisputed-jay/streaming-data-from-reddit-using-kafka-spark-and-mongodb

A data pipeline that streams Reddit comments from the 'Politics' subreddit using Kafka and Apache Spark. Processed data is stored in MongoDB for real-time analysis and management.
https://github.com/undisputed-jay/streaming-data-from-reddit-using-kafka-spark-and-mongodb

apache-spark big-data data-engineering etl-pipeline kafka mongodb mongodb-atlas pyspark real-time-streaming redditapi streaming-analytics

Last synced: 6 months ago
JSON representation

A data pipeline that streams Reddit comments from the 'Politics' subreddit using Kafka and Apache Spark. Processed data is stored in MongoDB for real-time analysis and management.

Host: GitHub
URL: https://github.com/undisputed-jay/streaming-data-from-reddit-using-kafka-spark-and-mongodb
Owner: Undisputed-jay
Created: 2024-12-09T00:02:24.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-12-09T00:32:39.000Z (10 months ago)
Last Synced: 2025-02-06T09:47:52.211Z (8 months ago)
Topics: apache-spark, big-data, data-engineering, etl-pipeline, kafka, mongodb, mongodb-atlas, pyspark, real-time-streaming, redditapi, streaming-analytics
Language: Python
Homepage: https://www.reddit.com/
Size: 399 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/undisputed-jay/streaming-data-from-reddit-using-kafka-spark-and-mongodb

Awesome Lists containing this project