https://github.com/danhenriquex/real-time-streaming
End to End Data Engineering using PySpark, Kafka, Cassandra and Docker
https://github.com/danhenriquex/real-time-streaming
cassandra dbeaver docker kafka llm pyspark
Last synced: 5 months ago
JSON representation
End to End Data Engineering using PySpark, Kafka, Cassandra and Docker
- Host: GitHub
- URL: https://github.com/danhenriquex/real-time-streaming
- Owner: danhenriquex
- Created: 2024-08-23T01:21:39.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-18T22:19:14.000Z (over 1 year ago)
- Last Synced: 2025-03-18T07:43:13.589Z (about 1 year ago)
- Topics: cassandra, dbeaver, docker, kafka, llm, pyspark
- Language: Python
- Homepage:
- Size: 73.2 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
π€ Data Project
Learning Data Engineering.
Overview β’
Technologies and Tools Used β’
Project Structure β’
Getting Started β’
Running the Pipeline
What I Learned
π§ Data Engineering Project π Finished π§
### Overview
This project demonstrates a data processing pipeline using Kafka, PySpark, Docker, Cassandra, and OpenAI. The goal was to create a system for real-time data streaming and processing, integrating various technologies to build a scalable and efficient architecture.
### Features
- **Kafka**: A distributed streaming platform used for building real-time data pipelines and streaming applications.
- **PySpark**: The Python API for Apache Spark, used for large-scale data processing and analytics.
- **Docker**: A platform for automating containerized applications, ensuring consistent environments across development, testing, and production.
- **Cassandra**: A distributed NoSQL database designed for handling large amounts of data across many commodity servers.
- **OpenAI**: Used for integrating advanced language models into the pipeline.
- **Python**: The primary programming language used for scripting and application logic.
### Project Structure
```bash
βββ jobs/requirements.txt
βββ jobs/spark-consumer.py
βββ .env.example
βββ constants.py
βββ docker-compose.yml
βββ main.py
```
### Scripts Overview
- `jobs/requirements.txt`: Lists the Python dependencies required for the Spark consumer job.
- `jobs/spark-consumer.py`: Contains the code for consuming data from Kafka and processing it using PySpark.
- `.env.example`: An example environment variable file to configure your local environment.
- `constants.py`: Defines constants used throughout the project.
- `docker-compose.yml`: Defines and runs multi-container Docker applications, configuring services for Kafka, Cassandra, and other components.
- `main.py`: The main script to initialize and run the application.
### Getting Started
To get started with this project, follow these steps:
1. **Clone the Repository:**
```bash
git clone
cd
3. **Setup Environment:**
```bash
# Install Python version
pyenv install 3.10.12
# Create a virtual environment
pyenv virtualenv 3.10.12
# Activate the environment
pyenv activate
# Install dependencies
pip install -r jobs/requirements.txt
```
2. **Run Docker Containers**
```bash
docker compose up -d
3. **Setup database**
```bash
spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 spark-consumer.pydocker exec -it realestatedataengineering-spark-master-1
4. **Execute the main script**
```bash
python main.py
### What I learned
- Kafka Integration: Gained experience in using Kafka for real-time data streaming and message brokering.
- PySpark: Developed skills in large-scale data processing and analytics using PySpark.
- Docker: Learned to containerize applications and manage multi-container setups with Docker Compose.
- Cassandra: Worked with Cassandra for scalable and distributed database solutions.
- OpenAI API: Integrated OpenAIβs language models for advanced text processing and analysis.
### Author
---
Fonte: https://www.youtube.com/@CodeWithYu/videos