https://github.com/danhenriquex/real-time-streaming

End to End Data Engineering using PySpark, Kafka, Cassandra and Docker
https://github.com/danhenriquex/real-time-streaming

cassandra dbeaver docker kafka llm pyspark

Last synced: 5 months ago
JSON representation

End to End Data Engineering using PySpark, Kafka, Cassandra and Docker

Host: GitHub
URL: https://github.com/danhenriquex/real-time-streaming
Owner: danhenriquex
Created: 2024-08-23T01:21:39.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-09-18T22:19:14.000Z (over 1 year ago)
Last Synced: 2025-03-18T07:43:13.589Z (about 1 year ago)
Topics: cassandra, dbeaver, docker, kafka, llm, pyspark
Language: Python
Homepage:
Size: 73.2 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

🤖 Data Project

Learning Data Engineering.

Overview •
Technologies and Tools Used •
Project Structure •
Getting Started •
Running the Pipeline
What I Learned

🚧 Data Engineering Project 🚀 Finished 🚧

### Overview

This project demonstrates a data processing pipeline using Kafka, PySpark, Docker, Cassandra, and OpenAI. The goal was to create a system for real-time data streaming and processing, integrating various technologies to build a scalable and efficient architecture.

### Features

- **Kafka**: A distributed streaming platform used for building real-time data pipelines and streaming applications.
- **PySpark**: The Python API for Apache Spark, used for large-scale data processing and analytics.
- **Docker**: A platform for automating containerized applications, ensuring consistent environments across development, testing, and production.
- **Cassandra**: A distributed NoSQL database designed for handling large amounts of data across many commodity servers.
- **OpenAI**: Used for integrating advanced language models into the pipeline.
- **Python**: The primary programming language used for scripting and application logic.

### Project Structure

```bash
├── jobs/requirements.txt
├── jobs/spark-consumer.py
├── .env.example
├── constants.py
├── docker-compose.yml
├── main.py
```

### Scripts Overview

- `jobs/requirements.txt`: Lists the Python dependencies required for the Spark consumer job.
- `jobs/spark-consumer.py`: Contains the code for consuming data from Kafka and processing it using PySpark.
- `.env.example`: An example environment variable file to configure your local environment.
- `constants.py`: Defines constants used throughout the project.
- `docker-compose.yml`: Defines and runs multi-container Docker applications, configuring services for Kafka, Cassandra, and other components.
- `main.py`: The main script to initialize and run the application.

### Getting Started

To get started with this project, follow these steps:

1. **Clone the Repository:**

```bash
git clone
cd

3. **Setup Environment:**

```bash
# Install Python version
pyenv install 3.10.12

# Create a virtual environment
pyenv virtualenv 3.10.12

# Activate the environment
pyenv activate

# Install dependencies
pip install -r jobs/requirements.txt
```

2. **Run Docker Containers**
```bash
docker compose up -d
3. **Setup database**
```bash
spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 spark-consumer.pydocker exec -it realestatedataengineering-spark-master-1
4. **Execute the main script**
```bash
python main.py

### What I learned

- Kafka Integration: Gained experience in using Kafka for real-time data streaming and message brokering.
- PySpark: Developed skills in large-scale data processing and analytics using PySpark.
- Docker: Learned to containerize applications and manage multi-container setups with Docker Compose.
- Cassandra: Worked with Cassandra for scalable and distributed database solutions.
- OpenAI API: Integrated OpenAI’s language models for advanced text processing and analysis.

### Author

---

Fonte: https://www.youtube.com/@CodeWithYu/videos

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/danhenriquex/real-time-streaming

Awesome Lists containing this project

README

🤖 Data Project

🚧 Data Engineering Project 🚀 Finished 🚧