An open API service indexing awesome lists of open source software.

https://github.com/danhenriquex/real-time-streaming

End to End Data Engineering using PySpark, Kafka, Cassandra and Docker
https://github.com/danhenriquex/real-time-streaming

cassandra dbeaver docker kafka llm pyspark

Last synced: 5 months ago
JSON representation

End to End Data Engineering using PySpark, Kafka, Cassandra and Docker

Awesome Lists containing this project

README

          

πŸ€– Data Project


Learning Data Engineering.


Overview β€’
Technologies and Tools Used β€’
Project Structure β€’
Getting Started β€’
Running the Pipeline
What I Learned


🚧 Data Engineering Project πŸš€ Finished 🚧


wakatime

### Overview


This project demonstrates a data processing pipeline using Kafka, PySpark, Docker, Cassandra, and OpenAI. The goal was to create a system for real-time data streaming and processing, integrating various technologies to build a scalable and efficient architecture.

### Features

- **Kafka**: A distributed streaming platform used for building real-time data pipelines and streaming applications.
- **PySpark**: The Python API for Apache Spark, used for large-scale data processing and analytics.
- **Docker**: A platform for automating containerized applications, ensuring consistent environments across development, testing, and production.
- **Cassandra**: A distributed NoSQL database designed for handling large amounts of data across many commodity servers.
- **OpenAI**: Used for integrating advanced language models into the pipeline.
- **Python**: The primary programming language used for scripting and application logic.

### Project Structure

```bash
β”œβ”€β”€ jobs/requirements.txt
β”œβ”€β”€ jobs/spark-consumer.py
β”œβ”€β”€ .env.example
β”œβ”€β”€ constants.py
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ main.py
```

### Scripts Overview

- `jobs/requirements.txt`: Lists the Python dependencies required for the Spark consumer job.
- `jobs/spark-consumer.py`: Contains the code for consuming data from Kafka and processing it using PySpark.
- `.env.example`: An example environment variable file to configure your local environment.
- `constants.py`: Defines constants used throughout the project.
- `docker-compose.yml`: Defines and runs multi-container Docker applications, configuring services for Kafka, Cassandra, and other components.
- `main.py`: The main script to initialize and run the application.



### Getting Started

To get started with this project, follow these steps:

1. **Clone the Repository:**

```bash
git clone
cd

3. **Setup Environment:**

```bash
# Install Python version
pyenv install 3.10.12

# Create a virtual environment
pyenv virtualenv 3.10.12

# Activate the environment
pyenv activate

# Install dependencies
pip install -r jobs/requirements.txt
```

2. **Run Docker Containers**
```bash
docker compose up -d
3. **Setup database**
```bash
spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 spark-consumer.pydocker exec -it realestatedataengineering-spark-master-1
4. **Execute the main script**
```bash
python main.py

### What I learned


- Kafka Integration: Gained experience in using Kafka for real-time data streaming and message brokering.
- PySpark: Developed skills in large-scale data processing and analytics using PySpark.
- Docker: Learned to containerize applications and manage multi-container setups with Docker Compose.
- Cassandra: Worked with Cassandra for scalable and distributed database solutions.
- OpenAI API: Integrated OpenAI’s language models for advanced text processing and analysis.

### Author

---

Fonte: https://www.youtube.com/@CodeWithYu/videos