https://github.com/airscholar/realtimestreamingengineering
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.
https://github.com/airscholar/realtimestreamingengineering
apache-spark chatgpt dataengineering elasticsearch kafka openai-api tcp-socket
Last synced: 10 months ago
JSON representation
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.
- Host: GitHub
- URL: https://github.com/airscholar/realtimestreamingengineering
- Owner: airscholar
- Created: 2023-10-28T11:29:28.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-01-04T21:43:25.000Z (about 2 years ago)
- Last Synced: 2025-03-24T02:21:55.161Z (10 months ago)
- Topics: apache-spark, chatgpt, dataengineering, elasticsearch, kafka, openai-api, tcp-socket
- Language: Python
- Homepage: https://www.youtube.com/watch?v=ETdyFfYZaqU
- Size: 726 KB
- Stars: 34
- Watchers: 2
- Forks: 26
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project
## Table of Contents
- [Introduction](#introduction)
- [System Architecture](#system-architecture)
- [What You'll Learn](#what-youll-learn)
- [Technologies](#technologies)
- [Getting Started](#getting-started)
- [Watch the Video Tutorial](#watch-the-video-tutorial)
## Introduction
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.
## System Architecture

The project is designed with the following components:
- **Data Source**: We use `yelp.com` dataset for our pipeline.
- **TCP/IP Socket**: Used to stream data over the network in chunks
- **Apache Spark**: For data processing with its master and worker nodes.
- **Confluent Kafka**: Our cluster on the cloud
- **Control Center and Schema Registry**: Helps in monitoring and schema management of our Kafka streams.
- **Kafka Connect**: For connecting to elasticsearch
- **Elasticsearch**: For indexing and querying
## What You'll Learn
- Setting up data pipeline with TCP/IP
- Real-time data streaming with Apache Kafka
- Data processing techniques with Apache Spark
- Realtime sentiment analysis with OpenAI ChatGPT
- Synchronising data from kafka to elasticsearch
- Indexing and Querying data on elasticsearch
## Technologies
- Python
- TCP/IP
- Confluent Kafka
- Apache Spark
- Docker
- Elasticsearch
## Getting Started
1. Clone the repository:
```bash
git clone https://github.com/airscholar/E2EDataEngineering.git
```
2. Navigate to the project directory:
```bash
cd E2EDataEngineering
```
3. Run Docker Compose to spin up the spark cluster:
```bash
docker-compose up
```
For more detailed instructions, please check out the video tutorial linked below.
## Watch the Video Tutorial
For a complete walkthrough and practical demonstration, check out the video here: [](https://www.youtube.com/watch?v=ETdyFfYZaqU)