https://github.com/airscholar/realtimestreamingengineering

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.
https://github.com/airscholar/realtimestreamingengineering

apache-spark chatgpt dataengineering elasticsearch kafka openai-api tcp-socket

Last synced: 11 months ago
JSON representation

Host: GitHub
URL: https://github.com/airscholar/realtimestreamingengineering
Owner: airscholar
Created: 2023-10-28T11:29:28.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-01-04T21:43:25.000Z (about 2 years ago)
Last Synced: 2025-03-24T02:21:55.161Z (12 months ago)
Topics: apache-spark, chatgpt, dataengineering, elasticsearch, kafka, openai-api, tcp-socket
Language: Python
Homepage: https://www.youtube.com/watch?v=ETdyFfYZaqU
Size: 726 KB
Stars: 34
Watchers: 2
Forks: 26
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project

## Table of Contents
- [Introduction](#introduction)
- [System Architecture](#system-architecture)
- [What You'll Learn](#what-youll-learn)
- [Technologies](#technologies)
- [Getting Started](#getting-started)
- [Watch the Video Tutorial](#watch-the-video-tutorial)

## Introduction

## System Architecture
![System_architecture.png](assets%2FSystem_architecture.png)

The project is designed with the following components:

- **Data Source**: We use `yelp.com` dataset for our pipeline.
- **TCP/IP Socket**: Used to stream data over the network in chunks
- **Apache Spark**: For data processing with its master and worker nodes.
- **Confluent Kafka**: Our cluster on the cloud
- **Control Center and Schema Registry**: Helps in monitoring and schema management of our Kafka streams.
- **Kafka Connect**: For connecting to elasticsearch
- **Elasticsearch**: For indexing and querying

## What You'll Learn

- Setting up data pipeline with TCP/IP
- Real-time data streaming with Apache Kafka
- Data processing techniques with Apache Spark
- Realtime sentiment analysis with OpenAI ChatGPT
- Synchronising data from kafka to elasticsearch
- Indexing and Querying data on elasticsearch

## Technologies

- Python
- TCP/IP
- Confluent Kafka
- Apache Spark
- Docker
- Elasticsearch

## Getting Started

1. Clone the repository:
```bash
git clone https://github.com/airscholar/E2EDataEngineering.git
```

2. Navigate to the project directory:
```bash
cd E2EDataEngineering
```

3. Run Docker Compose to spin up the spark cluster:
```bash
docker-compose up
```

For more detailed instructions, please check out the video tutorial linked below.

## Watch the Video Tutorial

For a complete walkthrough and practical demonstration, check out the video here: [![Realtime Streaming with TCP IP Spark LLM Kafka Elasticsearch.png](assets%2FRealtime%20Streaming%20with%20TCP%20IP%20Spark%20LLM%20Kafka%20Elasticsearch.png)](https://www.youtube.com/watch?v=ETdyFfYZaqU)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/airscholar/realtimestreamingengineering

Awesome Lists containing this project

README