{"id":20162669,"url":"https://github.com/airscholar/realtimestreamingengineering","last_synced_at":"2025-04-10T00:36:02.827Z","repository":{"id":204120667,"uuid":"711152714","full_name":"airscholar/RealtimeStreamingEngineering","owner":"airscholar","description":"This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.","archived":false,"fork":false,"pushed_at":"2024-01-04T21:43:25.000Z","size":743,"stargazers_count":34,"open_issues_count":1,"forks_count":26,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-24T02:21:55.161Z","etag":null,"topics":["apache-spark","chatgpt","dataengineering","elasticsearch","kafka","openai-api","tcp-socket"],"latest_commit_sha":null,"homepage":"https://www.youtube.com/watch?v=ETdyFfYZaqU","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/airscholar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-10-28T11:29:28.000Z","updated_at":"2025-03-20T14:25:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"8ea036b7-9e03-4c09-a2cd-b4c56d8799b9","html_url":"https://github.com/airscholar/RealtimeStreamingEngineering","commit_stats":null,"previous_names":["airscholar/e2edataengineering","airscholar/realtimestreamingengineering"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FRealtimeStreamingEngineering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FRealtimeStreamingEngineering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FRealtimeStreamingEngineering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FRealtimeStreamingEngineering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/airscholar","download_url":"https://codeload.github.com/airscholar/RealtimeStreamingEngineering/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248137897,"owners_count":21053771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","chatgpt","dataengineering","elasticsearch","kafka","openai-api","tcp-socket"],"created_at":"2024-11-14T00:26:13.949Z","updated_at":"2025-04-10T00:36:02.805Z","avatar_url":"https://github.com/airscholar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project\n\n## Table of Contents\n- [Introduction](#introduction)\n- [System Architecture](#system-architecture)\n- [What You'll Learn](#what-youll-learn)\n- [Technologies](#technologies)\n- [Getting Started](#getting-started)\n- [Watch the Video Tutorial](#watch-the-video-tutorial)\n\n## Introduction\n\nThis project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.\n\n## System Architecture\n![System_architecture.png](assets%2FSystem_architecture.png)\n\nThe project is designed with the following components:\n\n- **Data Source**: We use `yelp.com` dataset for our pipeline.\n- **TCP/IP Socket**: Used to stream data over the network in chunks\n- **Apache Spark**: For data processing with its master and worker nodes.\n- **Confluent Kafka**: Our cluster on the cloud\n- **Control Center and Schema Registry**: Helps in monitoring and schema management of our Kafka streams.\n- **Kafka Connect**: For connecting to elasticsearch\n- **Elasticsearch**: For indexing and querying\n\n## What You'll Learn\n\n- Setting up data pipeline with TCP/IP \n- Real-time data streaming with Apache Kafka\n- Data processing techniques with Apache Spark\n- Realtime sentiment analysis with OpenAI ChatGPT\n- Synchronising data from kafka to elasticsearch\n- Indexing and Querying data on elasticsearch\n\n## Technologies\n\n- Python\n- TCP/IP\n- Confluent Kafka\n- Apache Spark\n- Docker\n- Elasticsearch\n\n## Getting Started\n\n1. Clone the repository:\n    ```bash\n    git clone https://github.com/airscholar/E2EDataEngineering.git\n    ```\n\n2. Navigate to the project directory:\n    ```bash\n    cd E2EDataEngineering\n    ```\n\n3. Run Docker Compose to spin up the spark cluster:\n    ```bash\n    docker-compose up\n    ```\n\nFor more detailed instructions, please check out the video tutorial linked below.\n\n## Watch the Video Tutorial\n\nFor a complete walkthrough and practical demonstration, check out the video here: [![Realtime Streaming with TCP IP Spark LLM Kafka Elasticsearch.png](assets%2FRealtime%20Streaming%20with%20TCP%20IP%20Spark%20LLM%20Kafka%20Elasticsearch.png)](https://www.youtube.com/watch?v=ETdyFfYZaqU)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairscholar%2Frealtimestreamingengineering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fairscholar%2Frealtimestreamingengineering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairscholar%2Frealtimestreamingengineering/lists"}