{"id":26230045,"url":"https://github.com/frocode/real_streaming_kafka","last_synced_at":"2026-04-21T22:01:19.872Z","repository":{"id":272499007,"uuid":"868640931","full_name":"FroCode/Real_Streaming_Kafka","owner":"FroCode","description":null,"archived":false,"fork":false,"pushed_at":"2025-01-14T19:35:13.000Z","size":745,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-31T12:45:44.413Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FroCode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-06T22:23:02.000Z","updated_at":"2025-01-14T19:35:16.000Z","dependencies_parsed_at":"2025-01-14T20:50:21.525Z","dependency_job_id":"32fdcc4f-f66e-45f3-87ce-b3b1c5011c1a","html_url":"https://github.com/FroCode/Real_Streaming_Kafka","commit_stats":null,"previous_names":["frocode/real_streaming_kafka"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/FroCode/Real_Streaming_Kafka","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FroCode%2FReal_Streaming_Kafka","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FroCode%2FReal_Streaming_Kafka/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FroCode%2FReal_Streaming_Kafka/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FroCode%2FReal_Streaming_Kafka/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FroCode","download_url":"https://codeload.github.com/FroCode/Real_Streaming_Kafka/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FroCode%2FReal_Streaming_Kafka/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32112030,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-21T11:25:29.218Z","status":"ssl_error","status_checked_at":"2026-04-21T11:25:28.499Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-12T22:19:38.822Z","updated_at":"2026-04-21T22:01:19.855Z","avatar_url":"https://github.com/FroCode.png","language":"Python","readme":"# Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch\n## Table of Contents\n- [Introduction](#introduction)\n- [System Architecture](#system-architecture)\n- [Technologies](#technologies)\n- [Getting Started](#getting-started)\n\n\n## Introduction\n\nThis project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.\n\n## System Architecture\n![System_architecture.png](assets%2FSystem_architecture.png)\n\nThe project is designed with the following components:\n\n- **Data Source**: We use `yelp.com` dataset for our pipeline.\n- **TCP/IP Socket**: Used to stream data over the network in chunks\n- **Apache Spark**: For data processing with its master and worker nodes.\n- **Confluent Kafka**: Our cluster on the cloud\n- **Control Center and Schema Registry**: Helps in monitoring and schema management of our Kafka streams.\n- **Kafka Connect**: For connecting to elasticsearch\n- **Elasticsearch**: For indexing and querying\n\n## Technologies\n\n- Python\n- TCP/IP\n- Confluent Kafka\n- Apache Spark\n- Docker\n- Elasticsearch\n\n## Getting Started\n\n1. Clone the repository:\n    ```bash\n    git clone https://github.com/FroCode/Real_Streaming_Kafka.git\n    ```\n\n2. Navigate to the project directory:\n    ```bash\n    cd Real_Streaming_Kafka\n    ```\n\n3. Run Docker Compose to spin up the spark cluster:\n    ```bash\n    docker-compose up\n    ```\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffrocode%2Freal_streaming_kafka","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffrocode%2Freal_streaming_kafka","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffrocode%2Freal_streaming_kafka/lists"}