{"id":15056637,"url":"https://github.com/danhenriquex/real-time-streaming","last_synced_at":"2026-01-03T22:52:52.898Z","repository":{"id":254368886,"uuid":"846327344","full_name":"danhenriquex/Real-Time-Streaming","owner":"danhenriquex","description":"End to End Data Engineering using PySpark, Kafka, Cassandra and Docker","archived":false,"fork":false,"pushed_at":"2024-09-18T22:19:14.000Z","size":75,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-18T07:43:13.589Z","etag":null,"topics":["cassandra","dbeaver","docker","kafka","llm","pyspark"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danhenriquex.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-23T01:21:39.000Z","updated_at":"2024-10-30T10:59:20.000Z","dependencies_parsed_at":"2024-11-19T22:40:23.800Z","dependency_job_id":null,"html_url":"https://github.com/danhenriquex/Real-Time-Streaming","commit_stats":{"total_commits":25,"total_committers":2,"mean_commits":12.5,"dds":0.24,"last_synced_commit":"b8efe0bd3d7a5d94651306aeb13c27357d5b20b6"},"previous_names":["danhenriquex/big-data-project","danhenriquex/real-time-streaming"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danhenriquex%2FReal-Time-Streaming","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danhenriquex%2FReal-Time-Streaming/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danhenriquex%2FReal-Time-Streaming/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danhenriquex%2FReal-Time-Streaming/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danhenriquex","download_url":"https://codeload.github.com/danhenriquex/Real-Time-Streaming/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254567248,"owners_count":22092737,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cassandra","dbeaver","docker","kafka","llm","pyspark"],"created_at":"2024-09-24T21:54:31.282Z","updated_at":"2026-01-03T22:52:52.857Z","avatar_url":"https://github.com/danhenriquex.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e🤖 Data Project\u003c/h1\u003e\n\u003cp align=\"center\" id=\"objetivo\"\u003eLearning Data Engineering. \n \u003c/p\u003e \n\n\n\u003cp align=\"center\"\u003e\n \u003ca href=\"#overview\"\u003eOverview\u003c/a\u003e •\n \u003ca href=\"#features\"\u003eTechnologies and Tools Used\u003c/a\u003e •\n \u003ca href=\"#roadmap\"\u003eProject Structure\u003c/a\u003e • \n \u003ca href=\"#tecnologias\"\u003eGetting Started\u003c/a\u003e • \n \u003ca href=\"#author\"\u003eRunning the Pipeline\u003c/a\u003e\n\u003ca href=\"#author\"\u003eWhat I Learned\u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch4 align=\"center\"\u003e \n\t🚧  Data Engineering Project 🚀 Finished  🚧 \n\u003c/h4\u003e\n\n\u003cdiv align=\"center\"\u003e\n\t\u003ca href=\"https://wakatime.com/badge/user/8028aaab-232d-4832-8b66-f103e1d713b9/project/6ac45fa8-dfae-463f-bca1-a84418e4883c\"\u003e\u003cimg src=\"https://wakatime.com/badge/user/8028aaab-232d-4832-8b66-f103e1d713b9/project/6ac45fa8-dfae-463f-bca1-a84418e4883c.svg\" alt=\"wakatime\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n### Overview\n\n\u003cdiv style='margin: 20px' id=\"overview\"\u003e\nThis project demonstrates a data processing pipeline using Kafka, PySpark, Docker, Cassandra, and OpenAI. The goal was to create a system for real-time data streaming and processing, integrating various technologies to build a scalable and efficient architecture.\n\u003c/div\u003e\n\n### Features\n\n\u003cdiv id=\"features\"\u003e\n\n- **Kafka**: A distributed streaming platform used for building real-time data pipelines and streaming applications.\n- **PySpark**: The Python API for Apache Spark, used for large-scale data processing and analytics.\n- **Docker**: A platform for automating containerized applications, ensuring consistent environments across development, testing, and production.\n- **Cassandra**: A distributed NoSQL database designed for handling large amounts of data across many commodity servers.\n- **OpenAI**: Used for integrating advanced language models into the pipeline.\n- **Python**: The primary programming language used for scripting and application logic.\n\n\u003c/div\u003e\n\n\u003cdiv id=\"roadmap\"\u003e\n\n### Project Structure\n\n```bash\n├── jobs/requirements.txt         \n├── jobs/spark-consumer.py                \n├── .env.example         \n├── constants.py              \n├── docker-compose.yml                 \n├── main.py  \n```\n        \n\u003c/div\u003e\n\n\n### Scripts Overview\n\n- `jobs/requirements.txt`: Lists the Python dependencies required for the Spark consumer job.\n- `jobs/spark-consumer.py`: Contains the code for consuming data from Kafka and processing it using PySpark.\n- `.env.example`: An example environment variable file to configure your local environment.\n- `constants.py`: Defines constants used throughout the project.\n- `docker-compose.yml`: Defines and runs multi-container Docker applications, configuring services for Kafka, Cassandra, and other components.\n- `main.py`: The main script to initialize and run the application.\n\n\u003cdiv id=\"tecnologias\"\u003e\n\t\n### Getting Started\n\nTo get started with this project, follow these steps:\n\n1. **Clone the Repository:**\n\n   ```bash\n   git clone \u003crepository_url\u003e\n   cd \u003crepository_directory\u003e\n\n3. **Setup Environment:**\n\n   ```bash\n   # Install Python version\n   pyenv install 3.10.12\n\n   # Create a virtual environment\n   pyenv virtualenv 3.10.12 \u003cenv_name\u003e\n\n   # Activate the environment\n   pyenv activate \u003cenv_name\u003e\n\n   # Install dependencies\n   pip install -r jobs/requirements.txt\n   ```\n\n2. **Run Docker Containers**\n   ```bash\n   docker compose up -d\n3. **Setup database**\n   ```bash\n   spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 spark-consumer.pydocker exec -it realestatedataengineering-spark-master-1\n4. **Execute the main script**\n   ```bash\n   python main.py\n\n\u003c/div\u003e\n\n### What I learned\n\n\t\n- Kafka Integration: Gained experience in using Kafka for real-time data streaming and message brokering.\n- PySpark: Developed skills in large-scale data processing and analytics using PySpark.\n- Docker: Learned to containerize applications and manage multi-container setups with Docker Compose.\n- Cassandra: Worked with Cassandra for scalable and distributed database solutions.\n- OpenAI API: Integrated OpenAI’s language models for advanced text processing and analysis.\n\n\u003c/div\u003e\n\n\n### Author\n\n---\n\nFonte: https://www.youtube.com/@CodeWithYu/videos\n\n\u003c!-- \u003cscript type=\"text/javascript\" src=\"https://platform.linkedin.com/badges/js/profile.js\" async defer\u003e\u003c/script\u003e --\u003e\n\n\u003cdiv align=\"left\" id=\"author\"\u003e\n\n\u003ca href=\"https://github.com/danhenriquex\"\u003e\n  \u003cimg src=\"https://github.com/danhenriquex.png\" width=\"100\" height=\"100\" style=\"border-radius: 50%\"/\u003e\n\u003c/a\u003e\n\n\u003c!-- \u003cdiv class=\"LI-profile-badge\"  data-version=\"v1\" data-size=\"medium\" data-locale=\"pt_BR\" data-type=\"vertical\" data-theme=\"dark\" data-vanity=\"danilo-henrique-santana\"\u003e\u003ca class=\"LI-simple-link\" href='https://br.linkedin.com/in/danilo-henrique-santana?trk=profile-badge'\u003eDanilo Henrique\u003c/a\u003e\u003c/div\u003e --\u003e\n\u003c/div\u003e\n\n\u003cdiv style=\"margin-top: 20px\" \u003e\n  \u003ca href=\"https://www.linkedin.com/in/danilo-henrique-480032167/\"\u003e\n    \u003cimg  src=\"https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge\u0026logo=linkedin\u0026logoColor=white\"/\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanhenriquex%2Freal-time-streaming","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanhenriquex%2Freal-time-streaming","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanhenriquex%2Freal-time-streaming/lists"}