{"id":23647650,"url":"https://github.com/hassonor/kafka-spark-data-engineering","last_synced_at":"2026-05-06T04:07:38.847Z","repository":{"id":267634828,"uuid":"901869040","full_name":"hassonor/kafka-spark-data-engineering","owner":"hassonor","description":null,"archived":false,"fork":false,"pushed_at":"2025-02-18T15:21:36.000Z","size":3353,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-18T15:23:13.942Z","etag":null,"topics":["apachekafka","apachespark","data","docker","docker-compose","high-performance","java","kafka","producer","python"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hassonor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-11T13:19:04.000Z","updated_at":"2025-02-18T15:21:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"3a467b2d-168d-44e9-b997-ff0e848c802d","html_url":"https://github.com/hassonor/kafka-spark-data-engineering","commit_stats":null,"previous_names":["hassonor/kafka-spark-data-engineering"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hassonor%2Fkafka-spark-data-engineering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hassonor%2Fkafka-spark-data-engineering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hassonor%2Fkafka-spark-data-engineering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hassonor%2Fkafka-spark-data-engineering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hassonor","download_url":"https://codeload.github.com/hassonor/kafka-spark-data-engineering/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239599040,"owners_count":19665911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apachekafka","apachespark","data","docker","docker-compose","high-performance","java","kafka","producer","python"],"created_at":"2024-12-28T14:38:40.859Z","updated_at":"2025-11-12T07:30:18.038Z","avatar_url":"https://github.com/hassonor.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# E2E Data Engineering Project: High-Performance Financial Transaction Processing\n\n🚧 Project Status: Under Active Development 🚧\n\n[![Java](https://img.shields.io/badge/Java-17-orange.svg)](https://openjdk.org/projects/jdk/17/)\n[![Apache Kafka](https://img.shields.io/badge/Apache%20Kafka-3.8.1-blue.svg)](https://kafka.apache.org/)\n[![Apache Spark](https://img.shields.io/badge/Apache%20Spark-3.5.0-orange.svg)](https://spark.apache.org/)\n[![Docker](https://img.shields.io/badge/Docker-Latest-blue.svg)](https://www.docker.com/)\n\nA production-grade data engineering pipeline designed to handle massive-scale financial transaction processing with high\nthroughput, fault tolerance, and real-time analytics capabilities.\n\n## 🚀 System Capabilities\n\n- **Processing Volume**: 1.2 billion transactions per hour\n- **Storage Capacity**: Handles up to 1.852 PB of data (5-year projection with 20% YoY growth)\n- **Compression Ratio**: 5:1 using Snappy compression\n- **High Availability**: Triple replication for fault tolerance\n- **Real-time Processing**: Sub-second latency for transaction analytics\n- **Scalable Architecture**: Horizontally scalable Kafka-Spark infrastructure\n\n## 📊 Data Specifications\n\n### Transaction Schema\n\n```scala\ncase class Transaction (\n    transactionId: String,      // 36 bytes\n    amount: Float,              // 8 bytes\n    userId: String,             // 12 bytes\n    transactionTime: BigInt,    // 8 bytes\n    merchantId: String,         // 12 bytes\n    transactionType: String,    // 8 bytes\n    location: String,           // 12 bytes\n    paymentMethod: String,      // 15 bytes\n    isInternational: Boolean,   // 5 bytes\n    currency: String            // 5 bytes\n)                              // Total: 120 bytes\n```\n\n### Storage Requirements\n\n| Time Period | Raw Data  | With 3x Replication \u0026 Compression |\n|-------------|-----------|-----------------------------------|\n| Per Hour    | 144 GB    | 86.4 GB                           |\n| Per Day     | 3.456 TB  | 2.07 TB                           |\n| Per Month   | 103.68 TB | 62.1 TB                           |\n| Per Year    | 1.244 PB  | 745.2 TB                          |\n\n## 🛠 Technology Stack\n\n- **Apache Kafka 3.8.1**: Distributed streaming platform\n    - 3 Controller nodes for cluster management\n    - 3 Broker nodes for data distribution\n    - Schema Registry for data governance\n- **Apache Spark 3.5.0**: Real-time data processing\n    - 1 Master node\n    - 3 Worker nodes\n    - Structured Streaming for real-time analytics\n- **Additional Components**:\n    - RedPanda Console for cluster monitoring\n    - Docker \u0026 Docker Compose for containerization\n    - Java 17 for transaction generation\n    - Python for Spark processing\n\n## 🚀 Getting Started\n\n### Prerequisites\n\n- Docker and Docker Compose\n- Java 17 or later\n- Maven\n- Python 3.8 or later\n\n### Quick Start\n\n1. Clone the repository:\n\n```bash\ngit clone https://github.com/yourusername/kafka-spark-data-engineering.git\ncd kafka-spark-data-engineering\n```\n\n2. Start the infrastructure:\n\n```bash\ndocker-compose up -d\n```\n\n3. Build and run the transaction producer:\n\n```bash\nmvn clean package\njava -jar target/transaction-producer.jar\n```\n\n4. Start the Spark processor:\n\n```bash\npython spark_processor.py\n```\n\n### Accessing Services\n\n- RedPanda Console: http://localhost:8080\n- Spark Master UI: http://localhost:9190\n- Schema Registry: http://localhost:18081\n\n## 🏗 Architecture\n\n```mermaid\ngraph LR\n    subgraph \"Producers\"\n        java[Java]\n        python[Python]\n    end\n\n    subgraph \"Kafka Ecosystem\"\n        kc[Kafka Controllers]\n        kb[Kafka Brokers]\n        sr[Schema Registry]\n    end\n\n    subgraph \"Processing\"\n        master[Spark Master]\n        w1[Worker]\n        w2[Worker]\n        w3[Worker]\n    end\n\n    subgraph \"Analytics Stack\"\n        es[Elasticsearch]\n        logstash[Logstash]\n        kibana[Kibana]\n    end\n\n    %% Connections\n    java -- Streaming --\u003e kc\n    python -- Streaming --\u003e kc\n    kc -- Streaming --\u003e kb\n    kb --\u003e sr\n    kb -- Streaming --\u003e master\n    master --\u003e w1\n    master --\u003e w2\n    master --\u003e w3\n    w1 --\u003e es\n    w2 --\u003e es\n    w3 --\u003e es\n    es --\u003e logstash\n    logstash --\u003e kibana\n```\n\n## 🔧 Configuration\n\n### Kafka Configuration\n\n- Bootstrap Servers: `localhost:29092,localhost:39092,localhost:49092`\n- Topics:\n    - `financial_transactions`: Raw transaction data\n    - `transaction_aggregates`: Processed aggregations\n    - `transaction_anomalies`: Detected anomalies\n\n### Spark Configuration\n\n- Master URL: `spark://spark-master:7077`\n- Worker Resources:\n    - Cores: 2 per worker\n    - Memory: 2GB per worker\n- Checkpoint Directory: `/mnt/spark-checkpoints`\n- State Store: `/mnt/spark-state`\n\n## 📈 Performance Optimization\n\n### Producer Optimizations\n\n- Batch Size: 64KB\n- Linger Time: 3ms\n- Compression: Snappy\n- Acknowledgments: 1 (leader acknowledgment)\n- Multi-threaded production: 3 concurrent threads\n\n### Consumer Optimizations\n\n- Structured Streaming for efficient processing\n- Checkpointing for fault tolerance\n- State store for stateful operations\n- Partition optimization for parallel processing\n\n## 🔍 Monitoring\n\n- Transaction throughput logging\n- Kafka cluster health monitoring via RedPanda Console\n- Spark UI for job tracking and performance metrics\n- Producer performance metrics with throughput calculation\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🤝 Contributing\n\nContributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and\nthe process for submitting pull requests.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhassonor%2Fkafka-spark-data-engineering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhassonor%2Fkafka-spark-data-engineering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhassonor%2Fkafka-spark-data-engineering/lists"}