{"id":25709558,"url":"https://github.com/pranavbarthwal/kafka","last_synced_at":"2025-06-16T22:34:22.383Z","repository":{"id":277171613,"uuid":"930421568","full_name":"PranavBarthwal/kafka","owner":"PranavBarthwal","description":"Kafka is an open-source software platform for storing, processing, and analyzing streaming data in real time. It's used to build data pipelines and applications that can adapt to data streams. ","archived":false,"fork":false,"pushed_at":"2025-02-12T15:19:33.000Z","size":38,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-25T09:42:45.359Z","etag":null,"topics":["apache-kafka","data-streaming","distributed-systems","documentation","kafka","system-design"],"latest_commit_sha":null,"homepage":"https://medium.com/@prayagbhatt2003/kafka-x-zomato-c07c02da09cd","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PranavBarthwal.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-10T15:57:00.000Z","updated_at":"2025-02-12T15:21:43.000Z","dependencies_parsed_at":"2025-02-12T14:55:41.978Z","dependency_job_id":"eaa8f33d-c98d-44df-acdd-358301563b3d","html_url":"https://github.com/PranavBarthwal/kafka","commit_stats":null,"previous_names":["pranavbarthwal/kafka"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/PranavBarthwal/kafka","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PranavBarthwal%2Fkafka","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PranavBarthwal%2Fkafka/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PranavBarthwal%2Fkafka/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PranavBarthwal%2Fkafka/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PranavBarthwal","download_url":"https://codeload.github.com/PranavBarthwal/kafka/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PranavBarthwal%2Fkafka/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260252705,"owners_count":22981320,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-kafka","data-streaming","distributed-systems","documentation","kafka","system-design"],"created_at":"2025-02-25T09:34:24.409Z","updated_at":"2025-06-16T22:34:22.354Z","avatar_url":"https://github.com/PranavBarthwal.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 📍 Apache Kafka\n\nApache Kafka is a distributed event streaming platform used for building real-time data pipelines and streaming applications. It is designed to handle high-throughput, fault-tolerant, and scalable messaging. Kafka is widely used for log aggregation, real-time analytics, and event-driven architectures.\n\nIn simpler terms, Apache kafka is like a communication system that helps different parts of a computer system exchange data by publishing and subscribing to topics.\n\n\n\n# 📍 Kafka's Publisher-Subscriber model\n![411843836-8d38bcea-0ee5-41fb-89dd-8b619e77827c](https://github.com/user-attachments/assets/ad484d78-682a-4055-96f5-fe862600427e)\n\n\nIn **Kafka's Publisher-Subscriber model** (Pub-Sub model), producers (publishers) send messages to **topics**, and consumers (subscribers) read messages from these topics. Kafka brokers store and distribute these messages efficiently.  \n\n### **How It Works:**\n1. **Producers publish** messages to a topic.  \n2. **Kafka brokers store** these messages in **partitions** (subdivisions of topics for scalability).  \n3. **Consumers subscribe** to topics and process messages.  \n4. **Consumer groups ensure** each message is processed by only one consumer in the group.  \n\n\n\n# 📍 **Zomato’s Kafka-based Architecture for Real-Time Delivery Tracking**  \n\n### **Problem with Traditional Architecture**  \nIn a traditional architecture, Zomato would frequently retrieve and store the **delivery boy’s location** in the **database (DB)** and send updates to the **user**. Given Zomato’s scale, this would lead to:  \n- **Excessive DB hits** every second.  \n- **Performance issues** due to limited DB throughput.  \n- **Risk of DB crashes** from high-frequency reads and writes. \n\n### **Diagram:**  \n```\n+------------------+       +------------------+       +------------------+\n| Delivery Boy    | -----\u003e |   Database (DB)  | \u003c----- |      User       |\n| (Sends Location)|        | (Frequent Writes)|        | (Requests Data) |\n+------------------+       +------------------+       +------------------+\n                            ▲    ▲    ▲    ▲\n                            |    |    |    |\n              High DB Load Due to Frequent Writes \u0026 Reads\n``` \n\n### **Kafka-based Pub-Sub Model for Zomato**  \nTo handle high scale and volume efficiently, Zomato can implement a **Kafka-based Publish-Subscribe (Pub-Sub) model**, where:  \n- **Producers (Delivery Boys)** publish location updates to **Kafka topics**.  \n- **Kafka** efficiently handles high-throughput streaming.  \n- **Consumers (Users)** subscribe to receive real-time updates.  \n- The **server processes** data and stores it in the DB in bulk after order completion in a **batch process**.  \n\n### **Diagram:**  \n```\n+------------------+       +------------------+       +------------------+\n| Delivery Boy    | -----\u003e |      Kafka       | -----\u003e |      User       |\n| (Sends Location)|        | (High Throughput)|        | (Receives Live  |\n+------------------+        |  Pub-Sub Model) |        |  Updates)       |\n                            +------------------+\n                                    │\n                                    ▼\n                        +----------------------+\n                        |   Database (DB)      |\n                        | (Bulk Storage After  |\n                        |  Order Completion)   |\n                        +----------------------+\n``` \n\n### **Why Do We Still Need a Database Along with Kafka?**  \nKafka is a **message broker**, not a **permanent storage solution**. While Kafka can retain messages for a configurable period, we still need a **database** for:  \n- **Long-term storage** – Order and delivery history must be stored permanently.  \n- **Querying and analytics** – Databases provide structured access to historical data.  \n- **Data consistency** – Kafka handles streaming but does not ensure **ACID compliance** like relational databases.  \n- **Data retrieval** – If a user wants to check past orders, this data must be stored in a DB, not Kafka.  \n\n### **Comparison Between Traditional DB-Based Approach and Kafka-Based Approach**  \n\n| Feature                     | **Traditional DB-Based Approach** | **Kafka-Based Approach** |\n|-----------------------------|----------------------------------|-------------------------|\n| **Data Flow**               | Delivery boy updates DB, user fetches from DB | Delivery boy publishes to Kafka, user subscribes to real-time updates |\n| **Database Load**           | Very high due to frequent writes and reads | Minimal as updates are stored in Kafka and written to DB in batches |\n| **Scalability**             | Limited due to DB bottlenecks | Highly scalable with Kafka's distributed architecture |\n| **Real-Time Updates**       | No, updates depend on DB read frequency | Yes, users get instant updates through Kafka |\n| **Latency**                 | High due to DB query and write delays | Low as Kafka streams data in real-time |\n| **Reliability**             | Risk of DB crashes under high load | High as Kafka provides replication and fault tolerance |\n| **Storage Efficiency**      | Inefficient, as every update is stored in the DB | Efficient, as only final delivery data is stored in the DB |\n| **System Complexity**       | Simpler but not optimized for scale | Slightly more complex but highly optimized for performance |\n| **Cost Efficiency**         | High cost due to heavy DB infrastructure | Lower cost as Kafka handles high throughput without DB dependency |\n| **Use Case Suitability**    | Works for small-scale applications | Best for large-scale, high-throughput systems like Zomato |\n\n\n\n# 📍 Key Features of Kafka\n\n1. **High Throughput**  \n   - Kafka can handle **millions of messages per second** with low latency.  \n   - It achieves this by using a **distributed, partitioned, and log-based storage system**.  \n   - Messages are written and read in a **sequential** manner, reducing disk I/O overhead.  \n\n2. **Fault Tolerance (Replication)**  \n   - Kafka ensures **data reliability** through **replication** across multiple brokers.  \n   - Each topic partition has **multiple replicas**, preventing data loss in case of broker failures.  \n   - If a leader broker fails, a replica automatically takes over as the new leader.  \n\n3. **Durable**  \n   - Kafka persists messages on **disk storage**, ensuring durability.  \n   - Messages are retained for a **configurable period** (even if they have been consumed).  \n   - This allows for message replay, which is useful for event-driven architectures.  \n\n4. **Scalable**  \n   - Kafka scales **horizontally** by adding more brokers to a cluster.  \n   - Topics are divided into **partitions**, enabling parallel processing.  \n   - Consumer groups allow for **load balancing**, ensuring efficient message consumption.  \n\n\n\n# 📍 Kafka Architecture\n![Screenshot 2025-02-11 142903](https://github.com/user-attachments/assets/6d328164-f9bf-41f3-a4b5-5bc4a57fc48b)\n\nKafka follows a **distributed, event-driven, and high-throughput architecture** designed for real-time data streaming. It consists of multiple components working together to enable **efficient message publishing, storage, and consumption**.\n\n### **1. Producer**\n- Sends (publishes) messages to **Kafka topics**.\n- Can send messages to specific **partitions** within a topic.\n- Works asynchronously for **high throughput**.\n\n### **2. Kafka Cluster**\n- Consists of multiple **Kafka brokers** (servers) handling message storage and distribution.\n- Ensures **fault tolerance and replication** for data reliability.\n\n### **3. Broker**\n- A **Kafka server** that stores and manages data.\n- Each broker handles a subset of **topic partitions**.\n- Kafka can have **multiple brokers**, forming a **cluster** for scalability.\n\n### **4. Topic**\n- A logical channel where **messages** are published.\n- **Producers write** to topics, and **consumers read** from topics.\n- Topics are divided into **partitions** for parallelism.\n\n### **5. Partition**\n- A subset of a topic that allows **load distribution** across multiple brokers.\n- Each partition is replicated across brokers for **fault tolerance**.\n- Consumers read messages **sequentially** from partitions.\n\n### **6. Offset**\n- A unique **ID assigned to each message** within a partition.\n- Kafka tracks offsets to ensure **message ordering and retrieval**.\n- Consumers can store offsets to **resume processing** from the last read position.\n\n### **7. Consumer**\n- Subscribes to **Kafka topics** and reads messages.\n- Can be part of a **consumer group** for **parallel processing**.\n- **Manages offsets** to track message consumption.\n\n### **8. Consumer Group**\n- A collection of **consumers reading from the same topic**.\n- Each message is **processed by only one consumer per group**, ensuring load balancing.\n- Multiple groups can subscribe to the same topic, each processing independently.\n\n### **9. ZooKeeper**\n- Manages **Kafka broker metadata, leader election, and configuration**.\n- Ensures **broker coordination and failure detection**.\n- Required for maintaining **cluster health**.\n\n\n\n# 📍 Partition-Consumer Exclusivity and Consumer Groups\n\nKafka follows a **partition-based parallelism** model where:  \n- **One Consumer Can Consume Multiple Partitions**.  \n- **One Partition Can Be Consumed by Only One Consumer** at a time.  \n\nThis ensures:  \n1. **Efficient parallel processing** by distributing partitions across consumers.  \n2. **Message order within a partition is maintained**, as only one consumer reads from a partition.  \n\n### **Case 1: One Consumer, Multiple Partitions**\n```\n+--------------------+\n|      TOPIC        |\n+--------------------+\n| Partition 0       |  ---\u003e Consumer A\n| Partition 1       |  ---\u003e Consumer A\n| Partition 2       |  ---\u003e Consumer A\n+--------------------+\n```\n- Consumer A is consuming messages from multiple partitions.\n- Each partition still has only one consumer.\n\n### **Case 2: Equal Partitions to Consumers (1:1 Mapping)**  \nWhen the number of consumers **matches** the number of partitions, each consumer **exclusively** consumes messages from a single partition.\n\n```\n+--------------------+\n|      TOPIC        |\n+--------------------+\n| Partition 0       |  ---\u003e Consumer A\n| Partition 1       |  ---\u003e Consumer B\n| Partition 2       |  ---\u003e Consumer C\n+--------------------+\n```\n- **Each consumer gets exactly one partition.**  \n- **Parallelism is maximized.**  \n- **Order is preserved within each partition.**  \n\n### **Case 3: More Consumers Than Partitions (Consumers Remain Idle)**  \nWhen there are **more consumers than partitions**, some consumers remain **idle** as a partition **cannot be shared**.\n\n```\n+--------------------+\n|      TOPIC        |\n+--------------------+\n| Partition 0       |  ---\u003e Consumer A\n| Partition 1       |  ---\u003e Consumer B\n| Partition 2       |  ---\u003e Consumer C\n+--------------------+\n              |\n              |  Consumer D (Idle)\n              |  Consumer E (Idle)\n```\n- **Two consumers are not assigned any partition.**  \n- **Adding more consumers than partitions does not improve performance.**  \n\n### **Case 4: Fewer Consumers Than Partitions (Consumers Handle Multiple Partitions)**  \nIf there are **fewer consumers than partitions**, Kafka **distributes partitions evenly** among the available consumers.\n\n```\n+--------------------+\n|      TOPIC        |\n+--------------------+\n| Partition 0       |  ---\u003e Consumer A\n| Partition 1       |  ---\u003e Consumer A\n| Partition 2       |  ---\u003e Consumer B\n| Partition 3       |  ---\u003e Consumer B\n| Partition 4       |  ---\u003e Consumer C\n+--------------------+\n```\n- **Each consumer handles multiple partitions.**  \n- **Kafka ensures that the partitions are evenly distributed.**  \n\n## Consumer Groups\n![Screenshot 2025-02-12 191323](https://github.com/user-attachments/assets/6f1484d3-da0a-45a8-8dc6-4066201e6b17)\n\nIn Apache Kafka, a **consumer group** is a collection of consumer instances that work together to consume messages from one or more topics. The key idea is that Kafka allows messages in a topic to be **distributed among multiple consumers** in a scalable way while maintaining fault tolerance. Each consumer in a group reads from a subset of the partitions, ensuring that each partition is consumed by only one consumer within the group.  \n\nKafka achieves this by **assigning partitions** to different consumers within a group, and this assignment is dynamically managed. If a consumer joins or leaves the group, Kafka automatically redistributes the partitions among the remaining consumers, a process known as **rebalancing**.  \n\n### **How Group-Level Self-Balancing Works**  \n\nKafka dynamically balances load across consumers within a group. The partition assignment strategy ensures an **even and efficient** distribution of partitions across available consumers. The number of partitions and consumers in a group determine how messages are distributed.  \n\n### **Brief on Rebalancing**  \n\nRebalancing occurs when:  \n- A new consumer joins  \n- A consumer leaves (crashes or is stopped)  \n- A partition count changes  \n\nKafka **automatically reassigns partitions** among available consumers to maintain optimal load distribution. However, frequent rebalancing can be costly in terms of performance since consumers need to pause consuming messages during this process.  \n\n\n\n# 📍**Queue Model vs. Pub/Sub Model in Kafka**  \n\nKafka supports **both the Queue Model and the Publish-Subscribe (Pub/Sub) Model** through **Consumer Groups**, allowing it to function in different messaging patterns.\n\n## **Queue Model (Point-to-Point Messaging)**\n- In a **Queue Model**, multiple consumers act as workers, but **each message is processed by only one consumer**.\n- Kafka achieves this using **Consumer Groups**, where partitions are **evenly distributed among consumers**.\n- **Messages are load-balanced** across consumers.  \n- **No message duplication** within the consumer group.  \n- This model is useful for **task processing (e.g., background jobs, event-driven processing)**.\n\n## **Publish-Subscribe (Pub/Sub) Model**\n- In **Pub/Sub**, multiple consumers receive **a copy of the same message**.\n- Kafka enables this by using **multiple independent consumer groups**.  \n- Each group gets **its own copy** of the messages.\n- **Each consumer group gets the entire data stream.**  \n- **Consumers in different groups do not affect each other.**  \n- Used for **real-time analytics, logging, event-driven architectures**.\n\n## **Kafka Consumer Groups: Combining Queue \u0026 Pub/Sub Models**\n- Kafka's **Consumer Groups enable both models**:\n  - Within a **single group**, Kafka works as a **Queue Model**.\n  - With **multiple groups**, Kafka behaves as a **Pub/Sub Model**.\n\n![Screenshot 2025-02-12 191204](https://github.com/user-attachments/assets/4ff4c742-22b1-4cdb-956e-5ce620150950)\n\n\n- **Each group acts as a queue** (messages distributed across group members).  \n- **Multiple groups enable pub/sub** (each group gets all messages).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpranavbarthwal%2Fkafka","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpranavbarthwal%2Fkafka","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpranavbarthwal%2Fkafka/lists"}