Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/linnykoleh/awesome-system-design

Personal notes from courses and articles about system design.
https://github.com/linnykoleh/awesome-system-design
List: awesome-system-design
faang faang-interview grokking high-performance highload leetcode microservices system-design
Last synced: about 1 month ago
JSON representation
Personal notes from courses and articles about system design.
Host: GitHub
URL: https://github.com/linnykoleh/awesome-system-design
Owner: linnykoleh
Created: 2024-08-08T11:21:42.000Z (5 months ago)
Default Branch: master
Last Pushed: 2024-11-07T18:05:53.000Z (about 1 month ago)
Last Synced: 2024-11-07T19:20:11.701Z (about 1 month ago)
Topics: faang, faang-interview, grokking, high-performance, highload, leetcode, microservices, system-design
Homepage:
Size: 44.3 MB
Stars: 83
Watchers: 3
Forks: 33
Open Issues: 0
Metadata Files:
- Readme: README.MD
Awesome Lists containing this project

ultimate-awesome - awesome-system-design - Personal notes from courses and articles about system design. (Other Lists / Monkey C Lists)
README

        # Awesome System Design 

> - This is a collection of personal notes from various courses, articles, and books on system design. 

> - The goal is to share a comprehensive guide to system design that can be used as a reference for interviews and real-world projects.

> - The notes are organized by topics and include diagrams, code snippets, and examples to help understand the concepts better.

> - The notes are a work in progress and will be updated regularly with new information and resources.

## 📈 Resources:

### 💽 Courses:

- [x] [System Design course on neetcode.io](https://neetcode.io)

- [x] [System Design for Interviews and Beyond](https://leetcode.com/explore/interview/card/system-design-for-interviews-and-beyond)

- [x] [Highload Software Architecture on Projector](https://prjctr.com/course/highload-software-architecture)

- [ ] [Learn System Design in a Hurry](https://www.hellointerview.com/learn/system-design/in-a-hurry/introduction)

- [ ] [System Design Simplified on interviewready.io](https://interviewready.io/course-page/system-design-course)

- [ ] [Grokking the System Design Interview](https://www.designgurus.io/course/grokking-the-system-design-interview)

- [ ] [AWS certification](https://aws.amazon.com/certification)

### 📝 Articles:

- [x] [Helpful list of LeetCode Posts on System Design at Facebook, Google, Amazon, Uber, Microsoft](https://leetcode.com/discuss/interview-question/1140451/helpful-list-of-leetcode-posts-on-system-design-at-facebook-google-amazon-uber-microsoft)

- [x] [My System Design Template](https://leetcode.com/discuss/career/229177/My-System-Design-Template)

- [x] [Latency Numbers Every Programmer Should Know](https://colin-scott.github.io/personal_website/research/interactive_latency.html)

- [x] [Real-time Messaging in Slack](https://slack.engineering/real-time-messaging)

- [x] [How Discord Stores Billions of Messages](https://discord.com/blog/how-discord-stores-billions-of-messages)

- [x] [How Discord Stores Trillions of Messages](https://discord.com/blog/how-discord-stores-trillions-of-messages)

- [x] [Meta Onsite System Design Questions](https://leetcode.com/discuss/interview-experience/4428743/Meta-Onsite-System-Design-Questions)

- [x] [A Senior Engineer's Guide to the System Design Interview](https://interviewing.io/guides/system-design-interview)

- [x] [System design interview guide for Software Engineers](https://www.techinterviewhandbook.org/system-design/)

- [x] [ByteByteGo Newsletter](https://blog.bytebytego.com/)

- [x] [Netflix’s interview process and questions](https://interviewing.io/guides/hiring-process/netflix)

- [x] [How software engineering behavioral interviews are evaluated at Meta](https://interviewing.io/blog/how-software-engineering-behavioral-interviews-are-evaluated-meta)

- [ ] [How Figma’s databases team lived to tell the scale](https://www.figma.com/blog/how-figmas-databases-team-lived-to-tell-the-scale)

### 📚 Books:

- [x] [System Design Interview – An insider's Guide by Alex Xu](https://www.amazon.com/System-Design-Interview-insiders-Second/dp/B08CMF2CQF?tag=interior0d3-20)

- [x] [Microservices patterns](https://microservices.io/patterns)

- [x] [Microservices Patterns: With examples in Java by Chris Richardson](https://www.amazon.com/Microservices-Patterns-examples-Chris-Richardson/dp/1617294543?tag=interior0d3-20)

- [ ] [Designing Data-Intensive Applications by Martin Kleppmann](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321?tag=interior0d3-20)

- [ ] [Distributed Systems 4 by Maarten van Steen](https://www.amazon.com/Distributed-Systems-Maarten-van-Steen/dp/9081540637?tag=interior0d3-20)

- [ ] [System Design Interview – An Insider's Guide Volume 2 by Alex Xu, Sahn Lam](https://www.amazon.com/System-Design-Interview-Insiders-Guide/dp/1736049119?tag=interior0d3-20)

### 🎬 Videos:

- [x] [!!! Most Tech Interview Prep is GARBAGE. (From a Principal Engineer at Amazon)](https://www.youtube.com/watch?v=0Z9RW_hhUT4&ab_channel=ALifeEngineered)

- [x] [System Design Interview – An insider's Guide discussion](https://youtube.com/playlist?list=PLlghaO_0b1OcY4yfpwGUOcFlnU7LpIkhC&si=Y7OqweK6rEgtaZUw)

- [x] [Videos System Design for Interviews and Beyond videos](https://www.youtube.com/c/SystemDesignInterview/videos)

- [x] [System design playlist #1 on Youtube](https://www.youtube.com/playlist?list=PLMCXHnjXnTnvo6alSjVkgxV-VH6EPyvoX)

- [x] [System design playlist #2 on Youtube](https://www.youtube.com/playlist?list=PLrtCHHeadkHp92TyPt1Fj452_VGLipJnL)

- [ ] [System design playlist #3 on Youtube](https://www.youtube.com/playlist?list=PLm6XThSMgu_FFth3WUNBKnz_1dG3BKdBn)

- [ ] [System design mock interviews](https://www.youtube.com/playlist?list=PLf3F6FcQwgqEpnucyupbIqzxyvFOz9uDq)

### 🐈 GitHub Repositories:

- [ ] [https://github.com/InterviewReady/system-design-resources](https://github.com/InterviewReady/system-design-resources)

- [ ] [https://github.com/donnemartin/system-design-primer](https://github.com/donnemartin/system-design-primer)

- [ ] [https://github.com/karanpratapsingh/system-design](https://github.com/karanpratapsingh/system-design)

- [ ] [https://github.com/charlax/professional-programming](https://github.com/charlax/professional-programming)

### ⚙️ Tech Blogs:

- [ ] https://blog.bytebytego.com

- [ ] https://newsletter.pragmaticengineer.com

- [ ] https://newsletter.systemdesign.one

- [ ] https://highscalability.com

- [ ] https://engineering.fb.com

- [ ] https://www.notion.so/blog/topic/tech

- [ ] https://www.notion.so/blog

- [ ] https://netflixtechblog.com

- [ ] https://stripe.com/blog

- [ ] https://medium.com/airbnb-engineering

- [ ] https://www.uber.com/en-IN/blog/engineering

- [ ] https://medium.com/pinterest-engineering

- [ ] https://www.figma.com/blog/engineering

- [ ] https://engineering.atspotify.com

- [ ] https://slack.engineering

- [ ] https://blog.x.com

- [ ] https://engineering.ramp.com

- [ ] https://www.canva.dev/blog/engineering

- [ ] https://stackoverflow.blog/engineering

- [ ] https://medium.com/paypal-tech

- [ ] https://medium.engineering

- [ ] https://dropbox.tech

- [ ] https://quoraengineering.quora.com

- [ ] https://blog.cloudflare.com

- [ ] https://www.akamai.com/blog

- [ ] https://www.linkedin.com/blog/engineering

- [ ] https://discord.com/category/engineering

- [ ] https://www.grammarly.com/blog

- [ ] https://eng.lyft.com

- [ ] https://medium.com/strava-engineering

- [ ] https://medium.com/tinder

- [ ] https://www.toptal.com/developers/blog

- [ ] https://github.blog/engineering

- [ ] https://devblogs.microsoft.com/engineering-at-microsoft

- [ ] https://blogs.nvidia.com

---

## 🌐 Navigation:

- [A Senior Engineer's Guide to the System Design Interview](#a-senior-engineers-guide-to-the-system-design-interview)

  - [System Design Interview Requirements](#system-design-interview-requirements)

    - [Availability](#availability)

- [System Design Blueprint: The Ultimate Guide](#system-design-blueprint-the-ultimate-guide)

  - [High Load Applications Architecture](#high-load-applications-architecture)

- [System Design Basics](#system-design-basics)

  - [Networking](#networking)

    - [Web servers](#web-servers)

    - [TCP vs UDP](#tcp-vs-udp)

    - [OSI Model](#osi-model)

    - [DNS](#dns)

    - [Network protocols](#network-protocols)

  - [CDN](#cdn)

    - [Edge computing](#edge-computing)

  - [Security](#security)

  - [Proxies](#proxies)

    - [Forward Proxy and Reverse Proxy](#forward-proxy-and-reverse-proxy)

  - [Load Balancing stategies](#load-balancing-stategies)

  - [Consistent Hashing](#consistent-hashing)

  - [Databases](#databases)

    - [SQL Database](#sql-database)

      - [ACID](#acid)

      - [Replication](#replication)

      - [Sharding](#sharding)

      - [Indexing](#indexing)

    - [NoSQL Database](#nosql-database)

      - [BASE](#base)

    - [SQL vs NoSQL](#sql-vs-nosql)

    - [Quorum](#quorum)

    - [Consensus](#consensus)

    - [Cache](#cache)

      - [Cache aside](#cache-aside)

      - [Read through](#read-through)

      - [Write through](#write-through)

      - [Write-Around](#Write-Around)

      - [Write-Back](#Write-Back)

      - [Deduplication Cache](#deduplication-cache)

    - [CAP Theorem](#cap-theorem)

      - [PACELC Theorem](#pacelc-theorem)

  - [Numbers every programmer should know](#numbers-every-programmer-should-know)

  - [Bloom filter](#bloom-filter)

  - [Message delivery guarantees](#message-delivery-guarantees)

  - [Autoscaling](#autoscaling)

  - [How to avoid cascading failures in a distributed system](#how-to-avoid-cascading-failures-in-a-distributed-system)

  - [Region vs Availability Zone](#region-vs-availability-zone)

  - [Resource monitoring](#resource-monitoring)

  - [Stress Testing](#stress-testing)

- [System Design Interview Questions](#system-design-interview-questions)

  - [Design Rate Limiter #1](#design-rate-limiter-1)

  - [Desing Consistent hashing](#desing-consistent-hashing)

  - [Desing a key-value store](#desing-a-key-value-store)

  - [Desing a unique ID generator in distributed systems](#desing-a-unique-id-generator-in-distributed-systems)

  - [Design a URL shortening service like TinyURL](#design-a-url-shortening-service-like-tinyurl)

  - [Design a web crawler](#design-a-web-crawler)

  - [Design a notification system #1](#design-a-notification-system-1)

  - [Design a news feed system](#design-a-news-feed-system)

  - [Design a chat system](#design-a-chat-system)

  - [Desing a search autocomplete system](#desing-a-search-autocomplete-system)

  - [Desing video sharing service](#desing-video-sharing-service)

  - [Design a google drive](#design-a-google-drive)

  - [Distributed Message Queue](#distributed-message-queue)

  - [Desing a notification service #2](#desing-a-notification-service-2)

  - [Design Rate Limiter #2](#design-rate-limiter-2)

  - [Design a distributed cache system](#design-a-distributed-cache-system)

  - [Top K Problem #2](#top-k-problem-2)

  - [Desing a system to count videos views](#desing-a-system-to-count-videos-views)

  - [URL shortening system questions about system](#url-shortening-system-questions-about-system)

  - [Fraud detection system](#fraud-detection-system)

  - [Authentication and authorization system](#authentication-and-authorization-system)

  - [Monitoring system](#monitoring-system)

  - [Design TicTok/Instagram reels system](#design-tictokinstagram-reels-system)

  - [Online Judge for coding contests](#online-judge-for-coding-contests)

  - [Design an Amazon S3 or Object Storage](#design-an-amazon-s3-or-object-storage)

  - [Design a Dropbox](#design-a-dropbox)

  - [Design Reddit home page feed](#design-reddit-home-page-feed)

  - [Design Parking Garage](#design-parking-garage)

  - [Design TicTok #2](#design-tictok)

  - [Design Facebook Messenger](#design-facebook-messenger)

  - [Design Instagram](#design-instagram)

  - [Design Amazon Kindle Payments](#design-amazon-kindle-payments)

  - [Design Amazon Prime Video](#design-amazon-prime-video)

  - [Downloading User Data](#downloading-user-data)

  - [Design Calendar data entities mapping](#design-calendar-data-entities-mapping)

  - [Twitter API](#twitter-api)

  - [Design architecture for Instagram likes and comments](#design-architecture-for-instagram-likes-and-comments)

  - [Design instagram #2](#design-instagram-2)

- [Cloud design patterns](#cloud-design-patterns)

- [Five common system design interview mistakes](#five-common-system-design-interview-mistakes)

---

## [A Senior Engineer's Guide to the System Design Interview](https://interviewing.io/guides/system-design-interview)

### Part 1:

- Do not panic!

- Don’t think like a coder. Think like a Tech Lead.

  - During the interview, you’ll spend an hour playing the role of a `Tech Lead`, so just pretend that the interviewer is a junior engineer who will be implementing your design.

- There are no optimal solutions in system design interviews.

- It’s your responsibility to leave breadcrumbs for the interviewer to `go where you want them to go`. That way you have them walk you down the road where you are at your best

- You do not need to display deep expertise in the given problem domain. 

- Interviewers want to see that you have a broad, base-level understanding of system design fundamentals.

- Interviewers want to engage you in a back-and-forth conversation about problem constraints and parameters, so avoid making assumptions about the prompt.

- Interviewers are not looking for specific answers with ironclad certainty. They want to see well-reasoned, qualified decisions based on engineering trade-offs.

- Interviewers are not looking for a predefined path from the beginning to end of the problem. They want to see the unique direction your experience and decisions take them.

- Interviewers seek a holistic view of a system and its users.

- `Communicate honestly` about what you know and what you don’t.

- For senior role, it’s a good sign if you direct more of the interview

- A common failure point occurs when candidates don’t make decisions

- Not being familiar with specific databases or other components is fine. Be smart and don’t say brand names just for the sake of saying them.

  - _"I’m going to use Cassandra ..."_ unless you are VERY familiar with that, because the next question will be: _"Why Cassandra and not some_other_db?"_

  - _"I will use Kafka ..."_ unless you’re prepared to explain how Kafka works. Don’t say _"I will use Kafka"_ unless you are prepared to talk about other types of queues, because they may ask you: _"Oh, Kafka, interesting choice. Why that instead of [some other queue]?"_

### Part 2:

- There’s no right way to design a system.

- If the interviewer interrupts you, it's probably because you’re going off track.

- It’s fine if the interviewer asks you questions, but it’s a bad sign if the interviewer starts telling you how to do things.

- It's more important to cover everything broadly than it is to explain every small thing in detail.

- Whatever decision you make, explain why. In a system design interview, why is more important than what. For anything you say, be prepared to explain why.

- Mock interviews with different types of interviewers are `the best solution we’ve found to refining your communication skills`, or working with a dedicated coach who can get to know you (and your areas of expertise and improvement) very well.

- There is no strictly wrong answer for which database to use, as long as you can justify yourself and demonstrate an understanding of the alternative options.

### Part 3:

- You should start with the functional requirements first—that is, the core product features and use cases that the system needs to support.

- What makes these questions difficult or complicated is not the fact that they are inherently complicated. It's because the interviewer intentionally chooses to withhold information. Information that you can only get if you ask the right questions.

- Treat the system as a black box. No thinking about design, implementation, or pretty much anything technical. The sole goal of this first step is to specify what needs to be built. Not how. Not the scale. Focus on the “what.”

- Consider mutability.  Can tweets be edited after they’re published? Can tweets be deleted?

  - It might sound like a small detail at first, but `mutability can limit our ability` to use caching in our design

- Your interviewer cares less about your decisions, and more about whether you are able to talk about the trade-offs (positives and negatives) of your decisions

- We measure availability by the percentage of the time the system is up and running. A common goal is to aim for five nines, i.e., 99.999% availability—that’s less than 6 minutes of downtime a year.

- Whenever there is user-generated code execution involved (aka low trust code), running it in isolation should be a non-functional security requirement.

- Want speed, use a cache. Want availability, put in some redundancy. Want fast reads to db, use a read replica. Want fast writes to db, use sharding.

- Consider using caching when all three of these are true:

  - Computing the result is costly

  - Once computed, the result tends to not change very often (or at all)

  - The objects we are caching are read often

### Part 4:

- System Design problems are almost always easier in the immutable case (compared to the mutable one).

- `How to Get Yourself Unstuck – Tip #1` - To simplify the problem, think about the data in the problem as immutable. You can choose to add mutability back in later after you have an initial working design.

- `How to Get Yourself Unstuck – Tip #2` - Ask your interviewer: “Are there any important requirements you have in mind that I’ve overlooked?”

- `How to Get Yourself Unstuck – Tip #3` - Ask your interviewer if they’d like to see some calculations before jumping in and starting them—you might be able to skip these entirely if the interviewer doesn’t care about it!

- `How to Get Yourself Unstuck – Tip #4` - Follow these rough guides to get the basic estimates for any system

  - `Storage` Estimation: Storage = daily data used by 1 user * DAU count * length of time to store data

  - `Bandwidth` Estimation: Bandwidth per second = (daily data used by 1 user * DAU count ) / total seconds in a day

- `How to Get Yourself Unstuck – Tip #5` “Would it be fine if the data in my system was occasionally wrong for a split second or so?” 

  - If the answer is **yes**, then you probably want `eventual consistency`. 

  - If the answer is **no**, then you’re looking for a `strong consistency` called linearizability.

---

## System Design interview requirements

> Do not jump right in to give a solution. Slow down.

> Think deeply and ask questions to clarify requirements and assumptions.

> This is extremely important.

- **𝐑𝐞𝐪𝐮𝐢𝐫𝐞𝐦𝐞𝐧𝐭 𝐂𝐥𝐚𝐫𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬 (3-5 𝐦𝐢𝐧)**

    - Ask clarifying questions to understand the problem and expectations of the interviewer.

        - Use cases

        - Scenarios that will not be covered

        - Who will use

        - How many will use

        - Usage patterns

        - Technology stack

    - **𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐚𝐥 𝐑𝐞𝐪𝐮𝐢𝐫𝐞𝐦𝐞𝐧𝐭𝐬**

        - Start with the customer and work backwards

        - Who will use the system

        - How the system will be used

        - Platform (Mobile, Web, Desktop)

    - **𝐍𝐨𝐧 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐚𝐥 𝐑𝐞𝐪𝐮𝐢𝐫𝐞𝐦𝐞𝐧𝐭𝐬**

        - `High availability` system uptime in percentage

            - 99 % => 3.65 days of downtime per year

            - 99.9% => 8.76 hours of downtime per year

            - 99.99% => 52 minutes 35 seconds of downtime per year

            - 99.999% => 5 minutes 15 seconds of downtime per year

            - 99.9999% => 31 seconds of downtime per year

        - `Scalability` needs to handle peak load

        - `Performance` in terms of latency and throughput

        - `Durability` - data should not be lost

        - `Consistency` - data should not be corrupted

            - Strong consistency

            - Weak consistency

        - `Maintainability`

            - Failure modes and mitigations

            - Monitoring

            - Testing

            - Deployment

        - `Security`

        - `Cost`

- **𝐄𝐬𝐭𝐢𝐦𝐚𝐭𝐢𝐨𝐧𝐬 (3-5 𝐦𝐢𝐧)**

    - Latency/Throughput expectations

    - QPS (Queries Per Second) Read/Write ratio

    - Total/Daily active users

        - Traffic estimates

            - Write (QPS, Volume of data)

            - Read  (QPS, Volume of data)

        - Storage estimates

        - Memory estimates

            - If we are using a cache, what is the kind of data we want to store in cache

            - How much RAM and how many machines do we need for us to achieve this ?

            - Amount of data you want to store in disk/ssd

            - Number of users/tweets/likes etc

- **Start with something simple**

    - Single machine solution only

    - Single service solution

    - Design monolith service and then break it into microservices

- **High level design (5-10 min)**

    - APIs for Read/Write scenarios for crucial components

    - Database schema

    - Basic algorithm

    - High level design for Read/Write scenario

    - High level design for `Read heavy` scenario

    - High level design for `Write heavy` scenario

- **𝐃𝐞𝐬𝐢𝐠𝐧 details (15-20 min)**

    - Scaling the algorithm

    - Scaling individual components:

        - Availability, Consistency and Scale story for each component

        - Consistency and availability patterns

    - Think about the following components, how they would fit in and how it would help

        - Client (Mobile, Browser)

        - DNS

        - CDN (Push vs Pull)

        - Load Balancers (Active-Passive, Active-Active, Layer 4, Layer 7)

        - Reverse Proxy

        - Blob / Object Storage

        - Application layer scaling (Microservices, Service Discovery)

        - DB (RDBMS, NoSQL)

            - RDBMS

                - Master-slave, Master-master, Federation, Sharding, Denormalization, SQL Tuning

            - NoSQL

                - Key-Value, Wide-Column, Graph, Document

                    - Fast-lookups:

                        - RAM (Bounded size) => Redis, Memcached

                        - AP (Unbounded size) => Cassandra, RIAK, Voldemort

                        - CP (Unbounded size) => HBase, MongoDB, Couchbase, DynamoDB

                - Read heavy: MongoDB, Couchbase

                - Write heavy: Cassandra, ScyllaDB

        - Caches

            - Client caching, CDN caching, Webserver caching, Database caching, Application caching, Cache @Query level,

              Cache @Object level

            - Eviction policies:

                - Cache aside

                - Write through

                - Write behind

                - Refresh ahead

            - Asynchronism

                - Message queues

                - Task queues

                - Back pressure

        - Communication

            - TCP

            - UDP

            - REST

            - RPC

            - Web Sockets

            - HTTP/S

    - Logging, Metrics and Automation

- **Justify (5 min)**

    - Throughput of each layer

    - Latency caused between each layer

    - Overall latency justification

- **𝐑𝐞𝐬𝐨𝐥𝐯𝐞 𝐛𝐨𝐭𝐭𝐥𝐞𝐧𝐞𝐜𝐤𝐬 𝐚𝐧𝐝 𝐟𝐨𝐥𝐥𝐨𝐰-𝐮𝐩 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧𝐬 (2-3 𝐦𝐢𝐧𝐮𝐭𝐞𝐬)**

    - Error cases (server failure, network loss, etc.) and how to handle them

    - Bottlenecks

    - How to monitor the system and what metrics to collect

    - How to scale the system to the next level

![1.png](./images/system.design.3/1.png)

![2.png](./images/system.design.3/2.png)

### Availability

![numbers_4.png](./images/system.design/numbers_4.png)

---

## System Design Blueprint: The Ultimate Guide

![sys_design_blue_print.png](./images/system.design/sys_design_blue_print_1.png)

![sys_design_blue_print_2.png](./images/system.design/sys_design_blue_print_2.png)

## High Load Applications Architecture

### Problems

- `Single point of failure` - `SPOF` is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability.

- `Bottlenecks` - Bottleneck occurs when the capacity of an application or a computer system is severely limited by a single component. It has lowest throughput of all parts of the transaction path. **System works as fast as the slowest part of the system**.

### Main principles

- Reliability

- Scalability

- Efficiency

- Performance

### Simplicity is a key

- It is better to have a bunch of simple components in the system.

- Simple things are easier to understand, analyze, improve, replace.

- Look at `SOLID`. `S` there stands exactly for that - `Single Responsibility Principle`. Each component in the system should serve one and only one goal.

### Proper approach for building High Load Applications

- Always keep in mind what is your main goal

- Start with `simplest` solution (**Simple is not weak**)

- Continuously look for `bottlenecks` and `SPOFs` - improve first and avoid second

- Split complex things into `simple` ones

### Architecture Patterns

![common.png](images/common.png)

#### Load Balancer

- This pattern is used to distribute incoming traffic evenly across multiple servers to prevent overload and improve the system's overall performance. Load balancing patterns can be applied at different levels, including network, transport, application, and database.

#### N Tier

- This pattern involves dividing an application into separate tiers, each responsible for a different aspect of the application, making it easier to scale and manage

![n-tier.webp](images/system.design.6/n-tier.webp)

#### Event Notification

- Is a pattern in which a system sends an event in order to notify other systems of a change.

- A defining feature of this architecture is that the system doesn't expect a response from the event at all.

- There should be a separation between the logic that sends the event and any logic that get triggered by it.

- This pattern is handy because it makes a lower level of coupling simple.

![event-notification.png](images/system.design.6/event-notification.png)

#### Event-carried State Transfer

- The event-carried state transfer pattern is very similar to the event notification pattern.

- It often shows up when you're looking for less communication between systems.

- This is achieved by duplicating data across multiple systems

- Write operations are done on the `master`, and then the `replica` can be accessed by other systems.

![5.png](images/system.design.6/5.png)

#### Event Sourcing

- This pattern is less commonly used but the main idea of event sourcing all changes to a systems state is recorded as an event.

- You should be able to recreate the current state by replaying all previous events.

- This is a fundamentally different way of looking at a system because the event store is the source of truth, not the database.

- A commonly used example for describing this pattern is a version control system like git where each commit is an event.

- Built for `audit` and `replay` purposes. Finance systems use this pattern a lot.

#### CQRS

- `Command Query Responsibility Segregation` Pattern: This pattern separates read and write operations, allowing for the scaling and optimization of each independently.

![6.png](images/system.design.6/6.png)

#### Map-Reduce

- This pattern is commonly used in big data processing and involves splitting large data sets into smaller chunks, processing them independently, and then combining the results.

![7.png](images/system.design.6/7.png)

#### Microservices

- It is a variant of the service-oriented architecture (`SOA`) architectural style that structures an application as a collection of loosely coupled services.

- In a microservices architecture, services are fine-grained and the protocols are lightweight.

[//]: # (## System Desing resources)

[//]: # (![dZEbZaT.png](./images/system.design/dZEbZaT.png))

---

## System Design Basics

## Networking

### Web Servers

- A web server is a software application or a hardware device that caches static content and proxies dynamic content to application servers.

- It listens for requests on a specific port (usually port 80 for HTTP or 443 for HTTPS) and responds with the requested content.

https://trends.builtwith.com/web-server

![8.png](images/system.design.6/8.png)

#### Nginx

- Load balancer

- Web server

- Cache server

- Static content server

- No dynamic content handling

- Compression (_this is very useful because the internet is slow but the CPU is fast_)

```nginx configuration

user www-data;

worker_processes 4;

pid /run/nginx.pid;

daemon off;

events {

  worker_connections  2048;

  multi_accept on;

  use epoll;

}

http {

  server_tokens off;

  sendfile on;

  tcp_nopush on;

  tcp_nodelay on;

  keepalive_timeout 15;

  types_hash_max_size 2048;

  client_max_body_size 20M;

  include /etc/nginx/mime.types;

  default_type application/octet-stream;

  access_log /dev/stdout;

  error_log /dev/stderr;

  gzip on;

  gzip_disable "msie6";

  gzip_comp_level 5;

  gzip_min_length 256;

  gzip_proxied any;

  gzip_vary on;

  gzip_types

  application/atom+xml

  application/javascript

  application/json

  application/rss+xml

  application/vnd.ms-fontobject

  application/x-font-ttf

  application/x-web-app-manifest+json

  application/xhtml+xml

  application/xml

  font/opentype

  image/svg+xml

  image/x-icon

  text/css

  text/plain

  text/x-component;

  

  ssl_protocols TLSv1.2 TLSv1.3;

  ssl_ciphers 'ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS';

  

  include /etc/nginx/conf.d/*.conf;

  include /etc/nginx/sites-available/*.conf;

  open_file_cache off; 

  charset UTF-8;

}

```

- Examples:

  - https://github.com/linnykoleh/nginx

  - https://github.com/laradock/laradock/tree/master/nginx

  - https://medium.com/@jituboss/nginx-and-php-fpm-optimization-for-high-traffic-web-applications-f790bf1b30fb

### Applications in High Load

- `Handling High Traffic`: In high-load applications, web servers must efficiently handle a large number of concurrent requests without significant delays or downtime.

- `Scalability`: Web servers should be scalable, meaning they can handle increasing loads by adding resources (such as memory, CPU, or additional servers) without requiring significant changes to the architecture.

- `Performance Optimization`: High-load applications often require optimizations such as `caching`, `load balancing`, and `content compression` to ensure that the web server can handle the traffic without compromising on performance.

### TCP vs UDP

![Graphic-UDP-Vs-TCP-Diagram_1150x685-700x417.png](./images/system.design/Graphic-UDP-Vs-TCP-Diagram_1150x685-700x417.png)

![table-tcp-udp.png](./images/system.design/table-tcp-udp.png)

- `TCP` is a connection-oriented protocol, which means that before data can be transferred, two computers must first

  establish a connection.

    - Reliable

    - Connection-oriented

    - Slow

    - Used for HTTP, emails, file transfers, SSH.

- `UDP` is a connectionless protocol, which means that data can be transferred as soon as you have the IP address and

  port number of the destination computer.

    - Unreliable

    - Connectionless

    - Fast

    - Used for DNS, streaming, video, online gaming.

### OSI Model

![original-seven-layers-of-osi-model-1627523878-JYjV8oybcC.png](./images/system.design/original-seven-layers-of-osi-model-1627523878-JYjV8oybcC.png)

### DNS

![dns.png](./images/system.design/dns.png)

![dns.png](./images/system.design.3/dns.png)

### Network protocols

![1696882344725.gif](./images/system.design.3/1696882344725.gif)

---

## CDN

![cdn_cover.png](./images/system.design/cdn_cover.png)

- `CDN` is a system of distributed servers (network) that deliver webpages and other web content to a user based on the

  geographic locations of the user, the origin of the webpage and a content delivery server.

- `Edge server` is a server that is located at the edge of the network and is responsible for caching content. It is

  usually located in a data center that is close to the user. The edge server caches content from the origin server and

  serves it to users when requested. The edge server can also serve static content directly to users without

  contacting the origin server. This reduces the load on the origin server and improves performance.

- `Origin server` is a server that stores the original, definitive versions of web pages and other files. It is usually

  located in a data center that is far from the user. The origin server receives requests from edge servers and

  fulfills them by sending the appropriate content.

### Edge computing

- `Edge computing` is a networking philosophy focused on bringing computing as close to the source of data as possible

  in order to reduce latency and bandwidth use. In simpler terms, edge computing means running fewer processes in the

  cloud and moving those processes to local places, such as on a user’s computer, an IoT device, or an edge server.

  Bringing computation to the network’s edge minimizes the amount of long-distance communication that has to happen

  between a client and server.

---

## Security

- `Authentication` - figuring out who you're talking to

- `Authorization` - figuring out what they're allowed to do

- `Secure password storage` is the use of a cryptographic hash function to store passwords in a way that makes it difficult

  for an attacker to recover the original password.

![security_1.png](images/system.design.4/security_1.png)

![security_2.png](images/system.design.4/security_2.png)

- `Salting` is the process of adding a random string of characters to a password before hashing it. This makes it more

  difficult for an attacker to crack the password using a precomputed hash table or rainbow table.

- `Session Tokens` are used to authenticate users and authorize access to resources. A session token is a unique string

  of characters that is generated when a user logs in and is stored in a cookie or in the URL. The session token is

  sent with each request to the server to identify the user and authorize access to resources.

  - Session tokens should also come with an expiration date, as short as feasible.

  - Session token is equivalent to a password, so it should be stored securely.

- `JSON Web Tokens` (JWT) are a standard for representing claims securely between two parties. JWTs are signed using a

  secret key or a public/private key pair, which makes them secure and tamper-proof. JWTs can be used to authenticate

  users and authorize access to resources.

![security_3.png](images/system.design.4/security_3.png)

- `Cookies` - by storing a session token or JWT in a cookie, we can ensure that all subsequent requests will include it and allow the server to validate the current user session.

### Summary of web authentication and basic security

1. The user signs up. At this point, we need to salt and hash their password and store those values (but not the password itself!).

2. The user logs in with their username and password. We verify the password by hashing it with the stored salt and checking to see if it matches the stored hash (ideally using a secure library to make the comparison). We then send some kind of identifying token, either a simple session token or a JWT or similar token, back to the client in a cookie set header.

3. On subsequent requests, the browser sends the cookie back to the server, where we can verify the session token or check the signature on/decrypt a JWT.

4. Periodically, the session token or JWT should be expired and a new one generated and sent down to the client with a cookie set header.

5. Eventually, the user's session may expire from inactivity. In this case, we go back to step 2.

---

## Proxies

### Forward Proxy and Reverse Proxy

- `Forward proxy` and a reverse proxy are two types of proxies used to provide additional security, anonymity, and

  caching for web traffic. A forward proxy acts as an intermediary between clients and servers on the Internet. When a

  client requests a web page or other resources from a server, the request is first sent to the forward proxy. The

  forward proxy then retrieves the requested content on behalf of the client and returns it to the client. This can

  provide additional security and privacy for clients by hiding their IP addresses and location information from the

  server. Forward proxies are often used in corporate networks to control access to the Internet and to improve

  performance by caching frequently accessed content. `forward proxy` is used to provide security and caching for client

  requests

- `Reverse proxy`, on the other hand, sits between the client and the server, intercepting requests from the client

  and forwarding them to the appropriate server. The client believes it is communicating directly with the server, but

  in reality, it is communicating with the reverse proxy. The reverse proxy can provide additional security by filtering

  requests and blocking malicious traffic. It can also improve performance by caching frequently accessed content and

  distributing traffic across multiple servers. `reverse proxy` is used to provide security, load balancing, and caching

  for server requests.

![forward_proxy_vs_reverse_proxy-f_mobile.png](./images/system.design/forward_proxy_vs_reverse_proxy-f_mobile.png)

## Load Balancing stategies

- **Round Robin** - The simplest load balancing method. The load balancer cycles through the available servers, and

  each server handles an equal number of requests. If one of the servers goes down, the load balancer stops sending

  requests to that server.

- **Least Connections** - Each server is assigned a number of current connections. When a new request comes in, the

  load balancer sends it to the server with the fewest connections. This method works well if the work each server does

  is roughly the same. If one server is handling twice as many connections as the others, it will get twice as many

  new requests as the others. This will quickly overload the server.

- **IP Hash** - Each request from a client is always sent to the same server. This is based on the client's IP

  address. If a server goes down, the load balancer will send requests from that server's IP address to the remaining

  servers, which could overload those servers. If a new server is added, the load balancer may send some requests from

  existing servers to the new server, which would cause a performance hit.

- **Weighted Round Robin** - Similar to round robin, but some servers are weighted more heavily than others. This

  allows you to balance the load between servers based on their capabilities. For example, you might have one

  powerful server and three smaller servers. You could set the weights so that the powerful server handles 60% of

  the requests and each of the smaller servers handles 20%.

- **Least Response Time** - Each server is assigned an estimated average response time. When a new request comes in,

  the load balancer sends it to the server with the lowest average response time. This method works well if the work

  each server does is roughly the same. If one server is much faster or slower than the others, this method will not

  work well.

![the-five-most-common-balancing-methods.jpg](./images/system.design/the-five-most-common-balancing-methods.jpg)

## Consistent Hashing

- `Consistent hashing` is a special kind of hashing such that when a hash table is resized, only `K/n` keys need to

  be remapped on average, where `K` is the number of keys, and `n` is the number of slots. `Consistent hashing` is

  useful in situations where we need to add or remove nodes without significant reorganization of the data. It is

  also useful for caching, where we need to store cached data on multiple machines, and we want to add or remove

  machines without invalidating all the cached data.

- Servers put in `circle` placed by key hash of their `IP address`. When we need to add new server, we just add it to

  circle and place it by hash of its IP address. When we need to remove server, we just remove it from circle.

- We move `clockwise` from the hash of the key we are looking for. The first server we come to is the server that

  stores the value for that key.

- The idea of `consistent hashing` is to distribute the load evenly across the servers. When a new server is added,

  only a small portion of the keys need to be remapped. When a server is removed, the keys that were assigned to

  that server are evenly distributed among the remaining servers.

![Consistent Hashing.png](./images/system.design/Consistent_Hashing.png)

---

## Databases

### SQL Database

- `Relational database` is a database that stores and provides access to data points that are related to one another.

  Relational databases are based on the relational model, an intuitive, straightforward way of representing data in

  tables. Each row in the table is a record with a unique ID called the key. The columns of the table hold attributes

  of the data, and each record usually has a value for each attribute, making it easy to establish the relationships

  among data points.

- `Advantages of SQL Databases`:

    - SQL offers more powerful querying out of the box

    - SQL has stronger ACID guarantees out of the box

- `Disadvantages of SQL Databases`:

    - B-Trees, used in SQL DBs, are slower to write into

    - Strong consistency is expensive to reduce latency for

    - SQL does *not* work well for mixed schema data

    - SQL databases are hard to scale horizontally

#### ACID

- **Atomicity** - Atomicity guarantees that each transaction is "all or nothing": if one part of the transaction

fails, the entire transaction fails, and the database state is left unchanged. For example, if a transfer

transaction is supposed to debit $100 from one account and credit $100 to another, but the debit succeeds and the

credit fails, then the entire transaction fails, and the balance in both accounts remains unchanged.

- **Consistency** - Consistency ensures that any transaction will bring the database from one valid state to

another. Any data written to the database must be valid according to all defined rules, including constraints,

cascades, triggers, and any combination thereof.

- **Isolation** - Isolation ensures that the concurrent execution of transactions results in a system state that

would be obtained if transactions were executed serially, i.e., one after the other. The database guarantees

that the results of any transaction are as if transactions were executed one at a time. Transactions are

isolated from each other. The effects of any given transaction are not visible to any other transaction until

that transaction has been completed.

- **Durability** - Durability ensures that once a transaction has been committed, it will remain so, even in the

event of power loss, crashes, or errors.

#### Replication

- Some key terms to understand for `replication`:

    - `Replica`: Copy of data

    - `Leader`: Machine that handles write requests to the data store.

    - `Followers`: Machines that are replicas of the leader node, and cater to read requests.

- `Replication` is the process of copying data from one database to another. Replication is used to increase

  availability and reliability of data. It is also used to distribute data geographically so that it can be

  accessed locally from various places.

- Replication stategies:

    - **Master-slave replication** - In this strategy, all writes are done to the master database, and the slave

      databases are updated with the changes. If the master database goes down, one of the slave databases can be

      promoted to be the new master. This strategy is simple and easy to implement, but it does not scale well.

    - **Master-master replication** - In this strategy, both master databases are writable and both read from the

      slave databases. This strategy can scale well, but it is more complex to implement and it can lead to conflicts

      when the same data is modified concurrently on both master databases.

    - **Federation** - In this strategy, each database is treated as a separate entity. Applications connect directly

      to the appropriate database, with no single master database. This strategy can scale well, but it is more

      complex to implement and it can lead to conflicts when the same data is modified concurrently on multiple

      databases.

![replication.png](./images/system.design/replication.png)

![replication_2.png](./images/system.design/replication_2.png)

- A `master database` generally only supports `write` operations. A `slave database` gets copies of

  the data from the master database and only supports read operations. All the data-modifying

  commands like insert, delete, or update must be sent to the master database. Most

  applications require a much higher ratio of reads to writes; thus, the number of slave

  databases in a system is usually larger than the number of master databases.

![replication_3.png](./images/system.design/replication_3.png)

![leader leader replication.png](./images/system.design.4/leader_leader_replication.png)

#### Sharding

![sharding.png](images/system.design.4/sharding.png)

- `Sharding` is the process of storing the same data set in multiple databases. Sharding is used to increase

  scalability and performance of databases. It is also used to distribute data geographically so that it can be

  accessed locally from various places.

- Sharding stategies:

    - **Hash-based sharding** - Hash-based sharding is the most common sharding strategy. In this strategy, each

      shard is responsible for a range of the hash values. For example, if we have three shards, one shard may be

      responsible for hash values 0-33, the second shard may be responsible for hash values 34-66, and the third shard

      may be responsible for hash values 67-99. When a new record is inserted into the database, the hash value of the

      key is computed, and the record is inserted into the shard that is responsible for that hash value range.

    - **Range-based sharding** - Range-based sharding is similar to hash-based sharding, except that the shards are

      responsible for a range of the actual values of the keys. For example, if we have three shards, one shard may

      be responsible for keys a-m, the second shard may be responsible for keys n-z, and the third shard may be

      responsible for keys 0-9. When a new record is inserted into the database, the key is compared against the

      ranges, and the record is inserted into the shard that is responsible for that key range.

    - **List sharding** - List sharding is similar to range-based sharding, except that the shards are responsible

      for a list of the actual values of the keys. For example, if we have three shards, one shard may be responsible

      for keys a, d, g, j, m, p, s, v, y, and the second shard may be responsible for keys b, e, h, k, n, q, t, w, z,

      and the third shard may be responsible for keys c, f, i, l, o, r, u, x. When a new record is inserted into the

      database, the key is compared against the lists, and the record is inserted into the shard that is responsible

      for that key.

    - **Directory sharding** - Directory sharding is similar to list sharding, except that the shards are

      responsible for a directory of the actual values of the keys. For example, if we have three shards, one shard

      may be responsible for keys starting with a-m, the second shard may be responsible for keys starting with n-z,

      and the third shard may be responsible for keys starting with 0-9. When a new record is inserted into the

      database, the key is compared against the directories, and the record is inserted into the shard that is

      responsible for that key.

![DB_image_2_cropped.png](./images/system.design/DB_image_2_cropped.png)

![DB-Shard-Pic-1-e1513014455288.png](./images/system.design/DB-Shard-Pic-1-e1513014455288.png)

#### Indexing

- `B-trees` are self-adjusting trees that can achieve multilevel indexing. They are a generalized form of Binary Search Trees. The data is stored in sorted order in the B-trees. B-tree achieves the efficient utilization of space in nodes, along with keeping the height of the tree small.

![b-tree.png](images/system.design.5/b-tree.png)

- `B+ trees` are an extension of B-trees. The major differences in the data structure are:

  - Only the leaf nodes store the record or reference to the record.

  - All the leaf nodes are connected to form a linked list. This enables sequential access along with direct access.

### NoSQL Database

- `NoSQL database` is a database that stores and provides access to data that does not have a predefined data

  model. NoSQL databases are highly scalable and are designed to handle large amounts of data and high user

  traffic. They are well suited for storing large sets of user data, such as social media profiles, product

  catalogs, and inventory records. NoSQL databases are also well suited for storing large sets of time series data

  such as clickstreams and location tracking data. NoSQL databases are not good at handling complex queries that

  require joins.

- `Advantages of NoSQL Database`:

    - NoSQL is faster for writes but slower to query.

        - `Log-structured merge tree` is much faster for writes since you don't do anything to maintain structure when you add to it

    - NoSQL has sharding and scaling out of the box

    - Shema is easily changeable

- `Disadvantages of NoSQL Database`:

    - NoSQL databases are more limited in the types of efficient queries that can be done

    - They are less suitable for circumstances where strong consistency is required,

    - Cannot have transactions

    - Joins are hard

#### BASE

- **Basically Available** - The system guarantees availability.

- **Soft state** - The state of the system may change over time.

- **Eventual consistency** - The system will become consistent over a period of time.

- **Consistency** - The system will become consistent over a period of time.

### SQL vs NoSQL

![asidvsbase.webp](./images/system.design/asidvsbase.webp)

![04fig04_alt.jpg](./images/system.design/04fig04_alt.jpg)

![photo_2023-02-15_09-50-11.jpg](./images/system.design/photo_2023-02-15_09-50-11.jpg)

![simple-map-for-db.png](images/system.design.4/simple-map-for-db.png)

### Quorum

- `Quorum` ensures that a sufficient number of nodes agree on a value to maintain consistency and allow the system to progress even if some nodes are unavailable.

- `Quorum` refers to a mechanism used to ensure consistency and availability of data in distributed systems. It is commonly used in databases with replication, such as Cassandra, MongoDB, or systems that implement consensus protocols like Paxos or Raft.

- The basic idea behind Quorum is that for an operation (such as a write or read) to be considered successful, a majority of replicas in the system must agree. The number of replicas required for the operation to succeed is known as the quorum.

- `Quorum Read`: If you have 5 replicas, and 3 of them must return the latest data for the read to be considered successful, this is called a quorum read.

- `Quorum Write`: For a write operation to be successful, it must be acknowledged by a majority of replicas.

- Quorum ensures that any read will always retrieve the most up-to-date replica. The general formula is:

  - `W + R > N`:

    - `W` — the number of replicas to which a write operation must be applied (write quorum).

    - `R` — the number of replicas from which a read operation must be successful (read quorum).

    - `N` — the total number of replicas.

- Advantages of using Quorum:

  - `High availability`: The system can continue to operate even if some replicas are unavailable.

  - `Consistency`: Data remains consistent across the majority of replicas.

  - `Scalability`: It is easier to scale the number of replicas in distributed systems.

- `Consensus` requires all (or a majority) of nodes to agree, while `Quorum` only requires agreement from a subset, leading to lower latency in Quorum.

### Consensus

- `Consensus` is a fundamental problem in distributed systems that requires multiple nodes to agree on a single value or a sequence of values. 

- In distributed systems, consensus is critical to keep data consistent across nodes. 

- For example:

  - In a distributed database, consensus helps ensure that all replicas agree on updates or state changes, allowing users to see consistent data even if nodes go offline temporarily.

  - Consensus algorithms ensure that all nodes agree on the leader or primary data source.

- `Consensus` requires all (or a majority) of nodes to agree, while `Quorum` only requires agreement from a subset, leading to lower latency in Quorum.

### CAP Theorem

- `CAP theorem` states that it is impossible for a distributed data store to simultaneously provide `more than two`

  out of the following three guarantees:

    - **Consistency** - Every read receives the most recent write or an error.

    - **Availability** - Every request receives a (non-error) response, without the guarantee that it contains the

      most recent write.

    - **Partition tolerance** - The system continues to operate despite an arbitrary number of messages being

      dropped (or delayed) by the network between nodes.

![cap_theorem.webp](./images/system.design/cap_theorem.webp)

- In a distributed system, `partitions` cannot be avoided, and when a partition occurs, we must

  choose between `consistency` and `availability`. `n3` goes down and cannot

  communicate with `n1` and `n2`. If clients write data to `n1` or `n2`, data cannot be propagated to

  `n3`. If data is written to n3 but not propagated to `n1` and `n2` yet, `n1` and `n2` would have stale

  data.

- If we choose `consistency` over `availability` (`CP` system), we must block all write operations to

  `n1` and `n2` to avoid data inconsistency among these three servers, which makes the system

  unavailable. Bank systems usually have extremely high consistent requirements. For

  example, it is crucial for a bank system to display the most up-to-date balance info. If

  inconsistency occurs due to a network partition, the bank system returns an error before the

  inconsistency is resolved

- If we choose `availability` over `consistency` (`AP` system), the system keeps accepting

  reads, even though it might return `stale` data. For writes, `n1` and `n2` will keep accepting writes,

  and data will be synced to `n3` when the network partition is resolved.

For some systems like financial systems, consistency is very important. For others like TikTok, where it is OK if some users get access to certain videos later than the rest, we try to aim for availability over consistency. Even in these cases, we want our system to eventually have the same view of the data. And that is called eventual consistency, where systems become consistent eventually, if not immediately.

#### PACELC Theorem

![PACELC-theorem-diagram-1.webp](./images/system.design/PACELC-theorem-diagram-1.webp)

### Cache

- `Cache` is a temporary storage area for data that is accessed very frequently. It is used to increase the

  performance of data retrieval operations. It is also used to reduce the load on the database by storing

  frequently-accessed data in the cache.

- Types (_It’s a good practice to include a brief point about cache invalidation during system design interviews_):

    - `LRU Cache` - Least Recently Used Cache

    - `LFU Cache` - Least Frequently Used Cache

- Generally, caching is used for read-heavy systems.

![cache.png](images/system.design.5/cache.png)

- https://codeahoy.com/2017/08/11/caching-strategies-and-how-to-choose-the-right-one

#### Cache-Aside

 

- The cache sits on the side and the application directly talks to both the cache and the database.

  There is no connection between the cache and the primary database. All operations to cache and the database are

  handled by the application. Cache-aside caches are usually general purpose and work best for `read-heavy`

  workloads.

    1. The application first checks the cache.

    2. If the data is found in cache, we’ve cache hit. The data is read and returned to the client.

    3. If the data is not found in cache, we’ve cache miss. The application has to do some extra work. It

       queries the database to read the data, returns it to the client and stores the data in cache so the

       subsequent reads for the same data results in a cache hit.

![cache_1.png](./images/system.design/cache_1.png)

#### Read-Through

- `Read-through cache` sits in-line with the database. When there is a cache miss, it loads

  missing data from database, populates the cache and returns it to the application. Read-through caches work best

  for `read-heavy` workloads when the same data is requested many times. For example, a news story.

    1. In cache-aside, the application is responsible for fetching data from the database and populating the cache.

       In read-through, this logic is usually supported by the library or stand-alone cache provider.

    2. Unlike cache-aside, the data model in read-through cache cannot be different than that of the database.

![cache_2.png](./images/system.design/cache_2.png)

#### Write-Through

- `Write-Through Cache` - data is first written to the cache and then to the database. The cache sits in-line with

  the database and writes always go through the cache to the main database. This helps cache maintain consistency

  with the main database.

    1. The application writes the data directly to the cache.

    2. The cache updates the data in the main database. When the write is complete, both the cache and the database

       have the same value and the cache always remains consistent.

![cache_3.png](./images/system.design/cache_3.png)

#### Write-Around

- `Write-Around` - Here, data is written directly to the database and only the data that is read makes it way into

  the cache.

#### Write-Back

- `Write-Back` or `Write-Behind` - Here, the application writes data to the cache which stores the data and

  acknowledges to the application immediately. Then later, the cache writes the data back to the database. Here, the

  application writes data to the cache which stores the data and acknowledges to the application immediately. Then

  later, the cache writes the data back to the database. Write back caches improve the write performance and are

  good for `write-heavy` workloads.

![cache_4.png](./images/system.design/cache_4.png)

#### Deduplication cache

![3.png](./images/system.design.3/3.png)

---

## Numbers every programmer should know

- [Latency Numbers Every Programmer Should Know](https://colin-scott.github.io/personal_website/research/interactive_latency.html)

- `ns` = nanosecond, `µs` = microsecond, `ms` = millisecond

    - `1 ns` = 10^-9 seconds

    - `1 µs`= 10^-6 seconds = 1,000 ns

    - `1 ms` = 10^-3 seconds = 1,000 µs = 1,000,000 ns

![numbers_1.png](./images/system.design/numbers_1.png)

![numbers_3.png](./images/system.design/numbers_3.png)

![numbers_2.png](./images/system.design/numbers_2.png)

- By analyzing the numbers we get the following conclusions:

    - Memory is fast but the disk is slow.

    - Avoid disk seeks if possible.

    - Simple compression algorithms are fast.

    - Compress data before sending it over the internet if possible.

    - Data centers are usually in different regions, and it takes time to send data between them.

---

## Bloom filter

- `Bloom filter` is a space-efficient probabilistic data structure, that is used to test whether an element is a member

  of a set.

- `False positive` matches are possible

- `False negatives` are not – in other words, a query returns either **possibly in set** or **definitely not in set**

- Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter);

- the more elements that are added to the set, the larger the probability of false positives.

![bloom_filter.png](./images/system.design/bloom_filter.png)

---

## Message delivery guarantees

- At-most-once guarantee

- At-least-once guarantee

- Exactly-once guarantee

![message-delivery-guarantees.png](./images/system.design.3/message-delivery-guarantees.png)

![At-most-once-guarantee.png](./images/system.design.3/At-most-once-guarantee.png)

![At-least-once guarantee.png](./images/system.design.3/At-least-once-guarantee.png)

![Exactly-once guarantee.png](./images/system.design.3/Exactly-once-guarantee.png)

---

## Autoscaling

- Scaling policies (metric-based, schedule-based, predictive)

![autoscaling.png](./images/system.design.3/autoscaling.png)

---

## How to avoid cascading failures in a distributed system

![cascading_failures_in_distributed_system.png](./images/system.design.3/cascading_failures_in_distributed_system.png)

- `Caching` - key value store

- `Gradual deploy`

---

## Region vs Availability Zone

- `Region` is a geographical area. Each region consists of availability zones. Each region is completely independent.

    - Example: us, uk, eu

- `Availability zone` is one or more data centers in region:

    - Example: 3 data centers in one availability zone or 2 data centers in another availability zone

![region_availability_zone_1.png](./images/system.design.4/region_availability_zone_1.png)

![region_availability_zone_2.png](./images/system.design.4/region_availability_zone_2.png)

---

## Resource monitoring

### Metrics

###### Load average

- `LA` is a measure of the amount of computational work that a computer system performs. The load average represents the average system load over a period of time

- System Load average is the average number of processes either in a `runnable` or `uninterruptible` sleep state

- Normal when it less than amount of CPU cores, e.g.

    - 8 core CPU, load average is 8 this is ok

    - 1 core CPU, load average is 8 this is bad

![2.png](images/system.design.6/2.png)

###### RAM usage

- Running out of RAM indicates that the server is under severe load and application performance will almost certainly be noticeable to end users.

- Database uses RAM for load `indexes` and cache data

![3.png](images/system.design.6/3.png)

###### SWAP usage

- Swap is a space on a disk that is used when the amount of physical RAM memory is full. When a Linux system runs out of RAM, inactive pages are moved from the RAM to the swap space.

- When available RAM is short or totally maxed out, Linux moves data from RAM to SWAP. 

- High SWAP usage means that you don’t have enough RAM

- Server might reboot if you will run out of SWAP

###### Disc usage

- Track current disk space used

- Monitor `Docker` images (prune unused images)

- Monitor logs if they are stored on the disk, because HDD is slower than this is the slower part of the system

![4.png](images/system.design.6/4.png)

###### CPU usage

- `Idle` - idle task - when nothing to do

- `User` - running user space processes

- `System` - running the kernel

- `Iowait` - idle, while system making disk I/O request (_if this is high, it means that the disk is not enough, need to look at IO requests to HDD_)

- `Steal` - virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor

- `Softirq` - processor swtches context

- `Nice`  - users' priority processes that have been `niced`

![1.png](images/system.design.6/1.png)

### Monitoring systems

- `Agents` - Software on every machine for collecting and reporting metrics and events

  - Telegraf

  - Zabbix

- `Storages` - Database with metrics and events

   - InfluxDB

   - ElasticSearch

   - Graphite

   - MySQL

   - Zabbix Server

   - Prometheus

- `Dashboards` - Visualization of metrics and events

   - Grafana

   - Kibana

   - Zabbix Frontend

   - Datadog

### Systems

- `TIG` - Telegraf, InfluxDB, Grafana

- `ELK` - ElasticSearch, Logstash, Kibana

---

## Stress testing

- `Stress testing` is a type of performance testing that verifies the stability and reliability of the system under extreme conditions. 

- The purpose of `stress testing` is to ensure that the system or application can handle the expected and unexpected load

- Stress testing can help to identify and fix performance issues, such as slow response times, high CPU or memory usage, database connectivity problems, or network latency.

- Can help find `bottlenecks` because the _system's speed depends on the slowest part of the system_

- `Tools`:

  - Apache JMeter

  - Siege

  - ab (Apache Benchmark)

  - Gatling

- `Approaches`:

  - **Load Testing** - Load testing is the most common approach to stress testing, which involves simulating the `expected user load` on a system or application and measuring its performance

  - **Performance Testing** - Performance testing is a stress testing approach that focuses on `measuring the response time of a system or application under different loads` and scenarios

  - **Spike Testing** - Spike testing is a stress testing approach that involves `simulating sudden spikes` in user activity or load on a system or application. Spike testing is used to evaluate the behavior of the system `under sudden and unexpected increases` in traffic, such as during a marketing campaign or product launch

  - **Endurance Testing** - Endurance testing is a stress testing approach that involves `measuring the performance` of a system or application `over an extended period of time`, usually several hours or days. Endurance testing is used to evaluate the system's ability to maintain its performance under a sustained load or stress

  - **Worsening Tests** - We create high load in part of production environment. Results affect real users. By observing the change in hardware metrics we can judge how these changes affect user metrics. So instead of spending weeks on improving something it’s easier to artificially slow down (worsen) that part and see if there is any impact

---

# System Design Interview Questions

## Design Rate Limiter #1

### What is Rate Limiter?

- In a network system, a rate limiter is used to control the rate of traffic sent by a client or a

  service. In the HTTP world, a rate limiter limits the number of client requests allowed to be

  sent over a specified period. If the API request count exceeds the threshold defined by the

  rate limiter, all the excess calls are blocked. Here are a few examples:

    - A user can write no more than 2 posts per second.

    - You can create a maximum of 10 accounts per day from the same IP address.

    - You can claim rewards no more than 5 times per week from the same device.

### Establish design scope for rate limiter

- `Candidate`: What kind of rate limiter are we going to design? Is it a client-side rate limiter or server-side API

  rate limiter?

- `Interviewer`: Great question. We focus on the server-side API rate limiter.

- `Candidate`: Does the rate limiter throttle API requests based on IP, the user ID, or other properties?

- `Interviewer`: The rate limiter should be flexible enough to support different sets of throttle rules.

- `Candidate`: What is the scale of the system? Is it built for a startup or a big company with a large user base?

- `Interviewer`: The system must be able to handle a large number of requests.

- `Candidate`: Will the system work in a distributed environment?

- `Interviewer`: Yes.

- `Candidate`: Is the rate limiter a separate service or should it be implemented in application code?

- `Interviewer`: It is a design decision up to you.

- `Candidate`: Do we need to inform users who are throttled?

- `Interviewer`: Yes.

### Requirements for rate limiter

- Accurately limit excessive requests.

- Low latency. The rate limiter should not slow down HTTP response time.

- Use as little memory as possible.

- Distributed rate limiting. The rate limiter can be shared across multiple servers or

  processes.

- Exception handling. Show clear exceptions to users when their requests are throttled.

- High fault tolerance. If there are any problems with the rate limiter (for example, a cache server goes offline), it

  does not affect the entire system.

### Rate limiter types

- Client-side rate limiter

- Server-side rate limiter

- Middleware rate limiter

### Algorithms for rate limiting

- Token bucket

- Leaking bucket

- Fixed window counter

- Sliding window log

- Sliding window counter

### Token bucket algorithm

- This algorithm has a centralized bucket host where you take tokens on each request, and slowly drip more tokens into

  the bucket. If the bucket is empty, reject the request

- A `token bucket` is a container that has pre-defined capacity.

- Tokens are put in the bucket at preset rates periodically

- Once the bucket is full, no more tokens are added

- Each request consumes one token. When a request arrives, we check if there are enough tokens in the bucket.

    - If there are enough tokens, we take one token out for each request, and the request goes through.

    - If there are not enough tokens, the request is dropped

![rate_limiter_1.png](./images/system.design/rate_limiter_1.png)

![rate_limiter_2.png](./images/system.design/rate_limiter_2.png)

- The token bucket algorithm takes two parameters:

    - `Bucket size`: the maximum number of tokens allowed in the bucket

    - `Refill rate`: number of tokens put into the bucket every second

- Pros:

    - The algorithm is easy to implement.

    - Memory efficient.

    - Token bucket allows a burst of traffic for short periods. A request can go through as long

      as there are tokens left.

- Cons:

    - Two parameters in the algorithm are bucket size and token refill rate. However, it might

      be challenging to tune them properly.

### Leaking bucket algorithm

- The `leaking bucket algorithm` is similar to the token bucket except that requests are processed

  at a fixed rate. It is usually implemented with a first-in-first-out (FIFO) `queue`.

  The algorithm works as follows:

    - When a request arrives, the system checks if the queue is full. If it is not full, the request is added to the

      queue.

    - Otherwise, the request is dropped.

    - Requests are pulled from the queue and processed at regular intervals.

![rate_limiter_3.png](./images/system.design/rate_limiter_3.png)

- Leaking bucket algorithm takes the following two parameters:

    - `Bucket size`: it is equal to the queue size. The queue holds the requests to be processed at

      a fixed rate.

    - `Outflow rate`: it defines how many requests can be processed at a fixed rate, usually in

      seconds.

- Pros:

    - Memory efficient given the limited queue size.

    - Requests are processed at a fixed rate therefore it is suitable for use cases that a stable

      outflow rate is needed.

- Cons:

    - A burst of traffic fills up the queue with old requests, and if they are not processed in

      time, recent requests will be rate limited.

    - There are two parameters in the algorithm. It might not be easy to tune them properly

### Fixed window counter algorithm

- `Fixed window counter algorithm` works as follows:

    - The algorithm divides the timeline into fix-sized time windows and assign a counter for

      each window.

    - Each request increments the counter by one.

    - Once the counter reaches the pre-defined threshold, new requests are dropped until a new time window starts

- Example:

    - The time unit is 1 second

    - The system allows a maximum of 3 requests per second.

    - In each second window, if more than 3 requests are received, extra requests are dropped

![rate_limiter_4.png](./images/system.design/rate_limiter_4.png)

- A major problem with this algorithm is that a burst of traffic at the edges of time windows

  could cause more requests than allowed quota to go through

    - The system allows a maximum of 5 requests per minute, and the available quota

      resets at the human-friendly round minute.

    - As seen, there are five requests between 2:00:00 and 2:01:00 and five more requests between 2:01:00 and 2:02:00.

    - For the one-minute window between 2:00:30 and 2:01:30, 10 requests go through.

    - That is twice as many as allowed requests.

![rate_limiter_5.png](./images/system.design/rate_limiter_5.png)

### Sliding window log algorithm

- `Sliding window log algorithm` works as follows:

    - The algorithm keeps track of request timestamps. Timestamp data is usually kept in

      cache, such as sorted sets of Redis.

    - When a new request comes in, remove all the outdated timestamps. Outdated timestamps are defined as those older

      than the start of the current time window.

    - Add timestamp of the new request to the log.

    - If the log size is the same or lower than the allowed count, a request is accepted. Otherwise, it is rejected

![rate_limiter_6.png](./images/system.design/rate_limiter_6.png)

- Example:

    - In this example, the rate limiter allows 2 requests per minute. The time unit is 1 minute.

    - The log is empty when a new request arrives at `1:00:01`. Thus, the request is allowed.

    - A new request arrives at `1:00:30`, the timestamp `1:00:30` is inserted into the log. After the

      insertion, the log size is `2`, not larger than the allowed count. Thus, the request is allowed.

    - A new request arrives at `1:00:50`, and the timestamp is inserted into the log. After the

      insertion, the log size is `3`, larger than the allowed size `2`. Therefore, this request is rejected

      even though the timestamp remains in the log.

    - A new request arrives at `1:01:40`. Requests in the range `[1:00:40,1:01:40)` are within the

      latest time frame, but requests sent before `1:00:40` are outdated.

    - Two outdated timestamps, `1:00:01` and `1:00:30`, are removed from the log.

    - After the remove operation, the log size becomes `2`; therefore, the request is accepted

- Pros:

    - Rate limiting implemented by this algorithm is very accurate. In any rolling window,

      requests will not exceed the rate limit.

- Cons:

    - The algorithm consumes a lot of memory because even if a request is rejected, its

      timestamp might still be stored in memory.

### Sliding window counter algorithm

- The `sliding window counter algorithm` is a hybrid approach that combines the fixed window counter and sliding window

  log.

![rate_limiter_7.png](./images/system.design/rate_limiter_7.png)

- Example:

    - Let’s say I set a limit of `50` requests per minute on an API endpoint. The counter can be thought of like this:

    - In this situation, I did `18` requests during the current minute, which started `15` seconds ago, and `42`

      requests during the entire previous minute.

    - Based on this information, the rate approximation is calculated like this:

        - rate = `42 * ((60-15)/60) + 18` = `42 * 0.75 + 18` = `49.5` requests

- Pros:

    - Tiny memory usage: only two numbers per counter

    - Incrementing a counter can be done by sending a single `INCR` command

    - Calculating the rate is reasonably easy: one GET command and some very simple, fast math

- Cons:

    - It only works for not-so-strict look back window. It is an approximation of the actual rate

      because it assumes requests in the previous window are evenly distributed. However, this

      problem may not be as bad as it seems.

### Detailed design for rate limiter

![rate_limiter_8.png](./images/system.design/rate_limiter_8.png)

- The rate limiter is implemented as a `middleware`. It is placed in front of the API server.

- `Rules` are stored on the disk. Workers frequently pull rules from the disk and store them in the cache.

- When a client sends a request to the server, the request is sent to the rate limiter middleware first.

- Rate limiter `middleware` loads rules from the `cache`. It fetches counters and last request

  timestamp from `Redis` cache. Based on the response, the rate limiter decides:

    - if the request is not rate limited, it is forwarded to API servers.

    - if the request is rate limited, the rate limiter returns `429` too many requests error to

      the client. In the meantime, the request is either dropped or forwarded to the `queue`.

### Synchronization issue

![rate_limiter_9.png](./images/system.design/rate_limiter_9.png)

- `Synchronization issue` can be solved by using centralized data stores like `Redis`.

---

## Desing Consistent hashing

- `Consistent hashing` is a special kind of hashing such that when a

  hash table is re-sized and consistent hashing is used, only `k/n` keys need to be remapped on

  average, where `k` is the number of keys, and `n` is the number of slots. In contrast, in most

  traditional hash tables, a change in the number of array slots causes nearly all keys to be

  remapped

- The basic steps are:

    - Map servers and keys on to the ring using a uniformly distributed hash function.

    - To find out which server a key is mapped to, go clockwise from the key position until the

      first server on the ring is found

### Server lookup in Consistent hashing

- To determine which server a key is stored on, we go clockwise from the key position on the ring until a server is

  found.

- Going clockwise, `key0` is stored on server 0; `key1` is stored on server 1; `key2` is stored on server 2

  and `key3` is stored on server 3

![consistent_hashing_1.png](./images/system.design/consistent_hashing_1.png)

### Add a server in Consistent hashing

- After a new server 4 is added, only `key0` needs to be redistributed.

- `k1`, `k2`, and `k3` remain on the same servers. Let us take a close look at the logic.

- Before server 4 is added, `key0` is stored on server 0. Now, `key0` will be stored on server 4 because server 4 is

  the first

  server it encounters by going clockwise from `key0`’s position on the ring.

- The other keys are not redistributed based on consistent hashing algorithm.

![consistent_hashing_2.png](./images/system.design/consistent_hashing_2.png)

### Remove a server in Consistent hashing

- When a server is removed, only a small fraction of keys require redistribution with consistent hashing.

- When server 1 is removed, only `key1` must be remapped to server 2. The rest of the keys are unaffected.

![consistent_hashing_3.png](./images/system.design/consistent_hashing_3.png)

### Virtual nodes in Consistent hashing

- A `virtual node` refers to the real node, and each server is represented by multiple virtual nodes on the ring.

- The 3 is arbitrarily chosen; and in real-world systems, the number of virtual nodes is much larger.

- Instead of using `s0`, we have `s0_0`, `s0_1`, and `s0_2` to represent `server 0` on the ring.

- Similarly, `s1_0`, `s1_1`, and `s1_2` represent `server 1` on the ring.

- With virtual nodes, each server is responsible for multiple partitions.

- Partitions (edges) with label `s0` are managed by `server 0`. On the other hand, partitions with label `s1` are

  managed by `server 1`.

- As the number of `virtual nodes` increases, the distribution of keys becomes more balanced.

![consistent_hashing_4.png](./images/system.design/consistent_hashing_4.png)

- To find which server a key is stored on, we go `clockwise` from the key’s location and find the first `virtual node`

  encountered on the ring.

![consistent_hashing_5.png](./images/system.design/consistent_hashing_5.png)

### Redestribution when added server in Consistent hashing

- When a server is added or removed, a fraction of data needs to be redistributed

- `server 4` is added onto the ring. The affected range starts from `s4` (newly added node) and

  moves `anticlockwise`

  around the ring until a server is found (`s3`).

- Thus, keys located between `s3` and `s4` need to be redistributed to `s4`

![consistent_hashing_6.png](./images/system.design/consistent_hashing_6.png)

### Redestribution when removed server in Consistent hashing

- When a server (s1) is removed, the affected range starts from `s1` (removed node) and moves `anticlockwise` around

  the ring until a server is found (s0).

- Thus, keys located between s0 and s1 must be redistributed to s2.

![consistent_hashing_7.png](./images/system.design/consistent_hashing_7.png)

### Data replication in Consistent hashing

- To achieve high availability and reliability, data must be replicated asynchronously over `N`

  servers, where `N` is a configurable parameter.

- These `N` servers are chosen using the following logic:

    - after a key is mapped to a position on the hash ring, `walk clockwise` from that position

      and choose the first `N` servers on the ring to store data copies.

    - `key0` is replicated at `s1`, `s2`, and `s3`

![replication_in_consistent_hashing.png](./images/system.design/replication_in_consistent_hashing.png)

![replication_in_consistent_hashing_1.png](./images/system.design/replication_in_consistent_hashing_1.png)

---

## Desing a key-value store

- A `key-value` store, also referred to as a key-value database, is a non-relational database. Each

  unique identifier is stored as a key with its associated value.

- This data pairing is known as a `key-value` pair.

- `Cassandra`:

    - `Reads` are more expensive than writes.

    - `Writes` are appended to a commit log and written to an in memory structure called a memtable that is eventually

      flushed to disk.

    - `Reads`, however, need to query the memtable and potentially multiple SSTables (on-disk files), a more expensive

      operation

![key_value_store_1.png](./images/system.design/key_value_store_1.png)

### Data partitioning in key-value store

- `Data partitioning` is a technique used to distribute data across multiple machines. It is used to

  scale the storage and retrieval of data.

- `Data partitioning` can be achived by `consistent hashing`

### Consistency in key-value store

- Since data is replicated at multiple nodes, it must be synchronized across replicas. `Quorum

  consensus` can guarantee consistency for both read and write operations. Let us establish a

  few definitions first.

    - `N` = The number of `replicas`

    - `W` = A `write quorum` of size `W`. For a `write` operation to be considered as successful, write operation must

      be acknowledged from `W` replicas.

    - `R` = A `read quorum` of size `R`. For a `read` operation to be considered as successful, read operation must wait

      for responses from at least `R` replicas.

- The configuration of `W`, R and `N` is a typical tradeoff between latency and consistency

    - If `W = 1 or R = 1`, an `operation is returned quickly` because a coordinator only needs to wait for a

      response from any of the replicas.

    - If `W or R > 1`, the system offers `better consistency`; however, the `query will be slower` because the

      coordinator must wait for the response from the slowest replica.

    - If `W + R > N`, `strong consistency` is guaranteed because there must be at least one overlapping node that has

      the latest data to ensure consistency.

- How to configure N, W, and R to fit our use cases? Here are some of the possible setups:

    - If `R = 1 and W = N`, the system is optimized for a `fast read`.

    - If `W = 1 and R = N`, the system is optimized for `fast write`.

    - If `W + R > N`, strong `consistency` is `guaranteed` (Usually N = 3, W = R = 2).

    - If `W + R <= N`, strong `consistency` is `not guaranteed`.

- Consistency models:

    - `Strong consistency`: any read operation returns a value corresponding to the result of the

      most updated write data item. A client never sees out-of-date data.

    - `Weak consistency`: subsequent read operations may not see the most updated value.

    - `Eventual consistency`: this is a specific form of weak consistency. Given enough time, all

      updates are propagated, and all replicas are consistent.

![key_value_store_2.png](./images/system.design/key_value_store_2.png)

### Handling temporary failures for key-value store

- If a server is unavailable due to network or server failures, another server will process requests temporarily.

- When the down server is up, changes will be pushed back to achieve data consistency.

- This process is called `hinted handoff`.

- Since `s2` is unavailable, reads and writes will be handled by `s3` temporarily.

- When `s2` comes back online, `s3` will hand the data back to `s2`.

![key_value_store_7.png](./images/system.design/key_value_store_7.png)

### Handling permanent failures for key-value store

- What if a replica is permanently unavailable?

- To handle such a situation, we implement an `anti-entropy protocol` to keep replicas in sync.

- Anti-entropy involves comparing each piece of data on replicas and updating each replica to the newest version.

- A `Merkle tree` is used for inconsistency detection and minimizing the amount of data transferred

### Handling data center outage for key-value store

- Data center outage could happen due to power outage, network outage, natural disaster, etc.

- To build a system capable of handling data center outage, it is important to replicate data across multiple data

  centers.

- Even if a data center is completely offline, users can still access data through the other data centers.

### System architecture diagram for key-value store

![key_value_store_3.png](./images/system.design/key_value_store_3.png)

- Main features of the architecture are listed as follows:

    - Clients communicate with the key-value store through simple APIs: get(key) and put(key,

      value).

    - A coordinator is a node that acts as a proxy between the client and the key-value store.

    - Nodes are distributed on a ring using consistent hashing.

    - The system is completely decentralized so adding and moving nodes can be automatic.

    - Data is replicated at multiple nodes.

    - There is no single point of failure as every node has the same set of responsibilities

### Write path in key-value store

![key_value_store_4.png](./images/system.design/key_value_store_4.png)

- When data is written to `Cassandra`, it is first written to the `commit log`. Subsequently, the data is asynchronously

  written to the `memtable` (in-memory storage) and the `SSTable` (sorted file data on disk).

- The commit log provides data durability by retaining information about each write operation, even if the data hasn't

  been completely written to the memtable and SSTable.

- During a Cassandra restart or in case of a failure, the commit log is used to recover data that may have been lost

  during the failure. It allows Cassandra to replay the data from the commit log and update the memtable and SSTable

  accordingly.

- In summary, the commit log is a crucial component of Cassandra's data durability mechanism, ensuring the reliability

  of data in case of system failures.

![Write path in key-value store.png](./images/system.design.3/Write_path_in_key_value_store.png)

![Write path in key-value store 2.png](./images/system.design.3/Write_path_in_key_value_store_2.png)

![Write path in key-value store 3.png](./images/system.design.3/Write_path_in_key_value_store_3.png)

- The write request is persisted on a commit log file.

    - `log` - this is a **linked list** data-structute. Where adding a new entry is `O(1)` and reading is `O(n)`.

- Data is saved in the memory cache.

- When the memory cache is full or reaches a predefined threshold, data is flushed to SSTable on disk.

- A sorted-string table (SSTable) is a sorted list of  pairs, contains sorted keys from log

### Read path in key-value store

![key_value_store_5.png](./images/system.design/key_value_store_5.png)

- The system first checks if data is in memory - `memtable`. If not, go to step 2.

    - Data in Cassandra `memtables` is stored in a highly optimized, columnar format, which allows for efficient read

      and

      write operations. This format is designed to maximize performance and minimize disk I/O, which is essential for

      Cassandra's high-throughput, low-latency, and distributed architecture.

- If data is not in memory, the system checks the `bloom filter`

    - `Bloom filter` is a probabilistic data structure that is used to test whether an element is a member of a set.

- The bloom filter is used to figure out which SSTables might contain the key.

- `SSTables` return the result of the data set.

- The result of the data set is returned to the client.

- It's important to note that the in-memory data structures and the on-disk storage (SSTables) in Cassandra can have

  different formats and structures. The on-disk format is designed for durability and efficient storage, while the

  in-memory format (memtable) is optimized for quick access and updates.

### Techniques for a distributed key-value store

![key_value_store_6.png](./images/system.design/key_value_store_6.png)

---

## Desing a unique ID generator in distributed systems

### Establish design scope for unique ID generator

- `Candidate`: What are the characteristics of unique IDs?

- `Interviewer`: IDs must be unique and sortable.

- `Candidate`: For each new record, does ID increment by 1?

- `Interviewer`: The ID increments by time but not necessarily only increments by 1. IDs created in the evening are

  larger than those created in the morning on the same day.

- `Candidate`: Do IDs only contain numerical values?

- `Interviewer`: Yes, that is correct.

- `Candidate`: What is the ID length requirement?

- `Interviewer`: IDs should fit into 64-bit.

- `Candidate`: What is the scale of the system?

- `Interviewer`: The system should be able to generate 10,000 IDs per second.

### Requirements for unique ID generator

- IDs must be unique.

- IDs are numerical values only.

- IDs fit into 64-bit.

- IDs are ordered by date.

- Ability to generate over 10,000 unique IDs per second.

### Multi-master replication

- `Multi-master replication` is a replication strategy where all nodes are equal and can accept writes.

- This approach uses the databases `auto_increment` feature. Instead of increasing the next ID by 1, we increase it

  by `k`, where `k` is the number of database servers in use.

- Next ID to be generated is equal to the previous ID in the same server plus `2`.

- This solves some scalability issues because IDs can scale with the number of database servers.

- However, this strategy has some major drawbacks:

    - Hard to scale with multiple data centers

    - IDs do not go up with time across multiple servers.

    - It does not scale well when a server is added or removed.

![unique_id.png](./images/system.design/unique_id.png)

### UUID

- `UUID` is another easy way to obtain unique IDs. UUID is a 128-bit number used to identify information in computer

  systems.

- `UUID` has a very low probability of getting collusion.

- In this design, each web server contains an ID generator, and a web server is responsible for enerating IDs

  independently.

- Pros:

    - Generating `UUID` is simple. No coordination between servers is needed so there will not be any synchronization

      issues.

    - The system is easy to scale because each web server is responsible for generating IDs they consume. ID generator

      can easily scale with web servers.

- Cons:

    - IDs are 128 bits long, but our requirement is 64 bits.

    - IDs do not go up with time.

    - IDs could be non-numeric.

![unique_id_1.png](./images/system.design/unique_id_1.png)

### Ticket Server

- `Ticket server` is a centralized service that generates unique IDs.

- The idea is to use a centralized `auto_increment` feature in a single database server.

- Pros:

    - Numeric IDs.

    - It is easy to implement, and it works for small to medium-scale applications.

- Cons:

    - Single point of failure. Single ticket server means if the ticket server goes down, all systems that depend on it

      will face issues.

    - To avoid a single point of failure, we can set up multiple ticket servers.

    - However, this will introduce new challenges such as data synchronization

![unique_id_2.png](./images/system.design/unique_id_2.png)

### Twitter snowflake approach

- Sections:

    - `Sign bit`: 1 bit. It will always be 0. This is reserved for future uses. It can potentially be

      used to distinguish between signed and unsigned numbers.

    - `Timestamp`: 41 bits. Milliseconds since the epoch or custom epoch. We use Twitter

      snowflake default epoch 1288834974657, equivalent to Nov 04, 2010, 01:42:54 UTC.

    - `Datacenter ID`: 5 bits, which gives us 2 ^ 5 = 32 datacenters.

    - `Machine ID`: 5 bits, which gives us 2 ^ 5 = 32 machines per datacenter.

    - `Sequence number`: 12 bits. For every ID generated on that machine/process, the sequence

      number is incremented by 1. The number is reset to 0 every millisecond.

![unique_id_3.png](./images/system.design/unique_id_3.png)

- `Datacenter IDs` and `machine IDs` are chosen at the startup time, generally fixed once the system is up running

- `Timestamp`:

    - As timestamps grow with time, IDs are sortable by time.

    - The maximum timestamp that can be represented in `41` bits is 2 ^ 41 - 1 = `2199023255551` milliseconds (ms),

      which gives us: ~ `69` years

        - `2199023255551` ms / `1000` seconds / `365` days / `24` hours / `3600` seconds

- `Sequence number`

    - Sequence number is `12` bits, which give us 2 ^ 12 = `4096` combinations.

    - This field is `0` unless more than one ID is generated in a millisecond on the same server.

    - In theory, a machine can support a maximum of `4096` new IDs per millisecond.

---

## Design a URL shortening service like TinyURL

### Establish design scope for URL shortening service

- `Candidate`: Can you give an example of how a URL shortener work?

- `Interviewer`: Assume URL https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long is the original URL.

  Your service creates an alias with shorter length: https://tinyurl.com/y7keocwj. If you click the alias, it redirects

  you to the original URL.

- `Candidate`: What is the traffic volume?

- `Interviewer`: 100 million URLs are generated per day.

- `Candidate`: How long is the shortened URL?

- `Interviewer`: As short as possible.

- `Candidate`: What characters are allowed in the shortened URL?

- `Interviewer`: Shortened URL can be a combination of numbers (0-9) and characters (a-z, AZ).

- `Candidate`: Can shortened URLs be deleted or updated?

- `Interviewer`: For simplicity, let us assume shortened URLs cannot be deleted or updated

### API Endpoints for URL shortening service

- POST `api/v1/data/shorten`

    - request parameter: {`longUrl`: longURLString}

    - return shortURL

- GET `api/v1/shortUrl`

    - Return longURL for HTTP redirection

### URL redirecting for URL shortening service

- Once the server receives a tinyurl request, it changes the short URL to the long URL with `301` redirect.

![tiny_url.png](./images/system.design/tiny_url.png)

![tiny_url_1.png](./images/system.design/tiny_url_1.png)

### 301 redirect vs 302 redirect for URL shortening service

- A `301` redirect shows that the requested URL is `permanently` moved to the

  long URL. Since it is permanently redirected, the browser caches the response, and

  subsequent requests for the same URL will not be sent to the URL shortening service.

- A `302` redirect means that the URL is `temporarily` moved to the long URL,

  meaning that subsequent requests for the same URL will be sent to the URL shortening

  service first

### Hash value length for URL shortening service

- The `hashValue` consists of characters from [0-9, a-z, A-Z], containing `10 + 26 + 26` = `62` possible characters.

- When `n` = 7, `62^n` = `~3.5 trillion`, 3.5 trillion is more than enough to hold 365 billion URLs, so the length of

  hashValue is `7`.

![tiny_url_2.png](./images/system.design/tiny_url_2.png)

### URL shortening deep dive

![tiny_url_3.png](./images/system.design/tiny_url_3.png)

1. `longURL` is the input.

2. The system checks if the `longURL` is in the database.

3. If it is, it means the `longURL` was converted to `shortURL` before. In this case, fetch the `shortURL` from the

   database

   and return it to the client.

4. If not, the `longURL` is new. A new unique ID (primary key) Is generated by the unique ID generator.

5. Convert the ID to `shortURL` with `base 62` conversion.

6. Create a new database row with the `ID`, `shortURL`, and `longURL`.

- Concrete example.

    - Assuming the input longURL is: https://en.wikipedia.org/wiki/Systems_design

    - Unique ID generator returns ID: `2009215674938`.

    - Convert the ID to shortURL using the `base 62` conversion. ID (2009215674938) is converted to `zn9edcu`

    - Save `ID`, `shortURL`, and `longURL` to the database

![tiny_url_4.png](./images/system.design/tiny_url_4.png)

### URL redirecting deep dive

- As there are more reads than writes, `` mapping is stored in a cache to improve performance.

![tiny_url_5.png](./images/system.design/tiny_url_5.png)

- The flow of URL redirecting is summarized as follows:

    - A user clicks a short URL link: https://tinyurl.com/zn9edcu

    - The load balancer forwards the request to web servers.

    - If a `shortURL` is already in the cache, return the `longURL` directly.

    - If a `shortURL` is not in the cache, fetch the `longURL` from the database.

    - If it is not in the database, it is likely a user entered an invalid `shortURL`.

    - The `longURL` is returned to the user

---

## Design a web crawler

- A web crawler is known as a robot or spider.

- It is widely used by search engines to discover new or updated content on the web.

- A crawler is used for many purposes:

    - `Search engine indexing`: crawler collects web pages to create a local index for search engines

    - `Web archiving`: This is the process of collecting information from the web to preserve data for future uses

    - `Web mining`: Web mining helps to discover useful knowledge from the internet

    - `Web monitoring`: The crawlers help to monitor copyright and trademark infringements over the Internet

- `Locality`:

    - Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience

      faster

      download time

- `Timeouts`:

    - Some web servers respond slowly or may not respond at all. To avoid long wait time, a

      maximal wait time is specified.

    - If a host does not respond within a predefined time, the crawler will stop the job and crawl some other pages.

![web_crawler_1.png](./images/system.design/web_crawler_1.png)

### Establish design scope for web crawler

- `Candidate`: What is the main purpose of the crawler? Is it used for search engine indexing, data mining, or something

  else?

- `Interviewer`: Search engine indexing.

- `Candidate`: How many web pages does the web crawler collect per month?

- `Interviewer`: 1 billion pages.

- `Candidate`: What content types are included? HTML only or other content types such as PDFs and images as well?

- `Interviewer`: HTML only.

- `Candidate`: Shall we consider newly added or edited web pages?

- `Interviewer`: Yes, we should consider the newly added or edited web pages.

- `Candidate`: Do we need to store HTML pages crawled from the web?

- `Interviewer`: Yes, up to 5 years

- `Candidate`: How do we handle web pages with duplicate content?

- `Interviewer`: Pages with duplicate content should be ignored.

### Web crawler high level design

![web_crawler_2.png](./images/system.design/web_crawler_2.png)

### Seed URLs for web crawler

- Seed URLs are the URLs that are used to start the crawling process.

- A good seed URL serves as a good starting point that a crawler can utilize to traverse as many links as possible

- The first proposed approach is based on locality as different countries may have different popular websites.

- Another way is to choose seed URLs based on topics; for example, we can divide URL space into shopping, sports,

  healthcare, etc.

- Seed URL selection is an open-ended question

### URL Frontier for web crawler

- Most modern web crawlers split the crawl state into two: `to be downloaded` and `already downloaded`.

- The component that stores URLs `to be downloaded` is called the URL Frontier.

- You can refer to this as a First-in-First-out (FIFO) `queue`

### HTML Downloader for web crawler

- The HTML downloader downloads web pages from the internet.

- Those URLs are provided by the `URL Frontier`

- To achieve high performance, crawl jobs are distributed into multiple servers

- The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs

### DNS Resolver for web crawler

- To download a web page, a URL must be translated into an `IP address`.

- The `HTML Downloader` calls the `DNS Resolver` to get the corresponding `IP` address for the `URL`

- `DNS Resolver` is a bottleneck for crawlers because DNS requests might take time due to the synchronous nature of many

  DNS interfaces

    - Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization.

    - DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs.

### Content Parser for web crawler

- After a web page is downloaded, it must be parsed and validated because malformed web pages could provoke problems and

  waste storage space.

- Implementing a content parser in a crawl server will slow down the crawling process

### Content Seen for web crawler

- It helps to detect new content previously stored in the system.

- An efficient way to accomplish this task is to compare the hash values of the two web pages

### Content Storage for web crawler

- It is a storage system for storing HTML content.

- The choice of storage system depends on factors such as data type, data size, access frequency, life span, etc.

- Both disk and memory are used:

    - Most of the content is stored on disk because the data set is too big to fit in memory.

    - Popular content is kept in memory to reduce latency.

### URL Extractor for web crawler

- URL Extractor parses and extracts links from HTML pages

![web_crawler_3.png](./images/system.design/web_crawler_3.png)

### URL Filter for web crawler

- The URL filter excludes certain content types, file extensions, error links and URLs in `blacklisted` sites.

### URL Seen for web crawler

- `URL Seen` is a data structure that keeps track of URLs that are visited before or already in

  the `Frontier`.

- `URL Seen` helps to avoid adding the same URL multiple times as this can increase server load and cause potential

  infinite loops.

- Bloom filter and hash table are common techniques to implement the `URL Seen` component.

### URL Storage for web crawler

- URL Storage stores already visited URLs.

### Web crawler workflow

![web_crawler_4.png](./images/system.design/web_crawler_4.png)

- Step 1: Add `seed` URLs to the `URL Frontier`

- Step 2: `HTML Downloader` fetches a list of URLs from `URL Frontier`.

- Step 3: `HTML Downloader` gets `IP` addresses of `URLs` from `DNS resolver` and starts downloading.

- Step 4: `Content Parser` parses HTML pages and checks if pages are malformed.

- Step 5: After content is parsed and validated, it is passed to the `Content Seen` component.

- Step 6: `Content Seen` component checks if a HTML page is already in the storage.

    - If it is in the storage, this means the same content in a different URL has already been

      processed. In this case, the HTML page is discarded.

    - If it is not in the storage, the system has not processed the same content before. The

      content is passed to `Link Extractor`.

- Step 7: `Link extractor` extracts links from HTML pages.

- Step 8: Extracted links are passed to the `URL filter`.

- Step 9: After links are filtered, they are passed to the `URL Seen` component.

- Step 10: `URL Seen` component checks if a URL is already in the storage, if yes, it is processed before, and nothing

  needs to be done.

  Step 11: If a URL has not been processed before, it is added to the `URL Frontier`

### DFS vs BFS for web crawler

- `DFS` is usually not a good choice because the depth of DFS can be very deep.

- `BFS` is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs

  are dequeued in the order they are enqueued.

### URL frontier for web crawler deep dive

- Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period.

- The general idea of enforcing politeness is to download one page at a time from the same host.

- A delay can be added between two download tasks.

![web_crawler_5.png](./images/system.design/web_crawler_5.png)

- Priority:

    - `Prioritizer`: It takes URLs as input and computes the priorities.

    - `Queue f1 to fn`: Each queue has an assigned priority. Queues with high priority are selected with higher

      probability.

    - `Front queue selector`: Randomly choose a queue with a bias towards queues with higher priority.

- Politeness:

    - `Back queue router`: It ensures that each queue (b1, b2, … bn) only contains URLs from the same host.

    - `Mapping table`: It maps each host to a queue.

    - `FIFO queues b1, b2 to bn`: Each queue contains URLs from the same host.

    - `Back queue selector`: Each worker thread is mapped to a FIFO queue, and it only downloads URLs from that queue.

      The

      queue selection logic is done by the Queue selector.

    - `Worker thread 1 to N`. A worker thread downloads web pages one by one from the same host. A delay can be added

      between two download tasks.

---

## Design a notification system #1

### Establish design scope for notification system

- `Candidate`: What types of notifications does the system support?

- `Interviewer`: Push notification, SMS message, and email.

- `Candidate`: Is it a real-time system?

- `Interviewer`: Let us say it is a soft real-time system. We want a user to receive notifications as soon as possible.

  However, if the system is under a high workload, a slight delay is acceptable.

- `Candidate`: What are the supported devices?

- `Interviewer`: iOS devices, android devices, and laptop/desktop.

- `Candidate`: What triggers notifications?

- `Interviewer`: Notifications can be triggered by client applications. They can also be scheduled on the server-side.

- `Candidate`: Will users be able to opt-out?

- `Interviewer`: Yes, users who choose to opt-out will no longer receive notifications.

- `Candidate`: How many notifications are sent out each day?

- `Interviewer`: 10 million mobile push notifications, 1 million SMS messages, and 5 million emails.

### Types of notifications

- Push notification

- SMS message

- Email

### iOS push notification

![notification_system.png](./images/system.design/notification_system.png)

- `Provider`. A provider builds and sends notification requests to `Apple Push Notification Service` (APNS). To

  construct a push notification, the provider provides the following data:

    - `Device token`: This is a unique identifier used for sending push notifications.

    - `Payload`: This is a JSON dictionary that contains a notification’s payload.

- `APNS`: This is a remote service provided by Apple to propagate push notifications to iOS devices.

- `iOS Device`: It is the end client, which receives push notifications.

### Android push notification

![notification_system_2.png](./images/system.design/notification_system_1.png)

- `Android` adopts a similar notification flow. Instead of using APNs, `Firebase Cloud Messaging` (FCM) is commonly used

  to send push notifications to android devices

### SMS notification

![notification_system_3.png](./images/system.design/notification_system_2.png)

- For `SMS` messages, third party SMS services like `Twilio`, `Nexmo`, and many others are commonly used. Most of them

  are

  commercial services

### Email notification

![notification_system_4.png](./images/system.design/notification_system_3.png)

- Although companies can set up their own email servers, many of them opt for commercial email services.

- `Sendgrid` and `Mailchimp` are among the most popular email services, which offer a better delivery rate and data

  analytics.

### High-level design of a notification system

![notification_system_4.png](./images/system.design/notification_system_4.png)

1. A `service` calls APIs provided by notification servers to send notifications.

2. `Notification servers` fetch metadata such as user info, device token, and notification setting from the cache or

   database.

3. A notification event is sent to the corresponding queue for processing. For instance, an iOS push notification event

   is sent to the iOS PN queue.

4. `Workers` pull notification events from message queues.

5. `Workers` send notifications to third party services.

6. `Third-party` services send notifications to user devices.

### Preventing data loss in a notification system

- To satisfy this requirement, the notification system persists notification data in a database and implements a retry

  mechanism

![notification_system_5.png](./images/system.design/notification_system_5.png)

### Will recipients receive a notification exactly once in a notification system?

- The short answer is no. Although notification is delivered exactly once most of the time, the

  distributed nature could result in duplicate notifications.

- To reduce the duplication occurrence, we introduce a dedupe mechanism and handle each failure case carefully.

- Here is a simple dedupe logic:

    - When a notification event first arrives, we check if it is seen before by checking the event ID.

    - If it is seen before, it is discarded. Otherwise, we will send out the notification

### Notification template for a notification system

- A large notification system sends out millions of notifications per day, and many of these notifications follow a

  similar format.

- Notification templates are introduced to avoid building every notification from scratch.

- A notification template is a preformatted notification to create your unique notification by customizing parameters,

  styling, tracking links, etc.

### Notification setting for a notification system

- Users generally receive way too many notifications daily and they can easily feel overwhelmed.

- Thus, many websites and apps give users fine-grained control over notification settings

### Rate limiting for a notification system

- To avoid overwhelming users with too many notifications, we can limit the number of notifications a user can receive.

- This is important because receivers could turn off notifications completely if we send too often.

### Retry mechanism for a notification system

- When a third-party service fails to send a notification, the notification will be added to the message queue for

  retrying.

- If the problem persists, an alert will be sent out to developers.

### Security in push notifications for a notification system

- For iOS or Android apps, `appKey` and `appSecret` are used to secure push notification APIs

- Only authenticated or verified clients are allowed to send push notifications using our APIs

### Monitor queued notifications for a notification system

- A key metric to monitor is the total number of queued notifications.

- If the number is large, the notification events are not processed fast enough by workers.

- To avoid delay in the notification delivery, more workers are needed

### Deep dive into notification system

![notification_system_6.png](./images/system.design/notification_system_6.png)

---

## Design a news feed system

### Establish design scope for a news feed system

- `Candidate`: Is this a mobile app? Or a web app? Or both?

- `Interviewer`: Both

- `Candidate`: What are the important features?

- `Interview`: A user can publish a post and see her friends’ posts on the news feed page.

- `Candidate`: Is the news feed sorted by reverse chronological order or any particular order such as topic scores? For

  instance, posts from your close friends have higher scores.

- `Interviewer`: To keep things simple, let us assume the feed is sorted by reverse chronological order.

- `Candidate`: How many friends can a user have?

- `Interviewer`: 5000

- `Candidate`: What is the traffic volume?

- `Interviewer`: 10 million DAU

- `Candidate`: Can feed contain images, videos, or just text?

- `Interviewer`: It can contain media files, including both images and videos.

### Feed publishing API for a news feed system

- To publish a post, a HTTP POST request will be sent to the server. The API is shown below:

    - POST `/v1/me/feed`

    - Params:

        - `content`: content is the text of the post.

        - `auth_token`: it is used to authenticate API requests.

### Newsfeed retrieval API for a news feed system

- The API to retrieve news feed is shown below:

    - GET `/v1/me/feed`

    - Params:

        - `auth_token`: it is used to authenticate API requests.

### Feed publishing deep dive for a news feed system

![news_feed_system.png](./images/system.design/news_feed_system.png)

1. `User`: a user can view news feeds on a browser or mobile app. A user makes a posy through

   API: `/v1/me/feed?content=Hello&auth_token={auth_token}`

2. `Load balancer`: distribute traffic to web servers.

3. `Web servers`: web servers redirect traffic to different internal services.

4. `Post service`: persist post in the database and cache.

5. `Fanout service`: push new content to friends’ news feed. Newsfeed data is stored in the

   cache for fast retrieval.

6. `Notification service`: inform friends that new content is available and send out push

   notifications

### Fanout service for a news feed system

![news_feed_system_1.png](./images/system.design/news_feed_system_1.png)

The fanout service works as follows:

1. Fetch friend IDs from the graph database. `Graph databases` are suited for managing

   friend relationship and friend recommendations.

2. Get friends info from the user cache. The system then filters out friends based on user

   settings. For example, if you mute someone, her posts will not show up on your news feed

   even though you are still friends. Another reason why posts may not show is that a user

   could selectively share information with specific friends or hide it from other people.

3. Send friends list and new post ID to the `message queue`.

4. Fanout workers fetch data from the message queue and store news feed data in the news

   feed cache. You can think of the news feed cache as a  mapping table.

   The memory consumption can become very large if we store the entire user

   and post objects in the cache. Thus, only IDs are stored. To keep the memory size small,

   we set a configurable limit. The chance of a user scrolling through thousands of posts in

   news feed is slim. Most users are only interested in the latest content, so the cache miss

   rate is low.

5. Store  in news feed cache.

### Fanout on write for a news feed system

- With this approach, news feed is pre-computed during write time. A new

  post is delivered to friends’ cache immediately after it is published.

- Pros:

    - The news feed is generated in real-time and can be pushed to friends immediately.

    - Fetching news feed is fast because the news feed is pre-computed during write time.

- Cons:

    - If a user has many friends, fetching the friend list and generating news feeds for all of

      them are slow and time consuming. It is called hotkey problem.

    - For inactive users or those rarely log in, pre-computing news feeds waste computing

      resources.

### Fanout on read for a news feed system

- The news feed is generated during read time. This is an on-demand model.

  Recent posts are pulled when a user loads her home page.

- Pros:

    - For inactive users or those who rarely log in, fanout on read works better because it will

      not waste computing resources on them.

    - Data is not pushed to friends so there is no hotkey problem.

- Cons:

    - Fetching the news feed is slow as the news feed is not pre-computed.

### Newsfeed retrieval deep dive for a news feed system

![news_feed_system_2.png](./images/system.design/news_feed_system_2.png)

1. media content (images, videos, etc.) are stored in CDN for fast retrieval

2. A user sends a request to retrieve her news feed. The request looks like this: `/v1/me/feed`

3. The load balancer redistributes requests to web servers.

4. Web servers call the news feed service to fetch news feeds.

5. News feed service gets a list post IDs from the news feed cache.

6. A user’s news feed is more than just a list of feed IDs. It contains username, profile

   picture, post content, post image, etc. Thus, the news feed service fetches the complete

   user and post objects from caches (user cache and post cache) to construct the fully

   hydrated news feed.

7. The fully hydrated news feed is returned in JSON format back to the client for

   rendering.

---

## Design a chat system

### Establish design scope for a chat system

- `Candidate`: What kind of chat app shall we design? 1 on 1 or group based?

- `Interviewer`: It should support both 1 on 1 and group chat.

- `Candidate`: Is this a mobile app? Or a web app? Or both?

- `Interviewer`: Both.

- `Candidate`: What is the scale of this app? A startup app or massive scale?

- `Interviewer`: It should support 50 million daily active users (DAU).

- `Candidate`: For group chat, what is the group member limit?

- `Interviewer`: A maximum of 100 people

- `Candidate`: What features are important for the chat app? Can it support attachment?

- `Interviewer`: 1 on 1 chat, group chat, online indicator. The system only supports text messages.

- `Candidate`: Is there a message size limit?

- `Interviewer`: Yes, text length should be less than 100,000 characters long.

- `Candidate`: Is end-to-end encryption required?

- `Interviewer`: Not required for now but we will discuss that if time allows.

- `Candidate`: How long shall we store the chat history?

- `Interviewer`: Forever

### Techniques are used to simulate a server-initiated connection

- `Polling`

- `Long polling`

- `Websocket`

### Pooling in a chat system

- Polling is a technique that the client periodically asks the server if there are messages available.

- Depending on polling frequency, polling could be costly.

- It could consume precious server resources to answer a question that offers no as an answer most of the time.

![chat_system.png](./images/system.design/chat_system.png)

### Long polling in a chat system

- In long polling, a client holds the connection open until there are actually new messages available or a timeout

  threshold has been reached.

- Once the client receives new messages, it immediately sends another request to the server, restarting the process.

![chat_system_1.png](./images/system.design/chat_system_1.png)

### WebSocket in a chat system

- `WebSocket` connection is initiated by the client. It is `bi-directional` and persistent.

- It starts its life as a HTTP connection and could be “upgraded” via some well-defined handshake to a WebSocket

  connection

- WebSocket is a `full-duplex protocol`. It allows both the client and the server to send messages at any time.

- By using `WebSocket` for both sending and receiving, it simplifies the design and makes implementation on both client

  and server more straightforward.

![chat_system_2.png](./images/system.design/chat_system_2.png)

![chat_system_3.png](./images/system.design/chat_system_3.png)

### High-level design for a chat system

![chat_system_4.png](./images/system.design/chat_system_4.png)

- `Stateless services`

    - Stateless services are traditional public-facing request/response services, used to manage the

      login, signup, user profile, etc. These are common features among many websites and apps.

- `Stateful Service`

    - The only stateful service is the chat service.

    - The service is stateful because each client maintains a persistent network connection to a chat server.

    - In this service, a client normally does not switch to another chat server as long as the server is still

      available.

    - The service discovery coordinates closely with the chat service to avoid server overloading.

### Adjusted high-level design for a chat system

![chat_system_5.png](./images/system.design/chat_system_5.png)

- The client maintains a persistent `WebSocket` connection to a chat server for real-time messaging.

- `Chat servers` facilitate message sending/receiving.

- `Presence servers` manage online/offline status.

- `API servers` handle everything including user login, signup, change profile, etc.

- `Notification servers` send push notifications.

- The `key-value store` is used to store chat history. When an offline user comes online, she will see all her previous

  chat history.

### Storage for a chat system

- `SQL` databases:

    - The first is generic data, such as user profile, setting, user friends list.

    - These data are stored in robust and reliable relational databases.

    - Replication and sharding are common techniques to satisfy availability and scalability

      requirements.

- `NoSQL` databases:

    - The second is chat history.

    - Chat history is stored in a key-value store.

### Message table for 1 on 1 chat in a chat system

![chat_system_6.png](./images/system.design/chat_system_6.png)

- The primary key is `message_id`, which helps to decide message sequence

### Message table for group chat in a chat system

![chat_system_7.png](./images/system.design/chat_system_7.png)

- The composite primary key is (`channel_id`, `message_id`).

- `Channel` and `group` represent the same meaning here.

- `channel_id` is the partition key because all queries in a group chat operate in a channel.

### Service discovery for a chat system

- The primary role of `service discovery` is to recommend the best chat server for a client based on the criteria like

  geographical location, server capacity, etc

![chat_system_8.png](./images/system.design/chat_system_8.png)

1. `User A` tries to log in to the app.

2. The load balancer sends the login request to API servers.

3. After the backend authenticates the user, service discovery finds the best chat server for C. In this

   example, `server `2 is chosen and the server info is returned back to `User A`.

4. `User A` connects to chat server 2 through `WebSocket`

### 1 on 1 chat flow in a chat system

![chat_system_9.png](./images/system.design/chat_system_9.png)

![chat_system_10.png](./images/system.design/chat_system_10.png)

### Message synchronization across multiple devices in a chat system

![chat_system_11.png](./images/system.design/chat_system_11.png)

- Each device maintains a variable called `cur_max_message_id`, which keeps track of the latest

  message ID on the device.

- Messages that satisfy the following two conditions are considered as news messages:

    - The recipient ID is equal to the currently logged-in user ID.

    - Message ID in the key-value store is larger than `cur_max_message_id`

### Small group chat flow in a chat system

![chat_system_12.png](./images/system.design/chat_system_12.png)

- First, the message from `User A` is copied to each group member’s message sync queue: one for `User B` and the second

  for `User C`.

- You can think of the message sync queue as an inbox for a recipient

- This design choice is good for small group chat because:

    - it simplifies message sync flow as each client only needs to check its own inbox to get new messages.

    - when the group number is small, storing a copy in each recipient’s inbox is not too expensive.

### User login flow in a chat system

- After a WebSocket connection is built between the client and the real-time service, `user A’s` online status

  and `last_active_at` timestamp are saved in the KV store.

- Presence indicator shows the user is online after she logs in.

![chat_system_13.png](./images/system.design/chat_system_13.png)

### User logout flow in a chat system

- The online status is changed to offline in the KV store. The presence indicator shows a user is offline

![chat_system_14.png](./images/system.design/chat_system_14.png)

### User disconnection flow in a chat system

- We introduce a `heartbeat` mechanism to solve this problem.

- Periodically, an online client sends a heartbeat event to presence servers.

- If presence servers receive a heartbeat event within a certain time, say x seconds from the client, a user is

  considered as online.

- Otherwise, it is offline

![chat_system_15.png](./images/system.design/chat_system_15.png)

### Online status fanout in a chat system

- Presence servers use a `publish-subscribe model`, in which each friend pair maintains a channel.

- When `User A’s` online status changes, it publishes the event to three channels, channel `A-B`, `A-C`, and `A-D`.

- Those three channels are subscribed by User `B`, `C`, and `D`

- Thus, it is easy for friends to get online status updates.

- The communication between clients and servers is through real-time `WebSocket`

![chat_system_16.png](./images/system.design/chat_system_16.png)

### End-to-end encryption in a chat system

- End-to-end encryption is a way to secure communications such that only the sender and the intended recipient(s) can

  access the content of the message.

- This is achieved by encrypting the message on the sender's device and decrypting it

  on the recipient's device, without any intermediate parties being able to access the content.

- Here's how end-to-end encryption works for messages:

    - Sender and recipient(s) agree on a `secret key` that will be used for encryption and decryption.

    - `The sender's device` encrypts the message using this `secret key`.

    - This can be done using a variety of encryption algorithms such as :

        - `AES` (Advanced Encryption Standard)

        - `RSA` (Rivest-Shamir-Adleman)

        - `ECC` (Elliptic Curve Cryptography)

    - The encrypted message is then transmitted over a network or communication channel.

    - `The recipient's device` receives the encrypted message and uses the same secret key to decrypt it.

    - The decrypted message is then displayed to the recipient, who can read it.

- The key point to remember is that the message is encrypted and decrypted at the endpoints (i.e., the sender and

  recipient devices), and the encryption key is not shared with any intermediaries or servers along the way.

- This means that even if the message is intercepted or hacked, it cannot be read without the secret key, which is known

  only to the sender and recipient.

![chat_system_17.png](./images/system.design/chat_system_17.png)

---

### Desing a search autocomplete system

### What is a search autocomplete system?

![autocomplete_system_9.png](./images/system.design/autocomplete_system_9.png)

### Establish design scope for a search autocomplete system

- `Candidate`: Is the matching only supported at the beginning of a search query or in the middle as well?

- `Interviewer`: Only at the beginning of a search query.

- `Candidate`: How many autocomplete suggestions should the system return?

- `Interviewer`: 5

- `Candidate`: How does the system know which 5 suggestions to return?

- `Interviewer`: This is determined by popularity, decided by the historical query frequency.

- `Candidate`: Does the system support spell check?

- `Interviewer`: No, spell check or autocorrect is not supported.

- `Candidate`: Are search queries in English?

- `Interviewer`: Yes. If time allows at the end, we can discuss multi-language support.

- `Candidate`: Do we allow capitalization and special characters?

- `Interviewer`: No, we assume all search queries have lowercase alphabetic characters.

- `Candidate`: How many users use the product?

- `Interviewer`: 10 million DAU.

### Requirements for a search autocomplete system

- `Fast response time`: As a user types a search query, autocomplete suggestions must show

  up fast enough. An article about Facebook’s autocomplete system reveals that the

  system needs to return results within 100 milliseconds. Otherwise it will cause stuttering.

- `Relevant`: Autocomplete suggestions should be relevant to the search term.

- `Sorted`: Results returned by the system must be sorted by popularity or other ranking models.

- `Scalable`: The system can handle high traffic volume.

- `Highly available`: The system should remain available and accessible when part of the

  system is offline, slows down, or experiences unexpected network errors.

### Trie data structure for a search autocomplete system

![autocomplete_system.png](./images/system.design/autocomplete_system.png)

- To avoid traversing the whole trie, we store `top k most frequently` used queries at each node.

- Since 5 to 10 autocomplete suggestions are enough for users, `k` is a relatively small number.

- By caching top search queries at every node, we significantly reduce the time complexity to retrieve the top 5

  queries.

- However, this design requires a lot of space to store top queries at every node.

- Trading space for time is well worth it as fast response time is very important.

- `Time complexity` only `O(1)` to fetch `top k` queries.

### Data gathering service for a search autocomplete system

- To design a scalable data gathering service, we examine where data comes from and how data is used.

- Real-time applications like Twitter require up to date autocomplete suggestions.

- However, autocomplete suggestions for many Google keywords might not change much on a daily basis.

- Despite the differences in use cases, the underlying foundation for data gathering service

  remains the same because data used to build the trie is usually from analytics or logging

  services.

![autocomplete_system_1.png](./images/system.design/autocomplete_system_1.png)

### Analytics Logs for a search autocomplete system

- It stores raw data about search queries

- Logs are append-only and are not indexed

![autocomplete_system_2.png](./images/system.design/autocomplete_system_2.png)

### Aggregators for a search autocomplete system

- The size of analytics logs is usually very large, and data is not in the right format.

- We need to aggregate data so it can be easily processed by our system.

- Depending on the use case, we may aggregate data `differently`.

- `For real-time applications` such as Twitter, we aggregate data in a shorter time interval as real-time results are

  important.

- On the other hand, aggregating data less frequently, say once per week, might be good enough for many use cases.

- _During an interview session, verify whether real-time results are important_.

### Aggregated Data for a search autocomplete system

- `time` field represents the start time of a week.

- `frequency` field is the sum of the occurrences for the corresponding query in that week.

![autocomplete_system_3.png](./images/system.design/autocomplete_system_3.png)

### Workers for a search autocomplete system

- Workers are a set of servers that perform asynchronous jobs at regular intervals.

- They build the trie data structure and store it in `Trie DB`

### Trie Cache for a search autocomplete system

- Trie Cache is a distributed cache system that keeps trie in memory for fast read.

- It takes a weekly snapshot of the DB.

### Trie DB for a search autocomplete system

- Trie DB is the persistent storage. Two options are available to store the data:

    - `Document store`: Since a new trie is built weekly, we can periodically take a snapshot of it,

      serialize it, and store the serialized data in the database. Document stores like `MongoDB`

      are good fits for serialized data.

    - `Key-value store`: A trie can be represented in a `hash table` form by applying the following logic:

        - Every prefix in the trie is mapped to a key in a hash table

        - Data on each trie node is mapped to a value in a hash table

![autocomplete_system_4.png](./images/system.design/autocomplete_system_4.png)

### Create a trie for a search autocomplete system

- `Trie` is created by workers using aggregated data.

- The source of data is from `Analytics Log/DB`.

### Update a trie for a search autocomplete system

- Update the trie `weekly`. Once a new trie is created, the new trie replaces the old one

- Update individual trie node `directly` (slow)

![autocomplete_system_6.png](./images/system.design/autocomplete_system_6.png)

### Delete a trie for a search autocomplete system

- We have to remove hateful, violent, sexually explicit, or dangerous autocomplete suggestions.

- We add a `filter layer` in front of the `Trie Cache` to filter out unwanted suggestions.

- Having a filter layer gives us the flexibility of removing results based

  on different filter rules.

- Unwanted suggestions are removed physically from the database asynchronically so the correct data set will be used to

  build trie in the next update cycle.

![autocomplete_system_7.png](./images/system.design/autocomplete_system_7.png)

### Query service for a search autocomplete system

![autocomplete_system_5.png](./images/system.design/autocomplete_system_5.png)

1. A search query is sent to the `load balancer`.

2. The load balancer routes the request to `API servers`.

3. API servers get trie data from Trie Cache and construct autocomplete suggestions for the client.

4. In case the data is not in `Trie Cache`, we replenish data back to the cache. This way, all

   subsequent requests for the same prefix are returned from the cache. A cache miss can

   happen when a cache server is out of memory or offline.

### Optimizations for Query service for a search autocomplete system

- `AJAX request` - no need to reload the page

- `Browser caching` - autocomplete suggestions can be saved in browser cache to allow subsequent requests to get results

  from the cache directly.

### Scale the storage tier for a search autocomplete system

- The `shard map manager` maintains a lookup database for identifying where rows should be stored.

- For example, if there are a similar number of historical queries for ‘s’ and for ‘u’, ‘v’, ‘w’, ‘x’, ‘y’ and ‘z’

  combined, we

  can maintain two shards: one for ‘s’ and one for ‘u’ to ‘z’.

![autocomplete_system_8.png](./images/system.design/autocomplete_system_8.png)

### Follow-up questions for a search autocomplete system

- How do you extend your design to support multiple languages?

    - To support other non-English queries, we store Unicode characters in trie nodes

- What if top search queries in one country are different from others?

    - We might build different tries for different countries.

    - To improve the response time, we can store tries in CDNs.

- How can we support the trending (real-time) search queries?

    - Reduce the working data set by sharding.

    - Change the ranking model and assign more weight to recent search queries.

    - Data may come as streams, so we do not have access to all the data at once

    - Stream processing requires a different set of systems:

        - Apache Hadoop MapReduce

        - Apache Spark Streaming

        - Apache Storm

        - Apache Kafka

---

## Desing video sharing service

### Establish design scope for a video sharing service

- `Candidate`: What features are important?

- `Interviewer`: Ability to upload a video and watch a video.

- `Candidate`: What clients do we need to support?

- `Interviewer`: Mobile apps, web browsers, and smart TV.

- `Candidate`: How many daily active users do we have?

- `Interviewer`: 5 million

- `Candidate`: What is the average daily time spent on the product?

- `Interviewer`: 30 minutes.

- `Candidate`: Do we need to support international users?

- `Interviewer`: Yes, a large percentage of users are international users.

- `Candidate`: What are the supported video resolutions?

- `Interviewer`: The system accepts most of the video resolutions and formats.

- `Candidate`: Is encryption required?

- `Interviewer`: Yes

- `Candidate`: Any file size requirement for videos?

- `Interviewer`: Our platform focuses on small and medium-sized videos. The maximum

  allowed video size is 1GB.

- `Candidate`: Can we leverage some of the existing cloud infrastructures provided by Amazon,

  Google, or Microsoft?

- `Interviewer`: That is a great question. Building everything from scratch is unrealistic for most

  companies, it is recommended to leverage some of the existing cloud services.

### Video uploading flow for a video sharing service

![video_sharing_service.png](./images/system.design/video_sharing_service.png)

![video_sharing_service_1.png](./images/system.design/video_sharing_service_1.png)

### Popular streaming protocols for a video sharing service

- MPEG–DASH. MPEG stands for “Moving Picture Experts Group” and DASH stands for "Dynamic Adaptive Streaming over HTTP".

- Apple HLS. HLS stands for “HTTP Live Streaming”.

- Microsoft Smooth Streaming.

- Adobe HTTP Dynamic Streaming (HDS).

### Pre-signed upload URL for a video sharing service

![video_sharing_service_2.png](./images/system.design/video_sharing_service_2.png)

- Pre-signed upload URL is a URL that can be used to upload an object to a bucket.

- The client makes a HTTP request to API servers to fetch the `pre-signed URL`, which

  gives the access permission to the object identified in the URL.

- The term `pre-signed URL` is used by uploading files to Amazon S3.

- API servers respond with a `pre-signed URL`.

- Once the client receives the response, it uploads the video using the `pre-signed URL`

### Protect videos for a video sharing service

- To protect copyrighted videos, we can adopt one of the following three safety options:

    - `Digital rights management` (DRM) systems: Three major DRM systems are Apple

      FairPlay, Google Widevine, and Microsoft PlayReady.

    - `AES encryption`: You can encrypt a video and configure an authorization policy. The

      encrypted video will be decrypted upon playback. This ensures that only authorized users

      can watch an encrypted video.

    - `Visual watermarking`: This is an image overlay on top of your video that contains

      identifying information for your video. It can be your company logo or company name.

### Cost-saving optimization for a video sharing service

- CDN is a crucial component of our system.

- It ensures fast video delivery on a global scale.

- However, from the back of the envelope calculation, we know CDN is expensive, especially when the data size is large

![video_sharing_service_3.png](./images/system.design/video_sharing_service_3.png)

- Only serve the most popular videos from CDN and other videos from our high capacity

- For less popular content, we may not need to store many encoded video versions. Short videos can be encoded on-demand.

- Some videos are popular only in certain regions. There is no need to distribute these videos to other regions.

- Build your own CDN

### Error handling for a video sharing service

- `Upload error`: retry a few times.

- `Split video error`: if older versions of clients cannot split videos by GOP alignment, the

  entire video is passed to the server. The job of splitting videos is done on the server-side.

- `Transcoding error`: retry.

- `Preprocessor error`: regenerate DAG diagram.

- `DAG scheduler error`: reschedule a task.

- `Resource manager queue down`: use a replica.

- `Task worker down`: retry the task on a new worker.

- `API server down`: API servers are stateless so requests will be directed to a different API server.

- `Metadata cache server down`: data is replicated multiple times. If one node goes down,

  you can still access other nodes to fetch data. We can bring up a new cache server to

  replace the dead one.

- `Metadata DB server down`:

    - `Master is down`. If the master is down, promote one of the slaves to act as the new master.

    - `Slave is down`. If a slave goes down, you can use another slave for reads and bring up another database server to

      replace the dead one.

---

## Design a google drive

![google_drive.png](./images/system.design/google_drive.png)

### Establish design scope for a google drive

- `Candidate`: What are the most important features?

- `Interviewer`: Upload and download files, file sync, and notifications.

- `Candidate`: Is this a mobile app, a web app, or both?

- `Interviewer`: Both.

- `Candidate`: What are the supported file formats?

- `Interviewer`: Any file type.

- `Candidate`: Do files need to be encrypted?

- `Interview`: Yes, files in the storage must be encrypted.

- `Candidate`: Is there a file size limit?

- `Interview`: Yes, files must be 10 GB or smaller.

- `Candidate`: How many users does the product have?

- `Interviewer`: 10M DAU.

### Upload a file to Google Drive

- Two types of uploads are supported:

    - `Simple upload`. Use this upload type when the file size is small.

    - `Resumable upload`. Use this upload type when the file size is large and there is high chance of network

      interruption

- Here is an example of resumable upload API:

    - https://api.example.com/files/upload?uploadType=resumable

    - Params:

        - `uploadType`:resumable

        - `data`: Local file to be uploaded

### Download a file from Google Drive

- Example API: https://api.example.com/files/download

- Params:

    - `path`: download file path.

### Sync conflicts for a google drive

- The `first version` that gets processed `wins`, and the version that gets processed later receives a conflict

![google_drive_1.png](./images/system.design/google_drive_1.png)

### High-level design for a google drive

![google_drive_2.png](./images/system.design/google_drive_2.png)

- `User`: A user uses the application either through a browser or mobile app.

- `Block servers`: Block servers upload blocks to cloud storage. Block storage, referred to as

  block-level storage, is a technology to store data files on cloud-based environments. A file

  can be split into several blocks, each with a unique hash value, stored in our metadata

  database. Each block is treated as an independent object and stored in our storage system

  (S3). To reconstruct a file, blocks are joined in a particular order. As for the block size, we

  use Dropbox as a reference: it sets the maximal size of a block to 4MB [6].

- `Cloud storage`: A file is split into smaller blocks and stored in cloud storage.

- `Cold storage`: Cold storage is a computer system designed for storing inactive data, meaning

  files are not accessed for a long time.

- `Load balancer`: A load balancer evenly distributes requests among API servers.

- `API servers`: These are responsible for almost everything other than the uploading flow. API

  servers are used for user authentication, managing user profile, updating file metadata, etc.

- `Metadata database`: It stores metadata of users, files, blocks, versions, etc. Please note that

  files are stored in the cloud and the metadata database only contains metadata.

- `Metadata cache`: Some of the metadata are cached for fast retrieval.

- `Notification service`: It is a publisher/subscriber system that allows data to be transferred

  from notification service to clients as certain events happen. In our specific case, notification

  service notifies relevant clients when a file is added/edited/removed elsewhere so they can

  pull the latest changes.

- `Offline backup queue`: If a client is offline and cannot pull the latest file changes, the offline

  backup queue stores the info so changes will be synced when the client is online

### Block servers for a google drive

- Block servers process files passed from clients by splitting a file into blocks, compressing each block, and

  encrypting them.

![google_drive_3.png](./images/system.design/google_drive_3.png)

- A file is split into smaller blocks.

- Each block is compressed using compression algorithms.

- To ensure security, each block is encrypted before it is sent to cloud storage.

- Blocks are uploaded to the cloud storage.

### Metadata database for a google drive

![google_drive_4.png](./images/system.design/google_drive_4.png)

- `User`: The user table contains basic information about the user such as username, email, profile photo, etc.

- `Device`: Device table stores device info. Push_id is used for sending and receiving mobile

  push notifications. Please note a user can have multiple devices.

- `Namespace`: A namespace is the root directory of a user.

- `File`: File table stores everything related to the latest file.

- `File_version`: It stores version history of a file. Existing rows are read-only to keep the integrity of the file

  revision history.

- `Block`: It stores everything related to a file block. A file of any version can be reconstructed by joining all the

  blocks in the correct order.

### Failure Handling for a google drive

- `Load balancer failure`: If a load balancer fails, the secondary would become active and

  pick up the traffic. Load balancers usually monitor each other using a heartbeat, a periodic

  signal sent between load balancers. A load balancer is considered as failed if it has not sent

  a heartbeat for some time.

- `Block server failure`: If a block server fails, other servers pick up unfinished or pending jobs.

- `Cloud storage failure`: S3 buckets are replicated multiple times in different regions. If

  files are not available in one region, they can be fetched from different regions.

- `API server failure`: It is a stateless service. If an API server fails, the traffic is redirected

  to other API servers by a load balancer.

- `Metadata cache failure`: Metadata cache servers are replicated multiple times. If one node

  goes down, you can still access other nodes to fetch data. We will bring up a new cache

  server to replace the failed one.

- `Metadata DB failure`:

    - `Master down`: If the master is down, promote one of the slaves to act as a new master and bring up a new slave

      node.

    - `Slave down`: If a slave is down, you can use another slave for read operations and bring another database server

      to replace the failed one.

- `Notification service failure`: Every online user keeps a long poll connection with the

  notification server. If a server goes down, all the long poll connections are lost so clients must reconnect to a

  different server. Even though one server can keep many open connections, it cannot

  reconnect all the lost connections at once. Reconnecting with all the lost clients is a

  relatively slow process.

- `Offline backup queue failure`: Queues are replicated multiple times. If one queue fails,

  consumers of the queue may need to re-subscribe to the backup queue.

---

## Distributed Message Queue

### Distributed Message Queue Requirements

![distribute_message_queue.png](./images/system.design/distribute_message_queue.png)

### High-level architecture for Distributed Message Queue

![distribute_message_queue_1.png](./images/system.design/distribute_message_queue_1.png)

### VIP and Load balancer for Distributed Message Queue

- `VIP` - virtual IP

![distribute_message_queue_2.png](./images/system.design/distribute_message_queue_2.png)

### FrontEnd service for Distributed Message Queue

![distribute_message_queue_3.png](./images/system.design/distribute_message_queue_3.png)

### Metadata service for Distributed Message Queue

![distribute_message_queue_4.png](./images/system.design/distribute_message_queue_4.png)

### Backend service for Distributed Message Queue

![distribute_message_queue_5.png](./images/system.design/distribute_message_queue_5.png)

![distribute_message_queue_6.png](./images/system.design/distribute_message_queue_6.png)

![distribute_message_queue_7.png](./images/system.design/distribute_message_queue_7.png)

![distribute_message_queue_8.png](./images/system.design/distribute_message_queue_8.png)

### Other discussion topics for Distributed Message Queue

![distribute_message_queue_9.png](./images/system.design/distribute_message_queue_9.png)

---

## Desing a notification service #2

### Requirements for a notification service

![notification_system_0001.png](./images/system.design/notification_system_0001.png)

### High-level architecture for a notification service

![notification_system_0002.png](./images/system.design/notification_system_0002.png)

### Frontend service for a notification service

![notification_system_0003.png](./images/system.design/notification_system_0003.png)

![notification_system_0004.png](./images/system.design/notification_system_0004.png)

### Metadata service for a notification service

- Responsible for storing metadata such as topics, subscriptions in the database.

![notification_system_0005.png](./images/system.design/notification_system_0005.png)

### Temporary storage for a notification service

- Temporary storage is used to store messages that are not yet delivered to subscribers.

![notification_system_0006.png](./images/system.design/notification_system_0006.png)

### Sender service for a notification service

![notification_system_0007.png](./images/system.design/notification_system_0007.png)

### Other discussion topics for a notification service

![notification_system_0008.png](./images/system.design/notification_system_0008.png)

---

## Design Rate Limiter #2

### Requirements for a rate limiter

![design_rate_limiter.png](./images/system.design.2/design_rate_limiter.png)

### Simple solution for a rate limiter

![design_rate_limiter_2.png](./images/system.design.2/design_rate_limiter_2.png)

### Token bucket algorithm for a rate limiter

![design_rate_limiter_3.png](./images/system.design.2/design_rate_limiter_3.png)

### Token bucket algorithm code for a rate limiter

![design_rate_limiter_4.png](./images/system.design.2/design_rate_limiter_4.png)

![design_rate_limiter_5.png](./images/system.design.2/design_rate_limiter_5.png)

### Token bucket algorithm classes for a rate limiter

![design_rate_limiter_6.png](./images/system.design.2/design_rate_limiter_6.png)

### Message broadcasting for a rate limiter

![design_rate_limiter_7.png](./images/system.design.2/design_rate_limiter_7.png)

### How to integrate a rate limiter to a system

![design_rate_limiter_8.png](./images/system.design.2/design_rate_limiter_8.png)

---

## Design a distributed cache system

### Requirements for a distributed cache system

![distributed_cache_system.png](./images/system.design.2/distributed_cache_system.png)

### How to organise distributed cache system

![distributed_cache_system_1.png](./images/system.design.2/distributed_cache_system_1.png)

### Choosing a cache host for a distributed cache system

- Significantly smaller fraction of keys is rehashed when a new host is added or removed.

![distributed_cache_system_2.png](./images/system.design.2/distributed_cache_system_2.png)

![distributed_cache_system_3.png](./images/system.design.2/distributed_cache_system_3.png)

### Maitaining a list of caches for a distributed cache system

- It's ok to have not up-to-date data on the replicas, due to partition, because we are building `fast` cache, so it's

  being only a cache miss.

![distributed_cache_system_4.png](./images/system.design.2/distributed_cache_system_4.png)

![distributed_cache_system_5.png](./images/system.design.2/distributed_cache_system_5.png)

### What else to consider for a distributed cache system

![distributed_cache_system_6.png](./images/system.design.2/distributed_cache_system_6.png)

### High level architecture for a distributed cache system

![distributed_cache_system_7.png](./images/system.design.2/distributed_cache_system_7.png)

---

## Top K Problem #2

### Requirements for a top k problem

![top_k_problem.png](./images/system.design.2/top_k_problem.png)

### Single host solution for a top k problem

![top_k_problem_1.png](./images/system.design.2/top_k_problem_1.png)

### Multiple hosts solution for a top k problem

![top_k_problem_2.png](./images/system.design.2/top_k_problem_2.png)

### Count-min sketch for a top k problem

![top_k_problem_4.png](./images/system.design.2/top_k_problem_4.png)

### High level architecture for a top k problem

![top_k_problem_3.png](./images/system.design.2/top_k_problem_3.png)

### Data flow fast path for a top k problem

![top_k_problem_5.png](./images/system.design.2/top_k_problem_5.png)

### Data flow slow path for a top k problem

![top_k_problem_6.png](./images/system.design.2/top_k_problem_6.png)

### Map reduce jobs for a top k problem

![top_k_problem_7.png](./images/system.design.2/top_k_problem_7.png)

### Data retrieval for a top k problem

![top_k_problem_8.png](./images/system.design.2/top_k_problem_8.png)

---

## Desing a system to count videos views

### 1. Requirements clarifications for a system to count videos views

![sbs_1.png](./images/system.design.2/sbs_1.png)

### 2. Functional requirements API for a system to count videos views

![sbs_2.png](./images/system.design.2/sbs_2.png)

### 3. Non-functional requirements for a system to count videos views

- The most important requirements (should be focused first):

    - `scalability`

    - `performance`

    - `availability` or `consistency`

- Others (should be focused later):

    - `consistency`

    - `cost`

![sbs_3.png](./images/system.design.2/sbs_3.png)

### 4. High-level architecture diagram for a system to count videos views

![sbs_4.png](./images/system.design.2/sbs_4.png)

### 5. Define data model for a system to count videos views

- Clarify assumptions with interviewer what is better for the system

- Or combine both approaches

![sbs_5.png](./images/system.design.2/sbs_5.png)

### 6. Define databases for a system to count videos views

#### SQL

- `Normalization` - minimise a data duplication across different tables

![sbs_6.png](./images/system.design.2/sbs_6.png)

![sbs_7.png](./images/system.design.2/sbs_7.png)

#### NoSQL

- `Gossip protocol` is a process of computer peer-to-peer communication that is based on the way epidemics

  spread. It is a communication protocol in which nodes periodically `exchange state` information about themselves and

  about other nodes they know about. Each node in the cluster knows about every other node in the cluster.

- `Quorum writes/reads` define a minimum number of nodes that must agree on the response

- NoSQL provides `availability` and `partition tolerance` (AP) guarantees, but no consistency guarantee

![sbs_8.png](./images/system.design.2/sbs_8.png)

#### How to store data

![sbs_9.png](./images/system.design.2/sbs_9.png)

### 7. Processing service for a system to count videos views

![sbs_10.png](./images/system.design.2/sbs_10.png)

![sbs_11.png](./images/system.design.2/sbs_11.png)

![sbs_12.png](./images/system.design.2/sbs_12.png)

![sbs_13.png](./images/system.design.2/sbs_13.png)

### 8. Design diagram for data ingestion path for a system to count videos views

![sbs_14.png](./images/system.design.2/sbs_14.png)

![sbs_15.png](./images/system.design.2/sbs_15.png)

### 9. Design diagram for data retrival path for a system to count videos views

![sbs_16.png](./images/system.design.2/sbs_16.png)

### 10. Details design diagram for a system to count videos views

![sbs_17.png](./images/system.design.2/sbs_17.png)

### 11. Technology stack for a system to count videos views

![sbs_18.png](./images/system.design.2/sbs_18.png)

### 12. Summary for a system to count videos views

![sbs_19.png](./images/system.design.2/sbs_19.png)

---

## URL shortening system questions about system

- How to ensure uniqueness of short URLs?

    - Database will help. We first check if the generated short URL is already present in the database, and return a new

      URL to the client only when the uniqueness constraint is honored. To check URL presence in the database quickly

      and cheaply, we should utilize a Bloom filter, a memory-efficient probabilistic data structure to quickly (O(1))

      check whether an element is present in a set.

- SQL or NoSQL database to store the mapping between long and short URLs?

    - The URL shortening system is read-heavy. Any database that scales well for reads will work. And as we already

      know, both SQL databases (e.g. using solutions like Vitess) and NoSQL databases can scale reads very well. NoSQL

      database (e.g. MongoDB) will be a better choice. Why? Let’s discuss it in the second module of the course.

- How to retrieve long URLs quickly?

    - The cache will store the mapping between short URLs and long URLs. We can use either cache-aside

      and read-through pattern for updating the cache. With read-through being a better option. As for the eviction

      policy, the Least Recently Used (LRU) eviction policy is a good choice.

- Should we have a separate component for handling read requests?

    - Yes, we should introduce a separate web service for handling read requests. Let’s call it a Read service. Having a

      separate component will allow us scale read requests independently of write requests and we can make this service

      act as a read-through cache.

- How to scale read requests?

    - We are going to use horizontally scalable distributed cache solutions (Memcached, Redis) and add more read

      replicas to the database when needed. As mentioned earlier, we can also utilize Bloom filters to reduce the load

      on both the cache and the database.

- How to properly handle large spikes or read requests (popular URLs)?

    - We should utilize the local cache on Read service machines to store the same hot URL on multiple machines. This

      way, read requests can be handled by any of the Read service machines in the cluster, avoiding heavy load on one

      particular machine.

- How to achieve high availability?

    - As we already know, achieving high availability requires both architecture and processes. In our system, each

      individual component should be highly available. Which means we have redundancy for every component in the system,

      heavily rely on load balancing, protect our components from atypical behaviors of clients and downstream

      dependencies, quickly detect and resolve failures.

- How to ensure high durability?

    - By copying data at multiple different levels. Specifically, we rely on database data replication and backups.

---

## Fraud detection system

### Functional requirements

- I think accuracy and speed are important for such systems. Here are a few functional requirements for these

  characteristics:

    - The system should take no more than 5 seconds to evaluate each transaction.

    - If a transaction is flagged as `potentially fraudulent`, it should take no more than several hours to make the

      final decision on the transaction.

    - No more than X% (configurable, for example, 1%) should be flagged as `potentially fraudulent` by the system.

### Non-functional requirements

- High scalability (to support both long-term and short-term increases in incoming transactions).

- High availability (to be always up and running so that all transactions are evaluated and fraudulent transactions are

  not missed).

- High durability (transactions flagged as `fraudulent` should never be lost).

- High extensibility (we should be able to integrate the system with additional data sources (e.g. user’s past payment

  history) that allow to more accurately evaluate transactions).

### Key actors

- `User`, a person who makes an online purchase.

- `Fraud analyst`, a person who evaluates potentially fraudulent transactions to make a final decision.

### Key components

![Fraud_detection_system.png](./images/system.design.3/Fraud_detection_system.png)

- `API gateway`, a web service that represents an entry point to the system.

- `Fraud analyst admin tool`, a web application that is used by fraud analysts to review potentially fraudulent

  transactions and make a decision.

- `Data aggregator`, a web service that collects additional information about a transaction.

- `Scoring service`, a web service that calculates a score for a transaction (think of this score as the likelihood that

  a transaction is fraudulent).

- `Messaging system`, middleware that helps transfer potentially fraudulent transactions to fraud analysts in an

  asynchronous manner.

- `Database`, a persistent storage for potentially fraudulent transactions.

- `Machine learning model training pipeline`, a pipeline consisting of batch jobs that help train a machine learning

  model used by the scoring service.

- `User profile service`, payment history service, geo service, a set of web services that provide additional

  information about a transaction.

### High-level architecture

![Fraud_detection_system_2.png](./images/system.design.3/Fraud_detection_system_2.png)

1. `API gateway` is a single entry point for all requests. It is a common component of many modern architectures.

   Classic API gateway performs many different functions: authentication, authorization, request routing, response

   aggregation, protocol translation, load balancing, TLS termination, IP listing, rate limiting, response caching,

   response compression, logging, monitoring, and a few more. We will cover API gateways in more detail in the second

   module of the course.

2. `Data aggregator` calls a bunch of other services to retrieve additional information about the transaction. We want

   to know more about who makes the transaction, how many transactions this user made in the past, whether the user’s IP

   address location matches shipping address, and so on. All auxiliary services are called in parallel to reduce total

   latency of transaction processing. Retrieved information will be sent to the scoring service to calculate the fraud

   score for this transaction.

3. `Scoring service uses machine learning models` to calculate a score for every transaction. If the score is below a

   predefined threshold, the transaction is considered non-fraudulent and the payment processing is initiated. If the

   score exceeds the threshold, the transaction is considered potentially fraudulent and is submitted to a fraud analyst

   for review. Payment processing is also initiated in this case, but later, if the transaction is flagged as fraudulent

   by the analyst, the payment will be canceled.

4. `Messaging system (e.g. Kafka)` is used to transfer potentially fraudulent transactions to persistent storage. This

   model helps to deal with spikes in fraudulent transactions. The messaging system will help absorb these spikes and

   the database will not be impacted.

5. All potentially fraudulent transactions are stored in the `database`. Both SQL and NoSQL databases can be used.

6. `Fraud analysts` use an admin tool (web application) to review transactions. They can either pass or fail every

   transaction. The admin tools can be quite sophisticated, allowing fraud analysts to view much more additional

   information about a transaction. It is important to implement a locking mechanism for transactions so that when a

   transaction is under review, other analysts should not see it.

7. `Model training pipeline` uses final decisions to train machine learning models. The pipeline consists of multiple

   batch jobs that are used to train, test, and deploy models to production.

---

## Authentication and authorization system

### Functional requirements

- I think that such a system should be easy to use and integrate with. Here are a few functional requirements for these

  characteristics:

    - The system should allow to easily manage users, groups, roles, and policies.

    - The system should provide an SDK for various programming languages so that other systems can easily integrate with

      it

### Non-functional requirements

- High availability (since every request must be authenticated/authorized).

- High performance (low latency when processing requests).

- High scalability (to support both long-term and short-term increases in incoming requests).

### Key actors

- Here are the key actors in the system:

    - Resource admin, a person who manages policies.

    - User, a person who tries to access a resource (e.g. read an S3 object).

    - Services, programs (web services, applications, batch jobs, etc.) that try to access a resource (e.g. they produce

      and write logs to an S3 bucket).

### Key components

![Authentication_and_authorization_system.png](./images/system.design.3/Authentication_and_authorization_system.png)

- `API gateway`, a web service that represents an entry point to the system.

- `Admin tool`, a web application that is used by resource administrators to perform CRUD operations on policies.

- `Poller service`, a service responsible for retrieving all the recent updates for policies.

- `Database`, persistent storage for policies.

- `Auth service`, a web service responsible for handling authentication and authorization requests.

- `Auth service client`, a client application for the auth service.

- `Messaging system`, middleware that helps asynchronously load recently updated policies into a cache.

### High-level architecture

![Authentication_and_authorization_system_2.png](./images/system.design.3/Authentication_and_authorization_system_2.png)

- `API gateway` is a single entry point for all write requests. It is a common component of many modern architectures.

  Classic API gateway performs many different functions: authentication, authorization, request routing, response

  aggregation, protocol translation, load balancing, TLS termination, IP listing, rate limiting, response caching,

  response compression, logging, monitoring, and a few more. You might be wondering if read (is authorized) requests

  should go through API gateway as well. It depends on whether the auth system is internal to target services or not. If

  the auth system and target services all live within a trusted environment, we can ignore this extra hop to reduce

  latency and overall complexity.

- `Admin tool` acts as a facade for write requests to the database. It is a stateless web service. This chain of

  services:

  API gateway -> web service acting as a database facade -> database is a common pattern for microservice architecture.

- `Policies` are stored in the database. Both SQL and NoSQL databases can be used. If we decide to avoid an additional

  caching layer and serve read requests directly by the database (which is fine for low to medium scale), we should

  scale read requests by introducing more read replicas.

- `Poller service` is a component similar to Rule Checker. It polls

  for all the new and modified policies and sends this data to the messaging system. Alternatively, we could avoid this

  component entirely by having the Admin Tool write to both the database and the messaging system. A challenge with this

  option though is how to keep both the database and the messaging system in sync, especially when one of these systems

  becomes unavailable.

- We use the `messaging system` as a transport for policies to update the cache (auth service) asynchronously.

- `Auth service` represents a distributed cache. This allows to handle a large number of read requests. All policies are

  parsed and stored in a way that guarantees O(1) time complexity (think, hashmaps). And we use servers with a lot of

  CPU and memory.

- If the scale is huge (millions requests per second) and/or latency requirements are very aggressive (a single digit

  milliseconds), server side alone (auth service) is not enough. We also need to cache aggressively on the client side.

  In the case of multiple requests coming from the same user, we don't even need to call the auth service. A simple

  read-through cache implemented in the auth client will be enough. A further optimization is to use the target service

  fleet of machines as a consistent hashing ring. When all auth clients know about each other, and every client’s local

  cache stores its own chunk of data. Basically, in this case, we have three layers of caching: a local cache in the

  client, a distributed cache represented by all clients in the target service fleet, and a distributed cache

  represented by the auth service.

---

## Monitoring system

### Functional requirements

- Here are a few functional requirements for these characteristics:

    - The system should store metrics at one-minute granularity.

    - Older data should be stored at lower granularity (e.g. 1 hour).

    - The system should provide an SDK for various programming languages and/or a software agent so that other systems

      can easily integrate with it.

### Non-functional requirements

- High scalability (the total number of metrics will grow rapidly over time).

- High availability (keep in mind that latest data is typically more important than old data).

- High read performance (especially important for high resolution alerts configured on top of metrics).

- Low dollar cost (especially for storing less relevant/old data).

### Key actors

- `Metric producer services`, `programs` (web services, applications, batch jobs, etc.) that emit performance metrics.

- `User`, typically an owner of the service, a person who watches the metrics.

- `Metric consumer services`, programs that read metrics on a regular basis to timely react to their changes (e.g. alert

  generation systems, autoscaling systems, anomaly detection systems).

### Key components

![Monitoring_system.png](./images/system.design.3/Monitoring_system.png)

- `API gateway`, a web service that represents an entry point to the system.

- `Metric aggregator`, a service that aggregates metric values in memory over a short period of time and stores the

  aggregated values in persistent storage.

- `Metric partitioner`, a service responsible for partitioning metrics. This allows for more efficient aggregation of

  metric values.

- `Messaging system`, a temporary buffer for metric data that helps parallelise metric processing.

- `Monitoring client`, a client application for a monitoring system that helps to aggregate data on the client side and

  send aggregated data for further aggregation on the server side.

- `Hot storage`, read-optimized storage.

- `Cold storage`, persistent storage for metric values.

- `Read service`, a web service responsible for serving read requests by retrieving metric data from multiple

  locations (hot and cold storage).****

### High-level architecture

![Monitoring_system_2.png](./images/system.design.3/Monitoring_system_2.png)

- `Monitoring client` is a daemon process that runs locally on every metric producer service machine. It reads data from

  service logs and aggregates data in a local memory. Think of it as a hashmap where a metric name represents a key (

  e.g. "total request count") and value represents a count (of requests, errors, etc.) or a sum (of response times,

  request bytes, etc.). Periodically, for example, at the end of every minute, all accumulated values are sent to the

  monitoring system.

- `API gateway` is a single entry point for all requests. It is a common component of many modern architectures. Classic

  API gateway performs many different functions: authentication, authorization, request routing, response aggregation,

  protocol translation, load balancing, TLS termination, IP listing, rate limiting, response caching, response

  compression, logging, monitoring, and a few more. If the monitoring system and metric consumer services all live

  within a trusted environment and read latency is critical, we can have metric consumer services call the read service

  directly.

- Each `producer service` can produce thousands of different metrics. And there can be thousands of producer services.

  Which can lead to high cardinality of metrics. To aggregate data on the server side, which means calculating the total

  count or the total sum for each metric for every 1-minute interval, we need to partition metrics first. This allows us

  to split all metrics into several disjoint groups, and then aggregate metric values within each group. We can use a

  unique metric name (metric name + service name) as a partition key. We can avoid partitioning metrics if metric

  cardinality is low.

- Incoming metric data is buffered in a `messaging system`. Each metric goes to its own shard/partition inside the

  messaging system. There will be a separate consumer (metric aggregator service instance) for each shard.

- `Metric aggregator service` aggregates metric values. Which basically means it sums up the incoming values for a

  metric. It does it for a predefined duration (e.g. 5-10 seconds) and sends the accumulated value to persistent

  storage. If we need to scale and/or speed up data aggregation, we first add more shards/partitions to the messaging

  system and then add the same number of machines/instances to the service. So that each new shard gets its own

  dedicated consumer.

- `Hot storage` represents a distributed cache (e.g. Redis) or a key-value database (e.g. DynamoDB) and stores the last

  N days of data. In monitoring systems, the most recent data is more important and read frequently, while older data is

  usually less important and read less frequently. Therefore, this storage has two key requirements: high read

  throughput and low read latency.

- `Cold storage` represents persistent storage for data. We can store all data there (if our hot storage is not

  persistent) or only data older than N days (if our hot storage is persistent). As discussed in the course, object

  storage (e.g. S3) is a good option. Data in hot and cold storage is stored at some granularity, for example, 1 minute.

  But on a large scale (from millions to billions of metrics), storing such a large amount of per-minute data is costly.

  In addition, as we have already mentioned, the value of monitoring data decreases over time. For this reason, it makes

  sense to aggregate data over time into 5-minute intervals, 1-hour intervals, and so on. The system should have a

  separate component responsible for aggregating data from shorter time intervals to longer time intervals and purging

  data for shorter time intervals from the storage.

- `Read service` has three primary functions: route read requests to the appropriate storage, stitch data from hot and

  cold storage when both new and old data is requested, cache responses. The most recent monitoring data is not well

  suited for caching because it changes frequently. But old data can be effectively cached. When there is a read request

  that spans both storages, we will fetch new data from hot storage and old data from an internal cache of the read

  service.

---

## Design TicTok/Instagram reels system

![tictok.png](./images/system.design.3/tictok.png)

---

## Online Judge for coding contests

![Online Judge for coding contests.png](./images/system.design.3/Online_Judge_for_coding_contests.png)

---

## Design an Amazon S3 or Object Storage

#### Upload

![Object Storage Upload.png](./images/system.design.3/Object_Storage_Upload.png)

#### Download

![Object Storage Download.png](./images/system.design.3/Object_Storage_Download.png)

---

## Design a Dropbox

![Design a Dropbox.png](./images/system.design.3/Design_a_Dropbox.png)

---

## Design Reddit home page feed

### Non functional requirements

![Design Reddit home page feed.png](./images/system.design.3/Design_Reddit_home_page_feed.png)

### Design

![Design Reddit home page feed 2.png](./images/system.design.3/Design_Reddit_home_page_feed_2.png)

![Design Reddit home page feed 3.png](./images/system.design.3/Design_Reddit_home_page_feed_3.png)

---

## Design Parking Garage

### Requirements

![Design Parking Garage.png](./images/system.design.3/Design_Parking_Garage.png)

### API

![Design Parking Garage 2.png](./images/system.design.3/Design_Parking_Garage_2.png)

### Data types

![Design Parking Garage 3.png](./images/system.design.3/Design_Parking_Garage_3.png)

![Design Parking Garage 4.png](./images/system.design.3/Design_Parking_Garage_4.png)

### Design

![Design Parking Garage 5.png](./images/system.design.3/Design_Parking_Garage_5.png)

---

## Design TicTok

### Requirements

![Design TicTok.png](./images/system.design.4/Design_TicTok.png)

### Design

![Design TicTok 2.png](./images/system.design.4/Design_TicTok_2.png)

---

## Design Facebook Messenger

### Requirements

![Design Facebook Messenger.png](./images/system.design.4/Design_Facebook_Messenger.png)

### Design

![Design Facebook Messenger 2.png](./images/system.design.4/Design_Facebook_Messenger_2.png)

---

## Design Instagram

### Requirements

![Design Instagram.png](./images/system.design.4/Design_Instagram.png)

![Design Instagram 2.png](./images/system.design.4/Design_Instagram_2.png)

### Data types

![Design Instagram 3.png](./images/system.design.4/Design_Instagram_3.png)

### Design

![Design Instagram 4.png](./images/system.design.4/Design_Instagram_4.png)

---

## Design Amazon Kindle Payments

### Requirements

![Design Amazon Kindle Payments.png](./images/system.design.4/Design_Amazon_Kindle_Payments.png)

### API

![Design Amazon Kindle Payments 3.png](./images/system.design.4/Design_Amazon_Kindle_Payments_3.png)

### Data model

![Design Amazon Kindle Payments 4.png](./images/system.design.4/Design_Amazon_Kindle_Payments_4.png)

### Design

![Design Amazon Kindle Payments 2.png](./images/system.design.4/Design_Amazon_Kindle_Payments_2.png)

---

## Design Amazon Prime Video

### Requirements

![Design Amazon Prime Video.png](./images/system.design.4/Design_Amazon_Prime_Video.png)

### Design

![Design Amazon Prime Video 2.png](./images/system.design.4/Design_Amazon_Prime_Video_2.png)

---

## Downloading User Data

### Design

![Downloading User Data.png](./images/system.design.4/Downloading_User_Data.png)

---

## Design Calendar data entities mapping

### Functional Requirements

- User can create calendar

- User can create event in calendar

- User can invite other users to event

- Notification for event

- Timezone support

### Non Functional Requirements

- High availability

- Low latency

- Durability

- Scalability

### Data base design

![Calendar data entities mapping.drawio.png](./images/system.design.4/Calendar_data_entities_mapping.drawio.png)

---

## Twitter API

![Twitter API.png](./images/system.design.4/Twitter_API.png)

--- 

## Design architecture for Instagram likes and comments

### Requirements:

### Functional

##### Need to implement:

- User can add comments to photos and videos

- User can likes photos and videos

### Out of scope:

- Upload photo

- Upload video

- Watch photos, videos in feed

- Create stories

- Following other users

- Send and receive messages

### Non-functional

- Scalable

- High Available

- Performante

- Durable

- Resiliente

- Maintantable

- Secure

- Cost efficient

### Estimates:

### Need to implement:

- Every 10 user leave comments

- Every comment size up to 5 KB

- We assume that comments are text only

- Every 5 user make like with a predefined size 8 bytes

### Out of scope:

- 1B users

- 200M DAU

- Every 100 user upload a photo or video

- Photo size up to 5 MB after encoding with several dimensions = 2 * 5 = 10MB

- Video size up to 100 MB after encoding with several dimensions = 2 * 100 = 200MB

- Every 100 user create stories

- Video size up to 50 MB after encoding with several dimensions = 2 * 50 = 100MB

- Every 200 user send and receive messages

- Every message size up to 5 KB

### Calculations:

- If every 10 user leave comments and every comment size up to 5 KB then we have 200M / 10 = 20M comments

- Every comment size up to 5 KB then we have 20M users * 5 Kb = 100M KB = **100GB comments daily**

- Every 5 user make like with a predefined size 8 bytes then we have 200M / 5 = 40M likes

- Every like size up to 8 bytes then we have 40M users * 8 bytes = 320M bytes = **320MB likes daily**

> Overall we have **100GB comments daily** and **320MB likes daily**

### High Level Design:

![instagram_high_level_design.png](images/instagram_high_level_design.png)

### Detailed Design:

#### Use case: User leave/remove/edit the comment or like to the photo or video

1. The `user` sends a request to `DNS` to get the IP address of the Instagram service

2. The request goes to the `Load Balancer`

3. The `Load Balancer` forwards the request to the `API Gateway` where security checks and throttling are done

4. The `API Gateway` forwards the request to the `Write API async` service

5. The `Write API async` service writes the data to `Queue`

6. The `Workers` read the data from the `Queue` and write the data to the `NoSQL` database and to the `Cache`

7. The `Analytics service` reads the data from the `Queue` and writes the data to the `Data Warehouse`

#### REST API:

- **POST** https://instagram.com/api/v1/like

- **POST** https://instagram.com/api/v1/comment

    ```json

    {

        "user_id": "{user_id}",

        "photo_id": "{photo_id}",

        "comment": "{comment}"

    }

    ```

Response: `comment_id`: **UUID**

- **DELETE** https://instagram.com/api/v1/comment&comment_id={comment_id}

- **EDIT** https://instagram.com/api/v1/comment&comment_id={comment_id}

    ```json

    {

        "comment": "{new comment}"

    }

    ```

#### Use case: User read the comments or likes to the photo or video

1. The `user` sends a request to `DNS` to get the IP address of the Instagram service

2. The request goes to the `Load Balancer`

3. The `Load Balancer` forwards the request to the `API Gateway` where security checks and throttling are done

4. The `API Gateway` forwards the request to the `Read API` service

5. The `Read API` fetches the comments from the NoSQL database from the `Read replica`

#### REST API:

- **GET** https://instagram.com/api/v1/comment&photo_id={photo_id}

Response:

```json

{

    "comments": [

        {

            "comment_id": "{comment_id}",

            "user_id": "{user_id}",

            "comment": "{comment}"

        }

    ]

}

```

- **GET** https://instagram.com/api/v1/like&photo_id={photo_id}

Response:

```json

{

    "likes": [

        {

            "user_id": "{user_id}"

        }

    ]

}

```

![instagram_detailed_design.png](images/instagram_detailed_design.png)

### Bottlenecks:

1. Application is `Read Heavy` because people mostly consume than publish content.

    - `Read replica` might be a bottleneck if we have a lot of read requests

2. `Queue` might be a bottleneck if we have a lot of write requests and workers can't handle the load

### SPOF:

1. `Load Balancer` is a SPOF because if it goes down then the whole system goes down, that is why we need to have a several `Load Balancers` in different regions

2. `Database` is a SPOF because if it goes down then the whole system goes down, that is why we need to have a several `Read replicas` in different regions

### 10x raise in reading and writing:

1. We can implement a throttling mechanism in the `API Gateway` to limit the number of requests in order to prevent the system from being overloaded

2. We need to have a several `Read replicas` in different regions

3. If we encounter a 10x write increase, read replicas might not be consistent with the master database. By using `Eventual Consistency` users eventually will see the updated data.

4. In case of a 10x read increase, we can use `Cache` to store the most frequently accessed data. This will reduce the load on the database and improve the response time.

--- 

## Design Instagram #2

![instagram.svg](images/system.design.5/instagram.png)

---

## Cloud design patterns

![sys_des_practices.png](./images/system.design/sys_des_practices.png)

---

## Five common system design interview mistakes

![Five common system design interview mistakes.png](./images/system.design.3/Five_common_system_design_interview_mistakes.png)

---