https://github.com/cheremin/link-imbalance-metastable-issue-simulation
Simulation of "Link imbalance metastable issue"
https://github.com/cheremin/link-imbalance-metastable-issue-simulation
kotlin simulation
Last synced: over 1 year ago
JSON representation
Simulation of "Link imbalance metastable issue"
- Host: GitHub
- URL: https://github.com/cheremin/link-imbalance-metastable-issue-simulation
- Owner: cheremin
- Created: 2022-05-04T16:26:22.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-05-14T06:41:20.000Z (about 4 years ago)
- Last Synced: 2025-01-29T00:49:33.031Z (over 1 year ago)
- Topics: kotlin, simulation
- Language: Kotlin
- Homepage: https://dev.cheremin.info/2022/05/mystery-of-link-imbalance-metastable.html
- Size: 83 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
Simulation of "link imbalance metastable issue" described in
* ["Metastable Failures in Distributed Systems"](https://doi.org/10.1145/3458336.3465286) Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, Timothy Zhu
* ["Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale"](https://engineering.fb.com/2014/11/14/production-engineering/solving-the-mystery-of-link-imbalance-a-metastable-failure-state-at-scale/)
Simulation is built on [kalasim](https://www.kalasim.org/) -- Kotlin discrete-events simulation engine.
### Link imbalance metastable issue
Basically, issue was happening in facebook datacenters due to interaction between:
* MRU strategy in DB connection pools, used by apps
* Multi-fiber network link between apps and DB server
* Periodical spikes of load of many small short queries
These 3 pieces together interact in such a way that almost all network activity is attracted
to a single fiber, which exceeds its capacity.
More details:
* ["Metastable Failures in Distributed Systems"](https://doi.org/10.1145/3458336.3465286) Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, Timothy Zhu
* ["Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale"](https://engineering.fb.com/2014/11/14/production-engineering/solving-the-mystery-of-link-imbalance-a-metastable-failure-state-at-scale/)
### Simulation structure
Simulation is built around `Pool`, `Fiber`, `Connection` and `Query` domain objects.
"Client" (which is the source of `Query`es) is modelled by `kalasim`'s `ComponentGenerator`
"DB" behavior in current simulation is just a delay, hence it is not modelled explicitly at all.
Both `Pool` implementations and `Fiber` are extending `kalasim` `Resource`, which gives
acquire/release occupation-controlling behavior for free. It is also quite natural choice
(both pools and fibers are indeed resources with limited capacity), and allows in
post-processing to use extended statistical data `kalasim` gathers for `Resource`s
Simulation domain objects are in `MultiFibersMRUPoolSimulation`.
Simulation scenarios are created as tests, e.g. `MultiFibersMRUPoolSimulationScenarios`.
Such tests are probabilistic by nature, i.e. success/failure is usually not guaranteed,
but happens with high enough probability, e.g. >90%. Many tests are used `RepeatedTest`
to give a sense of success/failure odds explicitly.