An open API service indexing awesome lists of open source software.

https://github.com/bethibande/messaging

Low-latency in-memory message broker
https://github.com/bethibande/messaging

java

Last synced: 2 months ago
JSON representation

Low-latency in-memory message broker

Awesome Lists containing this project

README

          

# Messaging
This is an implementation of a very simple MQTT like high-throughput, low-latency, in-memory message broker.
This router can deliver messages to subscribers with sub-microsecond latency at hundreds of millions of messages per second.
The primary factor limiting performance at this point is the JVMs ability to iterate over and invoke the subscribers of each node.
Performance may also drastically vary depending on the JVM used.

Each key is broken into individual nodes which are assembled into a tree structure by the router.
A message must be posted using a concrete key like `entities/user/ID` whereas a subscriber can subscribe to individual topics like `entities/user/ID`.
Subscribers may also use wildcards in their keys like `entities/user/*` or `entities/*/someValue`.

### Example
```java
void main() {
final MessageRouter router = new MessageRouter();
final PreComputedKey key = router.createKey("entities", "user");
router.subscribe((subscription, actualRoute, message) -> {/* do something */}, key);

final PreComputedKey postKey = router.createKey("entities", "user", "12345");
router.post(postKey, "test");
}
```

## Benchmarks
These benchmarks are run on an AMD Ryzen 9 5950x with 64 GB of RAM.
Before the benchmark, a router with n subscribers is initialized. The subscribers are randomly generated using a hard-coded seed.
Each subscriber is subscribed to a random topic. Each generated topic is 2 or 3 tokens long. The 2nd and 3rd tokens
have a chance of being wildcards.
The benchmark then posts messages using a static key to ensure we don't perform any heap allocations during the benchmark.
With this setup, we ensure each posted message triggers a large portion of all registered subscribers.
There is no optimization for repeated posts to the same key as such we can assume that posting messages to any available topic
will be roughly the same in terms of throughput and latency as long as they are not the top-level keys.
Doing this optimizes CPU cache usage and avoids heap allocations. For a better real-world scenario, it may be necessary
to run these benchmarks again in the future with randomly chosen keys to ensure we limit CPU cache usage as much as possible.

All benchmarks are run on GraalVM 24.0.1 and 32 threads unless states otherwise. Depending on the benchmark type different GC settings are used.
See the description of each section for more details. GC settings can greatly affect throughput and latency
even though the benchmarks perform little to no heap allocations after warmup.

The blow table indicates the number of subscribers along with the number of total generated nodes
and how many subscribers are hit for each message posted in our benchmark.

| subscribers | nodes | hits |
|-------------|-------|-------|
| 10 | 16 | 1 |
| 100 | 67 | 15 |
| 1k | 463 | 155 |
| 10k | 3385 | 1624 |
| 100k | 12121 | 16048 |

### Throughput
The below tests are all run using the default GC provided by GraalVM. Using ZGC yields ~50% less throughput.

| subscribers | messages per second |
|-------------|---------------------|
| 10 | 673.6 million |
| 100 | 302 million |
| 1k | 48 million |
| 10k | 3 million |
| 100k | 421.9 thousand |

### Latency
The below tests are all run using ZGC as it yields much better and stable latencies.

| subscribers | p0.00 | p0.50 | p0.90 | p0.95 | p0.99 | p0.999 | p0.9999 | p1.00 |
|-------------|---------|----------|----------|----------|----------|----------|----------|---------|
| 10 | ? | 100 ns | 100 ns | 100 ns | 100 ns | 100 ns | 3700 ns | 14.8 ms |
| 100 | ? | 200 ns | 200 ns | 200 ns | 200 ns | 200 ns | 5000 ns | 31.5 ms |
| 1k | 300 ns | 700 ns | 800 ns | 800 ns | 800 ns | 900 ns | 16.6 µs | 23.2 ms |
| 10k | 5800 ns | 12800 ns | 12992 ns | 12992 ns | 14000 ns | 24800 ns | 129.2 µs | 35.3 ms |

For peak latencies it's recommended to not run the benchmark on all available CPUs. The following benchmarks are run on 16 instead of 32 threads.

| subscribers | p0.00 | p0.50 | p0.90 | p0.95 | p0.99 | p0.999 | p0.9999 | p1.00 |
|-------------|---------|---------|---------|---------|---------|----------|---------|----------|
| 10 | 0 ns | 0 ns | 100 ns | 100 ns | 100 ns | 100 ns | 1200 ns | 108.3 µs |
| 100 | 0 ns | 100 ns | 100 ns | 100 ns | 100 ns | 200 ns | 2800 ns | 145.2 µs |
| 1k | 200 ns | 400 ns | 400 ns | 400 ns | 400 ns | 700 ns | 4896 ns | 1.7 ms |
| 10k | 4096 ns | 4896 ns | 5000 ns | 5096 ns | 8896 ns | 14688 ns | 27.1 µs | 2.7 ms |

### Conclusion
The key takeaway from these benchmarks is that the router is able to deliver messages to subscribers with sub-microsecond latency at hundreds of millions of messages per second under ideal conditions.
However, an increased routing table size, including larger subscriber counts, will diminish the throughput of the router.
One key reason for this is the likely contention around L1 and L2 CPU caches. That's why latency improves a lot when running only on thread per core instead of one per processor.
At the same time invoking the listeners is quite expensive since they are interface method invocations. Though this likely depends on the JVM used.
Based on some other benchmarks I've run, I think it is safe to say, that subscriber count usually matters more than node count.
The routing table internally is a tree that makes heavy use of hash map lookups, as such node count even per parent node does not matter much — not nearly as much as listener count.