Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/nviennot/core-to-core-latency

Measures the latency between CPU cores
https://github.com/nviennot/core-to-core-latency

Last synced: 24 days ago
JSON representation

Measures the latency between CPU cores

Host: GitHub
URL: https://github.com/nviennot/core-to-core-latency
Owner: nviennot
License: mit
Created: 2022-09-17T07:38:34.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-01-18T01:33:50.000Z (over 1 year ago)
Last Synced: 2024-05-02T23:26:11.119Z (about 2 months ago)
Language: Jupyter Notebook
Homepage:
Size: 32 MB
Stars: 941
Watchers: 13
Forks: 56
Open Issues: 71
Metadata Files:
- Readme: README.md
- License: LICENSE-MIT

Lists

my-awesome-stars - nviennot/core-to-core-latency - Measures the latency between CPU cores (Jupyter Notebook)
awesome-stars - nviennot/core-to-core-latency - Measures the latency between CPU cores (Jupyter Notebook)
awesome-performance - Measuring CPU core-to-core latency
awesome-high-performance-computing - core-to-core-latency - A diagnostic tool designed to measure and report the latency between CPU cores, aiding in the optimization of parallel computing tasks. (Software / Trends)
awesome-stars - nviennot/core-to-core-latency - `★951` Measures the latency between CPU cores (Jupyter Notebook)

README

Measuring CPU core-to-core latency
==================================

![License](https://img.shields.io/badge/license-MIT-green.svg)
[![Cargo](https://img.shields.io/crates/v/core-to-core-latency.svg)](https://crates.io/crates/core-to-core-latency)
[![Rust 1.57+](https://img.shields.io/badge/rust-1.57+-lightgray.svg)](https://www.rust-lang.org/tools/install)

We measure the latency it takes for a CPU to send a message to another CPU via
its cache coherence protocol.

By pinning two threads on two different CPU cores, we can get them to do a bunch
of compare-exchange operation, and measure the latency.

How to run:

```
$ cargo install core-to-core-latency
$ core-to-core-latency
```

Single socket results
----------------------

CPU | Median Latency
-------------------------------------------------------------------------------| ------------------
AMD Ryzen 9 7950X, 16 Cores, zen4, 2022-Q3 | 68ns
AMD EPYC 7773X, 64 Cores, Milan-X, 2022-Q1 | 115ns
Intel Xeon Gold 6242, 16 Cores, Cascade Lake, 2019-Q2 | 48ns
Intel Xeon Phi 7210, 64 Cores, Knights Landing, 2016-Q2 | 91ns
HiSilicon Kunpeng 920-6426, 64 cores, ARMv8.2-A, 2019-Q1 | 72ns
Intel Core i9-12900K, 8P+8E Cores, Alder Lake, 12th gen, 2021-Q4 | 35ns, 44ns, 50ns
Intel Core i9-9900K, 3.60GHz, 8 Cores, Coffee Lake, 9th gen, 2018-Q4 | 21ns
Intel Core i7-1165G7, 2.80GHz, 4 Cores, Tiger Lake, 11th gen, 2020-Q3 | 27ns
Intel Core i7-6700K, 4.00GHz, 4 Cores, Skylake, 6th gen, 2015-Q3 | 20ns
Intel Core i5-10310U, 4 Cores, Comet Lake, 10th gen, 2020-Q2 | 21ns
Intel Core i5-4590, 3.30GHz 4 Cores, Haswell, 4th gen, 2014-Q2 | 21ns
Apple M1 Pro, 6P+2E Cores, 2021-Q4 | 40ns, 53ns, 145ns
Intel Xeon Platinum 8375C, 2.90GHz, 32 Cores, Ice Lake, 3rd gen, 2021-Q2 | 51ns
Intel Xeon Platinum 8275CL, 3.00GHz, 24 Cores, Cascade Lake, 2nd gen, 2019-Q2 | 47ns
Intel Xeon E5-2695 v4, 2.10GHz, 18 Cores, Broadwell, 5th gen, 2016-Q1 | 44ns
AMD EPYC 7R13, 48 Cores, Milan, 3rd gen, 2021-Q1 | 23ns, 107ns
AMD Ryzen Threadripper 3960X, 3.80GHz, 24 Cores, Zen 2, 3rd Gen, 2019-Q4 | 24ns, 94ns
AMD Ryzen Threadripper 1950X, 3.40GHz, 16 Cores, Zen, 1st Gen, 2017-Q3 | 25ns, 154ns
AMD Ryzen 9 5950X, 3.40GHz, 16 Cores, Zen3, 4th gen, 2020-Q4 | 17ns, 85ns
AMD Ryzen 9 5900X, 3.40GHz, 12 Cores, Zen3, 4th gen, 2020-Q4 | 16ns, 84ns
AMD Ryzen 7 5800U, 1.9GHz up to 4.4GHz, 8 Cores, Zen3, 4th gen, 2021-Q4 | 19ns
AMD Ryzen 7 5700X, 3.40GHz, 8 Cores, Zen3, 4th gen, 2022-Q2 | 18ns
AMD Ryzen 7 2700X, 3.70GHz, 8 Cores, Zen+, 2nd gen, 2018-Q3 | 24ns, 92ns
AMD Ryzen 9 5900HX, 3.3GHz, 8 Cores, Zen3, 4th gen, 2021-Q1 | 8ns, 17ns, 18ns
AWS Graviton3, 64 Cores, Arm Neoverse, 3rd gen, 2021-Q4 | 46ns
AWS Graviton2, 64 Cores, Arm Neoverse, 2rd gen, 2020-Q1 | 47ns
Sun/Oracle SPARC T4, 2.85GHz, 8 cores, 2011-Q3 | 98ns
IBM Power7, 3.3GHz, 8 Cores, 2010-Q1 | 173ns
IBM PowerPC 970, 1.8GHz, 2 Cores, 2003-Q2 | 576ns

## Intel Xeon Phi 7210, 64 Cores, Knights Landing, 2016-Q2

Data provided by [Concyclics](https://github.com/Concyclics).

![image](https://user-images.githubusercontent.com/33325023/212899823-9c9513c9-33a9-49a6-8e8a-b881811e4c2b.png)

## HiSilicon Kunpeng 920-6426, 64 cores, ARMv8.2-A, 2019-Q1

Data provided by [Concyclics](https://github.com/Concyclics).

![image](https://user-images.githubusercontent.com/33325023/212899952-fb6cee52-ba0d-4add-8896-4278d57d04db.png)

## Intel Xeon Gold 6242, 16 Cores, Cascade Lake, 2019-Q2

Data provided by [Concyclics](https://github.com/Concyclics).

![image](https://user-images.githubusercontent.com/33325023/212899588-5ca08d1a-24bf-401b-8955-5052a6244aaf.png)

## AMD Ryzen 9 7950X, 16 Cores, zen4, 2022-Q3

Data provided by [zamadatix](https://github.com/zamadatix).

![image](https://user-images.githubusercontent.com/33325023/212899457-2126ad0b-f38f-4b56-b723-05bc1bb18172.png)

Data provided by [zamadatix](https://github.com/zamadatix).

## AMD EPYC 7773X, 64 Cores, Milan-X, 2022-Q1

Data provided by [SchrodingerZhu](https://github.com/SchrodingerZhu).

![image](https://user-images.githubusercontent.com/33325023/212899131-5599bdcc-e2f7-4c33-9f42-cabd34e15251.png)

## Loongson 3A5000HV, 2.5GHz, 4 Cores, 2021-Q3

Data provided by [Glavo](https://github.com/Glavo).

![image](https://user-images.githubusercontent.com/33325023/212898113-8605327a-ff34-4f57-a46a-ee488c3437c8.png)

## Intel Core i9-12900K, 8P+8E Cores, Alder Lake, 12th gen, 2021-Q4

Data provided by [bizude](https://github.com/bizude).

This CPU has 8 performance cores, and 2 groups of 4 efficient cores.
We see CPU=8 with fast access to all other cores.

## Intel Core i9-9900K, 3.60GHz, 8 Cores, Coffee Lake, 8th gen, 2018-Q4

Data provided by [nviennot](https://github.com/nviennot).

## Intel Core i7-1165G7, 2.80GHz, 4 Cores, Tiger Lake, 11th gen, 2020-Q3

Data provided by [Jonas Wunderlich](https://github.com/jonas-w).

## Intel Core i7-6700K, 4.00GHz, 4 Cores, Skylake, 6th gen, 2015-Q3

Data provided by [CanIGetaPR](https://github.com/CanIGetaPR).

## Intel Core i5-10310U, 4 Cores, Comet Lake, 10th gen, 2020-Q2

Data provided by [Ashley Sommer](https://github.com/ashleysommer).

## Intel Core i5-4590, 3.30GHz, 4 Cores, Haswell, 4th gen, 2014-Q2

Data provided by [Felipe Lube de Bragança](https://github.com/felubra).

## Apple M1 Pro, 6P+2E Cores, 2021-Q4

Data provided by [Aditya Sharma](https://github.com/epk).

We see the two efficent cores clustered together with a latency of 53ns, then two groups of 3
performance cores, with a latency of 40ns. Cross-group communication is slow at ~145ns, which is a
latency typically seen in multi-socket configurations.

## Intel Xeon Platinum 8375C, 2.90GHz 32 Cores, Ice Lake, 3rd gen, 2021-Q2

From an AWS `c6i.metal` machine.

## Intel Xeon Platinum 8275CL, 3.00GHz 24 Cores, Cascade Lake, 2nd gen, 2019-Q2

From an AWS `c5.metal` machine.

## Intel Xeon E5-2695 v4, 2.10GHz 18 Cores, Broadwell, 5th gen, 2016-Q1

From a machine provided by GTHost

## AMD EPYC 7R13, 48 Cores, Milan, 3rd gen, 2021-Q1

From an AWS `c6a.metal` machine.

We can see cores arranged in 6 groups of 8 in which latency is excellent within
(23ns). When data crosses groups, the latency jumps to around 110ns. Note, that
the last 3 groups have a better cross-group latency than the first 3 (~90ns).

## AMD Ryzen Threadripper 3960X, 3.80GHz, 24 Cores, Zen 2, 3rd Gen, 2019-Q4

Data provided by [Mathias Siegel](https://github.com/ToolsDevler).

We see the CPUs in 8 groups of 3, and better performance for CPUS in the group [13,24].

## AMD Ryzen Threadripper 1950X, 3.40GHz, 16 Cores, Zen, 1st Gen, 2017-Q3

Data provided by [Jakub Okoński](https://github.com/farnoy)

We see the CPUs in 4 groups of 4, and better performance for CPUS in the group [9,16].

## AMD Ryzen 9 5950X, 3.40GHz 16 Cores, Zen3, 4th gen, 2020-Q1

Data provided by [John Schoenick](https://github.com/Nephyrin).

We can see two groups of 8 cores with latencies of 17ns intra-group, and 85ns inter-group.

## AMD Ryzen 9 5900X, 3.40GHz, 12 Cores, Zen3, 4th gen, 2020-Q4

Data provided by [Scott Markwell](https://github.com/smarkwell).

We see two groups of 6 cores with latencies of 16ns intra-group and 84ns inter-group.

## AMD Ryzen 7 5800U, 1.9GHz up to 4.4GHz, 8 Cores, Zen3, 4th gen, 2021-Q4

Data provided by [George Melikov](https://github.com/gmelikov).

## AMD Ryzen 7 5700X, 3.40GHz, 8 Cores, Zen3, 4th gen, 2022-Q2

Data provided by [Ashley Sommer](https://github.com/ashleysommer).

## AMD Ryzen 7 2700X, 3.70GHz, 8 Cores, Zen+, 2nd gen, 2018-Q3

Data provided by [David Hoppenbrouwers](https://github.com/Demindiro).

We can see 2 groups of 4 cores with latencies of 24ns intra-group, and 92ns inter-group.

## AMD Ryzen 9 5900HX, 3.3GHz, 8 Cores, Zen3, 4th gen, 2021-Q1

Data provided by [r4nd0m1z3r](https://github.com/r4nd0m1z3r).

## AWS Graviton3, 64 Cores, Arm Neoverse, 3rd gen, 2021-Q4

From an AWS `c7g.16xlarge` machine.

## AWS Graviton2, 64 Cores, Arm Neoverse, 2nd gen, 2020-Q1

From an AWS `c6gd.metal` machine.

## Sun/Oracle SPARC T4, 2.85GHz, 8 cores, 2011-Q3

Data provided by [Kokoa van Houten](https://github.com/koachan).

## IBM Power7, 3.3GHz, 8 Cores, 2010-Q1

Data provided by [Kokoa van Houten](https://github.com/koachan).

Dual sockets results
---------------------

The following shows dual-socket configuration latency where one CPU on the first socket sends a message to
another CPU on the second socket.
The number in parenthesis next to the latency denotes the slowdown compared to single socket.

CPU | Median Latency
-------------------------------------------------------------------------------| ------------------
Intel Xeon Gold 6242, 16 Cores, Cascade Lake, 2019-Q2 | 136ns (2.8x)
Intel Xeon Platinum 8375C, 2.90GHz, 32 Cores, Ice Lake, 3rd gen, 2021-Q2 | 108ns (2.1x)
Intel Xeon Platinum 8275CL, 3.00GHz, 24 Cores, Cascade Lake, 2nd gen, 2019-Q2 | 134ns (2.8x)
Intel Xeon E5-2695 v4, 2.10GHz, 18 Cores, Broadwell, 5th gen, 2016-Q1 | 118ns (2.7x)
AMD EPYC 7R13, 48 Cores, Milan, 3rd gen, 2021-Q1 | 197ns
Sun/Oracle SPARC T4, 2.85GHz, 8 cores, 2011-Q3 | 356ns (3.6x)
IBM Power7, 3.3GHz, 8 Cores, 2010-Q1 | 443ns (2.5x)

## Dual Intel Xeon Gold 6242, 16 Cores, Cascade Lake, 2019-Q2

Data provided by [Concyclics](https://github.com/Concyclics).

![image](https://user-images.githubusercontent.com/33325023/212899689-4761dd65-5973-4a1a-8a39-5ebb1bf205a1.png)

## Dual Intel Xeon Platinum 8375C, 2.90GHz 32 Cores, Ice Lake, 3rd gen, 2021-Q2

From an AWS `c6i.metal` machine.

## Dual Intel Xeon Platinum 8275CL, 3.00GHz 24 Cores, Cascade Lake, 2nd gen, 2019-Q2

From an AWS `c5.metal` machine.

## Dual Intel Xeon E5-2695 v4, 2.10GHz 18 Cores, Broadwell, 5th gen, 2016-Q1

From a machine provided by GTHost

## Dual AMD EPYC 7R13, 48 Cores, Milan, 3rd gen, 2021-Q1

From an AWS `c6a.metal` machine.

This one is a bit odd. The single socket test for Socket 1 shows median latencies of 107ns
cross-groups, but Socket 2 shows 200ns. It's 2x slower, very odd. The other platforms don't behave this way.
In fact, the socket-to-socket latencies are than the core-to-core within Socket 2.

Anandtech have measured [similar results on a Dual-Socket AMD EPYC 7763 and 7742](https://www.anandtech.com/show/16529/amd-epyc-milan-review/4).

**Socket 2 does not behave similarly than Socket 1, it's twice as slow**.

## Sun/Oracle SPARC T4, 2.85GHz, 8 cores, 2011-Q3

Data provided by [Kokoa van Houten](https://github.com/koachan).

## Dual IBM Power7, 3.3GHz, 8 Cores, 2010-Q1

Data provided by [Kokoa van Houten](https://github.com/koachan).

Hyper-threads
-------------

We measure the latency between two hyper-threads of the same core

CPU | Median Latency
-------------------------------------------------------------------------------| ------------------
AMD Ryzen 9 7950X, 16 Cores, zne4, 2022-Q3 | 5.3ns
AMD EPYC 7773X, 64 Cores, Milan-X, 2022-Q1 | 10ns
Intel Xeon Gold 6242, 16 Cores, Cascade Lake, 2019-Q2 | 7.4ns
Intel Core i9-12900K, 8+8 Cores, Alder Lake, 12th gen, 2021-Q4 | 4.3ns
Intel Core i9-9900K, 3.60GHz, 8 Cores, Coffee Lake, 9th gen, 2018-Q4 | 6.2ns
Intel Core i7-1165G7, 2.80GHz, 4 Cores, Tiger Lake, 11th gen, 2020-Q3 | 5.9ns
Intel Core i7-6700K, 4.00GHz, 4 Cores, Skylake, 6th gen, 2015-Q3 | 6.9ns
Intel Core i5-10310U, 4 Cores, Comet Lake, 10th gen, 2020-Q2 | 7.3ns
Intel Xeon Platinum 8375C, 2.90GHz, 32 Cores, Ice Lake, 3rd gen, 2021-Q2 | 8.1ns
Intel Xeon Platinum 8275CL, 3.00GHz, 24 Cores, Cascade Lake, 2nd gen, 2019-Q2 | 7.6ns
Intel Xeon E5-2695 v4, 2.10GHz, 18 Cores, Broadwell, 5th gen, 2016-Q1 | 7.6ns
AMD EPYC 7R13, 48 Cores, Milan, 3rd gen, 2021-Q1 | 9.8ns
AMD Ryzen Threadripper 3960X, 3.80GHz, 24 Cores, Zen 2, 3rd Gen, 2019-Q4 | 6.5ns
AMD Ryzen Threadripper 1950X, 3.40GHz, 16 Cores, Zen, 1st Gen, 2017-Q3 | 10ns
AMD Ryzen 9 5950X, 3.40GHz, 16 Cores, Zen3, 4th gen, 2020-Q4 | 7.8ns
AMD Ryzen 9 5900X, 3.40GHz, 12 Cores, Zen3, 4th gen, 2020-Q4 | 7.6ns
AMD Ryzen 7 5700X, 3.40GHz, 8 Cores, Zen3, 4th gen, 2022-Q2 | 7.8ns
AMD Ryzen 7 2700X, 3.70GHz, 8 Cores, Zen+, 2nd gen, 2018-Q3 | 9.7ns
Sun/Oracle SPARC T4, 2.85GHz, 8 cores, 2011-Q3 | 24ns
IBM Power7, 3.3GHz, 8 Cores, 2010-Q1 | 70ns

---

**The notebook [results/results.ipynb](results/results.ipynb) contains the code to generate these graphs**

How to use
----------

First [install Rust](https://www.rust-lang.org/tools/install) and `gcc` on linux, then:

```
$ cargo install core-to-core-latency
$ core-to-core-latency
Num cores: 10
Using RDTSC to measure time: false
Num round trips per samples: 1000
Num samples: 300
Showing latency=round-trip-time/2 in nanoseconds:

0 1 2 3 4 5 6 7 8 9
0
1 52±6
2 38±6 39±4
3 39±5 39±6 38±6
4 34±6 38±4 37±6 36±5
5 38±5 38±6 38±6 38±6 37±6
6 38±5 37±6 39±6 36±4 49±6 38±6
7 36±6 39±5 39±6 37±6 35±6 36±6 38±6
8 37±5 38±6 35±5 39±5 38±6 38±5 37±6 37±6
9 48±6 39±6 36±6 39±6 38±6 36±6 41±6 38±6 39±6

Min latency: 34.5ns ±6.1 cores: (4,0)
Max latency: 52.1ns ±9.4 cores: (1,0)
Mean latency: 38.4ns
```

Contribute
-----------

Use `core-to-core-latency 5000 --csv > output.csv` to instruct the program to use
5000 iterations per sample to reduce the noise, and save the results.

It can be used in the jupter notebook [results/results.ipynb](results/results.ipynb) for rendering graphs.

Create a GitHub issue with the generated `output.csv` file and I'll add your results.

License
-------

This software is licensed under the MIT license