Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/graysky2/kernel_compiler_patch

Kernel patch enables compiler optimizations for additional CPUs.
https://github.com/graysky2/kernel_compiler_patch

Last synced: 5 days ago
JSON representation

Kernel patch enables compiler optimizations for additional CPUs.

Awesome Lists containing this project

README

        

# kernel_compiler_patch

## Why a specific patch?
The kernel uses its own set of CFLAGS, KCFLAGS. For example, see:
* [arch/x86/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/Makefile)
* [arch/x86/Makefile_32.cpu](https://github.com/torvalds/linux/blob/master/arch/x86/Makefile_32.cpu)
* [arch/x86/Kconfig.cpu](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig.cpu)

### Alternative way to define a -march= option without this patch
As pointed out by codemac in [this topic](https://bbs.archlinux.org/viewtopic.php?id=281639), one can simply export the value/values for the `KCFLAGS` and `KCPPFLAGS` before calling `make` to achieve the same result, see [here](https://github.com/torvalds/linux/blob/88603b6dc419445847923fcb7fe5080067a30f98/Makefile#L1112).
```
export KCFLAGS=' -march=znver3'
export KCPPFLAGS=' -march=znver3'
make all
```

## New tunings
These patches adds additional tunings via new x86-64 ISA levels and more micro-architecture options to the Linux kernel in three broad classes.

### 1. New generic x86-64 ISA levels

When compiling the `Generic x86-64` Processor family target, these are selectable under:
```
Processor type and features ---> x86-64 compiler ISA level
```

* x86-64 A value of (1) is the default and builds with the generic x86-64 ISA level
* x86-64-v2 A value of (2) brings support for vector instructions up to Streaming SIMD Extensions 4.2 (SSE4.2) and Supplemental Streaming SIMD Extensions 3(SSSE3), the POPCNT instruction, and CMPXCHG16B.
* x86-64-v3 A value of (3) adds vector instructions up to AVX2, MOVBE, and additional bit-manipulation instructions.

x86-64-v4 does exist but it adds vector instructions from some of the AVX-512 variants which the kernel does not use so including it does not make much sense.

Users of glibc 2.33 and above can see which level is supported by running one of the follownig:
```
/lib/ld-linux-x86-64.so.2 --help | grep supported
/lib64/ld-linux-x86-64.so.2 --help | grep supported
```
### 2. New micro-architectures levels

These are selectable under:
```
Processor type and features ---> Processor family
```


CPU Family
-march=
Min GCC Ver
Min Clang Ver


AMD Improved K8-family
k8-sse3
9.3
9.0


AMD K10-family
amdfam10
9.3
9.0


AMD Family 10h (Barcelona)
barcelona
9.3
9.0


AMD Family 14h (Bobcat)
btver1
9.3
9.0


AMD Family 16h (Jaguar)
btver2
9.3
9.0


AMD Family 15h (Bulldozer)
bdver1
9.3
9.0


AMD Family 15h (Piledriver)
bdver2
9.3
9.0


AMD Family 15h (Steamroller)
bdver3
9.3
9.0


AMD Family 15h (Excavator)
bdver4
9.3
9.0


AMD Family 17h (Zen)
znver1
9.3
9.0


AMD Family 17h (Zen 2)
znver2
9.3
9.0


AMD Family 19h (Zen 3)
znver3
10.3
12.0


AMD Family 19h (Zen 4)
znver4
13.0
17.0


AMD Family 19h (Zen 5)
znver5
14.1
19.1 (speculated)


Intel Bonnell family Atom
bonnell
9.3
9.0


Intel Silvermont family Atom
silvermont
9.3
9.0


Intel Goldmont family Atom (Apollo Lake and Denverton)
goldmont
9.3
9.0


Intel Goldmont Plus family Atom (Gemini Lake)
goldmont-plus
9.3
9.0


Intel 1st Gen Core i3/i5/i7-family (Nehalem)
nehalem
9.3
9.0


Intel 1.5 Gen Core i3/i5/i7-family (Westmere)
westmere
9.3
9.0


Intel 2nd Gen Core i3/i5/i7-family (Sandybridge)
sandybridge
9.3
9.0


Intel 3rd Gen Core i3/i5/i7-family (Ivybridge)
ivybridge
9.3
9.0


Intel 4th Gen Core i3/i5/i7-family (Haswell)
haswell
9.3
9.0


Intel 5th Gen Core i3/i5/i7-family (Broadwell)
broadwell
9.3
9.0


Intel 6th Gen Core i3/i5/i7-family (Skylake)
skylake
9.3
9.0


Intel 6th Gen Core i7/i9-family (Skylake X)
skylake-avx512
9.3
9.0


Intel 8th Gen Core i3/i5/i7-family (Cannon Lake)
cannonlake
9.3
9.0


Intel 10th Gen Core i7/i9-family (Ice Lake)
icelake-client
9.3
9.0


Intel Xeon (Cascade Lake)
cascadelake
10.2
10.0


Intel Xeon (Cooper Lake)
cooperlake
10.2
10.0


Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake)
cooperlake
10.2
10.0


Intel 4th Gen 10nm++ Xeon (Sapphire Rapids)
sapphirerapids
11.1
12.0


Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake)
rocketlake
11.1
12.0


Intel 12th Gen i3/i5/i7/i9-family (Alder Lake)
alderlake
11.1
12.0


Intel 13th Gen i3/i5/i7/i9-family (Raptor Lake)
raptorlake
13.0
15.0.5


Intel 5th Gen 10nm++ Xeon (Emerald Rapids)
emeraldrapids
13.0
???

## 3. Auto-detected micro-architecture levels

These are also selectable under:
```
Processor type and features ---> Processor family
```
They have the ability to compile by passing the '-march=native' option which, according to the [GCC manual](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-x86-Options) "selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine. Using -march=native enables all instruction subsets supported by the local machine and will produce code optimized for the local machine under the constraints of the selected instruction set."

Users of Intel CPUs should select the 'Intel-Native' option and users of AMD CPUs should select the 'AMD-Native' option.

## Benchmarks
### Setup

The test machine measured the time it took to `make bzImage` of the linux kernel source (`.config` generated by `make x86_64_defconfig` prior).

Three separate test machines were evaluated:
1. AMD Ryzen 9 5950X
2. Intel i7-4790K
3. Intel N100

Separate kernels were first compiled from source patched with [more-uarches-for-kernel-6.8-rc4+.patch](https://github.com/graysky2/kernel_compiler_patch/blob/master/more-uarches-for-kernel-6.8-rc4%2B.patch).
* Kernel 1 used the default menu config option for Processor family = `Generic x86-64`
* Kernel 2 used the menu config option for Processor family = `x86-64-v3`
* Kernel 3 used the menu config option for Processor family = `AMD Zen 3` or `Intel Haswell` or `Intel Alder Lake`

#### The make test
Each machine was booted into its respective kernel and the make test was conducted. Then the next kernel was installed and the machine was booted into it and the make test was again conducted.

#### The stress-ng benchmark
The AMD 5950X ran `stress-ng --taskset 0-1 --metrics-brief -t 30s --foo 2` 12 times where `foo` was one of: `af-alg`, `fork`, `mmap`, or `pipe` under Kernel 1 and then again under Kernel 3.

## Conclusion
Consistently across all three test machines, the kernels built with the optimized processor family options introduced by the patch hosted in this repo ran the make test faster than the kernel compiled with the default processor family option by a small (<1% difference) but statistically significant amount as measured by this make compilation.

The stress-ng testing generally showed small improvements (1-2% faster) and one showing no difference.

What does this mean for real-world usage? Maybe nothing. The intent was to see if something easily automatable could show some value in applying these micro-arch tunings. People have historically gravitated to compilation-based benchmarks so that coupled with ease-of-use point is why I settled on it. If someone has a good kernel-centric benchmark, I am interested to see a controlled comparison.

## Discussion
1. All the assumptions for ANOVA are met:
* Data are normally distributed
* The population variances are fairly equal
2. The boxplot plot clearly show significance for either pair-wise comparison
* Pair-wise analysis by Tukey-Kramer data shown for all pairs (see tables)

In other words, x86-64-v3 is significantly different from generic x86-64. The various subtargets are also significantly different from x86-64.

### The make test
#### Stats for Machine 1. AMD Ryzen 9 X5950


Processor family option
Mean compile time
Std dev
# of replicates


Generic x86-64
79.800 sec
0.1076 sec
12


x86-64-v3
79.456 sec
0.0772 sec
12


AMD Zen 3
79.440 sec
0.0912 sec
12

![X9550](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/boxplot1.svg)


Treatment pairs
Tukey HSD Q stat
Tukey HSD p-value
Tukey HSD interfence


Generic x86-64 vs x86-64-v3
12.8771
0.0010053
$${\color{green} \verb|**|p<0.01}$$


Generic x86-64 vs AMD Zen 3
13.4675
0.0010053
$${\color{green} \verb|**|p<0.01}$$


x86-64-v3 vs AMD Zen 3
9.6524
0.8999947
$${\color{red}insignificant}$$

#### Stats for Machine 2. Intel i7-4790K


Processor family option
Mean compile time
Std dev
# of replicates


Generic x86-64
344.280 sec
0.6455 sec
12


x86-64-v3
342.035 sec
0.4971 sec
12


Intel Haswell
342.189 sec
0.2415 sec
12

![i7-4790k](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/boxplot2.svg)


Treatment pairs
Tukey HSD Q stat
Tukey HSD p-value
Tukey HSD interfence


Generic x86-64 vs x86-64-v3
28.9652
0.0010053
$${\color{green} \verb|**|p<0.01}$$


Generic x86-64 vs Intel Haswell
24.8335
0.0010053
$${\color{green} \verb|**|p<0.01}$$


x86-64-v3 vs Intel Haswell
4.1317
0.0167155
$${\color{lightgreen} \verb|*|p<0.05}$$

#### Stats for Machine 3. Intel N100


Processor family option
Mean compile time
Std dev
# of replicates


Generic x86-64
589.457 sec
0.1596 sec
12


x86-64-v3
589.217 sec
0.1382 sec
12


Intel Alder Lake
588.797 sec
0.1532 sec
12

![N100](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/boxplot3.svg)


Treatment pairs
Tukey HSD Q stat
Tukey HSD p-value
Tukey HSD interfence


Generic x86-64 vs x86-64-v3
5.5076
0.0012818
$${\color{green} \verb|**|p<0.01}$$


Generic x86-64 vs Intel Alder Lake
15.1600
0.0010053
$${\color{green} \verb|**|p<0.01}$$


x86-64-v3 vs Intel Alder Lake
9.6524
0.0010053
$${\color{green} \verb|**|p<0.01}$$

### Comparing GCC to Clang
The Ryzen 9 5950X was used to compare kernels built with GCC and Clang each with `Generic x86-64` and `x86-64-v3`. The results are consistent for both compilers.


Processor family option
Compiler
Mean compile time
Std dev
# of replicates


Generic x86-64
GCC
79.4569 sec
0.0664 sec
5


x86-64-v3
GCC
79.1403 sec
0.0580 sec
5


Generic x86-64
Clang
79.8398 sec
0.0629 sec
5


x86-64-v3
Clang
79.0975 sec
0.0711 sec
5

![X9550](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/boxplot4.svg)

### The stress-ng benchmarks
Here, stress-ng microbenchmark improvements or regressions (or neutral changes) were as follows (average from 12 x 30 sec runs):
```
af-alg: +2.7% (kernel AL_ALG crypto)
fork: * (process fork/exit)
mmap: +1.6% (memory mapping)
pipe: +1.3% (pipe + context switch)

*no statistically significant difference at p<0.05
```
| units | benchmark | optimization | mean | std dev |
|-|-|-|-|-|
|bogo ops/s (real time)|af-alg|x86-64|104,320.21|168.61|
|||x86-64-v3|107,154.54|127.73|
||pipe|x86-64|1,535,225.4|3,624.5|
|||x86-64-v3|1,555,824.2|4,212.6|
||fork|x86-64|3,964.14|21.02|
|||x86-64-v3|3,953.5|17.44|
||mmap|x86-64|35.72|0.28|
|||x86-64-v3|36.31|0.26|

![af-alg](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/af-alg.svg)

![fork](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/fork.svg)

![mmap](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/mmap.svg)

![pipe](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/pipe.svg)

## Software versions used

All machines ran Arch Linux with the all stock repo packages with the exception of the kernel (see below). At the time of work, the following the toolchain versions were used:
* binutils 2.43+r4+g7999dae6961-1
* clang 18.0.1-1
* gcc 14.2.1+r134+gab884fffe3fc-1
* gcc-libs 14.2.1+r134+gab884fffe3fc-1
* glibc 2.40+r16+gaa533d58ff-2
* linux-api-headers 6.10-1
* stress-ng 0.18.04-1

The kernel packages were built on the official Arch Linux PKGBUILD for kernel version 6.10.10-arch1-1 applying the distro config differing only by the modifications introduced by the aforementioned patch from this repo.

The benchmark was compiling the vanilla Linux kernel version 6.10.10 and as mentioned above, the `.config` used was generated by running `make x86_64_defconfig`.

## References
* Script to run the benchmark: [make_bench.sh](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/make_bench.sh)
* Data for three machines: [results.csv](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/results.csv)
* Data for GCC vs Clang: [results2.csv](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/results2.csv)
* Data for stress-ng tests: [stress-ng-data.csv](https://github.com/graysky2/kernel_compiler_patch/blob/master/benchmark/stress-ng-data.csv)

## Credit
* Original author: jeroen AT linuxforge DOT net
* Link to original version: http://www.linuxforge.net/docs/linux/linux-gcc.php
* Box plot generated with [statisty.app](https://statisty.app/anova-calculator)
* ANOVA stats generated with [astatsa.com](https://astatsa.com/OneWay_Anova_with_TukeyHSD/)

## Legacy support
Find support for older version of the linux kernel and of gcc in the outdated_versions directory.