https://github.com/aff3ct/MIPP
  
  
    MIPP is a portable wrapper for SIMD instructions written in C++11. It supports NEON, SSE, AVX, AVX-512 and SVE (length specific). 
    https://github.com/aff3ct/MIPP
  
avx avx-512 neon portable simd sse sve vector wrapper
        Last synced: 8 months ago 
        JSON representation
    
MIPP is a portable wrapper for SIMD instructions written in C++11. It supports NEON, SSE, AVX, AVX-512 and SVE (length specific).
- Host: GitHub
 - URL: https://github.com/aff3ct/MIPP
 - Owner: aff3ct
 - License: mit
 - Created: 2017-06-23T16:56:44.000Z (over 8 years ago)
 - Default Branch: master
 - Last Pushed: 2024-05-18T08:51:13.000Z (over 1 year ago)
 - Last Synced: 2024-05-18T09:38:39.651Z (over 1 year ago)
 - Topics: avx, avx-512, neon, portable, simd, sse, sve, vector, wrapper
 - Language: C++
 - Homepage:
 - Size: 2.01 MB
 - Stars: 465
 - Watchers: 23
 - Forks: 86
 - Open Issues: 16
 - 
            Metadata Files:
            
- Readme: README.md
 - License: LICENSE
 
 
Awesome Lists containing this project
- AwesomeCppGameDev - MIPP - 512. (Maths)
 
README
          # MyIntrinsics++ (MIPP)
[](https://gitlab.com/aff3ct/MIPP/pipelines)
[](https://aff3ct.gitlab.io/MIPP/)

## Purpose
MIPP is a portable and Open-source wrapper (MIT license) for vector intrinsic
functions (SIMD) written in C++11. It works for SSE, AVX, AVX-512, ARM NEON
and SVE (work in progress) instructions. MIPP wrapper supports simple/double
precision floating-point numbers and also signed/unsigned integer arithmetic
(64-bit, 32-bit, 16-bit and 8-bit).
With the MIPP wrapper you do not need to write a specific intrinsic code
anymore. Just use provided functions and the wrapper will automatically
generates the right intrisic calls for your specific architecture.
If you are interested by ARM SVE development status, 
[please follow this link](#arm-sve).
## Short Documentation
### Supported Compilers
At this time, MIPP has been tested on the following compilers:
  - Intel: `icpc` >= `16`,
  - GNU: `g++` >= `4.8`,
  - Clang: `clang++` >= `3.6`,
  - Microsoft: `msvc` >= `14`.
On `msvc` `14.10` (Microsoft Visual Studio 2017), the performances are reduced 
compared to the other compilers, the compiler is not able to fully inline all 
the MIPP methods. This has been fixed on `msvc` `14.21` (Microsoft Visual Studio 
2019) and now you can expect high performances.
### Install and Configure your Code
You don't have to install MIPP because it is a simple C++ header file. The 
headers are located in the `include` folder (note that this location has changed 
since commit `6795891`, before they were located in the `src` folder). 
Just include the header into your source files when the wrapper is needed.
```cpp
#include "mipp.h"
```
mipp.h use a C++ `namespace`: `mipp`, if you do not want to prefix all the MIPP 
calls by `mipp::` you can do that:
```cpp
#include "mipp.h"
using namespace mipp;
```
Before trying to compile, think to tell the compiler what kind of vector 
instructions you want to use. For instance, if you are using GNU compiler 
(`g++`) you simply have to add the `-march=native` option for SSE and AVX CPUs 
compatible. For ARMv7 CPUs with NEON instructions you have to add the 
`-mfpu=neon` option (since most of current NEONv1 instructions are not IEEE-754 
compliant). However, this is no more the case on ARMv8 processors, so the
`-march=native` option will work too. MIPP also uses some nice features provided 
by the C++11 and so we have to add the `-std=c++11` flag to compile the code. 
You are now ready to run your code with the MIPP wrapper.
In the case where MIPP is installed on the system it can be integrated into a
cmake projet in a standard way. Example
```sh
# install MIPP
cd MIPP/
export MIPP_ROOT=$PWD/build/install
cmake -B build -DCMAKE_INSTALL_PREFIX=$MIPP_ROOT
cmake --build build -j5
cmake --install build
```
In your `CMakeLists.txt`:
```cmake
# find the installation of MIPP on the system
find_package(MIPP REQUIRED)
# define your executable
add_executable(gemm gemm.cpp)
# link your executable to MIPP
target_link_libraries(gemm PRIVATE MIPP::mipp)
```
```sh
cd your_project/
# if MIPP is installed in a system standard path: MIPP will be found automatically with cmake
cmake -B build
# if MIPP is installed in a non-standard path: use CMAKE_PREFIX_PATH
cmake -B build -DCMAKE_PREFIX_PATH=$MIPP_ROOT
```
#### Generate Sources & Compile the Static Library
MIPP is mainly a header only library. However, some macro operations require
to compile a small library. This is particularly true for the `compress` 
operation that relies on generated LUTs stored in the static library.
To generate the source files containing these LUTs you need to install Python3
with the Jinja2 package:
```bash
sudo apt install python3 python3-pip
pip3 install --user -r codegen/requirements.txt
```
Then you can call the generator as follow:
```bash
python3 codegen/gen_compress.py
```
And, finally you can compile the MIPP static library:
```bash
cmake -B build -DMIPP_STATIC_LIB=ON
cmake --build build -j4
```
Note that **the compilation of the static library is optional**. You can choose 
to do not compile the static library then only some macro operations will be 
missing.
### Sequential Mode
By default, MIPP tries to recognize the instruction set from the preprocessor 
definitions. If MIPP can't match the instruction set (for instance when MIPP 
does not support the targeted instruction set), MIPP falls back on standard 
sequential instructions. In this mode, the vectorization is not guarantee 
anymore but the compiler can still perform auto-vectorization.
It is possible to force MIPP to use the sequential mode with the following 
compiler definition: `-DMIPP_NO_INTRINSICS`. Sometime it can be useful for 
debugging or to bench a code.
If you want to check the MIPP mode configuration, you can print the following 
global variable: `mipp::InstructionFullType` (`std::string`).
### Vector Register Declaration
Just use the `mipp::Reg` type.
```cpp
mipp::Reg r1, r2, r3; // we have declared 3 vector registers
```
But we do not know the number of elements per register here. This number of 
elements can be obtained by calling the `mipp::N()` function (`T` is a 
template parameter, it can be `double`, `float`, `int64_t`, `uint64_t`,
`int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t` or `uint8_t` type).
```cpp
for (int i = 0; i < n; i += mipp::N()) {
	// ...
}
```
The register size directly depends on the precision of the data we are working 
on.
### Register `load` and `store` Instructions
Loading memory from a vector into a register:
```cpp
int n = mipp::N() * 10;
std::vector myVector(n);
int i = 0;
mipp::Reg r1;
r1.load(&myVector[i*mipp::N()]);
```
The last two lines can be shorten as follow where the `load` call becomes
implicit:
```cpp
mipp::Reg r1 = &myVector[i*mipp::N()];
```
Store can be done with the `store(...)` method:
```cpp
int n = mipp::N() * 10;
std::vector myVector(n);
int i = 0;
mipp::Reg r1 = &myVector[i*mipp::N()];
// do something with r1
r1.store(&myVector[(i+1)*mipp::N()]);
```
By default the loads and stores work on **unaligned memory**.
It is possible to control this behavior with the `-DMIPP_ALIGNED_LOADS` 
definition: when specified, the loads and stores work on **aligned memory** by 
default. In the **aligned memory** mode, it is still possible to perform 
unaligned memory operations with the `mipp::loadu` and `mipp::storeu` functions.
However, it is not possible to perform aligned loads and stores in the 
**unaligned memory** mode.
To allocate aligned data you can use the MIPP aligned memory allocator wrapped 
into the `mipp::vector` class. `mipp::vector` is fully retro-compatible with the 
standard `std::vector` class and it can be use everywhere you can use 
`std::vector`.
```cpp
mipp::vector myVector(n);
```
### Register Initialization
You can initialize a vector register from a scalar value:
```cpp
mipp::Reg r1; // r1 = | unknown | unknown | unknown | unknown |
r1 = 1.0;            // r1 = |    +1.0 |    +1.0 |    +1.0 |    +1.0 |
```
Or from an initializer list (`std::initializer_list`):
```cpp
mipp::Reg r1;       // r1 = | unknown | unknown | unknown | unknown |
r1 = {1.0, 2.0, 3.0, 4.0}; // r1 = |    +1.0 |    +2.0 |    +3.0 |    +4.0 |
```
### Computational Instructions
**Add** two vector registers:
```cpp
mipp::Reg r1, r2, r3;
r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |
r3 = r1 + r2; // r3 = | +3.0 | +3.0 | +3.0 | +3.0 |
```
**Subtract** two vector registers:
```cpp
mipp::Reg r1, r2, r3;
r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |
r3 = r1 - r2; // r3 = | -1.0 | -1.0 | -1.0 | -1.0 |
```
**Multiply** two vector registers:
```cpp
mipp::Reg r1, r2, r3;
r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |
r3 = r1 * r2; // r3 = | +2.0 | +2.0 | +2.0 | +2.0 |
```
**Divide** two vector registers:
```cpp
mipp::Reg r1, r2, r3;
r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |
r3 = r1 / r2; // r3 = | +0.5 | +0.5 | +0.5 | +0.5 |
```
**Fused multiply and add** of three vector registers:
```cpp
mipp::Reg r1, r2, r3, r4;
r1 = 2.0;                     // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0;                     // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
r3 = 1.0;                     // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |
// r4 = (r1 * r2) + r3
r4 = mipp::fmadd(r1, r2, r3); // r4 = | +7.0 | +7.0 | +7.0 | +7.0 |
```
**Fused negative multiply and add** of three vector registers:
```cpp
mipp::Reg r1, r2, r3, r4;
r1 = 2.0;                      // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0;                      // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
r3 = 1.0;                      // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |
// r4 = -(r1 * r2) + r3
r4 = mipp::fnmadd(r1, r2, r3); // r4 = | -5.0 | -5.0 | -5.0 | -5.0 |
```
**Square root** of a vector register:
```cpp
mipp::Reg r1, r2;
r1 = 9.0;             // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |
r2 = mipp::sqrt(r1);  // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
```
**Reciprocal square root** of a vector register (be careful: this intrinsic 
exists only for simple precision floating-point numbers):
```cpp
mipp::Reg r1, r2;
r1 = 9.0;             // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |
r2 = mipp::rsqrt(r1); // r2 = | +0.3 | +0.3 | +0.3 | +0.3 |
```
### Selections
Select the **minimum** between two vector registers:
```cpp
mipp::Reg r1, r2, r3;
r1 = 2.0;               // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0;               // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
r3 = mipp::min(r1, r2); // r3 = | +2.0 | +2.0 | +2.0 | +2.0 |
```
Select the **maximum** between two vector registers:
```cpp
mipp::Reg r1, r2, r3;
r1 = 2.0;               // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0;               // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
r3 = mipp::max(r1, r2); // r3 = | +3.0 | +3.0 | +3.0 | +3.0 |
```
### Permutations
The `rrot(...)` method allows you to perform a **right rotation** (a cyclic 
permutation) of the elements inside the register:
```cpp
mipp::Reg r1, r2;
r1 = {3.0, 2.0, 1.0, 0.0}  // r1 = | +3.0 | +2.0 | +1.0 | +0.0 |
r2 = mipp::rrot(r1);       // r2 = | +0.0 | +3.0 | +2.0 | +1.0 |
r1 = mipp::rrot(r2);       // r1 = | +1.0 | +0.0 | +3.0 | +2.0 |
r2 = mipp::rrot(r1);       // r2 = | +2.0 | +1.0 | +0.0 | +3.0 |
r1 = mipp::rrot(r2);       // r1 = | +3.0 | +2.0 | +1.0 | +0.0 |
```
Of course there are many more available instructions in the MIPP wrapper and you 
can find these instructions at the [end of this page](#list-of-mipp-functions).
### Addition of Two Vectors
```cpp
#include  // rand()
#include "mipp.h"
int main()
{
	// data allocation
	const int n = 32000; // size of the vA, vB, vC vectors
	mipp::vector vA(n); // in
	mipp::vector vB(n); // in
	mipp::vector vC(n); // out
	// data initialization
	for (int i = 0; i < n; i++) vA[i] = rand() % 10;
	for (int i = 0; i < n; i++) vB[i] = rand() % 10;
	// declare 3 vector registers
	mipp::Reg rA, rB, rC;
	// compute rC with the MIPP vectorized functions
	for (int i = 0; i < n; i += mipp::N()) {
		rA.load(&vA[i]); // unaligned load by default (use the -DMIPP_ALIGNED_LOADS
		rB.load(&vB[i]); // macro definition to force aligned loads and stores).
		rC = rA + rB;
		rC.store(&vC[i]);
	}
	return 0;
}
```
### Vectorizing an Existing Code
#### Scalar Code
```cpp
// ...
for (int i = 0; i < n; i++) {
	out[i] = 0.75f * in1[i] * std::exp(in2[i]);
}
// ...
```
#### Vectorized Code
```cpp
// ...
// compute the vectorized loop size which is a multiple of 'mipp::N()'.
auto vecLoopSize = (n / mipp::N()) * mipp::N();
mipp::Reg rout, rin1, rin2;
for (int i = 0; i < vecLoopSize; i += mipp::N()) {
	rin1.load(&in1[i]); // unaligned load by default (use the -DMIPP_ALIGNED_LOADS
	rin2.load(&in2[i]); // macro definition to force aligned loads and stores).
	// the '0.75f' constant will be broadcast in a vector but it has to be at
	// the right of a 'mipp::Reg', this is why it has been moved at the right
	// of the 'rin1' register. Notice that 'std::exp' has been replaced by
	// 'mipp::exp'.
	rout = rin1 * 0.75f * mipp::exp(rin2);
	rout.store(&out[i]);
}
// scalar tail loop: compute the remaining elements that can't be vectorized.
for (int i = vecLoopSize; i < n; i++) {
	out[i] = 0.75f * in1[i] * std::exp(in2[i]);
}
// ...
```
### Masked Instructions
MIPP comes with two generic and templatized masked functions (`mask` and 
`maskz`). Those functions allow you to benefit from the AVX-512 and SVE masked 
instructions. `mask` and `maskz` functions are retro compatible with older 
instruction sets.
```cpp
mipp::Reg<        float   > ZMM1 = {   40,  -30,    60,    80};
mipp::Reg<        float   > ZMM2 = 0.1; // broadcast
mipp::Msk()> k1   = {false, true, false, false};
// ZMM3 = k1 ? ZMM1 * ZMM2 : ZMM1;
auto ZMM3 = mipp::mask(k1, ZMM1, ZMM1, ZMM2);
std::cout << ZMM3 << std::endl; // output: "[40, -3, 60, 80]"
// ZMM4 = k1 ? ZMM1 * ZMM2 : 0;
auto ZMM4 = mipp::maskz(k1, ZMM1, ZMM2);
std::cout << ZMM4 << std::endl; // output: "[0, -3, 0, 0]"
```
## List of MIPP Functions
This section presents an exhaustive list of all the available functions in MIPP.
Of course the MIPP wrapper does not cover all the possible intrinsics of each 
instruction set but it tries to give you the most important and useful ones.
In the following tables, `T`, `T1` and `T2` stand for data types (`double`, 
`float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`,
`int8_t` or `uint8_t`).
`N` stands for the number or elements in a mask or in a register.
`N` is a strictly positive integer and can easily be deduced from the data type: 
`constexpr int N = mipp::N()`.
When `T` and `N` are mixed in a prototype, `N` has to satisfy the previous 
constraint (`N = mipp::N()`).
In the documentation there are some terms that requires to be clarified:
  - **register element**: a SIMD register is composed by multiple scalar 
  elements, those elements are built-in data types (`double`, `float`, 
  `int64_t`, ...),
  - **register lane**: modern instruction sets can have multiple implicit sub 
  parts in an entire SIMD register, those sub parts are called lanes (SSE has 
  one lane of 128 bits, AVX has two lanes of 128 bits, AVX-512 has four lanes of 
  128 bits).
### Memory Operations
| **Short name**  | **Prototype**                                                                          | **Documentation**                                                                                                                                                   | **Supported types**                                                                                         |
| :---            | :---                                                                                   | :---                                                                                                                                                                | :---                                                                                                        |
| `load`          | `Reg   load          (const T* mem)`                                                | Loads aligned data from `mem` to a register.                                                                                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `loadu`         | `Reg   loadu         (const T* mem)`                                                | Loads unaligned data from `mem` to a register.                                                                                                                      | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `store`         | `void     store         (T* mem, const Reg r)`                                      | Stores the `r` register in the `mem` aligned data.                                                                                                                  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `storeu`        | `void     storeu        (T* mem, const Reg r)`                                      | Stores the `r` register in the `mem` unaligned data.                                                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `maskzld`       | `Reg   maskzld       (const Msk m, const T* mem)`                                | Loads elements according to the mask `m` (puts zero when the mask value is false).                                                                                  | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |
| `maskzlds`      | `Reg   maskzlds      (const Msk m, const T* mem)`                                | Loads elements according to the mask `m` (puts zero when the mask value is false). Safe version, only reads masked elements in memory.                              | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |
| `maskst`        | `void     maskst        (const Msk m, T* mem, const Reg r)`                      | Stores elements from the `r` register according to the mask `m` in the `mem` memory.                                                                                | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |
| `masksts`       | `void     masksts       (const Msk m, T* mem, const Reg r)`                      | Stores elements from the `r` register according to the mask `m` in the `mem` memory. Safe version, only writes masked elements in memory.                           | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |
| `gather`        | `Reg   gather    (const TD* mem, const Reg idx)`                            | Gathers elements from `mem` to a register. Selects elements according to the indices in `idx`.                                                                      | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |
| `scatter`       | `void  scatter   (TD* mem, const Reg idx, const Reg r)`                 | Scatters elements into `mem` from the `r` register. Writes elements at the `idx` indices in `mem`.                                                                  | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |
| `maskzgat`      | `Reg   gather    (const Msk m, const TD* mem, const Reg idx)`            | Gathers elements from `mem` to a register (according to the mask `m`). Selects elements according to the indices in `idx` (puts zero when the mask value is false). | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |
| `masksca`       | `void  scatter   (const Msk m, TD* mem, const Reg idx, const Reg r)` | Scatters elements into `mem` from the `r` register (according to the mask `m`). Writes elements at the `idx` indices in `mem`.                                      | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |
| `set`           | `Reg   set           (const T[N] vals)`                                             | Sets a register from the values in `vals`.                                                                                                                          | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `set`           | `Msk   set           (const bool[N] bits)`                                          | Sets a mask from the bits in `bits`.                                                                                                                                |                                                                                                             |
| `set1`          | `Reg   set1          (const T val)`                                                 | Broadcasts `val` in a register.                                                                                                                                     | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `set1`          | `Msk   set1          (const bool bit)`                                              | Broadcasts `bit` in a mask.                                                                                                                                         |                                                                                                             |
| `set0`          | `Reg   set0          ()`                                                            | Initializes a register to zero.                                                                                                                                     | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `set0`          | `Msk   set0          ()`                                                            | Initializes a mask to false.                                                                                                                                        |                                                              |
| `get`           | `T        get           (const Reg r, const size_t index)`                          | Gets a specific element from the register `r` at the `index` position.                                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `get`           | `T        get           (const Reg_2 r, const size_t index)`                        | Gets a specific element from the register `r` at the `index` position.                                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `get`           | `bool     get           (const Msk m, const size_t index)`                          | Gets a specific element from the register `m` at the `index` position.                                                                                              |                                                                                                             |
| `getfirst`      | `T        getfirst      (const Reg r)`                                              | Gets the first element from the register `r`.                                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `getfirst`      | `T        getfirst      (const Reg_2 r)`                                            | Gets the first element from the register `r`.                                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `getfirst`      | `bool     getfirst      (const Msk m)`                                              | Gets the first element from the register `m`.                                                                                                                       |                                                                                                             |
| `low`           | `Reg_2 low           (const Reg r)`                                              | Gets the low part of the `r` register.                                                                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `high`          | `Reg_2 high          (const Reg r)`                                              | Gets the high part of the `r` register.                                                                                                                             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `combine`       | `Reg   combine       (const Reg_2 r1, const Reg_2 r2)`                        | Combine two half registers in a full register, `r1` will be the low part and `r2` the high part.                                                                    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `combine`       | `Reg   combine     (const Reg r1, const Reg r2)`                            | `S` elements of `r1` are shifted to the left, `(S - N) + N` elements of `r2` are shifted to the right. Shifted `r1` and `r2` are combined to give the result.       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `compress`      | `Reg   compress      (const Reg r1, const Msk m)`                             | Pack the elements of `r1` at the beginning of the register according to the bitmask `m` (if the bit is 1 then element is picked, otherwise it is not).              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmask`         | `Reg   cmask         (const uint32_t[N  ] ids)`                                     | Creates a cmask from an indexes list (indexes have to be between 0 and N-1).                                                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmask2`        | `Reg   cmask2        (const uint32_t[N/2] ids)`                                     | Creates a cmask2 from an indexes list (indexes have to be between 0 and (N/2)-1).                                                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmask4`        | `Reg   cmask4        (const uint32_t[N/4] ids)`                                     | Creates a cmask4 from an indexes list (indexes have to be between 0 and (N/4)-1).                                                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `shuff`         | `Reg   shuff         (const Reg r, const Reg cm)`                             | Shuffles the elements of `r` according to the cmask `cm`.                                                                                                           | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `shuff2`        | `Reg   shuff2        (const Reg r, const Reg cm2)`                            | Shuffles the elements of `r` according to the cmask2 `cm2` (same shuffle is applied in both lanes).                                                                 | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `shuff4`        | `Reg   shuff4        (const Reg r, const Reg cm4)`                            | Shuffles the elements of `r` according to the cmask4 `cm4` (same shuffle is applied in the four lanes).                                                             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleave`    | `Regx2 interleave    (const Reg r1, const Reg r2)`                            | Interleaves `r1` and `r2` : `[r1_1, r2_1, r1_2, r2_2, ..., r1_n, r2_n]`.                                                                                            | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `deinterleave`  | `Regx2 deinterleave  (const Reg r1, const Reg r2)`                            | Reverts the previous defined interleave operation.                                                                                                                  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleave2`   | `Regx2 interleave2   (const Reg r1, const Reg r2)`                            | Interleaves `r1` and `r2` considering two lanes.                                                                                                                    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleave4`   | `Regx2 interleave4   (const Reg r1, const Reg r2)`                            | Interleaves `r1` and `r2` considering four lanes.                                                                                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavelo`  | `Reg   interleavelo  (const Reg r1, const Reg r2)`                            | Interleaves the low part of `r1` with the low part of `r2`.                                                                                                         | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavelo2` | `Reg   interleavelo2 (const Reg r1, const Reg r2)`                            | Interleaves the low part of `r1` with the low part of `r2` (considering two lanes).                                                                                 | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavelo4` | `Reg   interleavelo4 (const Reg r1, const Reg r2)`                            | Interleaves the low part of `r1` with the low part of `r2` (considering four lanes).                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavehi`  | `Reg   interleavehi  (const Reg r1, const Reg r2)`                            | Interleaves the high part of `r1` with the high part of `r2`.                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavehi2` | `Reg   interleavehi2 (const Reg r1, const Reg r2)`                            | Interleaves the high part of `r1` with the high part of `r2` (considering two lanes).                                                                               | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavehi4` | `Reg   interleavehi4 (const Reg r1, const Reg r2)`                            | Interleaves the high part of `r1` with the high part of `r2` (considering four lanes).                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `lrot`          | `Reg   lrot          (const Reg r)`                                              | Rotates the `r` register from the left (cyclic permutation).                                                                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `rrot`          | `Reg   rrot          (const Reg r)`                                              | Rotates the `r` register from the right (cyclic permutation).                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `blend`         | `Reg   blend         (const Reg r1, const Reg r2, const Msk m)`            | Combines `r1` and `r2` register following the `m` mask values (`m_i ? r1_i : r2_i`).                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `select`        | `Reg   select        (const Msk m, const Reg r1, const Reg r2)`            | Alias for the previous `blend` function. Parameters order is a little bit different.                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
### Bitwise Operations
The `pipe` keyword stands for the "|" binary operator.
| **Short name** | **Operator**       | **Prototype**                                       | **Documentation**                             | **Supported types**                                                                                         |
| :---           | :---               | :---                                                | :---                                          | :---                                                                                                        |
| `andb`         | `&` and `&=`       | `Reg andb    (const Reg r1, const Reg r2)` | Computes the bitwise AND: `r1 & r2`.          | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `andb`         | `&` and `&=`       | `Msk andb    (const Msk m1, const Msk m2)` | Computes the bitwise AND: `m1 & m2`.          |                                                                                                             |
| `andnb`        |                    | `Reg andnb   (const Reg r1, const Reg r1)` | Computes the bitwise AND NOT: `(~r1) & r2`.   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `andnb`        |                    | `Msk andnb   (const Msk m1, const Msk m2)` | Computes the bitwise AND NOT: `(~m1) & m2`.   |                                                                                                             |
| `orb`          | `pipe` and `pipe=` | `Reg orb     (const Reg r1, const Reg r2)` | Computes the bitwise OR: `r1 pipe r2`.        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `orb`          | `pipe` and `pipe=` | `Msk orb     (const Msk m1, const Msk m2)` | Computes the bitwise OR: `m1 pipe m2`.        |                                                                                                             |
| `xorb`         | `^` and `^=`       | `Reg xorb    (const Reg r1, const Reg r2)` | Computes the bitwise XOR: `r1 ^ r2`.          | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `xorb`         | `^` and `^=`       | `Msk xorb    (const Msk m1, const Msk m2)` | Computes the bitwise XOR: `m1 ^ m2`.          |                                                                                                             |
| `lshift`       | `<<` and `<<=`     | `Reg lshift  (const Reg r, const uint32_t n)` | Computes the bitwise LEFT SHIFT: `r << n`.    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `lshiftr`      | `<<` and `<<=`     | `Reg lshiftr (const Reg r1, const Reg r2)` | Computes the bitwise LEFT SHIFT: `r1 << r2`.  | `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t`                    |
| `lshift`       | `<<` and `<<=`     | `Msk lshift  (const Msk m, const uint32_t n)` | Computes the bitwise LEFT SHIFT: `m << n`.    |                                                                                                             |
| `rshift`       | `>>` and `>>=`     | `Reg rshift  (const Reg r, const uint32_t n)` | Computes the bitwise RIGHT SHIFT: `r >> n`.   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `rshiftr`      | `>>` and `>>=`     | `Reg rshiftr (const Reg r1, const Reg r2)` | Computes the bitwise RIGHT SHIFT: `r1 >> r2`. | `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t`                    |
| `rshift`       | `>>` and `>>=`     | `Msk rshift  (const Msk m, const uint32_t n)` | Computes the bitwise RIGHT SHIFT: `m >> n`.   |                                                                                                             |
| `notb`         | `~`                | `Reg notb    (const Reg r)`                   | Computes the bitwise NOT: `~r`.               | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `notb`         | `~`                | `Msk notb    (const Msk m)`                   | Computes the bitwise NOT: `~m`.               |                                                                                                             |
### Logical Comparisons
| **Short name** | **Operator** | **Prototype**                                      | **Documentation**                             | **Supported types**                                                                                         |
| :---           | :---         | :---                                               | :---                                          | :---                                                                                                        |
| `cmpeq`        | `==`         | `Msk cmpeq  (const Reg r1, const Reg r2)` | Compares if equal to: `r1 == r2`.             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmpneq`       | `!=`         | `Msk cmpneq (const Reg r1, const Reg r2)` | Compares if not equal to: `r1 != r2`.         | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmpge`        | `>=`         | `Msk cmpge  (const Reg r1, const Reg r2)` | Compares if greater or equal to: `r1 >= r2`.  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmpgt`        | `>`          | `Msk cmpgt  (const Reg r1, const Reg r2)` | Compares if strictly greater than: `r1 > r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmple`        | `<=`         | `Msk cmple  (const Reg r1, const Reg r2)` | Compares if lower or equal to: `r1 <= r2`.    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmplt`        | `<`          | `Msk cmplt  (const Reg r1, const Reg r2)` | Compares if strictly lower than: `r1 < r2`.   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
### Conversions and Packing
| **Short name** | **Prototype**                                        | **Documentation**                                                                                                                                                                                                                                                                                                                                   | **Supported types**                                                                                                                                                    |
| :---           | :---                                                 | :---                                                                                                                                                                                                                                                                                                                                                | :---                                                                                                                                                                   |
| `toReg`        | `Reg  toReg (const Msk m)`                     | Converts the mask `m` into a register of type `T`, the number of elements `N` has to be the same for the mask and the register. If the mask is `false` then all the bits of the corresponding element are set to 0, otherwise if the mask is `true` then all the bits are set to 1 (be careful, for float datatypes `true` is interpreted as NaN!). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t`                                                            |
| `cvt`          | `Reg cvt   (const Reg r)`                    | Converts the elements of `r` into an other representation (the new representation and the original one have to have the same size).                                                                                                                                                                                                                 | `float -> int32_t`, `float -> uint32_t`, `int32_t -> float`, `uint32_t -> float`, `double -> int64_t`, `double -> uint64_t`, `int64_t -> double`, `uint64_t -> double` |
| `cvt`          | `Reg cvt   (const Reg_2 r)`                  | Converts elements of `r` into bigger elements (in bits).                                                                                                                                                                                                                                                                                            | `int8_t -> int16_t`, `uint8_t -> uint16_t`, `int16_t -> int32_t`, `uint16_t -> uint32_t`, `int32_t -> int64_t`, `uint32_t -> uint64_t`                                 |
| `pack`         | `Reg pack  (const Reg r1, const Reg r2)` | Packs elements of `r1` and `r2` into smaller elements (some information can be lost in the conversion).                                                                                                                                                                                                                                             | `int32_t -> int16_t`, `uint32_t -> uint16_t`, `int16_t -> int8_t`, `uint16_t -> uint8_t`                                                                               |
### Arithmetic Operations
| **Short name** | **Operator** | **Prototype**                                                       | **Documentation**                                                                                   | **Supported types**                                                                                         |
| :---           | :---         | :---                                                                | :---                                                                                                | :---                                                                                                        |
| `add`          | `+` and `+=` | `Reg add    (const Reg r1, const Reg r2)`                  | Performs the arithmetic addition: `r1 + r2`.                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `sub`          | `-` and `-=` | `Reg sub    (const Reg r1, const Reg r2)`                  | Performs the arithmetic subtraction: `r1 - r2`.                                                     | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `mul`          | `*` and `*=` | `Reg mul    (const Reg r1, const Reg r2)`                  | Performs the arithmetic multiplication: `r1 * r2`.                                                  | `double`, `float`, `int32_t`, `int16_t`, `int8_t`                                                           |
| `div`          | `/` and `/=` | `Reg div    (const Reg r1, const Reg r2)`                  | Performs the arithmetic division: `r1 / r2`.                                                        | `double`, `float`                                                                                           |
| `fmadd`        |              | `Reg fmadd  (const Reg r1, const Reg r2, const Reg r3)` | Performs the fused multiplication and addition: `r1 * r2 + r3`.                                     | `double`, `float`                                                                                           |
| `fnmadd`       |              | `Reg fnmadd (const Reg r1, const Reg r2, const Reg r3)` | Performs the negative fused multiplication and addition: `-(r1 * r2) + r3`.                         | `double`, `float`                                                                                           |
| `fmsub`        |              | `Reg fmsub  (const Reg r1, const Reg r2, const Reg r3)` | Performs the fused multiplication and subtraction: `r1 * r2 - r3`.                                  | `double`, `float`                                                                                           |
| `fnmsub`       |              | `Reg fnmsub (const Reg r1, const Reg r2, const Reg r3)` | Performs the negative fused multiplication and subtraction: `-(r1 * r2) - r3`.                      | `double`, `float`                                                                                           |
| `min`          |              | `Reg min    (const Reg r1, const Reg r2)`                  | Selects the minimum: `r1_i < r2_i ? r1_i : r2_i`.                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `max`          |              | `Reg max    (const Reg r1, const Reg r2)`                  | Selects the maximum: `r1_i > r2_i ? r1_i : r2_i`.                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `div2`         |              | `Reg div2   (const Reg r)`                                    | Performs the arithmetic division by two: `r / 2`.                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `div4`         |              | `Reg div4   (const Reg r)`                                    | Performs the arithmetic division by four: `r / 4`.                                                  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `abs`          |              | `Reg