https://github.com/aff3ct/MIPP
MIPP is a portable wrapper for SIMD instructions written in C++11. It supports NEON, SSE, AVX, AVX-512 and SVE (length specific).
https://github.com/aff3ct/MIPP
avx avx-512 neon portable simd sse sve vector wrapper
Last synced: about 2 months ago
JSON representation
MIPP is a portable wrapper for SIMD instructions written in C++11. It supports NEON, SSE, AVX, AVX-512 and SVE (length specific).
- Host: GitHub
- URL: https://github.com/aff3ct/MIPP
- Owner: aff3ct
- License: mit
- Created: 2017-06-23T16:56:44.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2024-05-18T08:51:13.000Z (12 months ago)
- Last Synced: 2024-05-18T09:38:39.651Z (12 months ago)
- Topics: avx, avx-512, neon, portable, simd, sse, sve, vector, wrapper
- Language: C++
- Homepage:
- Size: 2.01 MB
- Stars: 465
- Watchers: 23
- Forks: 86
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- AwesomeCppGameDev - MIPP - 512. (Maths)
README
# MyIntrinsics++ (MIPP)
[](https://gitlab.com/aff3ct/MIPP/pipelines)
[](https://aff3ct.gitlab.io/MIPP/)
## Purpose
MIPP is a portable and Open-source wrapper (MIT license) for vector intrinsic
functions (SIMD) written in C++11. It works for SSE, AVX, AVX-512, ARM NEON
and SVE (work in progress) instructions. MIPP wrapper supports simple/double
precision floating-point numbers and also signed/unsigned integer arithmetic
(64-bit, 32-bit, 16-bit and 8-bit).With the MIPP wrapper you do not need to write a specific intrinsic code
anymore. Just use provided functions and the wrapper will automatically
generates the right intrisic calls for your specific architecture.If you are interested by ARM SVE development status,
[please follow this link](#arm-sve).## Short Documentation
### Supported Compilers
At this time, MIPP has been tested on the following compilers:
- Intel: `icpc` >= `16`,
- GNU: `g++` >= `4.8`,
- Clang: `clang++` >= `3.6`,
- Microsoft: `msvc` >= `14`.On `msvc` `14.10` (Microsoft Visual Studio 2017), the performances are reduced
compared to the other compilers, the compiler is not able to fully inline all
the MIPP methods. This has been fixed on `msvc` `14.21` (Microsoft Visual Studio
2019) and now you can expect high performances.### Install and Configure your Code
You don't have to install MIPP because it is a simple C++ header file. The
headers are located in the `include` folder (note that this location has changed
since commit `6795891`, before they were located in the `src` folder).Just include the header into your source files when the wrapper is needed.
```cpp
#include "mipp.h"
```mipp.h use a C++ `namespace`: `mipp`, if you do not want to prefix all the MIPP
calls by `mipp::` you can do that:```cpp
#include "mipp.h"
using namespace mipp;
```Before trying to compile, think to tell the compiler what kind of vector
instructions you want to use. For instance, if you are using GNU compiler
(`g++`) you simply have to add the `-march=native` option for SSE and AVX CPUs
compatible. For ARMv7 CPUs with NEON instructions you have to add the
`-mfpu=neon` option (since most of current NEONv1 instructions are not IEEE-754
compliant). However, this is no more the case on ARMv8 processors, so the
`-march=native` option will work too. MIPP also uses some nice features provided
by the C++11 and so we have to add the `-std=c++11` flag to compile the code.
You are now ready to run your code with the MIPP wrapper.In the case where MIPP is installed on the system it can be integrated into a
cmake projet in a standard way. Example
```sh
# install MIPP
cd MIPP/
export MIPP_ROOT=$PWD/build/install
cmake -B build -DCMAKE_INSTALL_PREFIX=$MIPP_ROOT
cmake --build build -j5
cmake --install build
```In your `CMakeLists.txt`:
```cmake
# find the installation of MIPP on the system
find_package(MIPP REQUIRED)# define your executable
add_executable(gemm gemm.cpp)# link your executable to MIPP
target_link_libraries(gemm PRIVATE MIPP::mipp)
``````sh
cd your_project/
# if MIPP is installed in a system standard path: MIPP will be found automatically with cmake
cmake -B build
# if MIPP is installed in a non-standard path: use CMAKE_PREFIX_PATH
cmake -B build -DCMAKE_PREFIX_PATH=$MIPP_ROOT
```#### Generate Sources & Compile the Static Library
MIPP is mainly a header only library. However, some macro operations require
to compile a small library. This is particularly true for the `compress`
operation that relies on generated LUTs stored in the static library.To generate the source files containing these LUTs you need to install Python3
with the Jinja2 package:
```bash
sudo apt install python3 python3-pip
pip3 install --user -r codegen/requirements.txt
```Then you can call the generator as follow:
```bash
python3 codegen/gen_compress.py
```And, finally you can compile the MIPP static library:
```bash
cmake -B build -DMIPP_STATIC_LIB=ON
cmake --build build -j4
```Note that **the compilation of the static library is optional**. You can choose
to do not compile the static library then only some macro operations will be
missing.### Sequential Mode
By default, MIPP tries to recognize the instruction set from the preprocessor
definitions. If MIPP can't match the instruction set (for instance when MIPP
does not support the targeted instruction set), MIPP falls back on standard
sequential instructions. In this mode, the vectorization is not guarantee
anymore but the compiler can still perform auto-vectorization.It is possible to force MIPP to use the sequential mode with the following
compiler definition: `-DMIPP_NO_INTRINSICS`. Sometime it can be useful for
debugging or to bench a code.If you want to check the MIPP mode configuration, you can print the following
global variable: `mipp::InstructionFullType` (`std::string`).### Vector Register Declaration
Just use the `mipp::Reg` type.
```cpp
mipp::Reg r1, r2, r3; // we have declared 3 vector registers
```But we do not know the number of elements per register here. This number of
elements can be obtained by calling the `mipp::N()` function (`T` is a
template parameter, it can be `double`, `float`, `int64_t`, `uint64_t`,
`int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t` or `uint8_t` type).```cpp
for (int i = 0; i < n; i += mipp::N()) {
// ...
}
```The register size directly depends on the precision of the data we are working
on.### Register `load` and `store` Instructions
Loading memory from a vector into a register:
```cpp
int n = mipp::N() * 10;
std::vector myVector(n);
int i = 0;
mipp::Reg r1;
r1.load(&myVector[i*mipp::N()]);
```The last two lines can be shorten as follow where the `load` call becomes
implicit:```cpp
mipp::Reg r1 = &myVector[i*mipp::N()];
```Store can be done with the `store(...)` method:
```cpp
int n = mipp::N() * 10;
std::vector myVector(n);
int i = 0;
mipp::Reg r1 = &myVector[i*mipp::N()];// do something with r1
r1.store(&myVector[(i+1)*mipp::N()]);
```By default the loads and stores work on **unaligned memory**.
It is possible to control this behavior with the `-DMIPP_ALIGNED_LOADS`
definition: when specified, the loads and stores work on **aligned memory** by
default. In the **aligned memory** mode, it is still possible to perform
unaligned memory operations with the `mipp::loadu` and `mipp::storeu` functions.
However, it is not possible to perform aligned loads and stores in the
**unaligned memory** mode.To allocate aligned data you can use the MIPP aligned memory allocator wrapped
into the `mipp::vector` class. `mipp::vector` is fully retro-compatible with the
standard `std::vector` class and it can be use everywhere you can use
`std::vector`.```cpp
mipp::vector myVector(n);
```### Register Initialization
You can initialize a vector register from a scalar value:
```cpp
mipp::Reg r1; // r1 = | unknown | unknown | unknown | unknown |
r1 = 1.0; // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
```Or from an initializer list (`std::initializer_list`):
```cpp
mipp::Reg r1; // r1 = | unknown | unknown | unknown | unknown |
r1 = {1.0, 2.0, 3.0, 4.0}; // r1 = | +1.0 | +2.0 | +3.0 | +4.0 |
```### Computational Instructions
**Add** two vector registers:
```cpp
mipp::Reg r1, r2, r3;r1 = 1.0; // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0; // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |r3 = r1 + r2; // r3 = | +3.0 | +3.0 | +3.0 | +3.0 |
```**Subtract** two vector registers:
```cpp
mipp::Reg r1, r2, r3;r1 = 1.0; // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0; // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |r3 = r1 - r2; // r3 = | -1.0 | -1.0 | -1.0 | -1.0 |
```**Multiply** two vector registers:
```cpp
mipp::Reg r1, r2, r3;r1 = 1.0; // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0; // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |r3 = r1 * r2; // r3 = | +2.0 | +2.0 | +2.0 | +2.0 |
```**Divide** two vector registers:
```cpp
mipp::Reg r1, r2, r3;r1 = 1.0; // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |
r2 = 2.0; // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |r3 = r1 / r2; // r3 = | +0.5 | +0.5 | +0.5 | +0.5 |
```**Fused multiply and add** of three vector registers:
```cpp
mipp::Reg r1, r2, r3, r4;r1 = 2.0; // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0; // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
r3 = 1.0; // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |// r4 = (r1 * r2) + r3
r4 = mipp::fmadd(r1, r2, r3); // r4 = | +7.0 | +7.0 | +7.0 | +7.0 |
```**Fused negative multiply and add** of three vector registers:
```cpp
mipp::Reg r1, r2, r3, r4;r1 = 2.0; // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0; // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
r3 = 1.0; // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |// r4 = -(r1 * r2) + r3
r4 = mipp::fnmadd(r1, r2, r3); // r4 = | -5.0 | -5.0 | -5.0 | -5.0 |
```**Square root** of a vector register:
```cpp
mipp::Reg r1, r2;r1 = 9.0; // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |
r2 = mipp::sqrt(r1); // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |
```**Reciprocal square root** of a vector register (be careful: this intrinsic
exists only for simple precision floating-point numbers):```cpp
mipp::Reg r1, r2;r1 = 9.0; // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |
r2 = mipp::rsqrt(r1); // r2 = | +0.3 | +0.3 | +0.3 | +0.3 |
```### Selections
Select the **minimum** between two vector registers:
```cpp
mipp::Reg r1, r2, r3;r1 = 2.0; // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0; // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |r3 = mipp::min(r1, r2); // r3 = | +2.0 | +2.0 | +2.0 | +2.0 |
```Select the **maximum** between two vector registers:
```cpp
mipp::Reg r1, r2, r3;r1 = 2.0; // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |
r2 = 3.0; // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |r3 = mipp::max(r1, r2); // r3 = | +3.0 | +3.0 | +3.0 | +3.0 |
```### Permutations
The `rrot(...)` method allows you to perform a **right rotation** (a cyclic
permutation) of the elements inside the register:```cpp
mipp::Reg r1, r2;
r1 = {3.0, 2.0, 1.0, 0.0} // r1 = | +3.0 | +2.0 | +1.0 | +0.0 |r2 = mipp::rrot(r1); // r2 = | +0.0 | +3.0 | +2.0 | +1.0 |
r1 = mipp::rrot(r2); // r1 = | +1.0 | +0.0 | +3.0 | +2.0 |
r2 = mipp::rrot(r1); // r2 = | +2.0 | +1.0 | +0.0 | +3.0 |
r1 = mipp::rrot(r2); // r1 = | +3.0 | +2.0 | +1.0 | +0.0 |
```Of course there are many more available instructions in the MIPP wrapper and you
can find these instructions at the [end of this page](#list-of-mipp-functions).### Addition of Two Vectors
```cpp
#include // rand()
#include "mipp.h"int main()
{
// data allocation
const int n = 32000; // size of the vA, vB, vC vectors
mipp::vector vA(n); // in
mipp::vector vB(n); // in
mipp::vector vC(n); // out// data initialization
for (int i = 0; i < n; i++) vA[i] = rand() % 10;
for (int i = 0; i < n; i++) vB[i] = rand() % 10;// declare 3 vector registers
mipp::Reg rA, rB, rC;// compute rC with the MIPP vectorized functions
for (int i = 0; i < n; i += mipp::N()) {
rA.load(&vA[i]); // unaligned load by default (use the -DMIPP_ALIGNED_LOADS
rB.load(&vB[i]); // macro definition to force aligned loads and stores).
rC = rA + rB;
rC.store(&vC[i]);
}return 0;
}
```### Vectorizing an Existing Code
#### Scalar Code
```cpp
// ...
for (int i = 0; i < n; i++) {
out[i] = 0.75f * in1[i] * std::exp(in2[i]);
}
// ...
```#### Vectorized Code
```cpp
// ...
// compute the vectorized loop size which is a multiple of 'mipp::N()'.
auto vecLoopSize = (n / mipp::N()) * mipp::N();
mipp::Reg rout, rin1, rin2;
for (int i = 0; i < vecLoopSize; i += mipp::N()) {
rin1.load(&in1[i]); // unaligned load by default (use the -DMIPP_ALIGNED_LOADS
rin2.load(&in2[i]); // macro definition to force aligned loads and stores).
// the '0.75f' constant will be broadcast in a vector but it has to be at
// the right of a 'mipp::Reg', this is why it has been moved at the right
// of the 'rin1' register. Notice that 'std::exp' has been replaced by
// 'mipp::exp'.
rout = rin1 * 0.75f * mipp::exp(rin2);
rout.store(&out[i]);
}// scalar tail loop: compute the remaining elements that can't be vectorized.
for (int i = vecLoopSize; i < n; i++) {
out[i] = 0.75f * in1[i] * std::exp(in2[i]);
}
// ...
```### Masked Instructions
MIPP comes with two generic and templatized masked functions (`mask` and
`maskz`). Those functions allow you to benefit from the AVX-512 and SVE masked
instructions. `mask` and `maskz` functions are retro compatible with older
instruction sets.```cpp
mipp::Reg< float > ZMM1 = { 40, -30, 60, 80};
mipp::Reg< float > ZMM2 = 0.1; // broadcast
mipp::Msk()> k1 = {false, true, false, false};// ZMM3 = k1 ? ZMM1 * ZMM2 : ZMM1;
auto ZMM3 = mipp::mask(k1, ZMM1, ZMM1, ZMM2);
std::cout << ZMM3 << std::endl; // output: "[40, -3, 60, 80]"// ZMM4 = k1 ? ZMM1 * ZMM2 : 0;
auto ZMM4 = mipp::maskz(k1, ZMM1, ZMM2);
std::cout << ZMM4 << std::endl; // output: "[0, -3, 0, 0]"
```## List of MIPP Functions
This section presents an exhaustive list of all the available functions in MIPP.
Of course the MIPP wrapper does not cover all the possible intrinsics of each
instruction set but it tries to give you the most important and useful ones.In the following tables, `T`, `T1` and `T2` stand for data types (`double`,
`float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`,
`int8_t` or `uint8_t`).
`N` stands for the number or elements in a mask or in a register.
`N` is a strictly positive integer and can easily be deduced from the data type:
`constexpr int N = mipp::N()`.
When `T` and `N` are mixed in a prototype, `N` has to satisfy the previous
constraint (`N = mipp::N()`).In the documentation there are some terms that requires to be clarified:
- **register element**: a SIMD register is composed by multiple scalar
elements, those elements are built-in data types (`double`, `float`,
`int64_t`, ...),
- **register lane**: modern instruction sets can have multiple implicit sub
parts in an entire SIMD register, those sub parts are called lanes (SSE has
one lane of 128 bits, AVX has two lanes of 128 bits, AVX-512 has four lanes of
128 bits).### Memory Operations
| **Short name** | **Prototype** | **Documentation** | **Supported types** |
| :--- | :--- | :--- | :--- |
| `load` | `Reg load (const T* mem)` | Loads aligned data from `mem` to a register. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `loadu` | `Reg loadu (const T* mem)` | Loads unaligned data from `mem` to a register. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `store` | `void store (T* mem, const Reg r)` | Stores the `r` register in the `mem` aligned data. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `storeu` | `void storeu (T* mem, const Reg r)` | Stores the `r` register in the `mem` unaligned data. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `maskzld` | `Reg maskzld (const Msk m, const T* mem)` | Loads elements according to the mask `m` (puts zero when the mask value is false). | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t` |
| `maskzlds` | `Reg maskzlds (const Msk m, const T* mem)` | Loads elements according to the mask `m` (puts zero when the mask value is false). Safe version, only reads masked elements in memory. | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t` |
| `maskst` | `void maskst (const Msk m, T* mem, const Reg r)` | Stores elements from the `r` register according to the mask `m` in the `mem` memory. | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t` |
| `masksts` | `void masksts (const Msk m, T* mem, const Reg r)` | Stores elements from the `r` register according to the mask `m` in the `mem` memory. Safe version, only writes masked elements in memory. | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t` |
| `gather` | `Reg gather (const TD* mem, const Reg idx)` | Gathers elements from `mem` to a register. Selects elements according to the indices in `idx`. | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t` |
| `scatter` | `void scatter (TD* mem, const Reg idx, const Reg r)` | Scatters elements into `mem` from the `r` register. Writes elements at the `idx` indices in `mem`. | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t` |
| `maskzgat` | `Reg gather (const Msk m, const TD* mem, const Reg idx)` | Gathers elements from `mem` to a register (according to the mask `m`). Selects elements according to the indices in `idx` (puts zero when the mask value is false). | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t` |
| `masksca` | `void scatter (const Msk m, TD* mem, const Reg idx, const Reg r)` | Scatters elements into `mem` from the `r` register (according to the mask `m`). Writes elements at the `idx` indices in `mem`. | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t` |
| `set` | `Reg set (const T[N] vals)` | Sets a register from the values in `vals`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `set` | `Msk set (const bool[N] bits)` | Sets a mask from the bits in `bits`. | |
| `set1` | `Reg set1 (const T val)` | Broadcasts `val` in a register. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `set1` | `Msk set1 (const bool bit)` | Broadcasts `bit` in a mask. | |
| `set0` | `Reg set0 ()` | Initializes a register to zero. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `set0` | `Msk set0 ()` | Initializes a mask to false. | |
| `get` | `T get (const Reg r, const size_t index)` | Gets a specific element from the register `r` at the `index` position. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `get` | `T get (const Reg_2 r, const size_t index)` | Gets a specific element from the register `r` at the `index` position. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `get` | `bool get (const Msk m, const size_t index)` | Gets a specific element from the register `m` at the `index` position. | |
| `getfirst` | `T getfirst (const Reg r)` | Gets the first element from the register `r`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `getfirst` | `T getfirst (const Reg_2 r)` | Gets the first element from the register `r`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `getfirst` | `bool getfirst (const Msk m)` | Gets the first element from the register `m`. | |
| `low` | `Reg_2 low (const Reg r)` | Gets the low part of the `r` register. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `high` | `Reg_2 high (const Reg r)` | Gets the high part of the `r` register. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `combine` | `Reg combine (const Reg_2 r1, const Reg_2 r2)` | Combine two half registers in a full register, `r1` will be the low part and `r2` the high part. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `combine` | `Reg combine (const Reg r1, const Reg r2)` | `S` elements of `r1` are shifted to the left, `(S - N) + N` elements of `r2` are shifted to the right. Shifted `r1` and `r2` are combined to give the result. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `compress` | `Reg compress (const Reg r1, const Msk m)` | Pack the elements of `r1` at the beginning of the register according to the bitmask `m` (if the bit is 1 then element is picked, otherwise it is not). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmask` | `Reg cmask (const uint32_t[N ] ids)` | Creates a cmask from an indexes list (indexes have to be between 0 and N-1). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmask2` | `Reg cmask2 (const uint32_t[N/2] ids)` | Creates a cmask2 from an indexes list (indexes have to be between 0 and (N/2)-1). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmask4` | `Reg cmask4 (const uint32_t[N/4] ids)` | Creates a cmask4 from an indexes list (indexes have to be between 0 and (N/4)-1). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `shuff` | `Reg shuff (const Reg r, const Reg cm)` | Shuffles the elements of `r` according to the cmask `cm`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `shuff2` | `Reg shuff2 (const Reg r, const Reg cm2)` | Shuffles the elements of `r` according to the cmask2 `cm2` (same shuffle is applied in both lanes). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `shuff4` | `Reg shuff4 (const Reg r, const Reg cm4)` | Shuffles the elements of `r` according to the cmask4 `cm4` (same shuffle is applied in the four lanes). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleave` | `Regx2 interleave (const Reg r1, const Reg r2)` | Interleaves `r1` and `r2` : `[r1_1, r2_1, r1_2, r2_2, ..., r1_n, r2_n]`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `deinterleave` | `Regx2 deinterleave (const Reg r1, const Reg r2)` | Reverts the previous defined interleave operation. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleave2` | `Regx2 interleave2 (const Reg r1, const Reg r2)` | Interleaves `r1` and `r2` considering two lanes. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleave4` | `Regx2 interleave4 (const Reg r1, const Reg r2)` | Interleaves `r1` and `r2` considering four lanes. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavelo` | `Reg interleavelo (const Reg r1, const Reg r2)` | Interleaves the low part of `r1` with the low part of `r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavelo2` | `Reg interleavelo2 (const Reg r1, const Reg r2)` | Interleaves the low part of `r1` with the low part of `r2` (considering two lanes). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavelo4` | `Reg interleavelo4 (const Reg r1, const Reg r2)` | Interleaves the low part of `r1` with the low part of `r2` (considering four lanes). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavehi` | `Reg interleavehi (const Reg r1, const Reg r2)` | Interleaves the high part of `r1` with the high part of `r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavehi2` | `Reg interleavehi2 (const Reg r1, const Reg r2)` | Interleaves the high part of `r1` with the high part of `r2` (considering two lanes). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `interleavehi4` | `Reg interleavehi4 (const Reg r1, const Reg r2)` | Interleaves the high part of `r1` with the high part of `r2` (considering four lanes). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `lrot` | `Reg lrot (const Reg r)` | Rotates the `r` register from the left (cyclic permutation). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `rrot` | `Reg rrot (const Reg r)` | Rotates the `r` register from the right (cyclic permutation). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `blend` | `Reg blend (const Reg r1, const Reg r2, const Msk m)` | Combines `r1` and `r2` register following the `m` mask values (`m_i ? r1_i : r2_i`). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `select` | `Reg select (const Msk m, const Reg r1, const Reg r2)` | Alias for the previous `blend` function. Parameters order is a little bit different. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |### Bitwise Operations
The `pipe` keyword stands for the "|" binary operator.
| **Short name** | **Operator** | **Prototype** | **Documentation** | **Supported types** |
| :--- | :--- | :--- | :--- | :--- |
| `andb` | `&` and `&=` | `Reg andb (const Reg r1, const Reg r2)` | Computes the bitwise AND: `r1 & r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `andb` | `&` and `&=` | `Msk andb (const Msk m1, const Msk m2)` | Computes the bitwise AND: `m1 & m2`. | |
| `andnb` | | `Reg andnb (const Reg r1, const Reg r1)` | Computes the bitwise AND NOT: `(~r1) & r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `andnb` | | `Msk andnb (const Msk m1, const Msk m2)` | Computes the bitwise AND NOT: `(~m1) & m2`. | |
| `orb` | `pipe` and `pipe=` | `Reg orb (const Reg r1, const Reg r2)` | Computes the bitwise OR: `r1 pipe r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `orb` | `pipe` and `pipe=` | `Msk orb (const Msk m1, const Msk m2)` | Computes the bitwise OR: `m1 pipe m2`. | |
| `xorb` | `^` and `^=` | `Reg xorb (const Reg r1, const Reg r2)` | Computes the bitwise XOR: `r1 ^ r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `xorb` | `^` and `^=` | `Msk xorb (const Msk m1, const Msk m2)` | Computes the bitwise XOR: `m1 ^ m2`. | |
| `lshift` | `<<` and `<<=` | `Reg lshift (const Reg r, const uint32_t n)` | Computes the bitwise LEFT SHIFT: `r << n`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `lshiftr` | `<<` and `<<=` | `Reg lshiftr (const Reg r1, const Reg r2)` | Computes the bitwise LEFT SHIFT: `r1 << r2`. | `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `lshift` | `<<` and `<<=` | `Msk lshift (const Msk m, const uint32_t n)` | Computes the bitwise LEFT SHIFT: `m << n`. | |
| `rshift` | `>>` and `>>=` | `Reg rshift (const Reg r, const uint32_t n)` | Computes the bitwise RIGHT SHIFT: `r >> n`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `rshiftr` | `>>` and `>>=` | `Reg rshiftr (const Reg r1, const Reg r2)` | Computes the bitwise RIGHT SHIFT: `r1 >> r2`. | `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `rshift` | `>>` and `>>=` | `Msk rshift (const Msk m, const uint32_t n)` | Computes the bitwise RIGHT SHIFT: `m >> n`. | |
| `notb` | `~` | `Reg notb (const Reg r)` | Computes the bitwise NOT: `~r`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `notb` | `~` | `Msk notb (const Msk m)` | Computes the bitwise NOT: `~m`. | |### Logical Comparisons
| **Short name** | **Operator** | **Prototype** | **Documentation** | **Supported types** |
| :--- | :--- | :--- | :--- | :--- |
| `cmpeq` | `==` | `Msk cmpeq (const Reg r1, const Reg r2)` | Compares if equal to: `r1 == r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmpneq` | `!=` | `Msk cmpneq (const Reg r1, const Reg r2)` | Compares if not equal to: `r1 != r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmpge` | `>=` | `Msk cmpge (const Reg r1, const Reg r2)` | Compares if greater or equal to: `r1 >= r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmpgt` | `>` | `Msk cmpgt (const Reg r1, const Reg r2)` | Compares if strictly greater than: `r1 > r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmple` | `<=` | `Msk cmple (const Reg r1, const Reg r2)` | Compares if lower or equal to: `r1 <= r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cmplt` | `<` | `Msk cmplt (const Reg r1, const Reg r2)` | Compares if strictly lower than: `r1 < r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |### Conversions and Packing
| **Short name** | **Prototype** | **Documentation** | **Supported types** |
| :--- | :--- | :--- | :--- |
| `toReg` | `Reg toReg (const Msk m)` | Converts the mask `m` into a register of type `T`, the number of elements `N` has to be the same for the mask and the register. If the mask is `false` then all the bits of the corresponding element are set to 0, otherwise if the mask is `true` then all the bits are set to 1 (be careful, for float datatypes `true` is interpreted as NaN!). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `cvt` | `Reg cvt (const Reg r)` | Converts the elements of `r` into an other representation (the new representation and the original one have to have the same size). | `float -> int32_t`, `float -> uint32_t`, `int32_t -> float`, `uint32_t -> float`, `double -> int64_t`, `double -> uint64_t`, `int64_t -> double`, `uint64_t -> double` |
| `cvt` | `Reg cvt (const Reg_2 r)` | Converts elements of `r` into bigger elements (in bits). | `int8_t -> int16_t`, `uint8_t -> uint16_t`, `int16_t -> int32_t`, `uint16_t -> uint32_t`, `int32_t -> int64_t`, `uint32_t -> uint64_t` |
| `pack` | `Reg pack (const Reg r1, const Reg r2)` | Packs elements of `r1` and `r2` into smaller elements (some information can be lost in the conversion). | `int32_t -> int16_t`, `uint32_t -> uint16_t`, `int16_t -> int8_t`, `uint16_t -> uint8_t` |### Arithmetic Operations
| **Short name** | **Operator** | **Prototype** | **Documentation** | **Supported types** |
| :--- | :--- | :--- | :--- | :--- |
| `add` | `+` and `+=` | `Reg add (const Reg r1, const Reg r2)` | Performs the arithmetic addition: `r1 + r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `sub` | `-` and `-=` | `Reg sub (const Reg r1, const Reg r2)` | Performs the arithmetic subtraction: `r1 - r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `mul` | `*` and `*=` | `Reg mul (const Reg r1, const Reg r2)` | Performs the arithmetic multiplication: `r1 * r2`. | `double`, `float`, `int32_t`, `int16_t`, `int8_t` |
| `div` | `/` and `/=` | `Reg div (const Reg r1, const Reg r2)` | Performs the arithmetic division: `r1 / r2`. | `double`, `float` |
| `fmadd` | | `Reg fmadd (const Reg r1, const Reg r2, const Reg r3)` | Performs the fused multiplication and addition: `r1 * r2 + r3`. | `double`, `float` |
| `fnmadd` | | `Reg fnmadd (const Reg r1, const Reg r2, const Reg r3)` | Performs the negative fused multiplication and addition: `-(r1 * r2) + r3`. | `double`, `float` |
| `fmsub` | | `Reg fmsub (const Reg r1, const Reg r2, const Reg r3)` | Performs the fused multiplication and subtraction: `r1 * r2 - r3`. | `double`, `float` |
| `fnmsub` | | `Reg fnmsub (const Reg r1, const Reg r2, const Reg r3)` | Performs the negative fused multiplication and subtraction: `-(r1 * r2) - r3`. | `double`, `float` |
| `min` | | `Reg min (const Reg r1, const Reg r2)` | Selects the minimum: `r1_i < r2_i ? r1_i : r2_i`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `max` | | `Reg max (const Reg r1, const Reg r2)` | Selects the maximum: `r1_i > r2_i ? r1_i : r2_i`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `div2` | | `Reg div2 (const Reg r)` | Performs the arithmetic division by two: `r / 2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `div4` | | `Reg div4 (const Reg r)` | Performs the arithmetic division by four: `r / 4`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |
| `abs` | | `Reg