https://github.com/aff3ct/MIPP

MIPP is a portable wrapper for SIMD instructions written in C++11. It supports NEON, SSE, AVX, AVX-512 and SVE (length specific).
https://github.com/aff3ct/MIPP
avx avx-512 neon portable simd sse sve vector wrapper
Last synced: about 1 year ago
JSON representation
MIPP is a portable wrapper for SIMD instructions written in C++11. It supports NEON, SSE, AVX, AVX-512 and SVE (length specific).
Host: GitHub
URL: https://github.com/aff3ct/MIPP
Owner: aff3ct
License: mit
Created: 2017-06-23T16:56:44.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2024-05-18T08:51:13.000Z (almost 2 years ago)
Last Synced: 2024-05-18T09:38:39.651Z (almost 2 years ago)
Topics: avx, avx-512, neon, portable, simd, sse, sve, vector, wrapper
Language: C++
Homepage:
Size: 2.01 MB
Stars: 465
Watchers: 23
Forks: 86
Open Issues: 16
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

AwesomeCppGameDev - MIPP - 512. (Maths)
README

          # MyIntrinsics++ (MIPP)

[![pipeline status](https://gitlab.com/aff3ct/MIPP/badges/master/pipeline.svg)](https://gitlab.com/aff3ct/MIPP/pipelines)

[![coverage report](https://gitlab.com/aff3ct/MIPP/badges/master/coverage.svg)](https://aff3ct.gitlab.io/MIPP/)

![](mipp.jpg)

## Purpose

MIPP is a portable and Open-source wrapper (MIT license) for vector intrinsic

functions (SIMD) written in C++11. It works for SSE, AVX, AVX-512, ARM NEON

and SVE (work in progress) instructions. MIPP wrapper supports simple/double

precision floating-point numbers and also signed/unsigned integer arithmetic

(64-bit, 32-bit, 16-bit and 8-bit).

With the MIPP wrapper you do not need to write a specific intrinsic code

anymore. Just use provided functions and the wrapper will automatically

generates the right intrisic calls for your specific architecture.

If you are interested by ARM SVE development status, 

[please follow this link](#arm-sve).

## Short Documentation

### Supported Compilers

At this time, MIPP has been tested on the following compilers:

  - Intel: `icpc` >= `16`,

  - GNU: `g++` >= `4.8`,

  - Clang: `clang++` >= `3.6`,

  - Microsoft: `msvc` >= `14`.

On `msvc` `14.10` (Microsoft Visual Studio 2017), the performances are reduced 

compared to the other compilers, the compiler is not able to fully inline all 

the MIPP methods. This has been fixed on `msvc` `14.21` (Microsoft Visual Studio 

2019) and now you can expect high performances.

### Install and Configure your Code

You don't have to install MIPP because it is a simple C++ header file. The 

headers are located in the `include` folder (note that this location has changed 

since commit `6795891`, before they were located in the `src` folder). 

Just include the header into your source files when the wrapper is needed.

```cpp

#include "mipp.h"

```

mipp.h use a C++ `namespace`: `mipp`, if you do not want to prefix all the MIPP 

calls by `mipp::` you can do that:

```cpp

#include "mipp.h"

using namespace mipp;

```

Before trying to compile, think to tell the compiler what kind of vector 

instructions you want to use. For instance, if you are using GNU compiler 

(`g++`) you simply have to add the `-march=native` option for SSE and AVX CPUs 

compatible. For ARMv7 CPUs with NEON instructions you have to add the 

`-mfpu=neon` option (since most of current NEONv1 instructions are not IEEE-754 

compliant). However, this is no more the case on ARMv8 processors, so the

`-march=native` option will work too. MIPP also uses some nice features provided 

by the C++11 and so we have to add the `-std=c++11` flag to compile the code. 

You are now ready to run your code with the MIPP wrapper.

In the case where MIPP is installed on the system it can be integrated into a

cmake projet in a standard way. Example

```sh

# install MIPP

cd MIPP/

export MIPP_ROOT=$PWD/build/install

cmake -B build -DCMAKE_INSTALL_PREFIX=$MIPP_ROOT

cmake --build build -j5

cmake --install build

```

In your `CMakeLists.txt`:

```cmake

# find the installation of MIPP on the system

find_package(MIPP REQUIRED)

# define your executable

add_executable(gemm gemm.cpp)

# link your executable to MIPP

target_link_libraries(gemm PRIVATE MIPP::mipp)

```

```sh

cd your_project/

# if MIPP is installed in a system standard path: MIPP will be found automatically with cmake

cmake -B build

# if MIPP is installed in a non-standard path: use CMAKE_PREFIX_PATH

cmake -B build -DCMAKE_PREFIX_PATH=$MIPP_ROOT

```

#### Generate Sources & Compile the Static Library

MIPP is mainly a header only library. However, some macro operations require

to compile a small library. This is particularly true for the `compress` 

operation that relies on generated LUTs stored in the static library.

To generate the source files containing these LUTs you need to install Python3

with the Jinja2 package:

```bash

sudo apt install python3 python3-pip

pip3 install --user -r codegen/requirements.txt

```

Then you can call the generator as follow:

```bash

python3 codegen/gen_compress.py

```

And, finally you can compile the MIPP static library:

```bash

cmake -B build -DMIPP_STATIC_LIB=ON

cmake --build build -j4

```

Note that **the compilation of the static library is optional**. You can choose 

to do not compile the static library then only some macro operations will be 

missing.

### Sequential Mode

By default, MIPP tries to recognize the instruction set from the preprocessor 

definitions. If MIPP can't match the instruction set (for instance when MIPP 

does not support the targeted instruction set), MIPP falls back on standard 

sequential instructions. In this mode, the vectorization is not guarantee 

anymore but the compiler can still perform auto-vectorization.

It is possible to force MIPP to use the sequential mode with the following 

compiler definition: `-DMIPP_NO_INTRINSICS`. Sometime it can be useful for 

debugging or to bench a code.

If you want to check the MIPP mode configuration, you can print the following 

global variable: `mipp::InstructionFullType` (`std::string`).

### Vector Register Declaration

Just use the `mipp::Reg` type.

```cpp

mipp::Reg r1, r2, r3; // we have declared 3 vector registers

```

But we do not know the number of elements per register here. This number of 

elements can be obtained by calling the `mipp::N()` function (`T` is a 

template parameter, it can be `double`, `float`, `int64_t`, `uint64_t`,

`int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t` or `uint8_t` type).

```cpp

for (int i = 0; i < n; i += mipp::N()) {

	// ...

}

```

The register size directly depends on the precision of the data we are working 

on.

### Register `load` and `store` Instructions

Loading memory from a vector into a register:

```cpp

int n = mipp::N() * 10;

std::vector myVector(n);

int i = 0;

mipp::Reg r1;

r1.load(&myVector[i*mipp::N()]);

```

The last two lines can be shorten as follow where the `load` call becomes

implicit:

```cpp

mipp::Reg r1 = &myVector[i*mipp::N()];

```

Store can be done with the `store(...)` method:

```cpp

int n = mipp::N() * 10;

std::vector myVector(n);

int i = 0;

mipp::Reg r1 = &myVector[i*mipp::N()];

// do something with r1

r1.store(&myVector[(i+1)*mipp::N()]);

```

By default the loads and stores work on **unaligned memory**.

It is possible to control this behavior with the `-DMIPP_ALIGNED_LOADS` 

definition: when specified, the loads and stores work on **aligned memory** by 

default. In the **aligned memory** mode, it is still possible to perform 

unaligned memory operations with the `mipp::loadu` and `mipp::storeu` functions.

However, it is not possible to perform aligned loads and stores in the 

**unaligned memory** mode.

To allocate aligned data you can use the MIPP aligned memory allocator wrapped 

into the `mipp::vector` class. `mipp::vector` is fully retro-compatible with the 

standard `std::vector` class and it can be use everywhere you can use 

`std::vector`.

```cpp

mipp::vector myVector(n);

```

### Register Initialization

You can initialize a vector register from a scalar value:

```cpp

mipp::Reg r1; // r1 = | unknown | unknown | unknown | unknown |

r1 = 1.0;            // r1 = |    +1.0 |    +1.0 |    +1.0 |    +1.0 |

```

Or from an initializer list (`std::initializer_list`):

```cpp

mipp::Reg r1;       // r1 = | unknown | unknown | unknown | unknown |

r1 = {1.0, 2.0, 3.0, 4.0}; // r1 = |    +1.0 |    +2.0 |    +3.0 |    +4.0 |

```

### Computational Instructions

**Add** two vector registers:

```cpp

mipp::Reg r1, r2, r3;

r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |

r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |

r3 = r1 + r2; // r3 = | +3.0 | +3.0 | +3.0 | +3.0 |

```

**Subtract** two vector registers:

```cpp

mipp::Reg r1, r2, r3;

r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |

r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |

r3 = r1 - r2; // r3 = | -1.0 | -1.0 | -1.0 | -1.0 |

```

**Multiply** two vector registers:

```cpp

mipp::Reg r1, r2, r3;

r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |

r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |

r3 = r1 * r2; // r3 = | +2.0 | +2.0 | +2.0 | +2.0 |

```

**Divide** two vector registers:

```cpp

mipp::Reg r1, r2, r3;

r1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |

r2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |

r3 = r1 / r2; // r3 = | +0.5 | +0.5 | +0.5 | +0.5 |

```

**Fused multiply and add** of three vector registers:

```cpp

mipp::Reg r1, r2, r3, r4;

r1 = 2.0;                     // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |

r2 = 3.0;                     // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |

r3 = 1.0;                     // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |

// r4 = (r1 * r2) + r3

r4 = mipp::fmadd(r1, r2, r3); // r4 = | +7.0 | +7.0 | +7.0 | +7.0 |

```

**Fused negative multiply and add** of three vector registers:

```cpp

mipp::Reg r1, r2, r3, r4;

r1 = 2.0;                      // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |

r2 = 3.0;                      // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |

r3 = 1.0;                      // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |

// r4 = -(r1 * r2) + r3

r4 = mipp::fnmadd(r1, r2, r3); // r4 = | -5.0 | -5.0 | -5.0 | -5.0 |

```

**Square root** of a vector register:

```cpp

mipp::Reg r1, r2;

r1 = 9.0;             // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |

r2 = mipp::sqrt(r1);  // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |

```

**Reciprocal square root** of a vector register (be careful: this intrinsic 

exists only for simple precision floating-point numbers):

```cpp

mipp::Reg r1, r2;

r1 = 9.0;             // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |

r2 = mipp::rsqrt(r1); // r2 = | +0.3 | +0.3 | +0.3 | +0.3 |

```

### Selections

Select the **minimum** between two vector registers:

```cpp

mipp::Reg r1, r2, r3;

r1 = 2.0;               // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |

r2 = 3.0;               // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |

r3 = mipp::min(r1, r2); // r3 = | +2.0 | +2.0 | +2.0 | +2.0 |

```

Select the **maximum** between two vector registers:

```cpp

mipp::Reg r1, r2, r3;

r1 = 2.0;               // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |

r2 = 3.0;               // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |

r3 = mipp::max(r1, r2); // r3 = | +3.0 | +3.0 | +3.0 | +3.0 |

```

### Permutations

The `rrot(...)` method allows you to perform a **right rotation** (a cyclic 

permutation) of the elements inside the register:

```cpp

mipp::Reg r1, r2;

r1 = {3.0, 2.0, 1.0, 0.0}  // r1 = | +3.0 | +2.0 | +1.0 | +0.0 |

r2 = mipp::rrot(r1);       // r2 = | +0.0 | +3.0 | +2.0 | +1.0 |

r1 = mipp::rrot(r2);       // r1 = | +1.0 | +0.0 | +3.0 | +2.0 |

r2 = mipp::rrot(r1);       // r2 = | +2.0 | +1.0 | +0.0 | +3.0 |

r1 = mipp::rrot(r2);       // r1 = | +3.0 | +2.0 | +1.0 | +0.0 |

```

Of course there are many more available instructions in the MIPP wrapper and you 

can find these instructions at the [end of this page](#list-of-mipp-functions).

### Addition of Two Vectors

```cpp

#include  // rand()

#include "mipp.h"

int main()

{

	// data allocation

	const int n = 32000; // size of the vA, vB, vC vectors

	mipp::vector vA(n); // in

	mipp::vector vB(n); // in

	mipp::vector vC(n); // out

	// data initialization

	for (int i = 0; i < n; i++) vA[i] = rand() % 10;

	for (int i = 0; i < n; i++) vB[i] = rand() % 10;

	// declare 3 vector registers

	mipp::Reg rA, rB, rC;

	// compute rC with the MIPP vectorized functions

	for (int i = 0; i < n; i += mipp::N()) {

		rA.load(&vA[i]); // unaligned load by default (use the -DMIPP_ALIGNED_LOADS

		rB.load(&vB[i]); // macro definition to force aligned loads and stores).

		rC = rA + rB;

		rC.store(&vC[i]);

	}

	return 0;

}

```

### Vectorizing an Existing Code

#### Scalar Code

```cpp

// ...

for (int i = 0; i < n; i++) {

	out[i] = 0.75f * in1[i] * std::exp(in2[i]);

}

// ...

```

#### Vectorized Code

```cpp

// ...

// compute the vectorized loop size which is a multiple of 'mipp::N()'.

auto vecLoopSize = (n / mipp::N()) * mipp::N();

mipp::Reg rout, rin1, rin2;

for (int i = 0; i < vecLoopSize; i += mipp::N()) {

	rin1.load(&in1[i]); // unaligned load by default (use the -DMIPP_ALIGNED_LOADS

	rin2.load(&in2[i]); // macro definition to force aligned loads and stores).

	// the '0.75f' constant will be broadcast in a vector but it has to be at

	// the right of a 'mipp::Reg', this is why it has been moved at the right

	// of the 'rin1' register. Notice that 'std::exp' has been replaced by

	// 'mipp::exp'.

	rout = rin1 * 0.75f * mipp::exp(rin2);

	rout.store(&out[i]);

}

// scalar tail loop: compute the remaining elements that can't be vectorized.

for (int i = vecLoopSize; i < n; i++) {

	out[i] = 0.75f * in1[i] * std::exp(in2[i]);

}

// ...

```

### Masked Instructions

MIPP comes with two generic and templatized masked functions (`mask` and 

`maskz`). Those functions allow you to benefit from the AVX-512 and SVE masked 

instructions. `mask` and `maskz` functions are retro compatible with older 

instruction sets.

```cpp

mipp::Reg<        float   > ZMM1 = {   40,  -30,    60,    80};

mipp::Reg<        float   > ZMM2 = 0.1; // broadcast

mipp::Msk()> k1   = {false, true, false, false};

// ZMM3 = k1 ? ZMM1 * ZMM2 : ZMM1;

auto ZMM3 = mipp::mask(k1, ZMM1, ZMM1, ZMM2);

std::cout << ZMM3 << std::endl; // output: "[40, -3, 60, 80]"

// ZMM4 = k1 ? ZMM1 * ZMM2 : 0;

auto ZMM4 = mipp::maskz(k1, ZMM1, ZMM2);

std::cout << ZMM4 << std::endl; // output: "[0, -3, 0, 0]"

```

## List of MIPP Functions

This section presents an exhaustive list of all the available functions in MIPP.

Of course the MIPP wrapper does not cover all the possible intrinsics of each 

instruction set but it tries to give you the most important and useful ones.

In the following tables, `T`, `T1` and `T2` stand for data types (`double`, 

`float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`,

`int8_t` or `uint8_t`).

`N` stands for the number or elements in a mask or in a register.

`N` is a strictly positive integer and can easily be deduced from the data type: 

`constexpr int N = mipp::N()`.

When `T` and `N` are mixed in a prototype, `N` has to satisfy the previous 

constraint (`N = mipp::N()`).

In the documentation there are some terms that requires to be clarified:

  - **register element**: a SIMD register is composed by multiple scalar 

  elements, those elements are built-in data types (`double`, `float`, 

  `int64_t`, ...),

  - **register lane**: modern instruction sets can have multiple implicit sub 

  parts in an entire SIMD register, those sub parts are called lanes (SSE has 

  one lane of 128 bits, AVX has two lanes of 128 bits, AVX-512 has four lanes of 

  128 bits).

### Memory Operations

| **Short name**  | **Prototype**                                                                          | **Documentation**                                                                                                                                                   | **Supported types**                                                                                         |

| :---            | :---                                                                                   | :---                                                                                                                                                                | :---                                                                                                        |

| `load`          | `Reg   load          (const T* mem)`                                                | Loads aligned data from `mem` to a register.                                                                                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `loadu`         | `Reg   loadu         (const T* mem)`                                                | Loads unaligned data from `mem` to a register.                                                                                                                      | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `store`         | `void     store         (T* mem, const Reg r)`                                      | Stores the `r` register in the `mem` aligned data.                                                                                                                  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `storeu`        | `void     storeu        (T* mem, const Reg r)`                                      | Stores the `r` register in the `mem` unaligned data.                                                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `maskzld`       | `Reg   maskzld       (const Msk m, const T* mem)`                                | Loads elements according to the mask `m` (puts zero when the mask value is false).                                                                                  | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |

| `maskzlds`      | `Reg   maskzlds      (const Msk m, const T* mem)`                                | Loads elements according to the mask `m` (puts zero when the mask value is false). Safe version, only reads masked elements in memory.                              | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |

| `maskst`        | `void     maskst        (const Msk m, T* mem, const Reg r)`                      | Stores elements from the `r` register according to the mask `m` in the `mem` memory.                                                                                | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |

| `masksts`       | `void     masksts       (const Msk m, T* mem, const Reg r)`                      | Stores elements from the `r` register according to the mask `m` in the `mem` memory. Safe version, only writes masked elements in memory.                           | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |

| `gather`        | `Reg   gather    (const TD* mem, const Reg idx)`                            | Gathers elements from `mem` to a register. Selects elements according to the indices in `idx`.                                                                      | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |

| `scatter`       | `void  scatter   (TD* mem, const Reg idx, const Reg r)`                 | Scatters elements into `mem` from the `r` register. Writes elements at the `idx` indices in `mem`.                                                                  | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |

| `maskzgat`      | `Reg   gather    (const Msk m, const TD* mem, const Reg idx)`            | Gathers elements from `mem` to a register (according to the mask `m`). Selects elements according to the indices in `idx` (puts zero when the mask value is false). | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |

| `masksca`       | `void  scatter   (const Msk m, TD* mem, const Reg idx, const Reg r)` | Scatters elements into `mem` from the `r` register (according to the mask `m`). Writes elements at the `idx` indices in `mem`.                                      | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |

| `set`           | `Reg   set           (const T[N] vals)`                                             | Sets a register from the values in `vals`.                                                                                                                          | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `set`           | `Msk   set           (const bool[N] bits)`                                          | Sets a mask from the bits in `bits`.                                                                                                                                |                                                                                                             |

| `set1`          | `Reg   set1          (const T val)`                                                 | Broadcasts `val` in a register.                                                                                                                                     | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `set1`          | `Msk   set1          (const bool bit)`                                              | Broadcasts `bit` in a mask.                                                                                                                                         |                                                                                                             |

| `set0`          | `Reg   set0          ()`                                                            | Initializes a register to zero.                                                                                                                                     | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `set0`          | `Msk   set0          ()`                                                            | Initializes a mask to false.                                                                                                                                        |                                                              |

| `get`           | `T        get           (const Reg r, const size_t index)`                          | Gets a specific element from the register `r` at the `index` position.                                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `get`           | `T        get           (const Reg_2 r, const size_t index)`                        | Gets a specific element from the register `r` at the `index` position.                                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `get`           | `bool     get           (const Msk m, const size_t index)`                          | Gets a specific element from the register `m` at the `index` position.                                                                                              |                                                                                                             |

| `getfirst`      | `T        getfirst      (const Reg r)`                                              | Gets the first element from the register `r`.                                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `getfirst`      | `T        getfirst      (const Reg_2 r)`                                            | Gets the first element from the register `r`.                                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `getfirst`      | `bool     getfirst      (const Msk m)`                                              | Gets the first element from the register `m`.                                                                                                                       |                                                                                                             |

| `low`           | `Reg_2 low           (const Reg r)`                                              | Gets the low part of the `r` register.                                                                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `high`          | `Reg_2 high          (const Reg r)`                                              | Gets the high part of the `r` register.                                                                                                                             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `combine`       | `Reg   combine       (const Reg_2 r1, const Reg_2 r2)`                        | Combine two half registers in a full register, `r1` will be the low part and `r2` the high part.                                                                    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `combine`       | `Reg   combine     (const Reg r1, const Reg r2)`                            | `S` elements of `r1` are shifted to the left, `(S - N) + N` elements of `r2` are shifted to the right. Shifted `r1` and `r2` are combined to give the result.       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `compress`      | `Reg   compress      (const Reg r1, const Msk m)`                             | Pack the elements of `r1` at the beginning of the register according to the bitmask `m` (if the bit is 1 then element is picked, otherwise it is not).              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `cmask`         | `Reg   cmask         (const uint32_t[N  ] ids)`                                     | Creates a cmask from an indexes list (indexes have to be between 0 and N-1).                                                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `cmask2`        | `Reg   cmask2        (const uint32_t[N/2] ids)`                                     | Creates a cmask2 from an indexes list (indexes have to be between 0 and (N/2)-1).                                                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `cmask4`        | `Reg   cmask4        (const uint32_t[N/4] ids)`                                     | Creates a cmask4 from an indexes list (indexes have to be between 0 and (N/4)-1).                                                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `shuff`         | `Reg   shuff         (const Reg r, const Reg cm)`                             | Shuffles the elements of `r` according to the cmask `cm`.                                                                                                           | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `shuff2`        | `Reg   shuff2        (const Reg r, const Reg cm2)`                            | Shuffles the elements of `r` according to the cmask2 `cm2` (same shuffle is applied in both lanes).                                                                 | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `shuff4`        | `Reg   shuff4        (const Reg r, const Reg cm4)`                            | Shuffles the elements of `r` according to the cmask4 `cm4` (same shuffle is applied in the four lanes).                                                             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `interleave`    | `Regx2 interleave    (const Reg r1, const Reg r2)`                            | Interleaves `r1` and `r2` : `[r1_1, r2_1, r1_2, r2_2, ..., r1_n, r2_n]`.                                                                                            | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `deinterleave`  | `Regx2 deinterleave  (const Reg r1, const Reg r2)`                            | Reverts the previous defined interleave operation.                                                                                                                  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `interleave2`   | `Regx2 interleave2   (const Reg r1, const Reg r2)`                            | Interleaves `r1` and `r2` considering two lanes.                                                                                                                    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `interleave4`   | `Regx2 interleave4   (const Reg r1, const Reg r2)`                            | Interleaves `r1` and `r2` considering four lanes.                                                                                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `interleavelo`  | `Reg   interleavelo  (const Reg r1, const Reg r2)`                            | Interleaves the low part of `r1` with the low part of `r2`.                                                                                                         | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `interleavelo2` | `Reg   interleavelo2 (const Reg r1, const Reg r2)`                            | Interleaves the low part of `r1` with the low part of `r2` (considering two lanes).                                                                                 | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `interleavelo4` | `Reg   interleavelo4 (const Reg r1, const Reg r2)`                            | Interleaves the low part of `r1` with the low part of `r2` (considering four lanes).                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `interleavehi`  | `Reg   interleavehi  (const Reg r1, const Reg r2)`                            | Interleaves the high part of `r1` with the high part of `r2`.                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `interleavehi2` | `Reg   interleavehi2 (const Reg r1, const Reg r2)`                            | Interleaves the high part of `r1` with the high part of `r2` (considering two lanes).                                                                               | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `interleavehi4` | `Reg   interleavehi4 (const Reg r1, const Reg r2)`                            | Interleaves the high part of `r1` with the high part of `r2` (considering four lanes).                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `lrot`          | `Reg   lrot          (const Reg r)`                                              | Rotates the `r` register from the left (cyclic permutation).                                                                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `rrot`          | `Reg   rrot          (const Reg r)`                                              | Rotates the `r` register from the right (cyclic permutation).                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `blend`         | `Reg   blend         (const Reg r1, const Reg r2, const Msk m)`            | Combines `r1` and `r2` register following the `m` mask values (`m_i ? r1_i : r2_i`).                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `select`        | `Reg   select        (const Msk m, const Reg r1, const Reg r2)`            | Alias for the previous `blend` function. Parameters order is a little bit different.                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

### Bitwise Operations

The `pipe` keyword stands for the "|" binary operator.

| **Short name** | **Operator**       | **Prototype**                                       | **Documentation**                             | **Supported types**                                                                                         |

| :---           | :---               | :---                                                | :---                                          | :---                                                                                                        |

| `andb`         | `&` and `&=`       | `Reg andb    (const Reg r1, const Reg r2)` | Computes the bitwise AND: `r1 & r2`.          | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `andb`         | `&` and `&=`       | `Msk andb    (const Msk m1, const Msk m2)` | Computes the bitwise AND: `m1 & m2`.          |                                                                                                             |

| `andnb`        |                    | `Reg andnb   (const Reg r1, const Reg r1)` | Computes the bitwise AND NOT: `(~r1) & r2`.   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `andnb`        |                    | `Msk andnb   (const Msk m1, const Msk m2)` | Computes the bitwise AND NOT: `(~m1) & m2`.   |                                                                                                             |

| `orb`          | `pipe` and `pipe=` | `Reg orb     (const Reg r1, const Reg r2)` | Computes the bitwise OR: `r1 pipe r2`.        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `orb`          | `pipe` and `pipe=` | `Msk orb     (const Msk m1, const Msk m2)` | Computes the bitwise OR: `m1 pipe m2`.        |                                                                                                             |

| `xorb`         | `^` and `^=`       | `Reg xorb    (const Reg r1, const Reg r2)` | Computes the bitwise XOR: `r1 ^ r2`.          | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `xorb`         | `^` and `^=`       | `Msk xorb    (const Msk m1, const Msk m2)` | Computes the bitwise XOR: `m1 ^ m2`.          |                                                                                                             |

| `lshift`       | `<<` and `<<=`     | `Reg lshift  (const Reg r, const uint32_t n)` | Computes the bitwise LEFT SHIFT: `r << n`.    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `lshiftr`      | `<<` and `<<=`     | `Reg lshiftr (const Reg r1, const Reg r2)` | Computes the bitwise LEFT SHIFT: `r1 << r2`.  | `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t`                    |

| `lshift`       | `<<` and `<<=`     | `Msk lshift  (const Msk m, const uint32_t n)` | Computes the bitwise LEFT SHIFT: `m << n`.    |                                                                                                             |

| `rshift`       | `>>` and `>>=`     | `Reg rshift  (const Reg r, const uint32_t n)` | Computes the bitwise RIGHT SHIFT: `r >> n`.   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `rshiftr`      | `>>` and `>>=`     | `Reg rshiftr (const Reg r1, const Reg r2)` | Computes the bitwise RIGHT SHIFT: `r1 >> r2`. | `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t`                    |

| `rshift`       | `>>` and `>>=`     | `Msk rshift  (const Msk m, const uint32_t n)` | Computes the bitwise RIGHT SHIFT: `m >> n`.   |                                                                                                             |

| `notb`         | `~`                | `Reg notb    (const Reg r)`                   | Computes the bitwise NOT: `~r`.               | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `notb`         | `~`                | `Msk notb    (const Msk m)`                   | Computes the bitwise NOT: `~m`.               |                                                                                                             |

### Logical Comparisons

| **Short name** | **Operator** | **Prototype**                                      | **Documentation**                             | **Supported types**                                                                                         |

| :---           | :---         | :---                                               | :---                                          | :---                                                                                                        |

| `cmpeq`        | `==`         | `Msk cmpeq  (const Reg r1, const Reg r2)` | Compares if equal to: `r1 == r2`.             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `cmpneq`       | `!=`         | `Msk cmpneq (const Reg r1, const Reg r2)` | Compares if not equal to: `r1 != r2`.         | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `cmpge`        | `>=`         | `Msk cmpge  (const Reg r1, const Reg r2)` | Compares if greater or equal to: `r1 >= r2`.  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `cmpgt`        | `>`          | `Msk cmpgt  (const Reg r1, const Reg r2)` | Compares if strictly greater than: `r1 > r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `cmple`        | `<=`         | `Msk cmple  (const Reg r1, const Reg r2)` | Compares if lower or equal to: `r1 <= r2`.    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `cmplt`        | `<`          | `Msk cmplt  (const Reg r1, const Reg r2)` | Compares if strictly lower than: `r1 < r2`.   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

### Conversions and Packing

| **Short name** | **Prototype**                                        | **Documentation**                                                                                                                                                                                                                                                                                                                                   | **Supported types**                                                                                                                                                    |

| :---           | :---                                                 | :---                                                                                                                                                                                                                                                                                                                                                | :---                                                                                                                                                                   |

| `toReg`        | `Reg  toReg (const Msk m)`                     | Converts the mask `m` into a register of type `T`, the number of elements `N` has to be the same for the mask and the register. If the mask is `false` then all the bits of the corresponding element are set to 0, otherwise if the mask is `true` then all the bits are set to 1 (be careful, for float datatypes `true` is interpreted as NaN!). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t`                                                            |

| `cvt`          | `Reg cvt   (const Reg r)`                    | Converts the elements of `r` into an other representation (the new representation and the original one have to have the same size).                                                                                                                                                                                                                 | `float -> int32_t`, `float -> uint32_t`, `int32_t -> float`, `uint32_t -> float`, `double -> int64_t`, `double -> uint64_t`, `int64_t -> double`, `uint64_t -> double` |

| `cvt`          | `Reg cvt   (const Reg_2 r)`                  | Converts elements of `r` into bigger elements (in bits).                                                                                                                                                                                                                                                                                            | `int8_t -> int16_t`, `uint8_t -> uint16_t`, `int16_t -> int32_t`, `uint16_t -> uint32_t`, `int32_t -> int64_t`, `uint32_t -> uint64_t`                                 |

| `pack`         | `Reg pack  (const Reg r1, const Reg r2)` | Packs elements of `r1` and `r2` into smaller elements (some information can be lost in the conversion).                                                                                                                                                                                                                                             | `int32_t -> int16_t`, `uint32_t -> uint16_t`, `int16_t -> int8_t`, `uint16_t -> uint8_t`                                                                               |

### Arithmetic Operations

| **Short name** | **Operator** | **Prototype**                                                       | **Documentation**                                                                                   | **Supported types**                                                                                         |

| :---           | :---         | :---                                                                | :---                                                                                                | :---                                                                                                        |

| `add`          | `+` and `+=` | `Reg add    (const Reg r1, const Reg r2)`                  | Performs the arithmetic addition: `r1 + r2`.                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `sub`          | `-` and `-=` | `Reg sub    (const Reg r1, const Reg r2)`                  | Performs the arithmetic subtraction: `r1 - r2`.                                                     | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `mul`          | `*` and `*=` | `Reg mul    (const Reg r1, const Reg r2)`                  | Performs the arithmetic multiplication: `r1 * r2`.                                                  | `double`, `float`, `int32_t`, `int16_t`, `int8_t`                                                           |

| `div`          | `/` and `/=` | `Reg div    (const Reg r1, const Reg r2)`                  | Performs the arithmetic division: `r1 / r2`.                                                        | `double`, `float`                                                                                           |

| `fmadd`        |              | `Reg fmadd  (const Reg r1, const Reg r2, const Reg r3)` | Performs the fused multiplication and addition: `r1 * r2 + r3`.                                     | `double`, `float`                                                                                           |

| `fnmadd`       |              | `Reg fnmadd (const Reg r1, const Reg r2, const Reg r3)` | Performs the negative fused multiplication and addition: `-(r1 * r2) + r3`.                         | `double`, `float`                                                                                           |

| `fmsub`        |              | `Reg fmsub  (const Reg r1, const Reg r2, const Reg r3)` | Performs the fused multiplication and subtraction: `r1 * r2 - r3`.                                  | `double`, `float`                                                                                           |

| `fnmsub`       |              | `Reg fnmsub (const Reg r1, const Reg r2, const Reg r3)` | Performs the negative fused multiplication and subtraction: `-(r1 * r2) - r3`.                      | `double`, `float`                                                                                           |

| `min`          |              | `Reg min    (const Reg r1, const Reg r2)`                  | Selects the minimum: `r1_i < r2_i ? r1_i : r2_i`.                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `max`          |              | `Reg max    (const Reg r1, const Reg r2)`                  | Selects the maximum: `r1_i > r2_i ? r1_i : r2_i`.                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `div2`         |              | `Reg div2   (const Reg r)`                                    | Performs the arithmetic division by two: `r / 2`.                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `div4`         |              | `Reg div4   (const Reg r)`                                    | Performs the arithmetic division by four: `r / 4`.                                                  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |

| `abs`          |              | `Reg
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aff3ct/MIPP

Awesome Lists containing this project

README