{"id":13435265,"url":"https://github.com/aff3ct/MIPP","last_synced_at":"2025-03-18T02:31:32.885Z","repository":{"id":36073305,"uuid":"95239416","full_name":"aff3ct/MIPP","owner":"aff3ct","description":"MIPP is a portable wrapper for SIMD instructions written in C++11. It supports NEON, SSE, AVX, AVX-512 and SVE (length specific).","archived":false,"fork":false,"pushed_at":"2024-05-18T08:51:13.000Z","size":2112,"stargazers_count":465,"open_issues_count":16,"forks_count":86,"subscribers_count":23,"default_branch":"master","last_synced_at":"2024-05-18T09:38:39.651Z","etag":null,"topics":["avx","avx-512","neon","portable","simd","sse","sve","vector","wrapper"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aff3ct.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-06-23T16:56:44.000Z","updated_at":"2024-07-31T04:34:21.154Z","dependencies_parsed_at":"2023-12-18T16:45:03.419Z","dependency_job_id":"b8763c63-b496-4f73-bed9-446d572b4bf2","html_url":"https://github.com/aff3ct/MIPP","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aff3ct%2FMIPP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aff3ct%2FMIPP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aff3ct%2FMIPP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aff3ct%2FMIPP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aff3ct","download_url":"https://codeload.github.com/aff3ct/MIPP/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221704701,"owners_count":16866815,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avx","avx-512","neon","portable","simd","sse","sve","vector","wrapper"],"created_at":"2024-07-31T03:00:34.341Z","updated_at":"2025-03-18T02:31:32.875Z","avatar_url":"https://github.com/aff3ct.png","language":"C++","readme":"# MyIntrinsics++ (MIPP)\n\n[![pipeline status](https://gitlab.com/aff3ct/MIPP/badges/master/pipeline.svg)](https://gitlab.com/aff3ct/MIPP/pipelines)\n[![coverage report](https://gitlab.com/aff3ct/MIPP/badges/master/coverage.svg)](https://aff3ct.gitlab.io/MIPP/)\n\n![](mipp.jpg)\n\n## Purpose\n\nMIPP is a portable and Open-source wrapper (MIT license) for vector intrinsic\nfunctions (SIMD) written in C++11. It works for SSE, AVX, AVX-512, ARM NEON\nand SVE (work in progress) instructions. MIPP wrapper supports simple/double\nprecision floating-point numbers and also signed/unsigned integer arithmetic\n(64-bit, 32-bit, 16-bit and 8-bit).\n\nWith the MIPP wrapper you do not need to write a specific intrinsic code\nanymore. Just use provided functions and the wrapper will automatically\ngenerates the right intrisic calls for your specific architecture.\n\nIf you are interested by ARM SVE development status, \n[please follow this link](#arm-sve).\n\n## Short Documentation\n\n### Supported Compilers\n\nAt this time, MIPP has been tested on the following compilers:\n\n  - Intel: `icpc` \u003e= `16`,\n  - GNU: `g++` \u003e= `4.8`,\n  - Clang: `clang++` \u003e= `3.6`,\n  - Microsoft: `msvc` \u003e= `14`.\n\nOn `msvc` `14.10` (Microsoft Visual Studio 2017), the performances are reduced \ncompared to the other compilers, the compiler is not able to fully inline all \nthe MIPP methods. This has been fixed on `msvc` `14.21` (Microsoft Visual Studio \n2019) and now you can expect high performances.\n\n### Install and Configure your Code\n\nYou don't have to install MIPP because it is a simple C++ header file. The \nheaders are located in the `include` folder (note that this location has changed \nsince commit `6795891`, before they were located in the `src` folder). \n\nJust include the header into your source files when the wrapper is needed.\n\n```cpp\n#include \"mipp.h\"\n```\n\nmipp.h use a C++ `namespace`: `mipp`, if you do not want to prefix all the MIPP \ncalls by `mipp::` you can do that:\n\n```cpp\n#include \"mipp.h\"\nusing namespace mipp;\n```\n\nBefore trying to compile, think to tell the compiler what kind of vector \ninstructions you want to use. For instance, if you are using GNU compiler \n(`g++`) you simply have to add the `-march=native` option for SSE and AVX CPUs \ncompatible. For ARMv7 CPUs with NEON instructions you have to add the \n`-mfpu=neon` option (since most of current NEONv1 instructions are not IEEE-754 \ncompliant). However, this is no more the case on ARMv8 processors, so the\n`-march=native` option will work too. MIPP also uses some nice features provided \nby the C++11 and so we have to add the `-std=c++11` flag to compile the code. \nYou are now ready to run your code with the MIPP wrapper.\n\nIn the case where MIPP is installed on the system it can be integrated into a\ncmake projet in a standard way. Example\n```sh\n# install MIPP\ncd MIPP/\nexport MIPP_ROOT=$PWD/build/install\ncmake -B build -DCMAKE_INSTALL_PREFIX=$MIPP_ROOT\ncmake --build build -j5\ncmake --install build\n```\n\nIn your `CMakeLists.txt`:\n```cmake\n# find the installation of MIPP on the system\nfind_package(MIPP REQUIRED)\n\n# define your executable\nadd_executable(gemm gemm.cpp)\n\n# link your executable to MIPP\ntarget_link_libraries(gemm PRIVATE MIPP::mipp)\n```\n\n```sh\ncd your_project/\n# if MIPP is installed in a system standard path: MIPP will be found automatically with cmake\ncmake -B build\n# if MIPP is installed in a non-standard path: use CMAKE_PREFIX_PATH\ncmake -B build -DCMAKE_PREFIX_PATH=$MIPP_ROOT\n```\n\n#### Generate Sources \u0026 Compile the Static Library\n\nMIPP is mainly a header only library. However, some macro operations require\nto compile a small library. This is particularly true for the `compress` \noperation that relies on generated LUTs stored in the static library.\n\nTo generate the source files containing these LUTs you need to install Python3\nwith the Jinja2 package:\n```bash\nsudo apt install python3 python3-pip\npip3 install --user -r codegen/requirements.txt\n```\n\nThen you can call the generator as follow:\n```bash\npython3 codegen/gen_compress.py\n```\n\nAnd, finally you can compile the MIPP static library:\n```bash\ncmake -B build -DMIPP_STATIC_LIB=ON\ncmake --build build -j4\n```\n\nNote that **the compilation of the static library is optional**. You can choose \nto do not compile the static library then only some macro operations will be \nmissing.\n\n### Sequential Mode\n\nBy default, MIPP tries to recognize the instruction set from the preprocessor \ndefinitions. If MIPP can't match the instruction set (for instance when MIPP \ndoes not support the targeted instruction set), MIPP falls back on standard \nsequential instructions. In this mode, the vectorization is not guarantee \nanymore but the compiler can still perform auto-vectorization.\n\nIt is possible to force MIPP to use the sequential mode with the following \ncompiler definition: `-DMIPP_NO_INTRINSICS`. Sometime it can be useful for \ndebugging or to bench a code.\n\nIf you want to check the MIPP mode configuration, you can print the following \nglobal variable: `mipp::InstructionFullType` (`std::string`).\n\n### Vector Register Declaration\n\nJust use the `mipp::Reg\u003cT\u003e` type.\n\n```cpp\nmipp::Reg\u003cT\u003e r1, r2, r3; // we have declared 3 vector registers\n```\n\nBut we do not know the number of elements per register here. This number of \nelements can be obtained by calling the `mipp::N\u003cT\u003e()` function (`T` is a \ntemplate parameter, it can be `double`, `float`, `int64_t`, `uint64_t`,\n`int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t` or `uint8_t` type).\n\n```cpp\nfor (int i = 0; i \u003c n; i += mipp::N\u003cfloat\u003e()) {\n\t// ...\n}\n```\n\nThe register size directly depends on the precision of the data we are working \non.\n\n### Register `load` and `store` Instructions\n\nLoading memory from a vector into a register:\n\n```cpp\nint n = mipp::N\u003cfloat\u003e() * 10;\nstd::vector\u003cfloat\u003e myVector(n);\nint i = 0;\nmipp::Reg\u003cfloat\u003e r1;\nr1.load(\u0026myVector[i*mipp::N\u003cfloat\u003e()]);\n```\n\nThe last two lines can be shorten as follow where the `load` call becomes\nimplicit:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1 = \u0026myVector[i*mipp::N\u003cfloat\u003e()];\n```\n\nStore can be done with the `store(...)` method:\n\n```cpp\nint n = mipp::N\u003cfloat\u003e() * 10;\nstd::vector\u003cfloat\u003e myVector(n);\nint i = 0;\nmipp::Reg\u003cfloat\u003e r1 = \u0026myVector[i*mipp::N\u003cfloat\u003e()];\n\n// do something with r1\n\nr1.store(\u0026myVector[(i+1)*mipp::N\u003cfloat\u003e()]);\n```\n\nBy default the loads and stores work on **unaligned memory**.\nIt is possible to control this behavior with the `-DMIPP_ALIGNED_LOADS` \ndefinition: when specified, the loads and stores work on **aligned memory** by \ndefault. In the **aligned memory** mode, it is still possible to perform \nunaligned memory operations with the `mipp::loadu` and `mipp::storeu` functions.\nHowever, it is not possible to perform aligned loads and stores in the \n**unaligned memory** mode.\n\nTo allocate aligned data you can use the MIPP aligned memory allocator wrapped \ninto the `mipp::vector` class. `mipp::vector` is fully retro-compatible with the \nstandard `std::vector` class and it can be use everywhere you can use \n`std::vector`.\n\n```cpp\nmipp::vector\u003cfloat\u003e myVector(n);\n```\n\n### Register Initialization\n\nYou can initialize a vector register from a scalar value:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1; // r1 = | unknown | unknown | unknown | unknown |\nr1 = 1.0;            // r1 = |    +1.0 |    +1.0 |    +1.0 |    +1.0 |\n```\n\nOr from an initializer list (`std::initializer_list`):\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1;       // r1 = | unknown | unknown | unknown | unknown |\nr1 = {1.0, 2.0, 3.0, 4.0}; // r1 = |    +1.0 |    +2.0 |    +3.0 |    +4.0 |\n```\n\n### Computational Instructions\n\n**Add** two vector registers:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2, r3;\n\nr1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |\nr2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |\n\nr3 = r1 + r2; // r3 = | +3.0 | +3.0 | +3.0 | +3.0 |\n```\n\n**Subtract** two vector registers:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2, r3;\n\nr1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |\nr2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |\n\nr3 = r1 - r2; // r3 = | -1.0 | -1.0 | -1.0 | -1.0 |\n```\n\n**Multiply** two vector registers:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2, r3;\n\nr1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |\nr2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |\n\nr3 = r1 * r2; // r3 = | +2.0 | +2.0 | +2.0 | +2.0 |\n```\n\n**Divide** two vector registers:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2, r3;\n\nr1 = 1.0;     // r1 = | +1.0 | +1.0 | +1.0 | +1.0 |\nr2 = 2.0;     // r2 = | +2.0 | +2.0 | +2.0 | +2.0 |\n\nr3 = r1 / r2; // r3 = | +0.5 | +0.5 | +0.5 | +0.5 |\n```\n\n**Fused multiply and add** of three vector registers:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2, r3, r4;\n\nr1 = 2.0;                     // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |\nr2 = 3.0;                     // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |\nr3 = 1.0;                     // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |\n\n// r4 = (r1 * r2) + r3\nr4 = mipp::fmadd(r1, r2, r3); // r4 = | +7.0 | +7.0 | +7.0 | +7.0 |\n```\n\n**Fused negative multiply and add** of three vector registers:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2, r3, r4;\n\nr1 = 2.0;                      // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |\nr2 = 3.0;                      // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |\nr3 = 1.0;                      // r3 = | +1.0 | +1.0 | +1.0 | +1.0 |\n\n// r4 = -(r1 * r2) + r3\nr4 = mipp::fnmadd(r1, r2, r3); // r4 = | -5.0 | -5.0 | -5.0 | -5.0 |\n```\n\n**Square root** of a vector register:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2;\n\nr1 = 9.0;             // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |\n\nr2 = mipp::sqrt(r1);  // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |\n```\n\n**Reciprocal square root** of a vector register (be careful: this intrinsic \nexists only for simple precision floating-point numbers):\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2;\n\nr1 = 9.0;             // r1 = | +9.0 | +9.0 | +9.0 | +9.0 |\n\nr2 = mipp::rsqrt(r1); // r2 = | +0.3 | +0.3 | +0.3 | +0.3 |\n```\n\n### Selections\n\nSelect the **minimum** between two vector registers:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2, r3;\n\nr1 = 2.0;               // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |\nr2 = 3.0;               // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |\n\nr3 = mipp::min(r1, r2); // r3 = | +2.0 | +2.0 | +2.0 | +2.0 |\n```\n\nSelect the **maximum** between two vector registers:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2, r3;\n\nr1 = 2.0;               // r1 = | +2.0 | +2.0 | +2.0 | +2.0 |\nr2 = 3.0;               // r2 = | +3.0 | +3.0 | +3.0 | +3.0 |\n\nr3 = mipp::max(r1, r2); // r3 = | +3.0 | +3.0 | +3.0 | +3.0 |\n```\n\n### Permutations\n\nThe `rrot(...)` method allows you to perform a **right rotation** (a cyclic \npermutation) of the elements inside the register:\n\n```cpp\nmipp::Reg\u003cfloat\u003e r1, r2;\nr1 = {3.0, 2.0, 1.0, 0.0}  // r1 = | +3.0 | +2.0 | +1.0 | +0.0 |\n\nr2 = mipp::rrot(r1);       // r2 = | +0.0 | +3.0 | +2.0 | +1.0 |\nr1 = mipp::rrot(r2);       // r1 = | +1.0 | +0.0 | +3.0 | +2.0 |\nr2 = mipp::rrot(r1);       // r2 = | +2.0 | +1.0 | +0.0 | +3.0 |\nr1 = mipp::rrot(r2);       // r1 = | +3.0 | +2.0 | +1.0 | +0.0 |\n```\n\nOf course there are many more available instructions in the MIPP wrapper and you \ncan find these instructions at the [end of this page](#list-of-mipp-functions).\n\n### Addition of Two Vectors\n\n```cpp\n#include \u003ccstdlib\u003e // rand()\n#include \"mipp.h\"\n\nint main()\n{\n\t// data allocation\n\tconst int n = 32000; // size of the vA, vB, vC vectors\n\tmipp::vector\u003cfloat\u003e vA(n); // in\n\tmipp::vector\u003cfloat\u003e vB(n); // in\n\tmipp::vector\u003cfloat\u003e vC(n); // out\n\n\t// data initialization\n\tfor (int i = 0; i \u003c n; i++) vA[i] = rand() % 10;\n\tfor (int i = 0; i \u003c n; i++) vB[i] = rand() % 10;\n\n\t// declare 3 vector registers\n\tmipp::Reg\u003cfloat\u003e rA, rB, rC;\n\n\t// compute rC with the MIPP vectorized functions\n\tfor (int i = 0; i \u003c n; i += mipp::N\u003cfloat\u003e()) {\n\t\trA.load(\u0026vA[i]); // unaligned load by default (use the -DMIPP_ALIGNED_LOADS\n\t\trB.load(\u0026vB[i]); // macro definition to force aligned loads and stores).\n\t\trC = rA + rB;\n\t\trC.store(\u0026vC[i]);\n\t}\n\n\treturn 0;\n}\n```\n\n### Vectorizing an Existing Code\n\n#### Scalar Code\n\n```cpp\n// ...\nfor (int i = 0; i \u003c n; i++) {\n\tout[i] = 0.75f * in1[i] * std::exp(in2[i]);\n}\n// ...\n```\n\n#### Vectorized Code\n\n```cpp\n// ...\n// compute the vectorized loop size which is a multiple of 'mipp::N\u003cfloat\u003e()'.\nauto vecLoopSize = (n / mipp::N\u003cfloat\u003e()) * mipp::N\u003cfloat\u003e();\nmipp::Reg\u003cfloat\u003e rout, rin1, rin2;\nfor (int i = 0; i \u003c vecLoopSize; i += mipp::N\u003cfloat\u003e()) {\n\trin1.load(\u0026in1[i]); // unaligned load by default (use the -DMIPP_ALIGNED_LOADS\n\trin2.load(\u0026in2[i]); // macro definition to force aligned loads and stores).\n\t// the '0.75f' constant will be broadcast in a vector but it has to be at\n\t// the right of a 'mipp::Reg\u003cT\u003e', this is why it has been moved at the right\n\t// of the 'rin1' register. Notice that 'std::exp' has been replaced by\n\t// 'mipp::exp'.\n\trout = rin1 * 0.75f * mipp::exp(rin2);\n\trout.store(\u0026out[i]);\n}\n\n// scalar tail loop: compute the remaining elements that can't be vectorized.\nfor (int i = vecLoopSize; i \u003c n; i++) {\n\tout[i] = 0.75f * in1[i] * std::exp(in2[i]);\n}\n// ...\n```\n\n### Masked Instructions\n\nMIPP comes with two generic and templatized masked functions (`mask` and \n`maskz`). Those functions allow you to benefit from the AVX-512 and SVE masked \ninstructions. `mask` and `maskz` functions are retro compatible with older \ninstruction sets.\n\n```cpp\nmipp::Reg\u003c        float   \u003e ZMM1 = {   40,  -30,    60,    80};\nmipp::Reg\u003c        float   \u003e ZMM2 = 0.1; // broadcast\nmipp::Msk\u003cmipp::N\u003cfloat\u003e()\u003e k1   = {false, true, false, false};\n\n// ZMM3 = k1 ? ZMM1 * ZMM2 : ZMM1;\nauto ZMM3 = mipp::mask\u003cfloat, mipp::mul\u003e(k1, ZMM1, ZMM1, ZMM2);\nstd::cout \u003c\u003c ZMM3 \u003c\u003c std::endl; // output: \"[40, -3, 60, 80]\"\n\n// ZMM4 = k1 ? ZMM1 * ZMM2 : 0;\nauto ZMM4 = mipp::maskz\u003cfloat, mipp::mul\u003e(k1, ZMM1, ZMM2);\nstd::cout \u003c\u003c ZMM4 \u003c\u003c std::endl; // output: \"[0, -3, 0, 0]\"\n```\n\n## List of MIPP Functions\n\nThis section presents an exhaustive list of all the available functions in MIPP.\nOf course the MIPP wrapper does not cover all the possible intrinsics of each \ninstruction set but it tries to give you the most important and useful ones.\n\nIn the following tables, `T`, `T1` and `T2` stand for data types (`double`, \n`float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`,\n`int8_t` or `uint8_t`).\n`N` stands for the number or elements in a mask or in a register.\n`N` is a strictly positive integer and can easily be deduced from the data type: \n`constexpr int N = mipp::N\u003cT\u003e()`.\nWhen `T` and `N` are mixed in a prototype, `N` has to satisfy the previous \nconstraint (`N = mipp::N\u003cT\u003e()`).\n\nIn the documentation there are some terms that requires to be clarified:\n\n  - **register element**: a SIMD register is composed by multiple scalar \n  elements, those elements are built-in data types (`double`, `float`, \n  `int64_t`, ...),\n  - **register lane**: modern instruction sets can have multiple implicit sub \n  parts in an entire SIMD register, those sub parts are called lanes (SSE has \n  one lane of 128 bits, AVX has two lanes of 128 bits, AVX-512 has four lanes of \n  128 bits).\n\n### Memory Operations\n\n| **Short name**  | **Prototype**                                                                          | **Documentation**                                                                                                                                                   | **Supported types**                                                                                         |\n| :---            | :---                                                                                   | :---                                                                                                                                                                | :---                                                                                                        |\n| `load`          | `Reg  \u003cT\u003e load          (const T* mem)`                                                | Loads aligned data from `mem` to a register.                                                                                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `loadu`         | `Reg  \u003cT\u003e loadu         (const T* mem)`                                                | Loads unaligned data from `mem` to a register.                                                                                                                      | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `store`         | `void     store         (T* mem, const Reg\u003cT\u003e r)`                                      | Stores the `r` register in the `mem` aligned data.                                                                                                                  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `storeu`        | `void     storeu        (T* mem, const Reg\u003cT\u003e r)`                                      | Stores the `r` register in the `mem` unaligned data.                                                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `maskzld`       | `Reg  \u003cT\u003e maskzld       (const Msk\u003cN\u003e m, const T* mem)`                                | Loads elements according to the mask `m` (puts zero when the mask value is false).                                                                                  | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `maskzlds`      | `Reg  \u003cT\u003e maskzlds      (const Msk\u003cN\u003e m, const T* mem)`                                | Loads elements according to the mask `m` (puts zero when the mask value is false). Safe version, only reads masked elements in memory.                              | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `maskst`        | `void     maskst        (const Msk\u003cN\u003e m, T* mem, const Reg\u003cT\u003e r)`                      | Stores elements from the `r` register according to the mask `m` in the `mem` memory.                                                                                | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `masksts`       | `void     masksts       (const Msk\u003cN\u003e m, T* mem, const Reg\u003cT\u003e r)`                      | Stores elements from the `r` register according to the mask `m` in the `mem` memory. Safe version, only writes masked elements in memory.                           | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `gather`        | `Reg  \u003cTD,TI\u003e gather    (const TD* mem, const Reg\u003cTI\u003e idx)`                            | Gathers elements from `mem` to a register. Selects elements according to the indices in `idx`.                                                                      | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `scatter`       | `void \u003cTD,TI\u003e scatter   (TD* mem, const Reg\u003cTI\u003e idx, const Reg\u003cTD\u003e r)`                 | Scatters elements into `mem` from the `r` register. Writes elements at the `idx` indices in `mem`.                                                                  | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `maskzgat`      | `Reg  \u003cTD,TI\u003e gather    (const Msk\u003cN\u003e m, const TD* mem, const Reg\u003cTI\u003e idx)`            | Gathers elements from `mem` to a register (according to the mask `m`). Selects elements according to the indices in `idx` (puts zero when the mask value is false). | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `masksca`       | `void \u003cTD,TI\u003e scatter   (const Msk\u003cN\u003e m, TD* mem, const Reg\u003cTI\u003e idx, const Reg\u003cTD\u003e r)` | Scatters elements into `mem` from the `r` register (according to the mask `m`). Writes elements at the `idx` indices in `mem`.                                      | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `set`           | `Reg  \u003cT\u003e set           (const T[N] vals)`                                             | Sets a register from the values in `vals`.                                                                                                                          | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `set`           | `Msk  \u003cN\u003e set           (const bool[N] bits)`                                          | Sets a mask from the bits in `bits`.                                                                                                                                |                                                                                                             |\n| `set1`          | `Reg  \u003cT\u003e set1          (const T val)`                                                 | Broadcasts `val` in a register.                                                                                                                                     | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `set1`          | `Msk  \u003cN\u003e set1          (const bool bit)`                                              | Broadcasts `bit` in a mask.                                                                                                                                         |                                                                                                             |\n| `set0`          | `Reg  \u003cT\u003e set0          ()`                                                            | Initializes a register to zero.                                                                                                                                     | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `set0`          | `Msk  \u003cN\u003e set0          ()`                                                            | Initializes a mask to false.                                                                                                                                        |                                                              |\n| `get`           | `T        get           (const Reg\u003cT\u003e r, const size_t index)`                          | Gets a specific element from the register `r` at the `index` position.                                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `get`           | `T        get           (const Reg_2\u003cT\u003e r, const size_t index)`                        | Gets a specific element from the register `r` at the `index` position.                                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `get`           | `bool     get           (const Msk\u003cN\u003e m, const size_t index)`                          | Gets a specific element from the register `m` at the `index` position.                                                                                              |                                                                                                             |\n| `getfirst`      | `T        getfirst      (const Reg\u003cT\u003e r)`                                              | Gets the first element from the register `r`.                                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `getfirst`      | `T        getfirst      (const Reg_2\u003cT\u003e r)`                                            | Gets the first element from the register `r`.                                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `getfirst`      | `bool     getfirst      (const Msk\u003cN\u003e m)`                                              | Gets the first element from the register `m`.                                                                                                                       |                                                                                                             |\n| `low`           | `Reg_2\u003cT\u003e low           (const Reg\u003cT\u003e r)`                                              | Gets the low part of the `r` register.                                                                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `high`          | `Reg_2\u003cT\u003e high          (const Reg\u003cT\u003e r)`                                              | Gets the high part of the `r` register.                                                                                                                             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `combine`       | `Reg  \u003cT\u003e combine       (const Reg_2\u003cT\u003e r1, const Reg_2\u003cT\u003e r2)`                        | Combine two half registers in a full register, `r1` will be the low part and `r2` the high part.                                                                    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `combine`       | `Reg  \u003cS,T\u003e combine     (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | `S` elements of `r1` are shifted to the left, `(S - N) + N` elements of `r2` are shifted to the right. Shifted `r1` and `r2` are combined to give the result.       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `compress`      | `Reg  \u003cT\u003e compress      (const Reg\u003cT\u003e r1, const Msk\u003cN\u003e m)`                             | Pack the elements of `r1` at the beginning of the register according to the bitmask `m` (if the bit is 1 then element is picked, otherwise it is not).              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `cmask`         | `Reg  \u003cT\u003e cmask         (const uint32_t[N  ] ids)`                                     | Creates a cmask from an indexes list (indexes have to be between 0 and N-1).                                                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `cmask2`        | `Reg  \u003cT\u003e cmask2        (const uint32_t[N/2] ids)`                                     | Creates a cmask2 from an indexes list (indexes have to be between 0 and (N/2)-1).                                                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `cmask4`        | `Reg  \u003cT\u003e cmask4        (const uint32_t[N/4] ids)`                                     | Creates a cmask4 from an indexes list (indexes have to be between 0 and (N/4)-1).                                                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `shuff`         | `Reg  \u003cT\u003e shuff         (const Reg\u003cT\u003e r, const Reg\u003cT\u003e cm)`                             | Shuffles the elements of `r` according to the cmask `cm`.                                                                                                           | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `shuff2`        | `Reg  \u003cT\u003e shuff2        (const Reg\u003cT\u003e r, const Reg\u003cT\u003e cm2)`                            | Shuffles the elements of `r` according to the cmask2 `cm2` (same shuffle is applied in both lanes).                                                                 | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `shuff4`        | `Reg  \u003cT\u003e shuff4        (const Reg\u003cT\u003e r, const Reg\u003cT\u003e cm4)`                            | Shuffles the elements of `r` according to the cmask4 `cm4` (same shuffle is applied in the four lanes).                                                             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `interleave`    | `Regx2\u003cT\u003e interleave    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Interleaves `r1` and `r2` : `[r1_1, r2_1, r1_2, r2_2, ..., r1_n, r2_n]`.                                                                                            | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `deinterleave`  | `Regx2\u003cT\u003e deinterleave  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Reverts the previous defined interleave operation.                                                                                                                  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `interleave2`   | `Regx2\u003cT\u003e interleave2   (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Interleaves `r1` and `r2` considering two lanes.                                                                                                                    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `interleave4`   | `Regx2\u003cT\u003e interleave4   (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Interleaves `r1` and `r2` considering four lanes.                                                                                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `interleavelo`  | `Reg  \u003cT\u003e interleavelo  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Interleaves the low part of `r1` with the low part of `r2`.                                                                                                         | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `interleavelo2` | `Reg  \u003cT\u003e interleavelo2 (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Interleaves the low part of `r1` with the low part of `r2` (considering two lanes).                                                                                 | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `interleavelo4` | `Reg  \u003cT\u003e interleavelo4 (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Interleaves the low part of `r1` with the low part of `r2` (considering four lanes).                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `interleavehi`  | `Reg  \u003cT\u003e interleavehi  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Interleaves the high part of `r1` with the high part of `r2`.                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `interleavehi2` | `Reg  \u003cT\u003e interleavehi2 (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Interleaves the high part of `r1` with the high part of `r2` (considering two lanes).                                                                               | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `interleavehi4` | `Reg  \u003cT\u003e interleavehi4 (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                            | Interleaves the high part of `r1` with the high part of `r2` (considering four lanes).                                                                              | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `lrot`          | `Reg  \u003cT\u003e lrot          (const Reg\u003cT\u003e r)`                                              | Rotates the `r` register from the left (cyclic permutation).                                                                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `rrot`          | `Reg  \u003cT\u003e rrot          (const Reg\u003cT\u003e r)`                                              | Rotates the `r` register from the right (cyclic permutation).                                                                                                       | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `blend`         | `Reg  \u003cT\u003e blend         (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2, const Msk\u003cN\u003e m)`            | Combines `r1` and `r2` register following the `m` mask values (`m_i ? r1_i : r2_i`).                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `select`        | `Reg  \u003cT\u003e select        (const Msk\u003cN\u003e m, const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`            | Alias for the previous `blend` function. Parameters order is a little bit different.                                                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n\n### Bitwise Operations\n\nThe `pipe` keyword stands for the \"\u0026#124;\" binary operator.\n\n| **Short name** | **Operator**       | **Prototype**                                       | **Documentation**                             | **Supported types**                                                                                         |\n| :---           | :---               | :---                                                | :---                                          | :---                                                                                                        |\n| `andb`         | `\u0026` and `\u0026=`       | `Reg\u003cT\u003e andb    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Computes the bitwise AND: `r1 \u0026 r2`.          | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `andb`         | `\u0026` and `\u0026=`       | `Msk\u003cN\u003e andb    (const Msk\u003cN\u003e m1, const Msk\u003cN\u003e m2)` | Computes the bitwise AND: `m1 \u0026 m2`.          |                                                                                                             |\n| `andnb`        |                    | `Reg\u003cT\u003e andnb   (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r1)` | Computes the bitwise AND NOT: `(~r1) \u0026 r2`.   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `andnb`        |                    | `Msk\u003cN\u003e andnb   (const Msk\u003cN\u003e m1, const Msk\u003cN\u003e m2)` | Computes the bitwise AND NOT: `(~m1) \u0026 m2`.   |                                                                                                             |\n| `orb`          | `pipe` and `pipe=` | `Reg\u003cT\u003e orb     (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Computes the bitwise OR: `r1 pipe r2`.        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `orb`          | `pipe` and `pipe=` | `Msk\u003cN\u003e orb     (const Msk\u003cN\u003e m1, const Msk\u003cN\u003e m2)` | Computes the bitwise OR: `m1 pipe m2`.        |                                                                                                             |\n| `xorb`         | `^` and `^=`       | `Reg\u003cT\u003e xorb    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Computes the bitwise XOR: `r1 ^ r2`.          | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `xorb`         | `^` and `^=`       | `Msk\u003cN\u003e xorb    (const Msk\u003cN\u003e m1, const Msk\u003cN\u003e m2)` | Computes the bitwise XOR: `m1 ^ m2`.          |                                                                                                             |\n| `lshift`       | `\u003c\u003c` and `\u003c\u003c=`     | `Reg\u003cT\u003e lshift  (const Reg\u003cT\u003e r, const uint32_t n)` | Computes the bitwise LEFT SHIFT: `r \u003c\u003c n`.    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `lshiftr`      | `\u003c\u003c` and `\u003c\u003c=`     | `Reg\u003cT\u003e lshiftr (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Computes the bitwise LEFT SHIFT: `r1 \u003c\u003c r2`.  | `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t`                    |\n| `lshift`       | `\u003c\u003c` and `\u003c\u003c=`     | `Msk\u003cN\u003e lshift  (const Msk\u003cN\u003e m, const uint32_t n)` | Computes the bitwise LEFT SHIFT: `m \u003c\u003c n`.    |                                                                                                             |\n| `rshift`       | `\u003e\u003e` and `\u003e\u003e=`     | `Reg\u003cT\u003e rshift  (const Reg\u003cT\u003e r, const uint32_t n)` | Computes the bitwise RIGHT SHIFT: `r \u003e\u003e n`.   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `rshiftr`      | `\u003e\u003e` and `\u003e\u003e=`     | `Reg\u003cT\u003e rshiftr (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Computes the bitwise RIGHT SHIFT: `r1 \u003e\u003e r2`. | `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t`                    |\n| `rshift`       | `\u003e\u003e` and `\u003e\u003e=`     | `Msk\u003cN\u003e rshift  (const Msk\u003cN\u003e m, const uint32_t n)` | Computes the bitwise RIGHT SHIFT: `m \u003e\u003e n`.   |                                                                                                             |\n| `notb`         | `~`                | `Reg\u003cT\u003e notb    (const Reg\u003cT\u003e r)`                   | Computes the bitwise NOT: `~r`.               | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `notb`         | `~`                | `Msk\u003cN\u003e notb    (const Msk\u003cN\u003e m)`                   | Computes the bitwise NOT: `~m`.               |                                                                                                             |\n\n### Logical Comparisons\n\n| **Short name** | **Operator** | **Prototype**                                      | **Documentation**                             | **Supported types**                                                                                         |\n| :---           | :---         | :---                                               | :---                                          | :---                                                                                                        |\n| `cmpeq`        | `==`         | `Msk\u003cN\u003e cmpeq  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Compares if equal to: `r1 == r2`.             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `cmpneq`       | `!=`         | `Msk\u003cN\u003e cmpneq (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Compares if not equal to: `r1 != r2`.         | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `cmpge`        | `\u003e=`         | `Msk\u003cN\u003e cmpge  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Compares if greater or equal to: `r1 \u003e= r2`.  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `cmpgt`        | `\u003e`          | `Msk\u003cN\u003e cmpgt  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Compares if strictly greater than: `r1 \u003e r2`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `cmple`        | `\u003c=`         | `Msk\u003cN\u003e cmple  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Compares if lower or equal to: `r1 \u003c= r2`.    | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `cmplt`        | `\u003c`          | `Msk\u003cN\u003e cmplt  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Compares if strictly lower than: `r1 \u003c r2`.   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n\n### Conversions and Packing\n\n| **Short name** | **Prototype**                                        | **Documentation**                                                                                                                                                                                                                                                                                                                                   | **Supported types**                                                                                                                                                    |\n| :---           | :---                                                 | :---                                                                                                                                                                                                                                                                                                                                                | :---                                                                                                                                                                   |\n| `toReg`        | `Reg\u003cT\u003e  toReg (const Msk\u003cN\u003e m)`                     | Converts the mask `m` into a register of type `T`, the number of elements `N` has to be the same for the mask and the register. If the mask is `false` then all the bits of the corresponding element are set to 0, otherwise if the mask is `true` then all the bits are set to 1 (be careful, for float datatypes `true` is interpreted as NaN!). | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t`                                                            |\n| `cvt`          | `Reg\u003cT2\u003e cvt   (const Reg\u003cT1\u003e r)`                    | Converts the elements of `r` into an other representation (the new representation and the original one have to have the same size).                                                                                                                                                                                                                 | `float -\u003e int32_t`, `float -\u003e uint32_t`, `int32_t -\u003e float`, `uint32_t -\u003e float`, `double -\u003e int64_t`, `double -\u003e uint64_t`, `int64_t -\u003e double`, `uint64_t -\u003e double` |\n| `cvt`          | `Reg\u003cT2\u003e cvt   (const Reg_2\u003cT1\u003e r)`                  | Converts elements of `r` into bigger elements (in bits).                                                                                                                                                                                                                                                                                            | `int8_t -\u003e int16_t`, `uint8_t -\u003e uint16_t`, `int16_t -\u003e int32_t`, `uint16_t -\u003e uint32_t`, `int32_t -\u003e int64_t`, `uint32_t -\u003e uint64_t`                                 |\n| `pack`         | `Reg\u003cT2\u003e pack  (const Reg\u003cT1\u003e r1, const Reg\u003cT1\u003e r2)` | Packs elements of `r1` and `r2` into smaller elements (some information can be lost in the conversion).                                                                                                                                                                                                                                             | `int32_t -\u003e int16_t`, `uint32_t -\u003e uint16_t`, `int16_t -\u003e int8_t`, `uint16_t -\u003e uint8_t`                                                                               |\n\n### Arithmetic Operations\n\n| **Short name** | **Operator** | **Prototype**                                                       | **Documentation**                                                                                   | **Supported types**                                                                                         |\n| :---           | :---         | :---                                                                | :---                                                                                                | :---                                                                                                        |\n| `add`          | `+` and `+=` | `Reg\u003cT\u003e add    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                  | Performs the arithmetic addition: `r1 + r2`.                                                        | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `sub`          | `-` and `-=` | `Reg\u003cT\u003e sub    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                  | Performs the arithmetic subtraction: `r1 - r2`.                                                     | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `mul`          | `*` and `*=` | `Reg\u003cT\u003e mul    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                  | Performs the arithmetic multiplication: `r1 * r2`.                                                  | `double`, `float`, `int32_t`, `int16_t`, `int8_t`                                                           |\n| `div`          | `/` and `/=` | `Reg\u003cT\u003e div    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                  | Performs the arithmetic division: `r1 / r2`.                                                        | `double`, `float`                                                                                           |\n| `fmadd`        |              | `Reg\u003cT\u003e fmadd  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2, const Reg\u003cT\u003e r3)` | Performs the fused multiplication and addition: `r1 * r2 + r3`.                                     | `double`, `float`                                                                                           |\n| `fnmadd`       |              | `Reg\u003cT\u003e fnmadd (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2, const Reg\u003cT\u003e r3)` | Performs the negative fused multiplication and addition: `-(r1 * r2) + r3`.                         | `double`, `float`                                                                                           |\n| `fmsub`        |              | `Reg\u003cT\u003e fmsub  (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2, const Reg\u003cT\u003e r3)` | Performs the fused multiplication and subtraction: `r1 * r2 - r3`.                                  | `double`, `float`                                                                                           |\n| `fnmsub`       |              | `Reg\u003cT\u003e fnmsub (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2, const Reg\u003cT\u003e r3)` | Performs the negative fused multiplication and subtraction: `-(r1 * r2) - r3`.                      | `double`, `float`                                                                                           |\n| `min`          |              | `Reg\u003cT\u003e min    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                  | Selects the minimum: `r1_i \u003c r2_i ? r1_i : r2_i`.                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `max`          |              | `Reg\u003cT\u003e max    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                  | Selects the maximum: `r1_i \u003e r2_i ? r1_i : r2_i`.                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `div2`         |              | `Reg\u003cT\u003e div2   (const Reg\u003cT\u003e r)`                                    | Performs the arithmetic division by two: `r / 2`.                                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `div4`         |              | `Reg\u003cT\u003e div4   (const Reg\u003cT\u003e r)`                                    | Performs the arithmetic division by four: `r / 4`.                                                  | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `abs`          |              | `Reg\u003cT\u003e abs    (const Reg\u003cT\u003e r)`                                    | Computes the absolute value of `r`.                                                                 | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `sqrt`         |              | `Reg\u003cT\u003e sqrt   (const Reg\u003cT\u003e r)`                                    | Computes the square root of `r`.                                                                    | `double`, `float`                                                                                           |\n| `rsqrt`        |              | `Reg\u003cT\u003e rsqrt  (const Reg\u003cT\u003e r)`                                    | Computes the reciprocal square root of `r`: `1 / sqrt(r)`.                                          | `double`, `float`                                                                                           |\n| `sat`          |              | `Reg\u003cT\u003e sat    (const Reg\u003cT\u003e r, const T minv, const T maxv)`        | Saturates the register values: `max(min(r, minv), maxv)`.                                           | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `neg`          |              | `Reg\u003cT\u003e neg    (const Reg\u003cT\u003e r, const Msk\u003cN\u003e m)`                    | Negates the register elements following the mask values: `m_i ? -r_i : r_i`.                        | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `neg`          |              | `Reg\u003cT\u003e neg    (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)`                  | Negates the register elements following the last register values: `r2_i \u003c 0 ? -r1_i : r1_i`.        | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `sign`         |              | `Msk\u003cN\u003e sign   (const Reg\u003cT\u003e r)`                                    | Returns the sign: `r \u003c 0`.                                                                          | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `round`        |              | `Reg\u003cT\u003e round  (const Reg\u003cT\u003e r)`                                    | Rounds the register values: `fractional_part(r) \u003e= 0.5 ? integral_part(r) + 1 : integral_part(r)`.  | `double`, `float`                                                                                           |\n| `trunc`        |              | `Reg\u003cT\u003e trunc  (const Reg\u003cT\u003e r)`                                    | Truncates the register values: `integral_part(r) `.                                                 | `double`, `float`                                                                                           |\n\n### Arithmetic Operations on Complex Numbers\n\nThe complex operations are exclusively performed on `Regx2\u003cT\u003e` objects (one \n`Regx2\u003cT\u003e` object contains two `Reg\u003cT\u003e` hardware registers). Each `Regx2\u003cT\u003e` \nobject contains `mipp::N\u003cT\u003e()` complex number. If we declare a `Regx2\u003cT\u003e cmplx` \nobject, the `cmplx[0]` register will contain the real part of the complex \nnumbers and `cmplx[1]` will contain the imaginary part. Depending on how you \nstored your complex numbers in memory you can need to use reordering before \ncalling a complex operation. For instance, if you choose to store the complex \nnumbers in a mixed format like this: `r0, i0, r1, i1, r2, i2, ..., rn, in` you \nwill need to call the `mipp::deinterleave` operation before and the \n`mipp::interleave` operation after the complex operation.\n\n| **Short name** | **Operator** | **Prototype**                                              | **Documentation**                                                    | **Supported types**                                                                                         |\n| :---           | :---         | :---                                                       | :---                                                                 | :---                                                                                                        |\n| `cadd`         | `+` and `+=` | `Regx2\u003cT\u003e cadd     (const Regx2\u003cT\u003e r1, const Regx2\u003cT\u003e r2)` | Performs the complex addition: `r1 + r2`.                            | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `csub`         | `-` and `-=` | `Regx2\u003cT\u003e csub     (const Regx2\u003cT\u003e r1, const Regx2\u003cT\u003e r2)` | Performs the complex subtraction: `r1 - r2`.                         | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `cmul`         | `*` and `*=` | `Regx2\u003cT\u003e cmul     (const Regx2\u003cT\u003e r1, const Regx2\u003cT\u003e r2)` | Performs the complex multiplication: `r1 * r2`.                      | `double`, `float`, `int32_t`, `int16_t`, `int8_t`                                                           |\n| `cdiv`         | `/` and `/=` | `Regx2\u003cT\u003e cdiv     (const Regx2\u003cT\u003e r1, const Regx2\u003cT\u003e r2)` | Performs the complex division: `r1 / r2`.                            | `double`, `float`                                                                                           |\n| `cmulconj`     |              | `Regx2\u003cT\u003e cmulconj (const Regx2\u003cT\u003e r1, const Regx2\u003cT\u003e r2)` | Performs the complex multiplication with conjugate: `r1 * conj(r2)`. | `double`, `float`, `int32_t`, `int16_t`, `int8_t`                                                           |\n| `conj`         |              | `Regx2\u003cT\u003e cmulconj (const Regx2\u003cT\u003e r)`                     | Computes the conjugate: `conj(r)`.                                   | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `norm`         |              | `Reg  \u003cT\u003e norm     (const Regx2\u003cT\u003e r)`                     | Computes the squared magnitude: `norm(r)`.                           | `double`, `float`, `int32_t`, `int16_t`, `int8_t`                                                           |\n\n### Reductions (Horizontal Functions)\n\n| **Short name**    | **Prototype**                                                     | **Documentation**                                                                                                  | **Supported types**                                                                                         |\n| :---              | :---                                                              | :---                                                                                                               | :---                                                                                                        |\n| `hadd` or `sum`   | `T    hadd                    (const Reg\u003cT\u003e r)`                   | Sums all the elements in the register `r`: `r_1 + r_2 + ... + r_n`.                                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `hmul`            | `T    hmul                    (const Reg\u003cT\u003e r)`                   | Multiplies all the elements in the register `r` : `r_1 * r_2 * ... * r_n`.                                         | `double`, `float`, `int64_t`, `int32_t`, `int16_t`, `int8_t`                                                |\n| `hmin`            | `T    hmin                    (const Reg\u003cT\u003e r)`                   | Selects the minimum element in the register `r` : `min(min(min(..., r_1), r_2), r_n)`.                             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `hmax`            | `T    hmax                    (const Reg\u003cT\u003e r)`                   | Selects the maximum element in the register `r` : `max(max(max(..., r_1), r_2), r_n)`.                             | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `testz`           | `bool testz                   (const Reg\u003cT\u003e r1, const Reg\u003cT\u003e r2)` | Mainly tests if all the elements of the registers are zeros: `r = (r1 \u0026 r2); !(r_1 OR r_2 OR ... OR r_n)`.         | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `testz`           | `bool testz                   (const Msk\u003cN\u003e m1, const Msk\u003cN\u003e m2)` | Mainly tests if all the elements of the masks are zeros: `m = (m1 \u0026 m2); !(m_1 OR m_2 OR ... OR m_n)`.             |                                                                                                             |\n| `testz`           | `bool testz                   (const Reg\u003cT\u003e r)`                   | Tests if all the elements of the register are zeros: `!(r_1 OR r_2 OR ... OR r_n)`.                                | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n| `testz`           | `bool testz                   (const Msk\u003cN\u003e m)`                   | Tests if all the elements of the mask are zeros: `!(m_1 OR m_2 OR ... OR m_n)`.                                    |                                                                                                             |\n| `Reduction\u003cT,OP\u003e` | `T    Reduction\u003cT,OP\u003e::sapply (const Reg\u003cT\u003e r)`                   | Generic reduction operation, can take a user defined operator `OP` and will performs the reduction with it on `r`. | `double`, `float`, `int64_t`, `uint64_t`, `int32_t`, `uint32_t`, `int16_t`, `uint16_t`, `int8_t`, `uint8_t` |\n\n### Math Functions\n\n| **Short name** | **Prototype**                                            | **Documentation**                                                    | **Supported types**                |\n| :---           | :---                                                     | :---                                                                 | :---                               |\n| `exp`          | `Reg\u003cT\u003e   exp    (const Reg\u003cT\u003e r)`                       | Computes the exponential of `r`.                                     | `double` (only on `icpc`), `float` |\n| `log`          | `Reg\u003cT\u003e   log    (const Reg\u003cT\u003e r)`                       | Computes the logarithm of `r`.                                       | `double` (only on `icpc`), `float` |\n| `sin`          | `Reg\u003cT\u003e   sin    (const Reg\u003cT\u003e r)`                       | Computes the sines of `r`.                                           | `double` (only on `icpc`), `float` |\n| `cos`          | `Reg\u003cT\u003e   cos    (const Reg\u003cT\u003e r)`                       | Computes the cosines of `r`.                                         | `double` (only on `icpc`), `float` |\n| `tan`          | `Reg\u003cT\u003e   tan    (const Reg\u003cT\u003e r)`                       | Computes the tangent of `r`.                                         | `double` (only on `icpc`), `float` |\n| `sincos`       | `void     sincos (const Reg\u003cT\u003e r, Reg\u003cT\u003e\u0026 s, Reg\u003cT\u003e\u0026 c)` | Computes at once the sines (in `s`) and the cosines (in `c`) of `r`. | `double` (only on `icpc`), `float` |\n| `sincos`       | `Regx2\u003cT\u003e sincos (const Reg\u003cT\u003e r)`                       | Computes and returns at once the sines and the cosines of `r`.       | `double` (only on `icpc`), `float` |\n| `cossin`       | `Regx2\u003cT\u003e cossin (const Reg\u003cT\u003e r)`                       | Computes and returns at once the cosines and the sines of `r`.       | `double` (only on `icpc`), `float` |\n| `sinh`         | `Reg\u003cT\u003e   sinh   (const Reg\u003cT\u003e r)`                       | Computes the hyperbolic sines of `r`.                                | `double` (only on `icpc`), `float` |\n| `cosh`         | `Reg\u003cT\u003e   cosh   (const Reg\u003cT\u003e r)`                       | Computes the hyperbolic cosines of `r`.                              | `double` (only on `icpc`), `float` |\n| `tanh`         | `Reg\u003cT\u003e   tanh   (const Reg\u003cT\u003e r)`                       | Computes the hyperbolic tangent of `r`.                              | `double` (only on `icpc`), `float` |\n| `asinh`        | `Reg\u003cT\u003e   asinh  (const Reg\u003cT\u003e r)`                       | Computes the inverse hyperbolic sines of `r`.                        | `double` (only on `icpc`), `float` |\n| `acosh`        | `Reg\u003cT\u003e   acosh  (const Reg\u003cT\u003e r)`                       | Computes the inverse hyperbolic cosines of `r`.                      | `double` (only on `icpc`), `float` |\n| `atanh`        | `Reg\u003cT\u003e   atanh  (const Reg\u003cT\u003e r)`                       | Computes the inverse hyperbolic tangent of `r`.                      | `double` (only on `icpc`), `float` |\n\n## ARM SVE\n\n### SVE Length Specific\n\nAn ARM SVE version is under construction. This version uses *SVE length\nspecific* which is more appropriated to the MIPP architecture. This way, the\nsize of the *MIPP registers* is defined at the compilation time. As a reminder, \nthe vector length can vary from a minimum of 128 bits up to a maximum of 2048\nbits, at 128-bit increments. On GNU and Clang compilers, it is specified at the \ncompilation time with the `-msve-vector-bits=\u003csize\u003e` flag.\n\n### Supported MIPP Operations\n\n- **Memory operations:** `load`, `store`, `blend`, `set`, `set1`, `gather`, \n  `scatter`, `maskzld`, `maskst`, `maskzgat`, `masksca`\n- **Logical comparisons:** `cmpeq`, `cmneq`\n- **Bitwise operations:** `andb`, `notb` (msk)\n- **Arithmetic operations:** `fmadd`, `add`, `sub`, `mul`, `div`\n- **Reductions:** `testz` (msk), `Reduce\u003cT, add\u003e`\n\n*Byte* and *word* operations are not yet implemented.\n\n## How to cite MIPP\n\nWe recommend you to cite the following article:\n- Adrien Cassagne, Olivier Aumage, Denis Barthou, Camille Leroux and Christophe Jégo,  \n  [**MIPP: a Portable C++ SIMD Wrapper and its use for Error Correction Coding in 5G Standard**](https://doi.org/10.1145/3178433.3178435),  \n  *The 5th International Workshop on Programming Models for SIMD/Vector Processing (WPMVP 2018), February 2018.*\n","funding_links":[],"categories":["C++","Maths"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faff3ct%2FMIPP","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faff3ct%2FMIPP","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faff3ct%2FMIPP/lists"}