{"id":20796859,"url":"https://github.com/eatingtomatoes/pure_simd","last_synced_at":"2025-05-06T10:07:44.490Z","repository":{"id":90600621,"uuid":"262938851","full_name":"eatingtomatoes/pure_simd","owner":"eatingtomatoes","description":"A simple, extensible, portable, efficient and header-only SIMD library!","archived":false,"fork":false,"pushed_at":"2021-10-04T08:25:20.000Z","size":120,"stargazers_count":230,"open_issues_count":0,"forks_count":9,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-31T01:23:17.350Z","etag":null,"topics":["compile-time","simd-library"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eatingtomatoes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-11T04:32:54.000Z","updated_at":"2025-01-15T14:21:32.000Z","dependencies_parsed_at":"2023-04-11T05:31:30.851Z","dependency_job_id":null,"html_url":"https://github.com/eatingtomatoes/pure_simd","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eatingtomatoes%2Fpure_simd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eatingtomatoes%2Fpure_simd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eatingtomatoes%2Fpure_simd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eatingtomatoes%2Fpure_simd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eatingtomatoes","download_url":"https://codeload.github.com/eatingtomatoes/pure_simd/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252663503,"owners_count":21784783,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compile-time","simd-library"],"created_at":"2024-11-17T16:29:16.194Z","updated_at":"2025-05-06T10:07:44.484Z","avatar_url":"https://github.com/eatingtomatoes.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pure SIMD \n\nA simple, extensible, portable, efficient and header-only SIMD library!\n\n- [Pure SIMD](#pure-simd)\n  \n  * [Introduction](#introduction)\n  * [Compiler Requirements](#compiler-requirements)\n  * [Interface](#interface)\n    + [Types](#types)\n    + [Basic Constructs](#basic-constructs)\n    + [High-level Operations](#high-level-operations) \n  * [Example](#example)\n  * [Test and Benchmark](#test-and-benchmark)\n  * [Development Status](#development-status)\n  * [To do](#to-do)\n\n\u003csmall\u003e\u003ci\u003e\u003ca href='http://ecotrust-canada.github.io/markdown-toc/'\u003eTable of contents generated with markdown-toc\u003c/a\u003e\u003c/i\u003e\u003c/small\u003e\n\n## Introduction\n\nThere are already tons of SIMD libraries, which usually export user-friendly interfaces by wrapping\nthe underlying messy SIMD intrinsics. \n\nPure SIMD also provides user-friendly interfaces, but by a different means. It just unrolls loops introduced by various vector operations at compile time and leaves the rest work to compilers. Modern compilers can generate SIMD instructions from them easily, then the vectorization is done. \n\nThis simple idea brings the following **advantages** to Pure SIMD:\n\n* **Simplicity**. The implementation uses variadic templates to unroll loops introduced by various vector operations and to construct user-friendly interfaces. Neither intrinsics nor extra library dependencies are required. If you known variadic templates, you can write your own version very quickly.\n\n* **Extensibility**. You can use the basic constructs to easily implement various vector operations. Nothing will limits your hands.\n\n* **Portability**. All codes are written in standard c++17 and there are no extra dependences and no intrinsics. Some hight-level vector operations might have no corresponding low-level instructions on your machine, but that doesn't matter. Your program will run normally and even performs better than scalar ones due to the benefits of loop unrolling.\n\n* **Efficiency**. The Pure SIMD depends on compilers to generate SIMD instructions from unrolled loops. For compilers supporting SLP(superword level parallelism) vectorization, such as gcc and clang, it's not a problem. As long as your compiler is OK, you can get nearly the same assembly code as manually-vectorized ones. Furthermore, intrinsics might get in the way of compiler's optimizations, while Pure SIMD has no such problems. Thus the latter may lead to better performance.\n\n* **Header-only**.  \n\n## Compiler Requirements\n\nC++17 \u0026 SLP vectorization.\n\n## Interface\n\nAll definitions of types and functions sit in the namespace `pure_simd`.\n\n### Types\n\n#### array\n\nPure SIMD uses `vector`, which is an aligned version of std::array, to model a sequence of values. \n\n```c++\n    template \u003ctypename T, std::size_t N, std::size_t Align = 32\u003e\n    struct alignas(Align) vector;\n```\n\n#### size_constant \n\nIt's just an alias for convenience.\n\n```c++\n    template \u003csize_t N\u003e\n    using size_constant = std::integral_constant\u003csize_t, N\u003e;\n```\n\n### Basic Constructs\n\nThe `unroll` function unrolls unary/binary operations on vectors. The result's type depends on the operations. You can use it to implement other operations.\n\n```c++\n    template \u003ctypename F, typename V, typename = must_be_vector\u003cV\u003e\u003e\n    constexpr auto unroll(F func, V xs);\n\n     template \u003c\n        typename F, typename V0, typename V1,\n        typename = must_be_vector\u003cV0\u003e,\n        typename = must_be_vector\u003cV1\u003e,\n        typename = assert_same_size\u003cV0, V1\u003e\u003e\n    constexpr auto unroll(F func, V0 xs, V1 ys);\n```\n\nTo facilitate the use of lambda,  two variants are provided.\n\n```c++\n    template \u003ctypename F, typename V, typename = must_be_vector\u003cV\u003e\u003e\n    constexpr auto unroll(V xs, F func);\n\n    template \u003c\n        typename F, typename V0, typename V1,\n        typename = must_be_vector\u003cV0\u003e,\n        typename = must_be_vector\u003cV1\u003e,\n        typename = assert_same_size\u003cV0, V1\u003e\u003e\n    constexpr auto unroll(V0 xs, V1 ys, F func);\n```\n\nSo you can write code like this:\n\n```c++\n    auto zs = unroll(xs, ys, [](auto a, auto b) {\n        return a * b;\n    });\n```\n\n### High-level Operations\n\n#### Arithmetic \u0026 Conversion Operations\n\nCurrently Pure SIMD supports +, -, *, /, %, ^, \u0026, |, ~, !, \u003c, \u003e, \u003c\u003c,  \u003e\u003e, ==, !=, \u003c=, \u003e=, \u0026\u0026, ||, max, min, and cast operations.\n\nNote that \u003c, \u003e, ==, !=, \u003c= and \u003e= are not defined for tuples, or they will conflict with those in the c++ standard library.\n\n#### Load \u0026 Store Operation\n\nThe `store_to` writes a vector's elements to continuous locations.\n\n```c++\n    template \u003ctypename V, typename T, typename = must_be_vector\u003cV\u003e\u003e\n    constexpr void store_to(V xs, T* dst)\n```\n\nThe `load_from` reads values from continuous locations to a vector.\n\n```c++\n    template \u003ctypename V, typename T, typename = must_be_vector\u003cV\u003e\u003e\n    constexpr V load_from(const T* src);\n```\n\nThe `scalar_to` constructs a vector from a scalar value.\n\n```c++\n    template \u003ctypename V, typename T, typename = must_be_vector\u003cV\u003e\u003e\n    constexpr V scalar(T x);\n```\n\nThe `iota` constructs a vector of ascending sequence , that is, V{ start + step * 0, start + step * 1, ... }.\n\nYou can use a specific type for 0, 1 ... so as to avoid  unnecessary type conversion.\n\n```c++\n    template \u003c\n        typename V, typename I = size_t,\n        typename T, typename S,\n        typename = must_be_vector\u003cV\u003e\u003e\n    constexpr V iota(T start, S step);\n```\n\n#### Scatter \u0026 Gather Operations\n\n`scatter_bits` constructs a vector from all bits of a scalar value. \n\nFor instance, scatter_bits(0b01010111) =\u003e vector { 1, 1, 1, 0, 1, 0, 1, 0 }.\n\n`gather_bits` does the opposite of `scatter_bits`.\n\n```c++\n    template \u003ctypename V, typename T, typename = must_be_vector\u003cV\u003e\u003e\n    constexpr V scatter_bits(T bits);\n    \n    template \u003ctypename T, typename V, typename = must_be_vector\u003cV\u003e\u003e\n    constexpr T gather_bits(V xs);    \n```\n\n\n#### Helpers for unrolling loops \n\nWhen the number of iterations is not a multiple of your vectors' size, extra code is need to handle the tail end. `unroll_loop` can do that for you.\n\n`unroll_loop` decomposes a irregular loop into a series of subloops with successively halved steps and generates different loop bodies for them.\n\n```c++\n    template \u003ctypename S, S MaxStep, typename I, typename F\u003e\n    constexpr auto unroll_loop(I start, S iterations, F func)\n        -\u003e decltype(func(std::integral_constant\u003cS, MaxStep\u003e {}, start), void())\n```\n\n`func` should be a callable object or a generic lambda, as it will be used to generate bodies for subloops of different size at compile time. `func` will be passed two arguments. The first one tells you the step of current loop, which is usually used as vector size in that loop. The second one is the iteration index in the global loop. You may use it to access some data structure.\n\nFor example, suppose there is a loop of 0 up to 15, and you want to use vectors of size 4 to vectorize it, Then you write:\n\n```c++\n    // Use 4 as the maximum step.\n    // `step` will get value of 4, 2 and 1 at compile time.\n    // `i` will get value of  0, 4, 8, 12 and 14 at runtime.\n    unroll_loop\u003cint, 4\u003e(0, 15, [\u0026](auto step, int i) {\n        constexpr std::size_t vector_size = decltype(step)::value;\n        using fvec = vector\u003cfloat, vector_size\u003e;\n         ...\n    });\n```\n\nThen `unroll_loop` will generate three loops,  iterating from 0 to 12 with step of 4,  12 to 14 with step of 2, and 14 to 15 with step of 1.\n\nThe following functions work in a way similar to the corresponding ones in the c++ standard library.\n\n```c++\n    template \u003ctypename V, typename T, typename = must_be_vector\u003cV\u003e\u003e\n    constexpr T sum(V x, T init);\n\n    template \u003csize_t VectorSize, typename F, typename T, typename S\u003e\n    constexpr void transform(const S* src, size_t n, T* dst, F func);\n\n    template \u003csize_t VectorSize, typename F, typename T, typename S0, typename S1\u003e\n    constexpr void transform(const S0* src0, size_t n, const S1* src1, T* dst, F func);\n\n    template \u003csize_t VectorSize, typename T, typename S, typename F\u003e\n    constexpr auto accumulate(const S* src, size_t n, T init, F func);\n\n    template \u003csize_t VectorSize, typename T, typename S\u003e\n    constexpr auto accumulate(const S* src, size_t n, T init);\n\n    template \u003csize_t VectorSize, typename T, typename S1, typename S2, typename FAdd, typename FMultiply\u003e\n    constexpr auto inner_product(const S1* src1, size_t n, const S2* src2, T init, FAdd f_add, FMultiply f_multiply);\n\n    template \u003csize_t VectorSize, typename T, typename S1, typename S2\u003e\n    constexpr auto inner_product(const S1* src1, size_t n, const S2* src2, T init);\n```\n\nAt present,  the supported operations  are not enough, but it's easy to add new ones.\n\n## Example\n\nThe following code comes from [Practical SIMD Programming](http://www.cs.uu.nl/docs/vakken/magr/2017-2018/files/SIMD%20Tutorial.pdf) with some modifications for simplicity and avoiding numeric errors. It's quite well-optimized and very compute-intensive.\n\n```c++\nvoid scalar_shader(int t, int* screen)\n{\n    for (int y = 0; y \u003c SCRHEIGHT; ++y) {\n\n        for (int x = 0; x \u003c SCRWIDTH; ++x, ++t) {\n            int ox = 0;\n            int oy = 0;\n\n            for (int i = 0; i \u003c 99; ++i) {\n                int px = ox;\n                int py = oy;\n                oy = -(py * py - px * px + t) % 10000079;\n                ox = -(px * py + py * px - t) % 10000019;\n            }\n\n            screen[x + y * SCRHEIGHT] = ox + oy;\n        }\n    }\n}\n```\n\nThe following code is the version rewritten with Pure SIMD. It's nearly identical to \nthe original one, except for some type specifications.\n\n```c++\ntemplate \u003cstd::size_t MaxVectorSize\u003e\nvoid pure_simd_shader(int t, int* screen)\n{\n    namespace psd = pure_simd; \n\n    for (int y = 0; y \u003c SCRHEIGHT; ++y) {\n        // `unroll_loop` will handle the tail end.\n        psd::unroll_loop\u003cMaxVectorSize\u003e(0, SCRWIDTH, [\u0026](auto step, int x) {           \n            constexpr std::size_t vector_size = decltype(step)::value;            \n            ivec vt = psd::iota\u003civec, int\u003e(t, 1);\n\n            for (int i = 0; i \u003c 99; ++i) {\n                ivec px = ox;\n                ivec py = oy;\n\n                oy = -(py * py - px * px + vt) % psd::scalar\u003civec\u003e(10000079);\n                ox = -(px * py + py * px - vt) % psd::scalar\u003civec\u003e(10000019);\n            }\n\n            psd::store_to(ox + oy, screen + x + y * SCRHEIGHT);\n\n            t += vector_size;\n        });\n    }\n}\n```\n\nHere is the result of a benchmark, which used clang++ 9.0 with -O3 and -march=native and executed on  Ubuntu 18.04 with Intel Core i7-9750H CPU.  `pure_simd_shader` was tested with vectors of size 1, 2, ..., 128.\n\n```\n--------------------------------------------------------------------------------\nBenchmark                                      Time             CPU   Iterations\n--------------------------------------------------------------------------------\nBM_shader_scalar_shader_mean                 103 ms          103 ms           10\nBM_shader_pure_simd_shader_1_mean            103 ms          103 ms           10\nBM_shader_pure_simd_shader_2_mean           61.3 ms         61.3 ms           10\nBM_shader_pure_simd_shader_4_mean           50.3 ms         50.3 ms           10\nBM_shader_pure_simd_shader_8_mean           28.0 ms         28.0 ms           10\nBM_shader_pure_simd_shader_16_mean          16.2 ms         16.2 ms           10\nBM_shader_pure_simd_shader_32_mean          12.5 ms         12.5 ms           10\nBM_shader_pure_simd_shader_64_mean          10.9 ms         10.9 ms           10\nBM_shader_pure_simd_shader_128_mean          158 ms          158 ms           10\n```\n\nYou can see that as the vector size increased, the code using Pure SIMD was faster and faster. \n\nGenerally speaking, the larger the size of vectors you use, the better performance you will get. But it's not a silver bullet. Too large unrolling factor will hurt the instruction cache.\n\n## Test and Benchmark\n\n**Note** that the library is header-only, but Conan is needed to run the tests and benchmarks.\n\n```shell\ncd pure_simd\nmkdir build \u0026\u0026 cd build\nconan install ..\ncmake ..\ncmake --build .\n./bin/test_pure_simd\n./bin/benchmark_pure_simd\n```\n\n## Development Status\n\nThis library has just taken its first baby step. It's now in the experimental stage, so the interfaces often change drastically.\n\n## To Do\n\n* Add more operations  \u0026 documents \n* Add examples \u0026 benchmarks\n* Keep consistencies across various compilers\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Featingtomatoes%2Fpure_simd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Featingtomatoes%2Fpure_simd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Featingtomatoes%2Fpure_simd/lists"}