{"id":13439039,"url":"https://github.com/google/highway","last_synced_at":"2025-05-14T21:02:21.896Z","repository":{"id":37395823,"uuid":"206791328","full_name":"google/highway","owner":"google","description":"Performance-portable, length-agnostic SIMD with runtime dispatch","archived":false,"fork":false,"pushed_at":"2025-05-07T18:10:00.000Z","size":27993,"stargazers_count":4610,"open_issues_count":63,"forks_count":348,"subscribers_count":49,"default_branch":"master","last_synced_at":"2025-05-07T19:47:42.457Z","etag":null,"topics":["avx","avx-512","avx-instructions","avx2","avx512","intrinsics","neon","simd","simd-instructions","simd-intrinsics","simd-library","simd-parallelism","simd-programming","sse42","wasm"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-09-06T12:41:23.000Z","updated_at":"2025-05-07T18:06:18.000Z","dependencies_parsed_at":"2023-10-13T17:15:56.421Z","dependency_job_id":"edbb706b-cd5e-4dc4-baac-bdf4de0c91c8","html_url":"https://github.com/google/highway","commit_stats":{"total_commits":2218,"total_committers":71,"mean_commits":"31.239436619718308","dds":"0.35617673579801623","last_synced_commit":"5cde138f2eb5adc2c48b3965ade527276dade891"},"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fhighway","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fhighway/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fhighway/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fhighway/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google","download_url":"https://codeload.github.com/google/highway/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254049588,"owners_count":22006097,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avx","avx-512","avx-instructions","avx2","avx512","intrinsics","neon","simd","simd-instructions","simd-intrinsics","simd-library","simd-parallelism","simd-programming","sse42","wasm"],"created_at":"2024-07-31T03:01:10.648Z","updated_at":"2025-05-14T21:02:21.398Z","avatar_url":"https://github.com/google.png","language":"C++","funding_links":[],"categories":["HarmonyOS","C++","Software","Repos","Tools"],"sub_categories":["Windows Manager","Trends"],"readme":"# Efficient and performance-portable vector software\n\n[//]: # (placeholder, do not remove)\n\nHighway is a C++ library that provides portable SIMD/vector intrinsics.\n\n[Documentation](https://google.github.io/highway/en/master/)\n\nPreviously licensed under Apache 2, now dual-licensed as Apache 2 / BSD-3.\n\n## Why\n\nWe are passionate about high-performance software. We see major untapped\npotential in CPUs (servers, mobile, desktops). Highway is for engineers who want\nto reliably and economically push the boundaries of what is possible in\nsoftware.\n\n## How\n\nCPUs provide SIMD/vector instructions that apply the same operation to multiple\ndata items. This can reduce energy usage e.g. *fivefold* because fewer\ninstructions are executed. We also often see *5-10x* speedups.\n\nHighway makes SIMD/vector programming practical and workable according to these\nguiding principles:\n\n**Does what you expect**: Highway is a C++ library with carefully-chosen\nfunctions that map well to CPU instructions without extensive compiler\ntransformations. The resulting code is more predictable and robust to code\nchanges/compiler updates than autovectorization.\n\n**Works on widely-used platforms**: Highway supports five architectures; the\nsame application code can target various instruction sets, including those with\n'scalable' vectors (size unknown at compile time). Highway only requires C++11\nand supports four families of compilers. If you would like to use Highway on\nother platforms, please raise an issue.\n\n**Flexible to deploy**: Applications using Highway can run on heterogeneous\nclouds or client devices, choosing the best available instruction set at\nruntime. Alternatively, developers may choose to target a single instruction set\nwithout any runtime overhead. In both cases, the application code is the same\nexcept for swapping `HWY_STATIC_DISPATCH` with `HWY_DYNAMIC_DISPATCH` plus one\nline of code. See also @kfjahnke's\n[introduction to dispatching](https://github.com/kfjahnke/zimt/blob/multi_isa/examples/multi_isa_example/multi_simd_isa.md).\n\n**Suitable for a variety of domains**: Highway provides an extensive set of\noperations, used for image processing (floating-point), compression, video\nanalysis, linear algebra, cryptography, sorting and random generation. We\nrecognise that new use-cases may require additional ops and are happy to add\nthem where it makes sense (e.g. no performance cliffs on some architectures). If\nyou would like to discuss, please file an issue.\n\n**Rewards data-parallel design**: Highway provides tools such as Gather,\nMaskedLoad, and FixedTag to enable speedups for legacy data structures. However,\nthe biggest gains are unlocked by designing algorithms and data structures for\nscalable vectors. Helpful techniques include batching, structure-of-array\nlayouts, and aligned/padded allocations.\n\nWe recommend these resources for getting started:\n\n-   [SIMD for C++ Developers](http://const.me/articles/simd/simd.pdf)\n-   [Algorithms for Modern Hardware](https://en.algorithmica.org/hpc/)\n-   [Optimizing software in C++](https://agner.org/optimize/optimizing_cpp.pdf)\n-   [Improving performance with SIMD intrinsics in three use cases](https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/)\n\n## Examples\n\nOnline demos using Compiler Explorer:\n\n-   [multiple targets with dynamic dispatch](https://gcc.godbolt.org/z/KM3ben7ET)\n    (more complicated, but flexible and uses best available SIMD)\n-   [single target using -m flags](https://gcc.godbolt.org/z/rGnjMevKG)\n    (simpler, but requires/only uses the instruction set enabled by compiler\n    flags)\n\nWe observe that Highway is referenced in the following open source projects,\nfound via sourcegraph.com. Most are GitHub repositories. If you would like to\nadd your project or link to it directly, feel free to raise an issue or contact\nus via the below email.\n\n*   Audio: [Zimtohrli perceptual metric](https://github.com/google/zimtohrli)\n*   Browsers: Chromium (+Vivaldi), Firefox (+floorp / foxhound / librewolf /\n    Waterfox)\n*   Computational biology: [RNA analysis](https://github.com/bnprks/BPCells)\n*   Computer graphics: [Sparse voxel renderer](https://github.com/rools/voxl)\n*   Cryptography: google/distributed_point_functions, google/shell-encryption\n*   Data structures: bkille/BitLib\n*   Image codecs: eustas/2im,\n    [Grok JPEG 2000](https://github.com/GrokImageCompression/grok),\n    [JPEG XL](https://github.com/libjxl/libjxl),\n    [JPEGenc](https://github.com/osamu620/JPEGenc),\n    [Jpegli](https://github.com/google/jpegli), OpenHTJ2K\n*   Image processing: cloudinary/ssimulacra2, m-ab-s/media-autobuild_suite,\n    [libvips](https://github.com/libvips/libvips)\n*   Image viewers: AlienCowEatCake/ImageViewer, diffractor/diffractor,\n    mirillis/jpegxl-wic,\n    [Lux panorama/image viewer](https://bitbucket.org/kfj/pv/)\n*   Information retrieval:\n    [iresearch database index](https://github.com/iresearch-toolkit/iresearch),\n    michaeljclark/zvec,\n    [nebula interactive analytics / OLAP](https://github.com/varchar-io/nebula),\n    [ScaNN Scalable Nearest Neighbors](https://github.com/google-research/google-research/tree/7a269cb2ce0ae1db591fe11b62cbc0be7d72532a/scann),\n    [vectorlite vector search](https://github.com/1yefuwang1/vectorlite/)\n*   Machine learning: [gemma.cpp](https://github.com/google/gemma.cpp),\n    Tensorflow, Numpy, zpye/SimpleInfer\n*   Robotics:\n    [MIT Model-Based Design and Verification](https://github.com/RobotLocomotion/drake)\n\nOther\n\n*   [Evaluation of C++ SIMD Libraries](https://www.mnm-team.org/pub/Fopras/rock23/):\n    \"Highway excelled with a strong performance across multiple SIMD extensions\n    [..]. Thus, Highway may currently be the most suitable SIMD library for many\n    software projects.\"\n*   [zimt](https://github.com/kfjahnke/zimt): C++11 template library to process n-dimensional arrays with multi-threaded SIMD code\n*   [vectorized Quicksort](https://github.com/google/highway/tree/master/hwy/contrib/sort) ([paper](https://arxiv.org/abs/2205.05982))\n\nIf you'd like to get Highway, in addition to cloning from this GitHub repository\nor using it as a Git submodule, you can also find it in the following package\nmanagers or repositories:\n\n*   alpinelinux\n*   conan-io\n*   conda-forge\n*   DragonFlyBSD,\n*   fd00/yacp\n*   freebsd\n*   getsolus/packages\n*   ghostbsd\n*   microsoft/vcpkg\n*   MidnightBSD\n*   MSYS2\n*   NetBSD\n*   openSUSE\n*   opnsense\n*   Xilinx/Vitis_Libraries\n*   xmake-io/xmake-repo\n\nSee also the list at https://repology.org/project/highway-simd-library/versions\n.\n\n## Current status\n\n### Targets\n\nHighway supports 24 targets, listed in alphabetical order of platform:\n\n-   Any: `EMU128`, `SCALAR`;\n-   Armv7+: `NEON_WITHOUT_AES`, `NEON`, `NEON_BF16`, `SVE`, `SVE2`, `SVE_256`,\n    `SVE2_128`;\n-   IBM Z: `Z14`, `Z15`;\n-   POWER: `PPC8` (v2.07), `PPC9` (v3.0), `PPC10` (v3.1B, not yet supported due\n    to compiler bugs, see #1207; also requires QEMU 7.2);\n-   RISC-V: `RVV` (1.0);\n-   WebAssembly: `WASM`, `WASM_EMU256` (a 2x unrolled version of wasm128,\n    enabled if `HWY_WANT_WASM2` is defined. This will remain supported until it\n    is potentially superseded by a future version of WASM.);\n-   x86:\n    -   `SSE2`\n    -   `SSSE3` (~Intel Core)\n    -   `SSE4` (~Nehalem, also includes AES + CLMUL).\n    -   `AVX2` (~Haswell, also includes BMI2 + F16 + FMA)\n    -   `AVX3` (~Skylake, AVX-512F/BW/CD/DQ/VL)\n    -   `AVX3_DL` (~Icelake, includes `BitAlg` + `CLMUL` + `GFNI` + `VAES` +\n        `VBMI` + `VBMI2` + `VNNI` + `VPOPCNT`),\n    -   `AVX3_ZEN4` (AVX3_DL plus BF16, optimized for AMD Zen4; requires opt-in\n        by defining `HWY_WANT_AVX3_ZEN4` if compiling for static dispatch, but\n        enabled by default for runtime dispatch),\n    -   `AVX3_SPR` (~Sapphire Rapids, includes AVX-512FP16)\n\nOur policy is that unless otherwise specified, targets will remain supported as\nlong as they can be (cross-)compiled with currently supported Clang or GCC, and\ntested using QEMU. If the target can be compiled with LLVM trunk and tested\nusing our version of QEMU without extra flags, then it is eligible for inclusion\nin our continuous testing infrastructure. Otherwise, the target will be manually\ntested before releases with selected versions/configurations of Clang and GCC.\n\nSVE was initially tested using farm_sve (see acknowledgments).\n\n### Versioning\n\nHighway releases aim to follow the semver.org system (MAJOR.MINOR.PATCH),\nincrementing MINOR after backward-compatible additions and PATCH after\nbackward-compatible fixes. We recommend using releases (rather than the Git tip)\nbecause they are tested more extensively, see below.\n\nThe current version 1.0 signals an increased focus on backwards compatibility.\nApplications using documented functionality will remain compatible with future\nupdates that have the same major version number.\n\n### Testing\n\nContinuous integration tests build with a recent version of Clang (running on\nnative x86, or QEMU for RISC-V and Arm) and MSVC 2019 (v19.28, running on native\nx86).\n\nBefore releases, we also test on x86 with Clang and GCC, and Armv7/8 via GCC\ncross-compile. See the [testing process](g3doc/release_testing_process.md) for\ndetails.\n\n### Related modules\n\nThe `contrib` directory contains SIMD-related utilities: an image class with\naligned rows, a math library (16 functions already implemented, mostly\ntrigonometry), and functions for computing dot products and sorting.\n\n### Other libraries\n\nIf you only require x86 support, you may also use Agner Fog's\n[VCL vector class library](https://github.com/vectorclass). It includes many\nfunctions including a complete math library.\n\nIf you have existing code using x86/NEON intrinsics, you may be interested in\n[SIMDe](https://github.com/simd-everywhere/simde), which emulates those\nintrinsics using other platforms' intrinsics or autovectorization.\n\n## Installation\n\nThis project uses CMake to generate and build. In a Debian-based system you can\ninstall it via:\n\n```bash\nsudo apt install cmake\n```\n\nHighway's unit tests use [googletest](https://github.com/google/googletest).\nBy default, Highway's CMake downloads this dependency at configuration time.\nYou can avoid this by setting the `HWY_SYSTEM_GTEST` CMake variable to ON and\ninstalling gtest separately:\n\n```bash\nsudo apt install libgtest-dev\n```\n\nAlternatively, you can define `HWY_TEST_STANDALONE=1` and remove all occurrences\nof `gtest_main` in each BUILD file, then tests avoid the dependency on GUnit.\n\nRunning cross-compiled tests requires support from the OS, which on Debian is\nprovided by the `qemu-user-binfmt` package.\n\nTo build Highway as a shared or static library (depending on BUILD_SHARED_LIBS),\nthe standard CMake workflow can be used:\n\n```bash\nmkdir -p build \u0026\u0026 cd build\ncmake ..\nmake -j \u0026\u0026 make test\n```\n\nOr you can run `run_tests.sh` (`run_tests.bat` on Windows).\n\nBazel is also supported for building, but it is not as widely used/tested.\n\nWhen building for Armv7, a limitation of current compilers requires you to add\n`-DHWY_CMAKE_ARM7:BOOL=ON` to the CMake command line; see #834 and #1032. We\nunderstand that work is underway to remove this limitation.\n\nBuilding on 32-bit x86 is not officially supported, and AVX2/3 are disabled by\ndefault there. Note that johnplatts has successfully built and run the Highway\ntests on 32-bit x86, including AVX2/3, on GCC 7/8 and Clang 8/11/12. On Ubuntu\n22.04, Clang 11 and 12, but not later versions, require extra compiler flags\n`-m32 -isystem /usr/i686-linux-gnu/include`. Clang 10 and earlier require the\nabove plus `-isystem /usr/i686-linux-gnu/include/c++/12/i686-linux-gnu`. See\n#1279.\n\n## Building highway - Using vcpkg\n\nhighway is now available in [vcpkg](https://github.com/Microsoft/vcpkg)\n\n```bash\nvcpkg install highway\n```\n\nThe highway port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please [create an issue or pull request](https://github.com/Microsoft/vcpkg) on the vcpkg repository.\n\n## Quick start\n\nYou can use the `benchmark` inside examples/ as a starting point.\n\nA [quick-reference page](g3doc/quick_reference.md) briefly lists all operations\nand their parameters, and the [instruction_matrix](g3doc/instruction_matrix.pdf)\nindicates the number of instructions per operation.\n\nThe [FAQ](g3doc/faq.md) answers questions about portability, API design and\nwhere to find more information.\n\nWe recommend using full SIMD vectors whenever possible for maximum performance\nportability. To obtain them, pass a `ScalableTag\u003cfloat\u003e` (or equivalently\n`HWY_FULL(float)`) tag to functions such as `Zero/Set/Load`. There are two\nalternatives for use-cases requiring an upper bound on the lanes:\n\n-   For up to `N` lanes, specify `CappedTag\u003cT, N\u003e` or the equivalent\n    `HWY_CAPPED(T, N)`. The actual number of lanes will be `N` rounded down to\n    the nearest power of two, such as 4 if `N` is 5, or 8 if `N` is 8. This is\n    useful for data structures such as a narrow matrix. A loop is still required\n    because vectors may actually have fewer than `N` lanes.\n\n-   For exactly a power of two `N` lanes, specify `FixedTag\u003cT, N\u003e`. The largest\n    supported `N` depends on the target, but is guaranteed to be at least\n    `16/sizeof(T)`.\n\nDue to ADL restrictions, user code calling Highway ops must either:\n\n*   Reside inside `namespace hwy { namespace HWY_NAMESPACE {`; or\n*   prefix each op with an alias such as `namespace hn = hwy::HWY_NAMESPACE;\n    hn::Add()`; or\n*   add using-declarations for each op used: `using hwy::HWY_NAMESPACE::Add;`.\n\nAdditionally, each function that calls Highway ops (such as `Load`) must either\nbe prefixed with `HWY_ATTR`, OR reside between `HWY_BEFORE_NAMESPACE()` and\n`HWY_AFTER_NAMESPACE()`. Lambda functions currently require `HWY_ATTR` before\ntheir opening brace.\n\nDo not use namespace-scope nor `static` initializers for SIMD vectors because\nthis can cause SIGILL when using runtime dispatch and the compiler chooses an\ninitializer compiled for a target not supported by the current CPU. Instead,\nconstants initialized via `Set` should generally be local (const) variables.\n\nThe entry points into code using Highway differ slightly depending on whether\nthey use static or dynamic dispatch. In both cases, we recommend that the\ntop-level function receives one or more pointers to arrays, rather than\ntarget-specific vector types.\n\n*   For static dispatch, `HWY_TARGET` will be the best available target among\n    `HWY_BASELINE_TARGETS`, i.e. those allowed for use by the compiler (see\n    [quick-reference](g3doc/quick_reference.md)). Functions inside\n    `HWY_NAMESPACE` can be called using `HWY_STATIC_DISPATCH(func)(args)` within\n    the same module they are defined in. You can call the function from other\n    modules by wrapping it in a regular function and declaring the regular\n    function in a header.\n\n*   For dynamic dispatch, a table of function pointers is generated via the\n    `HWY_EXPORT` macro that is used by `HWY_DYNAMIC_DISPATCH(func)(args)` to\n    call the best function pointer for the current CPU's supported targets. A\n    module is automatically compiled for each target in `HWY_TARGETS` (see\n    [quick-reference](g3doc/quick_reference.md)) if `HWY_TARGET_INCLUDE` is\n    defined and `foreach_target.h` is included. Note that the first invocation\n    of `HWY_DYNAMIC_DISPATCH`, or each call to the pointer returned by the first\n    invocation of `HWY_DYNAMIC_POINTER`, involves some CPU detection overhead.\n    You can prevent this by calling the following before any invocation of\n    `HWY_DYNAMIC_*`: `hwy::GetChosenTarget().Update(hwy::SupportedTargets());`.\n\nSee also a separate\n[introduction to dynamic dispatch](https://github.com/kfjahnke/zimt/blob/multi_isa/examples/multi_isa_example/multi_simd_isa.md)\nby @kfjahnke.\n\nWhen using dynamic dispatch, `foreach_target.h` is included from translation\nunits (.cc files), not headers. Headers containing vector code shared between\nseveral translation units require a special include guard, for example the\nfollowing taken from `examples/skeleton-inl.h`:\n\n```\n#if defined(HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_) == defined(HWY_TARGET_TOGGLE)\n#ifdef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_\n#undef HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_\n#else\n#define HIGHWAY_HWY_EXAMPLES_SKELETON_INL_H_\n#endif\n\n#include \"hwy/highway.h\"\n// Your vector code\n#endif\n```\n\nBy convention, we name such headers `-inl.h` because their contents (often\nfunction templates) are usually inlined.\n\n## Compiler flags\n\nApplications should be compiled with optimizations enabled. Without inlining\nSIMD code may slow down by factors of 10 to 100. For clang and GCC, `-O2` is\ngenerally sufficient.\n\nFor MSVC, we recommend compiling with `/Gv` to allow non-inlined functions to\npass vector arguments in registers. If intending to use the AVX2 target together\nwith half-width vectors (e.g. for `PromoteTo`), it is also important to compile\nwith `/arch:AVX2`. This seems to be the only way to reliably generate\nVEX-encoded SSE instructions on MSVC. Sometimes MSVC generates VEX-encoded SSE\ninstructions, if they are mixed with AVX, but not always, see\n[DevCom-10618264](https://developercommunity.visualstudio.com/t/10618264).\nOtherwise, mixing VEX-encoded AVX2 instructions and non-VEX SSE may cause severe\nperformance degradation. Unfortunately, with `/arch:AVX2` option, the resulting\nbinary will then require AVX2. Note that no such flag is needed for clang and\nGCC because they support target-specific attributes, which we use to ensure\nproper VEX code generation for AVX2 targets.\n\n## Strip-mining loops\n\nWhen vectorizing a loop, an important question is whether and how to deal with\na number of iterations ('trip count', denoted `count`) that does not evenly\ndivide the vector size `N = Lanes(d)`. For example, it may be necessary to avoid\nwriting past the end of an array.\n\nIn this section, let `T` denote the element type and `d = ScalableTag\u003cT\u003e`.\nAssume the loop body is given as a function `template\u003cbool partial, class D\u003e\nvoid LoopBody(D d, size_t index, size_t max_n)`.\n\n\"Strip-mining\" is a technique for vectorizing a loop by transforming it into an\nouter loop and inner loop, such that the number of iterations in the inner loop\nmatches the vector width. Then, the inner loop is replaced with vector\noperations.\n\nHighway offers several strategies for loop vectorization:\n\n*   Ensure all inputs/outputs are padded. Then the (outer) loop is simply\n\n    ```\n    for (size_t i = 0; i \u003c count; i += N) LoopBody\u003cfalse\u003e(d, i, 0);\n    ```\n    Here, the template parameter and second function argument are not needed.\n\n    This is the preferred option, unless `N` is in the thousands and vector\n    operations are pipelined with long latencies. This was the case for\n    supercomputers in the 90s, but nowadays ALUs are cheap and we see most\n    implementations split vectors into 1, 2 or 4 parts, so there is little cost\n    to processing entire vectors even if we do not need all their lanes. Indeed\n    this avoids the (potentially large) cost of predication or partial\n    loads/stores on older targets, and does not duplicate code.\n\n*   Process whole vectors and include previously processed elements\n    in the last vector:\n    ```\n    for (size_t i = 0; i \u003c count; i += N) LoopBody\u003cfalse\u003e(d, HWY_MIN(i, count - N), 0);\n    ```\n\n    This is the second preferred option provided that `count \u003e= N`\n    and `LoopBody` is idempotent. Some elements might be processed twice, but\n    a single code path and full vectorization is usually worth it. Even if\n    `count \u003c N`, it usually makes sense to pad inputs/outputs up to `N`.\n\n*   Use the `Transform*` functions in hwy/contrib/algo/transform-inl.h. This\n    takes care of the loop and remainder handling and you simply define a\n    generic lambda function (C++14) or functor which receives the current vector\n    from the input/output array, plus optionally vectors from up to two extra\n    input arrays, and returns the value to write to the input/output array.\n\n    Here is an example implementing the BLAS function SAXPY (`alpha * x + y`):\n\n    ```\n    Transform1(d, x, n, y, [](auto d, const auto v, const auto v1) HWY_ATTR {\n      return MulAdd(Set(d, alpha), v, v1);\n    });\n    ```\n\n*   Process whole vectors as above, followed by a scalar loop:\n\n    ```\n    size_t i = 0;\n    for (; i + N \u003c= count; i += N) LoopBody\u003cfalse\u003e(d, i, 0);\n    for (; i \u003c count; ++i) LoopBody\u003cfalse\u003e(CappedTag\u003cT, 1\u003e(), i, 0);\n    ```\n    The template parameter and second function arguments are again not needed.\n\n    This avoids duplicating code, and is reasonable if `count` is large.\n    If `count` is small, the second loop may be slower than the next option.\n\n*   Process whole vectors as above, followed by a single call to a modified\n    `LoopBody` with masking:\n\n    ```\n    size_t i = 0;\n    for (; i + N \u003c= count; i += N) {\n      LoopBody\u003cfalse\u003e(d, i, 0);\n    }\n    if (i \u003c count) {\n      LoopBody\u003ctrue\u003e(d, i, count - i);\n    }\n    ```\n    Now the template parameter and third function argument can be used inside\n    `LoopBody` to non-atomically 'blend' the first `num_remaining` lanes of `v`\n    with the previous contents of memory at subsequent locations:\n    `BlendedStore(v, FirstN(d, num_remaining), d, pointer);`. Similarly,\n    `MaskedLoad(FirstN(d, num_remaining), d, pointer)` loads the first\n    `num_remaining` elements and returns zero in other lanes.\n\n    This is a good default when it is infeasible to ensure vectors are padded,\n    but is only safe `#if !HWY_MEM_OPS_MIGHT_FAULT`!\n    In contrast to the scalar loop, only a single final iteration is needed.\n    The increased code size from two loop bodies is expected to be worthwhile\n    because it avoids the cost of masking in all but the final iteration.\n\n## Additional resources\n\n*   [Highway introduction (slides)](g3doc/highway_intro.pdf)\n*   [Overview of instructions per operation on different architectures](g3doc/instruction_matrix.pdf)\n*   [Design philosophy and comparison](g3doc/design_philosophy.md)\n*   [Implementation details](g3doc/impl_details.md)\n\n## Acknowledgments\n\nWe have used [farm-sve](https://gitlab.inria.fr/bramas/farm-sve) by Berenger\nBramas; it has proved useful for checking the SVE port on an x86 development\nmachine.\n\nThis is not an officially supported Google product.\nContact: janwas@google.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fhighway","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle%2Fhighway","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fhighway/lists"}