{"id":13377120,"url":"https://github.com/google/XNNPACK","last_synced_at":"2025-03-13T03:30:38.832Z","repository":{"id":37213086,"uuid":"208364128","full_name":"google/XNNPACK","owner":"google","description":"High-efficiency floating-point neural network inference operators for mobile, server, and Web","archived":false,"fork":false,"pushed_at":"2025-03-11T10:05:43.000Z","size":170631,"stargazers_count":1978,"open_issues_count":194,"forks_count":401,"subscribers_count":52,"default_branch":"master","last_synced_at":"2025-03-11T10:41:07.246Z","etag":null,"topics":["convolutional-neural-network","convolutional-neural-networks","cpu","inference","inference-optimization","matrix-multiplication","mobile-inference","multithreading","neural-network","neural-networks","simd"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-13T23:48:37.000Z","updated_at":"2025-03-11T10:05:10.000Z","dependencies_parsed_at":"2023-10-12T07:04:09.687Z","dependency_job_id":"1c643888-a71e-4269-9986-d39b9721187c","html_url":"https://github.com/google/XNNPACK","commit_stats":{"total_commits":5472,"total_committers":66,"mean_commits":82.9090909090909,"dds":0.610014619883041,"last_synced_commit":"0caafc62db1e2c3a281ccf34499a4f822a2ca67b"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2FXNNPACK","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2FXNNPACK/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2FXNNPACK/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2FXNNPACK/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google","download_url":"https://codeload.github.com/google/XNNPACK/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243335017,"owners_count":20274895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["convolutional-neural-network","convolutional-neural-networks","cpu","inference","inference-optimization","matrix-multiplication","mobile-inference","multithreading","neural-network","neural-networks","simd"],"created_at":"2024-07-30T06:01:16.000Z","updated_at":"2025-03-13T03:30:35.938Z","avatar_url":"https://github.com/google.png","language":"C","funding_links":[],"categories":["C","Misc","⚙️ **Compilers \u0026 Low-Level Frameworks**"],"sub_categories":["⚡ **XNNPack** - Google"],"readme":"# XNNPACK\n\nXNNPACK is a highly optimized solution for neural network inference on ARM, x86, WebAssembly, and RISC-V platforms. XNNPACK is not intended for direct use by deep learning practitioners and researchers; instead it provides low-level performance primitives for accelerating high-level machine learning frameworks, such as [TensorFlow Lite](https://www.tensorflow.org/lite), [TensorFlow.js](https://www.tensorflow.org/js), [PyTorch](https://pytorch.org/), [ONNX Runtime](https://onnxruntime.ai), and [MediaPipe](https://mediapipe.dev).\n\n## Supported Architectures\n\n- ARM64 on Android, iOS, macOS, Linux, and Windows\n- ARMv7 (with NEON) on Android\n- ARMv6 (with VFPv2) on Linux\n- x86 and x86-64 (up to AVX512) on Windows, Linux, macOS, Android, and iOS simulator\n- WebAssembly MVP\n- WebAssembly SIMD\n- [WebAssembly Relaxed SIMD](https://github.com/WebAssembly/relaxed-simd) (experimental)\n- RISC-V (RV32GC and RV64GC)\n\n## Operator Coverage\n\nXNNPACK implements the following neural network operators:\n\n- 2D Convolution (including grouped and depthwise)\n- 2D Deconvolution (AKA Transposed Convolution)\n- 2D Average Pooling\n- 2D Max Pooling\n- 2D ArgMax Pooling (Max Pooling + indices)\n- 2D Unpooling\n- 2D Bilinear Resize\n- 2D Depth-to-Space (AKA Pixel Shuffle)\n- Add (including broadcasting, two inputs only)\n- Subtract (including broadcasting)\n- Divide (including broadcasting)\n- Maximum (including broadcasting)\n- Minimum (including broadcasting)\n- Multiply (including broadcasting)\n- Squared Difference (including broadcasting)\n- Global Average Pooling\n- Channel Shuffle\n- Fully Connected\n- Abs (absolute value)\n- Bankers' Rounding (rounding to nearest, ties to even)\n- Ceiling (rounding to integer above)\n- Clamp (includes ReLU and ReLU6)\n- Convert (includes fixed-point and half-precision quantization and\n  dequantization)\n- Copy\n- ELU\n- Floor (rounding to integer below)\n- HardSwish\n- Leaky ReLU\n- Negate\n- Sigmoid\n- Softmax\n- Square\n- Tanh\n- Transpose\n- Truncation (rounding to integer towards zero)\n- PReLU\n\nAll operators in XNNPACK support NHWC layout, but additionally allow custom stride along the **C**hannel dimension. Thus, operators can consume a subset of channels in the input tensor, and produce a subset of channels in the output tensor, providing a zero-cost Channel Split and Channel Concatenation operations.\n\n## Performance\n\n### Mobile phones\n\nThe table below presents **single-threaded** performance of XNNPACK library on three generations of MobileNet models and three generations of Pixel phones.\n\n| Model                   | Pixel, ms | Pixel 2, ms | Pixel 3a, ms |\n| ----------------------- | :-------: | :---------: | :----------: |\n| FP32 MobileNet v1 1.0X  |    82     |      86     |      88      |\n| FP32 MobileNet v2 1.0X  |    49     |      53     |      55      |\n| FP32 MobileNet v3 Large |    39     |      42     |      44      |\n| FP32 MobileNet v3 Small |    12     |      14     |      14      |\n\nThe following table presents **multi-threaded** (using as many threads as there are big cores) performance of XNNPACK library on three generations of MobileNet models and three generations of Pixel phones.\n\n| Model                   | Pixel, ms | Pixel 2, ms | Pixel 3a, ms |\n| ----------------------- | :-------: | :---------: | :----------: |\n| FP32 MobileNet v1 1.0X  |    43     |      27     |      46      |\n| FP32 MobileNet v2 1.0X  |    26     |      18     |      28      |\n| FP32 MobileNet v3 Large |    22     |      16     |      24      |\n| FP32 MobileNet v3 Small |     7     |       6     |       8      |\n\nBenchmarked on March 27, 2020 with `end2end_bench --benchmark_min_time=5` on an Android/ARM64 build with Android NDK r21 (`bazel build -c opt --config android_arm64 :end2end_bench`) and neural network models with randomized weights and inputs.\n\n### Raspberry Pi\n\nThe table below presents **multi-threaded** performance of XNNPACK library on three generations of MobileNet models and three generations of Raspberry Pi boards.\n\n| Model                   | RPi Zero W (BCM2835), ms | RPi 2 (BCM2836), ms | RPi 3+ (BCM2837B0), ms | RPi 4 (BCM2711), ms | RPi 4 (BCM2711, ARM64), ms |\n| ----------------------- | :----------------------: | :-----------------: | :--------------------: | :-----------------: | :------------------------: |\n| FP32 MobileNet v1 1.0X  |          3919            |         302         |          114           |          72         |             77             |\n| FP32 MobileNet v2 1.0X  |          1987            |         191         |           79           |          41         |             46             |\n| FP32 MobileNet v3 Large |          1658            |         161         |           67           |          38         |             40             |\n| FP32 MobileNet v3 Small |           474            |          50         |           22           |          13         |             15             |\n| INT8 MobileNet v1 1.0X  |          2589            |         128         |           46           |          29         |             24             |\n| INT8 MobileNet v2 1.0X  |          1495            |          82         |           30           |          20         |             17             |\n\nBenchmarked on Feb 8, 2022 with `end2end-bench --benchmark_min_time=5` on a Raspbian Buster build with CMake (`./scripts/build-local.sh`) and neural network models with randomized weights and inputs. INT8 inference was evaluated on per-channel quantization schema.\n\n## Minimum build requirements\n\n- C11\n- C++14\n- Python 3\n\n## Publications\n\n- Marat Dukhan \"The Indirect Convolution Algorithm\". Presented on [Efficient Deep Learning for Compute Vision (ECV) 2019](https://sites.google.com/corp/view/ecv2019/) workshop ([slides](https://drive.google.com/file/d/1ZayB3By5ZxxQIRtN7UDq_JvPg1IYd3Ac/view), [paper on ArXiv](https://arxiv.org/abs/1907.02129)).\n- Erich Elsen, Marat Dukhan, Trevor Gale, Karen Simonyan \"Fast Sparse ConvNets\".\n  [Paper on ArXiv](https://arxiv.org/abs/1911.09723), [pre-trained sparse\n  models](https://github.com/google-research/google-research/tree/master/fastconvnets).\n- Marat Dukhan, Artsiom Ablavatski \"The Two-Pass Softmax Algorithm\".\n  [Paper on ArXiv](https://arxiv.org/abs/2001.04438).\n- Yury Pisarchyk, Juhyun Lee \"Efficient Memory Management for Deep Neural Net Inference\".\n  [Paper on ArXiv](https://arxiv.org/abs/2001.03288).\n\n## Ecosystem\n\n### Machine Learning Frameworks\n\n- [TensorFlow Lite](https://blog.tensorflow.org/2020/07/accelerating-tensorflow-lite-xnnpack-integration.html).\n- [TensorFlow.js WebAssembly backend](https://blog.tensorflow.org/2020/03/introducing-webassembly-backend-for-tensorflow-js.html).\n- [PyTorch Mobile](https://pytorch.org/mobile).\n- [ONNX Runtime Mobile](https://onnxruntime.ai/docs/execution-providers/Xnnpack-ExecutionProvider.html)\n- [MediaPipe for the Web](https://developers.googleblog.com/2020/01/mediapipe-on-web.html).\n- [Alibaba HALO (Heterogeneity-Aware Lowering and Optimization)](https://github.com/alibaba/heterogeneity-aware-lowering-and-optimization)\n- [Samsung ONE (On-device Neural Engine)](https://github.com/Samsung/ONE)\n\n## Acknowledgements\n\nXNNPACK is a based on [QNNPACK](https://github.com/pytorch/QNNPACK) library. Over time its codebase diverged a lot, and XNNPACK API is no longer compatible with QNNPACK.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2FXNNPACK","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle%2FXNNPACK","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2FXNNPACK/lists"}