Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/WojciechMula/sse-popcount
SIMD (SSE) population count --- http://0x80.pl/articles/sse-popcount.html
https://github.com/WojciechMula/sse-popcount
aarch64 arm-neon avx2 avx512 popcount sse
Last synced: 16 days ago
JSON representation
SIMD (SSE) population count --- http://0x80.pl/articles/sse-popcount.html
- Host: GitHub
- URL: https://github.com/WojciechMula/sse-popcount
- Owner: WojciechMula
- License: bsd-2-clause
- Created: 2015-04-04T12:32:50.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2024-04-01T19:14:51.000Z (8 months ago)
- Last Synced: 2024-05-31T19:54:45.706Z (6 months ago)
- Topics: aarch64, arm-neon, avx2, avx512, popcount, sse
- Language: C++
- Size: 299 KB
- Stars: 313
- Watchers: 30
- Forks: 48
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-simd - sse-popcount - SIMD (SSE) population count (Parsing)
README
========================================================================
SIMD popcount
========================================================================Sample programs for my article http://0x80.pl/articles/sse-popcount.html
.. image:: https://travis-ci.org/WojciechMula/sse-popcount.svg?branch=master
:target: https://travis-ci.org/WojciechMula/sse-popcountPaper
------------------------------------------------------------------------Daniel Lemire, Nathan Kurz and I published an article
`Faster Population Counts using AVX2 Instructions`__.__ https://arxiv.org/abs/1611.07612
Introduction
------------------------------------------------------------------------Subdirectory **original** contains code from 2008 --- it is 32-bit
and GCC-centric. The **root directory** contains fresh C++11 code,
written with intrinsics and tested on 64-bit machines.There are two programs:
* ``verify`` --- it tests if all non-lookup implementations counts
bits properly;
* ``speed`` --- benchmarks different implementations of popcount
procedure; please read help to find all options (run the program
without arguments).There are several targets:
* **default** --- builtin functions, SSE and popcnt instructions;
* **AVX2** --- all above plus AVX2 implementations;
* **AVX512BW** --- all above plus experimental AVX512BW code;
* **AVX512VBMI** --- all above plus experimental AVX512VBMI code;
* **AVX512 VPOPCNT** --- all above plus experimental AVX512 VPOPCNT
code (should be compilable with very recent GCC__, software emulator
doesn't support this extension yet);
* **arm** --- builtin and ARM Neon implementations.Type ``make help`` to find out details. To run the default target
benchmark simply type ``make``.__ https://github.com/gcc-mirror/gcc/commit/e0aa57d6b04908affdf4655a6b4a9f2d4d03483b
Available implementations
------------------------------------------------------------------------+---------------------------------------+------------------------------------------------------------------+
| procedure | description |
+=======================================+==================================================================+
| lookup-8 | lookup in std::uint8_t[256] LUT |
+---------------------------------------+------------------------------------------------------------------+
| lookup-64 | lookup in std::uint64_t[256] LUT |
+---------------------------------------+------------------------------------------------------------------+
| bit-parallel | naive bit parallel method |
+---------------------------------------+------------------------------------------------------------------+
| bit-parallel-optimized | a bit better bit parallel |
+---------------------------------------+------------------------------------------------------------------+
| bit-parallel-optimized2 | better utilization of 2- and 4-bit subwords |
+---------------------------------------+------------------------------------------------------------------+
| bit-parallel-mul | bit-parallel with fewer instructions |
+---------------------------------------+------------------------------------------------------------------+
| bit-parallel32 | naive bit parallel method (32 bit) |
+---------------------------------------+------------------------------------------------------------------+
| bit-parallel-optimized32 | a bit better bit parallel (32 bit) |
+---------------------------------------+------------------------------------------------------------------+
| harley-seal | Harley-Seal popcount (4th iteration) |
+---------------------------------------+------------------------------------------------------------------+
| sse-bit-parallel | SSE implementation of bit-parallel-optimized (unrolled) |
+---------------------------------------+------------------------------------------------------------------+
| sse-bit-parallel-original | SSE implementation of bit-parallel-optimized |
+---------------------------------------+------------------------------------------------------------------+
| sse-bit-parallel-better | SSE implementation of bit-parallel with fewer instructions |
+---------------------------------------+------------------------------------------------------------------+
| sse-harley-seal | SSE implementation of Harley-Seal |
+---------------------------------------+------------------------------------------------------------------+
| sse-lookup | SSSE3 variant using pshufb instruction (unrolled) |
+---------------------------------------+------------------------------------------------------------------+
| sse-lookup-original | SSSE3 variant using pshufb instruction |
+---------------------------------------+------------------------------------------------------------------+
| avx2-lookup | AVX2 variant using pshufb instruction (unrolled) |
+---------------------------------------+------------------------------------------------------------------+
| avx2-lookup-original | AVX2 variant using pshufb instruction |
+---------------------------------------+------------------------------------------------------------------+
| avx2-harley-seal | AVX2 implementation of Harley-Seal |
+---------------------------------------+------------------------------------------------------------------+
| cpu | CPU instruction popcnt (64-bit variant) |
+---------------------------------------+------------------------------------------------------------------+
| sse-cpu | load data with SSE, then count bits using popcnt |
+---------------------------------------+------------------------------------------------------------------+
| avx2-cpu | load data with AVX2, then count bits using popcnt |
+---------------------------------------+------------------------------------------------------------------+
| avx512-harley-seal | AVX512 implementation of Harley-Seal |
+---------------------------------------+------------------------------------------------------------------+
| avx512bw-shuf | AVX512BW implementation uses shuffle instruction |
+---------------------------------------+------------------------------------------------------------------+
| avx512vbmi-shuf | AVX512VBMI implementation uses shuffle instruction |
+---------------------------------------+------------------------------------------------------------------+
| avx512-vpopcnt | AVX512 VPOPCNT |
+---------------------------------------+------------------------------------------------------------------+
| builtin-popcnt | builtin for popcnt |
+---------------------------------------+------------------------------------------------------------------+
| builtin-popcnt32 | builtin for popcnt (32-bit variant) |
+---------------------------------------+------------------------------------------------------------------+
| builtin-popcnt-unrolled | unrolled builtin-popcnt |
+---------------------------------------+------------------------------------------------------------------+
| builtin-popcnt-unrolled32 | unrolled builtin-popcnt32 |
+---------------------------------------+------------------------------------------------------------------+
| builtin-popcnt-unrolled-errata | unrolled builtin-popcnt avoiding false-dependency |
+---------------------------------------+------------------------------------------------------------------+
| builtin-popcnt-unrolled-errata-manual | unrolled builtin-popcnt avoiding false-dependency (asembly code) |
+---------------------------------------+------------------------------------------------------------------+
| builtin-popcnt-movdq | builtin-popcnt where data is loaded via SSE registers |
+---------------------------------------+------------------------------------------------------------------+
| builtin-popcnt-movdq-unrolled | builtin-popcnt-movdq unrolled |
+---------------------------------------+------------------------------------------------------------------+
| builtin-popcnt-movdq-unrolled_manual | builtin-popcnt-movdq unrolled (assembly code) |
+---------------------------------------+------------------------------------------------------------------+
| neon-vcnt | ARM Neon using VCNT |
+---------------------------------------+------------------------------------------------------------------+
| neon-HS | Harley-Seal using Neon VCNT |
+---------------------------------------+------------------------------------------------------------------+
| aarch64-cnt | ARMv8 Neon using CNT |
+---------------------------------------+------------------------------------------------------------------+Performance results
------------------------------------------------------------------------The subdirectory results__ contains performance results from various
computers. If you can, please contribute.__ results/README.rst
Acknowledgments
------------------------------------------------------------------------* **Kim Walisch** (@kimwalisch) wrote Harley-Seal scalar implementation.
* **Simon Lindholm** (@simonlindholm) added unrolled versions of procedures.
* **Dan Luu** (@danluu) agreed to include his procedures (``builint-*``)
into this project. More details in Dan's article `Hand coded assembly
beats intrinsics in speed and simplicity`____ http://danluu.com/assembly-intrinsics/
See also
------------------------------------------------------------------------* libpopcnt__ --- library by Kim Walisch utilizing methods from our paper.
__ https://github.com/kimwalisch/libpopcnt
.. vim: nowrap