{"id":24616289,"url":"https://github.com/wunkolo/qaveragecolor","last_synced_at":"2025-05-07T02:25:38.874Z","repository":{"id":38238144,"uuid":"167591098","full_name":"Wunkolo/qAverageColor","owner":"Wunkolo","description":"SIMD accelerated method to get the average color of an RGBA8 image","archived":false,"fork":false,"pushed_at":"2025-01-08T16:25:52.000Z","size":4424,"stargazers_count":49,"open_issues_count":3,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-31T05:24:57.137Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Wunkolo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-25T18:06:31.000Z","updated_at":"2024-10-21T19:03:16.000Z","dependencies_parsed_at":"2024-11-24T21:25:13.604Z","dependency_job_id":"9a8fa817-692b-4ac0-9852-be77ef422668","html_url":"https://github.com/Wunkolo/qAverageColor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wunkolo%2FqAverageColor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wunkolo%2FqAverageColor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wunkolo%2FqAverageColor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wunkolo%2FqAverageColor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Wunkolo","download_url":"https://codeload.github.com/Wunkolo/qAverageColor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252800147,"owners_count":21806103,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-24T22:16:46.449Z","updated_at":"2025-05-07T02:25:38.848Z","avatar_url":"https://github.com/Wunkolo.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# qAverageColor [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)\n\n||||||\n|:-:|:-:|:-:|:-:|:-:|\n||Serial|SSE4.2|AVX2|AVX512|\n||![Serial](media/Pat-Serial.gif)|![SSE](media/Pat-SSE.gif)|![AVX2](media/Pat-AVX2.gif)|![AVX512](media/Pat-AVX512.gif)|\n|Processor||Speedup|\n|[i7-7500u](https://en.wikichip.org/wiki/intel/core_i7/i7-7500u)|-|x2.8451|x4.4087|_N/A_|\n|[i3-6100](https://en.wikichip.org/wiki/intel/core_i3/i3-6100)|-|x2.7258|x4.2358|_N/A_|\n|[i5-8600k](https://en.wikichip.org/wiki/intel/core_i5/i5-8600k)|-|x2.4015|x2.6498|_N/A_|\n|[i9-7900x](https://en.wikichip.org/wiki/intel/core_i9/i9-7900x)|-|x2.0651|x2.6140|x4.2704|\n|[i7-1065G7](https://en.wikichip.org/wiki/intel/core_i7/i7-1065g7)|-|x3.9124|x4.6244|x5.4683|\n|[i9-11900k](https://ark.intel.com/content/www/us/en/ark/products/212325/intel-core-i9-11900k-processor-16m-cache-up-to-5-30-ghz.html)|-|x4.2406|x5.3535|x6.0925|\n\n\u003csup\u003eTested against a synthetic 10-megapixel image, GCC version 8.2.1\u003c/sup\u003e\n\nThis is a little snippet write-up of code that will find the average color of an image of RGBA8 pixels (32-bits per pixel, 8 bits per channel) by utilizing the `psadbw`(`_mm_sad_epu8`) instruction to accumulate the sum of each individual channel into a (very overflow-safe)64-bit accumulator.\n\nInspired by the [\"SIMDized sum of all bytes in the array\" write-up](http://0x80.pl/notesen/2018-10-24-sse-sumbytes.html) by [Wojciech Muła](https://twitter.com/pshufb).\n\nThe usual method to get the statistical average of each color channel in an image is pretty trivial:\n\n 1. Load in a pixel\n 2. Unpack the individual color channels\n 3. Sum the channel values into a an accumulator\n 4. When you've summed them all, divide these sums by the number of total pixels\n 5. Interleave these averages into a new color value\n\n![Serial](/media/Serial.gif)\n\nSomething like this:\n```cpp\nstd::uint32_t AverageColorRGBA8(\n\tstd::uint32_t Pixels[],\n\tstd::size_t Count\n)\n{\n\tstd::uint64_t RedSum, GreenSum, BlueSum, AlphaSum;\n\tRedSum = GreenSum = BlueSum = AlphaSum = 0;\n\tfor( std::size_t i = 0; i \u003c Count; ++i )\n\t{\n\t\tconst std::uint32_t\u0026 CurColor = Pixels[i];\n\t\tAlphaSum += static_cast\u003cstd::uint8_t\u003e( CurColor \u003e\u003e 24 );\n\t\tBlueSum  += static_cast\u003cstd::uint8_t\u003e( CurColor \u003e\u003e 16 );\n\t\tGreenSum += static_cast\u003cstd::uint8_t\u003e( CurColor \u003e\u003e  8 );\n\t\tRedSum   += static_cast\u003cstd::uint8_t\u003e( CurColor \u003e\u003e  0 );\n\t}\n\tRedSum   /= Count;\n\tGreenSum /= Count;\n\tBlueSum  /= Count;\n\tAlphaSum /= Count;\n\n\treturn\n\t\t(static_cast\u003cstd::uint32_t\u003e( (std::uint8_t)AlphaSum ) \u003c\u003c 24 ) |\n\t\t(static_cast\u003cstd::uint32_t\u003e( (std::uint8_t) BlueSum ) \u003c\u003c 16 ) |\n\t\t(static_cast\u003cstd::uint32_t\u003e( (std::uint8_t)GreenSum ) \u003c\u003c  8 ) |\n\t\t(static_cast\u003cstd::uint32_t\u003e( (std::uint8_t)  RedSum ) \u003c\u003c  0 );\n}\n```\n\nThis is a pretty serial way to do it. Pick up a pixel, unpack it, add it to a sum.\n\nEach of these unpacks and sums are pretty independent of each other can be parallelized with some SIMD trickery to do these unpacks and sums in chunks of 4, 8, even **16** pixels at once in parallel.\n\nThere is no dedicated instruction for a horizontal sum of 8-bit elements within a vector register in any of the [x86 SIMD variations](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Later_versions).\nThe closest tautology is an instruction that gets the **S**um of **A**bsolute **D**ifferences of **8**-bit elements within 64-bit lanes, and then horizontally adds these 8-bit differences into the lower 16-bits of the 64-bit lane. This is basically computing the [manhatten distance](https://en.wikipedia.org/wiki/Taxicab_geometry) between two vectors of eight 8-bit elements.\n```\nAD(A,B) = ABS(A - B) # Absolute difference\n\nX = ( 1, 2, 3, 4, 5, 6, 7, 8) # 8 byte vectors\nY = ( 8, 9,10,11,12,13,14,15)\n\n# Sum of absolute differences\nSAD(X,Y) =\n\t# Absolute difference of each of the pairs of 8-bit elements\n\tAD(X[0],Y[0]) + AD(X[1],Y[1]) + AD(X[2],Y[2]) + AD(X[3],Y[3]) +\n\tAD(X[4],Y[4]) + AD(X[5],Y[5]) + AD(X[6],Y[6]) + AD(X[7],Y[7])\n\t=\n\tABS(  1 -  8 ) + ABS(  2 -  9 ) + ABS(  3 - 10 ) + ABS(  4 - 11 ) +\n\tABS(  5 - 12 ) + ABS(  6 - 13 ) + ABS(  7 - 14 ) + ABS(  8 - 15 )\n\t=\n\tABS( -7 ) + ABS( -7 ) + ABS( -7 ) + ABS( -7 ) +\n\tABS( -7 ) + ABS( -7 ) + ABS( -7 ) + ABS( -7 )\n\t# Horizontally sum all these differences into a 16-bit value\n\t= 7 + 7 + 7 + 7 + 7 + 7 + 7 + 7\n\t= 56\n```\n\n`psadbw` may seem like a pretty niche instruction at first. You're probably wondering why such a specific series of operations is implemented as an official x86 instruction but it has had plenty of usage since the original SSE days to aid in block-based [motion estimation](https://en.wikipedia.org/wiki/Sum_of_absolute_differences) for video encoding.\nThe trick here is recognizing that the absolute difference between an _unsigned_ number and _zero_, is just the unsigned number again. The _sum_ of the absolute difference between a vector of unsigned values and vector-0 is a way to extract just the horizontal-addition step of SAD for this particular use.\n\n```\n(A is unsigned)\nAD(A,B) = ABS(A - 0) = A\n\nX = ( 0, 1, 2, 3, 4, 5, 6, 7) # 8 byte vectors\nY = ( 0, 0, 0, 0, 0, 0, 0, 0)\n\nSAD(X,Y) =\n\tAD(X[0],0) + AD(X[1],0) + AD(X[2],0) + AD(X[3],0) +\n\tAD(X[4],0) + AD(X[5],0) + AD(X[6],0) + AD(X[7],0)\n\t=\n\tX[0] + X[1] + X[2] + X[3] + X[4] + X[5] + X[6] + X[7]\n\t=\n\t1 + 2 + 3 + 4 + 5 + 6 + 7\n\t= 28\n```\n\nThis kind of utilizaton of `psadbw` will allow a vector of 8 consecutive bytes to be horizontally summed into the low 16-bits of a 64-bit lane, and this 16-bit value can then be directly added to a larger 64-bit accumulator. With this, a chunk of RGBA color values can be loaded into a vector, unpacked so that all their R,G,B,A bytes are grouped into 64-bit lanes, `psadbw` these 64-bit lanes to get 16-bit sums, and then accumulate these sums into a 64-bit accumulator to later get their average.\n\nUsually, taking the average of a large amount of values can cause some worry for overflow. With instructions like `psadbw` that operate on 64-bit lanes, it lends itself to the usage of 64-bit accumulators which are very resistant to overflow.\nAn individual channel would need `2^64 / 2^8 == 72057594037927936` pixels (almost 69 billion megapixels) with a value of `0xFF` for that color channel to overflow its 64-bit accumulator.\nPretty resistant I'd say.\n\nAn SSE vector-register is 128 bits, it will be able to hold two 64-bit accumulators per vector-register so one SSE register can be used to accumulate the sum of `|Green|Red|` values and another vector-register for the `|Alpha|Blue|` sums.\n\nThe main loop would look something like this:\n\n 1. Load in a chunk of 4 32-bit pixels into a 128-bit register\n    * `|ABGR|ABGR|ABGR|ABGR|`\n 2. Shuffle the 8-bit channels within the vector so that the upper 64-bits of the register has one channel, and the lower 64-bits has another.\n    * There is a bit of \"waste\" as you only have four bytes of a particular channel and 8-bytes within a lane. These values can be set to zero by passing a shuffle-index with the upper bit set when using `_mm_shuffle_epi8`(A value such as `-1` will get `pshufb` to write `0`). This way these bytes will not effect the sum.\n    * `|0G0G0G0G|0R0R0R0R|` or `|0A0A0A0A|0B0B0B0B|`\n    * `|0000GGGG|0000RRRR|` or `|0000AAAA|0000BBBB|` works too\n    * Any permutation in particular works so long as the unused elements do not effect the horizontal sum and are 0\n 3. `_mm_sad_epu` the vector, getting two 16-bit sums within each of the 64-bit lanes, add this to the 64-bit accumulators\n    * `|0000ΣG16|0000ΣR16|` or `|0000ΣA16|0000ΣB16|` 16-bit sums, within the upper and lower 64-bit halfs of the 128-bit register\n\n\nA `psadbw`-accelerated pixel-summing loop that handles four pixels at a time would look something like this:\n\n![](media/SAD.gif)\n\n```cpp\n// | 64-bit Red Sum | 64-bit Green Sum |\n__m128i RedGreenSum64  = _mm_setzero_si128();\n// | 64-bit Blue Sum | 64-bit Alpha Sum |\n__m128i BlueAlphaSum64 = _mm_setzero_si128();\n\n// 4 pixels at a time! (SSE)\nfor( std::size_t j = i/4; j \u003c Count/4; j++, i += 4 )\n{\n\tconst __m128i QuadPixel = _mm_stream_load_si128((__m128i*)\u0026Pixels[i]);\n\tRedGreenSum64 = _mm_add_epi64( // Add it to the 64-bit accumulators\n\t\tRedGreenSum64,\n\t\t_mm_sad_epu8( // compute | 0+G+0+G+0+G+0+G | 0+R+0+R+0+R+0+R |\n\t\t\t_mm_shuffle_epi8( // Shuffle the bytes to | 0G0G0G0G | 0R0R0R0R |\n\t\t\t\tQuadPixel,\n\t\t\t\t_mm_set_epi8(\n\t\t\t\t\t// Green\n\t\t\t\t\t-1,13,-1, 5,\n\t\t\t\t\t-1, 9,-1, 1,\n\t\t\t\t\t// Red\n\t\t\t\t\t-1,12,-1, 4,\n\t\t\t\t\t-1, 8,-1, 0\n\t\t\t\t)\n\t\t\t),\n\t\t\t// SAD against 0, which just returns the original unsigned value\n\t\t\t_mm_setzero_si128()\n\t\t)\n\t);\n\tBlueAlphaSum64 = _mm_add_epi64( // Add it to the 64-bit accumulators\n\t\tBlueAlphaSum64,\n\t\t_mm_sad_epu8( // compute | 0+A+0+A+0+A+0+A | 0+B+0+B+0+B+0+B |\n\t\t\t_mm_shuffle_epi8( // Shuffle the bytes to | 0A0A0A0A | 0B0B0B0B |\n\t\t\t\tQuadPixel,\n\t\t\t\t_mm_set_epi8(\n\t\t\t\t\t// Alpha\n\t\t\t\t\t-1,15,-1, 7,\n\t\t\t\t\t-1,11,-1, 3,\n\t\t\t\t\t// Blue\n\t\t\t\t\t-1,14,-1, 6,\n\t\t\t\t\t-1,10,-1, 2\n\t\t\t\t)\n\t\t\t),\n\t\t\t// SAD against 0, which just returns the original unsigned value\n\t\t\t_mm_setzero_si128()\n\t\t)\n\t);\n}\n```\n\nAfter doing chunks of 4 at a time, it can handle the unaligned pixels(there will only ever be 3 or less left-over) by extracting the 64-bit accumulators from the vector-registers, and falling back to the usual serial method.\nThough there are some slight optimizations that can be done here too. `bextr` can extract continuous bits a little quicker without doing a shift-and-a-mask to get the upper color channels. x86 has [register aliasing for the lower two bytes of its general purpose registers](http://flint.cs.yale.edu/cs421/papers/x86-asm/x86-registers.png) though so a `bextr` would probably be overhandling for the lower color channels in the lower bytes.\n\n```cpp\n// Extract the 64-bit accumulators from the vector registers of the previous loop\nstd::uint64_t RedSum64 = _mm_cvtsi128_si64(RedGreenSum64);\nstd::uint64_t GreenSum64   = _mm_extract_epi64(RedGreenSum64,1);\nstd::uint64_t BlueSum64 = _mm_cvtsi128_si64(BlueAlphaSum64);\nstd::uint64_t AlphaSum64  = _mm_extract_epi64(BlueAlphaSum64,1);\n\n// New serial method\nfor( ; i \u003c Count; ++i )\n{\n\tconst std::uint32_t CurColor = Pixels[i];\n\tAlphaSum64 += _bextr_u64( CurColor, 24, 8);\n\tBlueSum64  += _bextr_u64( CurColor, 16, 8);\n\t// I'm being oddly specific here to make it obvious for the\n\t// compiler to do some ah/bh/ch/dh register trickery\n\t//                                              V\n\tGreenSum64 += static_cast\u003cstd::uint8_t\u003e( CurColor \u003e\u003e  8 );\n\tRedSum64   += static_cast\u003cstd::uint8_t\u003e( CurColor       );\n}\n// Average\nRedSum64   /= Count;\nGreenSum64 /= Count;\nBlueSum64  /= Count;\nAlphaSum64 /= Count;\n\n// Interleave\nreturn\n\t(static_cast\u003cstd::uint32_t\u003e( (std::uint8_t)AlphaSum64 ) \u003c\u003c 24 ) |\n\t(static_cast\u003cstd::uint32_t\u003e( (std::uint8_t) BlueSum64 ) \u003c\u003c 16 ) |\n\t(static_cast\u003cstd::uint32_t\u003e( (std::uint8_t)GreenSum64 ) \u003c\u003c  8 ) |\n\t(static_cast\u003cstd::uint32_t\u003e( (std::uint8_t)  RedSum64 ) \u003c\u003c  0 );\n```\n\nThis implementation so far with a 3840x2160 image on an [i7-7500U](https://en.wikichip.org/wiki/intel/core_i7/i7-7500u) shows an approximate **x2.6** increase in performance over the serial method. It now takes less than half the time to process an image now.\n```\nSerial: #10121AFF |      7100551ns\nFast  : #10121AFF |      2701641ns\nSpeedup: 2.628236\n```\n\nWith [AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) this algorithm can be updated to process **8** pixels a time, then **4** pixels a time before resorting to the serial algorithm for unaligned data(there will only be 7 or less unaligned pixels, think [greedy algorithms](https://en.wikipedia.org/wiki/Greedy_algorithm)). With much larger **256-bit** vectors, all four 64-bit accumulators can reside within a single AVX2 register.\nThough, AVX2's massive 256-bit vector-registers is almost just an alias for two regular 128-bit SSE registers from before with the additional benefit of being able to compactly handle two 128-bit registers with 1 instruction.\nThis also means that cross-lane arithmetic(shuffling elements across the full width of a 256-bit register, rather than staying within the upper and lower 128-bit halfs) can be tricky as crossing the 128-bit boundary needs some special attention. A solution to this is to shuffle first within the upper and lower 128-bit lanes, and then using a much larger cross-lane shuffle to further unpack the channels into continuous values before computing a `_mm256_sad_epu8` on each of the four 64-bit lanes.\n\n```cpp\n// Vector of four 64-bit accumulators for Red,Green,Blue, and Alpha\n__m256i RGBASum64  = _mm256_setzero_si256();\n// 8 pixels at a time! (AVX/AVX2)\nfor( std::size_t j = i/8; j \u003c Count/8; j++, i += 8 )\n{\n\tconst __m256i OctaPixel = _mm256_loadu_si256((__m256i*)\u0026Pixels[i]);\n\t// Shuffle within 128-bit lanes\n\t// | ABGRABGRABGRABGR | ABGRABGRABGRABGR |\n\t// | AAAABBBBGGGGRRRR | AAAABBBBGGGGRRRR |\n\t// Setting up for 64-bit lane sad_epu8\n\t__m256i Deinterleave = _mm256_shuffle_epi8(\n\t\tOctaPixel,\n\t\t_mm256_broadcastsi128_si256(\n\t\t\t_mm_set_epi8(\n\t\t\t\t// Alpha\n\t\t\t\t15,11, 7, 3,\n\t\t\t\t// Blue\n\t\t\t\t14,10, 6, 2,\n\t\t\t\t// Green\n\t\t\t\t13, 9, 5, 1,\n\t\t\t\t// Red\n\t\t\t\t12, 8, 4, 0\n\t\t\t)\n\t\t)\n\t);\n\t// Cross-lane shuffle\n\t// | AAAABBBBGGGGRRRR | AAAABBBBGGGGRRRR |\n\t// | AAAAAAAA | BBBBBBBB | GGGGGGGG | RRRRRRRR |\n\tDeinterleave = _mm256_permutevar8x32_epi32(\n\t\tDeinterleave,\n\t\t_mm256_set_epi32(\n\t\t\t// Alpha\n\t\t\t7, 3,\n\t\t\t// Blue\n\t\t\t6, 2,\n\t\t\t// Green\n\t\t\t5, 1,\n\t\t\t// Red\n\t\t\t4, 0\n\t\t)\n\t);\n\t// | ASum64 | BSum64 | GSum64 | RSum64 |\n\tRGBASum64 = _mm256_add_epi64(\n\t\tRGBASum64,\n\t\t_mm256_sad_epu8(\n\t\t\tDeinterleave,\n\t\t\t_mm256_setzero_si256()\n\t\t)\n\t);\n}\n\n// Pass the accumulators onto the next SSE loop from above\n__m128i BlueAlphaSum64  = _mm256_extractf128_si256(RGBASum64,1);\n__m128i RedGreenSum64 = _mm256_castsi256_si128(RGBASum64);\nfor( std::size_t j = i/4; j \u003c Count/4; j++, i += 4 )\n{\n...\n```\n\n\nThis implementation so far(AVX2,SSE,and Serial) with the same 3840x2160 image as before on an [i7-7500U](https://en.wikichip.org/wiki/intel/core_i7/i7-7500u) shows an approximate **x4.1** increase in performance over the serial method. It now takes less than a **forth** of the time to calculate the color average over the serial version!\n```\nSerial: #10121AFF |      7436508ns\nFast  : #10121AFF |      1802768ns\nSpeedup: 4.125050\n```\n\n### RGB8? RG8? R8?\n\nNot explored in this little write-up are other pixel formats such as the three-channel `RGB` with no alpha or just a 2-channel `RG` image or even just a monochrome \"R8\" image of just one channel.\nOther formats will work with the same principle as the `RGBA` format one so long as you account for the proper shuffling needed.\n\nDepending on where you are at in your pixel processing, consider the different cases of how you can take a 16-byte chunk out of a stream of 3-byte `RGB` pixels\nIf your bytes were organized `|RGB|RGB|RGB|RGB|...` and a vector-register with a width of 16-bytes, then you'll always be taking `16/3 = 15.3333...` `RGB` pixels at once. And every **3**rd chunk (`0.333.. * 3 = 1.0`) is when the period repeats. So only 3 shuffle-masks are ever needed depending on where your index `i` lands on the array. Some ASCII art might help visualize it\n```\nNote how different the bytes align when taking regular 16-byte chunks out of a\nstream of 3-byte pixels.\nDifferent shuffle patterns must be used to account for each case:\n\n \u003eRGBRGBRGBRGBRGBRG\u003c\n0:|---SIMD Reg.---|\n  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...\n                  \u003eBRGBRGBRGBRGBRGBR\u003c\n1:                 |---SIMD Reg.---|\n  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...\n                                   \u003eGBRGBRGBRGBRGBRGB\u003c\n2:                                  |---SIMD Reg.---|\n  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...\n                   This is the same as iteration 0  \u003eRGBRGBRGBRGBRGBRG\u003c\n3:                                                   |---SIMD Reg.---|\n  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...\n```\n\nYou can also process much larger **48**-byte (`lcd(16,3)`) aligned chunks to have more productive iterations at a higher granularity:\n\n```\nNow you don't have to worry about byte-level alignment and can do all the shuffles at once\n \u003eRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB\u003c\n0:|---SIMD Reg.---||---SIMD Reg.---||---SIMD Reg.---|\n  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...\n```\nOnce you got your sums, then it's just a division and interleave to turn these statistical averages into a new color value.\n\nFor RG8, the same principle applies but much more trivial since 2-byte pixels naturally align themselves with power-of-two register widths.\n\nFor R8, the summing step reduces to just be a sum-of-bytes which is a topic precisely [covered by Wojciech Muła](http://0x80.pl/notesen/2018-10-24-sse-sumbytes.html). After getting the sum, divide by the number of pixels to get the statistical average.\n\n# AVX512-VNNI\n\nThe upcoming Intel Icelake features **V**ector **N**eural **N**etwork **I**nstructions in consumer-level products.\nThe AVX512-VNNI subset is very small, featuring only 4 new instructions.\n\n\nInstruction|Description\n-|-\n`VPDPBUSD`\t| **Multiply and add unsigned and signed 8-bit integers**\n`VPDPBUSDS`\t| Multiply and add unsigned and signed 8-bit integers with saturation\n`VPDPWSSD`\t| Multiply and add signed 16-bit integers\n`VPDPWSSDS`\t| Multiply and add 16-bit integers with saturation\n\n \u003e ![](media/AVX512VNNI.jpg)\n \u003e \n \u003e [_Vector Neural Network Instructions Enable Int8 AI Inference on Intel Architecture_](https://www.intel.ai/vnni-enables-inference)\n\nThese instructions are [intended to accelerate convolutional neural network workloads](https://aidc.gallery.video/detail/videos/all-videos/video/5790616836001/understanding-new-vector-neural-network-instructions-vnni) which typically involves mixed-precision arithmetic and matrix multiplications. These four new instructions essentially implements a 8 or 16 bit dot-product into 32-bit accumulators which falls nicely into the domain of summing a large span of bytes together.\n\n[VPDPBUSD](https://github.com/HJLebbink/asm-dude/wiki/VPDPBUSD) calculates the dot product of sixteen 8-bit ℝ⁴ vectors and accumulates them upon a vector of 32-bit values, all in one instruction. It practically lends itself to the \"sum of bytes\" problem by allowing for large \"bites\" of data to be horizontally added and accumulated.\n\n \u003e ![](media/vpdpbusd.png)\n \u003e\n \u003e [WikiChip-AVX512VNNI](https://en.wikichip.org/wiki/x86/avx512vnni)\n\nThe [_mm512_dpbusd_epi32](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_dpbusd_epi32\u0026expand=2195) intrinsic is described as:\n\n```\nSynopsis\n__m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)\n#include \u003cimmintrin.h\u003e\nInstruction: vpdpbusd zmm {k}, zmm, zmm\nCPUID Flags: AVX512_VNNI\n\nDescription\nMultiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst.\n\nOperation\nFOR j := 0 to 15\n\ttmp1 := a.byte[4*j] * b.byte[j]\n\ttmp2 := a.byte[4*j+1] * b.byte[j+1]\n\ttmp3 := a.byte[4*j+2] * b.byte[j+2]\n\ttmp4 := a.byte[4*j+3] * b.byte[j+3]\n\tdst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4\nENDFOR\ndst[MAX:512] := 0\n```\n\nBy passing a vector of `1` values into the multiplication step, the implementation basically becomes:\n```\nFOR j := 0 to 15\n\ttmp1 := a.byte[4*j] * 1\n\ttmp2 := a.byte[4*j+1] * 1\n\ttmp3 := a.byte[4*j+2] * 1\n\ttmp4 := a.byte[4*j+3] * 1\n\tdst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4\nENDFOR\ndst[MAX:512] := 0\n```\n\nWhich is essentially:\n\n```\nFOR j := 0 to 15\n\tdst.dword[j] := src.dword[j] + a.byte[4*j] + a.byte[4*j+1] + a.byte[4*j+2] + a.byte[4*j+3]\nENDFOR\ndst[MAX:512] := 0\n```\n\nWhich basically turns it into a \"sum 16 groups of 4 bytes and add this sum into another 16 32-bit values\" instruction. Before we would have had to shuffle our bytes into appropriate lanes, `_mm***_sad_epu8` them, and then use a separate `_mm***_add_epi64` to add it to the accumulator. But this will save us from the additional add step. Though, the accumulator is only 32-bits rather than 64-bit. It is now twice as susceptible to overflow but it can be mitigated by only running an inner loop of this instruction a certain number of times guaranteed to be safe from overflow, and then adding this to the outer-loop's 64-bit accumulator.\n\n```cpp\n// The usual shuffle pattern from the sad_epu8 method\n// Each \"R\" \"G\" \"B\" \"A\" value is an 8-bit channel-byte\n// | AAAA | AAAA | BBBB | BBBB | GGGG | GGGG | RRRR | RRRR | AAAA | AAAA | BBBB | BBBB | GGGG | GGGG | RRRR | RRRR |\n__m512i Deinterleave = _mm512_shuffle_epi8(\n\tHexadecaPixel,\n\t_mm512_set_epi32(\n\t\t// Alpha\n\t\t0x3C'38'34'30 + 0x03'03'03'03, 0x2C'28'24'20 + 0x03'03'03'03,\n\t\t// Blue\n\t\t0x3C'38'34'30 + 0x02'02'02'02, 0x2C'28'24'20 + 0x02'02'02'02,\n\t\t// Green\n\t\t0x3C'38'34'30 + 0x01'01'01'01, 0x2C'28'24'20 + 0x01'01'01'01,\n\t\t// Red\n\t\t0x3C'38'34'30 + 0x00'00'00'00, 0x2C'28'24'20 + 0x00'00'00'00,\n\t\t// Alpha\n\t\t0x1C'18'14'10 + 0x03'03'03'03, 0x0C'08'04'00 + 0x03'03'03'03,\n\t\t// Blue\n\t\t0x1C'18'14'10 + 0x02'02'02'02, 0x0C'08'04'00 + 0x02'02'02'02,\n\t\t// Green\n\t\t0x1C'18'14'10 + 0x01'01'01'01, 0x0C'08'04'00 + 0x01'01'01'01,\n\t\t// Red\n\t\t0x1C'18'14'10 + 0x00'00'00'00, 0x0C'08'04'00 + 0x00'00'00'00\n\t)\n);\n```\n\n```\nBasic pattern of the partial sums found within a 256-bit lane\n|                      256 bits                         |\n| AAAA | AAAA | BBBB | BBBB | GGGG | GGGG | RRRR | RRRR | \u003c\n| **** | **** | **** | **** | **** | **** | **** | **** | |\n| 1111 | 1111 | 1111 | 1111 | 1111 | 1111 | 1111 | 1111 | |\n| hadd | hadd | hadd | hadd | hadd | hadd | hadd | hadd | |\n|ASum32|ASum32|BSum32|BSum32|GSum32|GSum32|RSum32|RSum32| \u003c Inner loop, 32-bit sum\n|   \\  +  /   |   \\  +  /   |   \\  +  /   |   \\  +  /   | | Sum Adjacent pairs\n|   ASum64    |   BSum64    |   GSum64    |   RSum64    | \u003c Outer loop, 64-bit sum\n```\n\nAs of now(`Wed 19 Jun 2019 10:15:53 PM PDT`) there is no publicly available\nIcelake hardware to test this on but just in terms of uops this should tighten\nup the pipeline a bit more than the sad_epu-method.\n\nUpdate: As of now(`Wed 07 Aug 2019 04:49:23 PM PDT`) there is still no publicly\navailable hardware to test this on, but based on [some public benchmarks](https://www.anandtech.com/show/14664/testing-intel-ice-lake-10nm/3) there is now some [latency data](https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel00706E5_IceLakeY_InstLatX64.txt) for the key instructions in the algorithm.\n\nOn Skylake-X:\n * `vpsadbw *mm, *mm, *mm` has a latency of **3**-cycles and a throughput of **1.0**\n * `vpaddq (x|y)mm, (x|y)mm, (x|y)mm` has a latency of **1**-cycle and a throughput of **0.33**\n * `vpaddq zmm, zmm, zmm` has a latency of **1**-cycle and a throughput of **0.5**\n\n**4 cycles**\n\nOn Icelake:\n * `vpsadbw *mm, *mm, *mm` and `vpaddq *mm, *mm, *mm` have the same latencies and throughput as Skylake-X\n * `vpdpbusd (x|y)mm, (x|y)mm, (x|y)mm` has a latency of **5**-cycles and a throughput of **0.5**\n * `vpdpbusd zmm, zmm, zmm` has a latency of **5**-cycles and a throughput of **1.0**\n\n**5 cycles**\n\nIt looks like ultimately, by having an instruction that fuses previous two\ninstructions to achieve a horizontal-byte-addition, we end up with an extra cycle though it saves the additional instruction decoding.\n\nThis doesn't consider the overhead of the outer-loops either. Once I get one of\nthe new Icelake laptops in my hands I can get some much harder benchmark numbers\nof how the two algorithms perform on the same Icelake hardware.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwunkolo%2Fqaveragecolor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwunkolo%2Fqaveragecolor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwunkolo%2Fqaveragecolor/lists"}