{"id":45920525,"url":"https://github.com/tphakala/simd","last_synced_at":"2026-02-28T08:41:23.082Z","repository":{"id":325543139,"uuid":"1101591388","full_name":"tphakala/simd","owner":"tphakala","description":"High-performance SIMD library for Go with AVX/NEON support","archived":false,"fork":false,"pushed_at":"2025-12-10T19:16:02.000Z","size":335,"stargazers_count":2,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-10T19:54:00.270Z","etag":null,"topics":["avx","float32","float64","go","golang","math","neon","performance","simd","vectorization"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tphakala.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-21T22:40:23.000Z","updated_at":"2025-12-10T19:16:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tphakala/simd","commit_stats":null,"previous_names":["tphakala/simd"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/tphakala/simd","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tphakala%2Fsimd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tphakala%2Fsimd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tphakala%2Fsimd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tphakala%2Fsimd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tphakala","download_url":"https://codeload.github.com/tphakala/simd/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tphakala%2Fsimd/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29929091,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-27T19:37:42.220Z","status":"online","status_checked_at":"2026-02-28T02:00:07.010Z","response_time":90,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avx","float32","float64","go","golang","math","neon","performance","simd","vectorization"],"created_at":"2026-02-28T08:41:22.515Z","updated_at":"2026-02-28T08:41:23.059Z","avatar_url":"https://github.com/tphakala.png","language":"Go","readme":"# simd\n\n[![Go Reference](https://pkg.go.dev/badge/github.com/tphakala/simd.svg)](https://pkg.go.dev/github.com/tphakala/simd)\n[![Go Report Card](https://goreportcard.com/badge/github.com/tphakala/simd)](https://goreportcard.com/report/github.com/tphakala/simd)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nA high-performance SIMD (Single Instruction, Multiple Data) library for Go providing vectorized operations on float64, float32, float16, complex128, and complex64 slices.\n\n## Features\n\n- **Pure Go assembly** - Native Go assembler, simple cross-compilation\n- **Runtime CPU detection** - Automatically selects optimal implementation (AVX-512, AVX+FMA, SSE4.1, NEON, NEON+FP16, or pure Go)\n- **Zero allocations** - All operations work on pre-allocated slices\n- **80+ operations** - Arithmetic, reduction, statistical, vector, signal processing, activation functions, and complex number operations\n- **Multi-architecture** - AMD64 (AVX-512/AVX+FMA/SSE4.1) and ARM64 (NEON/NEON+FP16) with pure Go fallback\n- **Half-precision support** - Native FP16 SIMD on ARM64 with FP16 extension (Apple Silicon, Cortex-A55+)\n- **Thread-safe** - All functions are safe for concurrent use\n\n## Installation\n\n```bash\ngo get github.com/tphakala/simd\n```\n\nRequires Go 1.25+\n\n## Quick Start\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"github.com/tphakala/simd/cpu\"\n    \"github.com/tphakala/simd/f64\"\n)\n\nfunc main() {\n    fmt.Println(\"SIMD:\", cpu.Info())\n\n    // Vector operations\n    a := []float64{1, 2, 3, 4, 5, 6, 7, 8}\n    b := []float64{8, 7, 6, 5, 4, 3, 2, 1}\n\n    // Dot product\n    dot := f64.DotProduct(a, b)\n    fmt.Println(\"Dot product:\", dot) // 120\n\n    // Element-wise operations\n    dst := make([]float64, len(a))\n    f64.Add(dst, a, b)\n    fmt.Println(\"Sum:\", dst) // [9, 9, 9, 9, 9, 9, 9, 9]\n\n    // Statistical operations\n    mean := f64.Mean(a)\n    stddev := f64.StdDev(a)\n    fmt.Printf(\"Mean: %.2f, StdDev: %.2f\\n\", mean, stddev)\n\n    // Vector operations\n    f64.Normalize(dst, a)\n    fmt.Println(\"Normalized:\", dst)\n\n    // Distance calculation\n    dist := f64.EuclideanDistance(a, b)\n    fmt.Println(\"Distance:\", dist)\n}\n```\n\n## Packages\n\n### `cpu` - CPU Feature Detection\n\n```go\nimport \"github.com/tphakala/simd/cpu\"\n\nfmt.Println(cpu.Info())      // \"AMD64 AVX-512\", \"AMD64 AVX+FMA\", \"AMD64 SSE2\", or \"ARM64 NEON\"\nfmt.Println(cpu.HasAVX())    // true/false\nfmt.Println(cpu.HasAVX512()) // true/false\nfmt.Println(cpu.HasNEON())   // true/false\nfmt.Println(cpu.HasFP16())   // true/false (ARM64 half-precision SIMD)\n```\n\n### `f64` - float64 Operations\n\n| Category        | Function                            | Description                   | SIMD Width                          |\n| --------------- | ----------------------------------- | ----------------------------- | ----------------------------------- |\n| **Arithmetic**  | `Add(dst, a, b)`                    | Element-wise addition         | 8x (AVX-512) / 4x (AVX) / 2x (NEON) |\n|                 | `Sub(dst, a, b)`                    | Element-wise subtraction      | 8x / 4x / 2x                        |\n|                 | `Mul(dst, a, b)`                    | Element-wise multiplication   | 8x / 4x / 2x                        |\n|                 | `Div(dst, a, b)`                    | Element-wise division         | 8x / 4x / 2x                        |\n|                 | `Scale(dst, a, s)`                  | Multiply by scalar            | 8x / 4x / 2x                        |\n|                 | `AddScalar(dst, a, s)`              | Add scalar                    | 8x / 4x / 2x                        |\n|                 | `FMA(dst, a, b, c)`                 | Fused multiply-add: a\\*b+c    | 8x / 4x / 2x                        |\n|                 | `AddScaled(dst, alpha, s)`          | dst += alpha\\*s (axpy)        | 8x / 4x / 2x                        |\n| **Unary**       | `Abs(dst, a)`                       | Absolute value                | 8x / 4x / 2x                        |\n|                 | `Neg(dst, a)`                       | Negation                      | 8x / 4x / 2x                        |\n|                 | `Sqrt(dst, a)`                      | Square root                   | 8x / 4x / 2x                        |\n|                 | `Reciprocal(dst, a)`                | Reciprocal (1/x)              | 8x / 4x / 2x                        |\n| **Reduction**   | `DotProduct(a, b)`                  | Dot product                   | 8x / 4x / 2x                        |\n|                 | `Sum(a)`                            | Sum of elements               | 8x / 4x / 2x                        |\n|                 | `Min(a)`                            | Minimum value                 | 8x / 4x / 2x                        |\n|                 | `Max(a)`                            | Maximum value                 | 8x / 4x / 2x                        |\n|                 | `MinIdx(a)`                         | Index of minimum value        | Pure Go                             |\n|                 | `MaxIdx(a)`                         | Index of maximum value        | Pure Go                             |\n| **Statistical** | `Mean(a)`                           | Arithmetic mean               | 8x / 4x / 2x                        |\n|                 | `Variance(a)`                       | Population variance           | 8x / 4x / 2x                        |\n|                 | `StdDev(a)`                         | Standard deviation            | 8x / 4x / 2x                        |\n| **Vector**      | `EuclideanDistance(a, b)`           | L2 distance                   | 8x / 4x / 2x                        |\n|                 | `Normalize(dst, a)`                 | Unit vector normalization     | 8x / 4x / 2x                        |\n|                 | `CumulativeSum(dst, a)`             | Running sum                   | Sequential                          |\n| **Range**       | `Clamp(dst, a, min, max)`           | Clamp to range                | 8x / 4x / 2x                        |\n| **Activation**  | `Sigmoid(dst, src)`                 | Sigmoid: 1/(1+e^-x)           | 4x (AVX) / 2x (NEON)                |\n|                 | `ReLU(dst, src)`                    | Rectified Linear Unit         | 8x / 4x / 2x                        |\n|                 | `Tanh(dst, src)`                    | Hyperbolic tangent            | 8x / 4x / 2x                        |\n|                 | `Exp(dst, src)`                     | Exponential e^x               | Pure Go                             |\n|                 | `ClampScale(dst, src, min, max, s)` | Fused clamp and scale         | 8x / 4x / 2x                        |\n| **Batch**       | `DotProductBatch(r, rows, v)`       | Multiple dot products         | 8x / 4x / 2x                        |\n| **Signal**      | `ConvolveValid(dst, sig, k)`        | FIR filter / convolution      | 8x / 4x / 2x                        |\n|                 | `ConvolveValidMulti(dsts, sig, ks)` | Multi-kernel convolution      | 8x / 4x / 2x                        |\n|                 | `AccumulateAdd(dst, src, off)`      | Overlap-add: dst[off:] += src | 8x / 4x / 2x                        |\n| **Audio**       | `Interleave2(dst, a, b)`            | Pack stereo: [L,R,L,R,...]    | 4x / 2x                             |\n|                 | `Deinterleave2(a, b, src)`          | Unpack stereo to channels     | 4x / 2x                             |\n|                 | `CubicInterpDot(hist,a,b,c,d,x)`    | Fused cubic interp dot product| 4x / 2x                             |\n|                 | `Int32ToFloat32Scale(dst,src,s)`    | PCM int32 to normalized float | 8x / 4x                             |\n\n### `f32` - float32 Operations\n\nSame API as `f64` but for `float32` with wider SIMD:\n\n| Architecture    | SIMD Width  |\n| --------------- | ----------- |\n| AMD64 (AVX-512) | 16x float32 |\n| AMD64 (AVX+FMA) | 8x float32  |\n| AMD64 (SSE2)    | 4x float32  |\n| ARM64 (NEON)    | 4x float32  |\n\n**Additional split-format complex operations** (for FFT pipelines with separate real/imag arrays):\n\n| Category   | Function                              | Description                        | SIMD Width       |\n| ---------- | ------------------------------------- | ---------------------------------- | ---------------- |\n| **Complex**| `MulComplex(dstRe,dstIm,aRe,aIm,bRe,bIm)` | Split-format complex multiply  | 8x (AVX) / 4x (NEON) |\n|            | `MulConjComplex(dstRe,dstIm,aRe,aIm,bRe,bIm)` | Multiply by conjugate      | 8x / 4x          |\n|            | `AbsSqComplex(dst,aRe,aIm)`           | Magnitude squared                  | 8x / 4x          |\n|            | `ButterflyComplex(uRe,uIm,lRe,lIm,twRe,twIm)` | FFT butterfly with twiddle | 8x / 4x          |\n|            | `RealFFTUnpack(outRe,outIm,zRe,zIm,twRe,twIm)` | Real FFT unpack step     | 8x / 4x          |\n| **Utility**| `Reverse(dst, src)`                   | Reverse slice order                | 8x / 4x          |\n|            | `AddSub(sum, diff, a, b)`             | Fused sum and difference           | 8x / 4x          |\n\n### `f16` - float16 (Half-Precision) Operations\n\nIEEE 754 half-precision floating-point operations, optimized for ML inference, audio DSP, and memory-bandwidth-bound workloads.\n\n```go\nimport \"github.com/tphakala/simd/f16\"\n\n// Convert between float32 and float16\nh := f16.FromFloat32(3.14)\nf := f16.ToFloat32(h)\n\n// Vector operations (same API as f32/f64)\na := make([]f16.Float16, 1024)\nb := make([]f16.Float16, 1024)\ndst := make([]f16.Float16, 1024)\n\nf16.Add(dst, a, b)           // Element-wise addition\ndot := f16.DotProduct(a, b)  // Dot product (returns float32)\nf16.ReLU(dst, a)             // Activation functions\n```\n\n| Category        | Function                            | Description                   | SIMD Width       |\n| --------------- | ----------------------------------- | ----------------------------- | ---------------- |\n| **Conversion**  | `ToFloat32(h)`                      | FP16 → float32                | Scalar           |\n|                 | `FromFloat32(f)`                    | float32 → FP16                | Scalar           |\n|                 | `ToFloat32Slice(dst, src)`          | Batch FP16 → float32          | 8x (NEON+FP16)   |\n|                 | `FromFloat32Slice(dst, src)`        | Batch float32 → FP16          | 8x (NEON+FP16)   |\n| **Arithmetic**  | `Add(dst, a, b)`                    | Element-wise addition         | 8x (NEON+FP16)   |\n|                 | `Sub(dst, a, b)`                    | Element-wise subtraction      | 8x (NEON+FP16)   |\n|                 | `Mul(dst, a, b)`                    | Element-wise multiplication   | 8x (NEON+FP16)   |\n|                 | `Div(dst, a, b)`                    | Element-wise division         | 8x (NEON+FP16)   |\n|                 | `Scale(dst, a, s)`                  | Multiply by scalar            | 8x (NEON+FP16)   |\n|                 | `AddScalar(dst, a, s)`              | Add scalar                    | 8x (NEON+FP16)   |\n|                 | `FMA(dst, a, b, c)`                 | Fused multiply-add: a*b+c     | 8x (NEON+FP16)   |\n|                 | `AddScaled(dst, alpha, s)`          | dst += alpha*s (AXPY)         | 8x (NEON+FP16)   |\n| **Unary**       | `Abs(dst, a)`                       | Absolute value                | 8x (NEON+FP16)   |\n|                 | `Neg(dst, a)`                       | Negation                      | 8x (NEON+FP16)   |\n|                 | `Sqrt(dst, a)`                      | Square root                   | 8x (NEON+FP16)   |\n|                 | `Reciprocal(dst, a)`                | Reciprocal (1/x)              | 8x (NEON+FP16)   |\n| **Reduction**   | `DotProduct(a, b)` → float32        | Dot product                   | 8x (NEON+FP16)   |\n|                 | `Sum(a)` → float32                  | Sum of elements               | 8x (NEON+FP16)   |\n|                 | `Min(a)`                            | Minimum value                 | 8x (NEON+FP16)   |\n|                 | `Max(a)`                            | Maximum value                 | 8x (NEON+FP16)   |\n|                 | `MinIdx(a)`                         | Index of minimum              | Pure Go          |\n|                 | `MaxIdx(a)`                         | Index of maximum              | Pure Go          |\n| **Statistical** | `Mean(a)` → float32                 | Arithmetic mean               | 8x (NEON+FP16)   |\n|                 | `Variance(a)` → float32             | Population variance           | Pure Go          |\n|                 | `StdDev(a)` → float32               | Standard deviation            | Pure Go          |\n| **Vector**      | `EuclideanDistance(a, b)` → float32 | L2 distance                   | Pure Go          |\n|                 | `Normalize(dst, a)`                 | Unit vector normalization     | 8x (NEON+FP16)   |\n|                 | `CumulativeSum(dst, a)`             | Running sum                   | Sequential       |\n| **Range**       | `Clamp(dst, a, min, max)`           | Clamp to range                | 8x (NEON+FP16)   |\n|                 | `ClampScale(dst, src, min, max, s)` | Fused clamp and scale         | Pure Go          |\n| **Activation**  | `ReLU(dst, src)`                    | Rectified Linear Unit         | 8x (NEON+FP16)   |\n|                 | `Sigmoid(dst, src)`                 | Sigmoid: 1/(1+e^-x)           | Pure Go          |\n|                 | `Tanh(dst, src)`                    | Hyperbolic tangent            | Pure Go          |\n|                 | `Exp(dst, src)`                     | Exponential e^x               | Pure Go          |\n| **Batch**       | `DotProductBatch(r, rows, v)`       | Multiple dot products         | 8x (NEON+FP16)   |\n| **Signal**      | `ConvolveValid(dst, sig, k)`        | FIR filter / convolution      | Pure Go          |\n|                 | `AccumulateAdd(dst, src, off)`      | Overlap-add: dst[off:] += src | 8x (NEON+FP16)   |\n| **Audio**       | `Interleave2(dst, a, b)`            | Pack stereo: [L,R,L,R,...]    | Pure Go          |\n|                 | `Deinterleave2(a, b, src)`          | Unpack stereo to channels     | Pure Go          |\n\n**Key characteristics:**\n\n- **Storage**: IEEE 754 half-precision (1 sign, 5 exponent, 10 mantissa bits)\n- **Precision**: ~3.3 decimal digits, range ~6×10⁻⁸ to 65504\n- **Reductions**: Accumulate in float32 for numerical stability\n- **Memory efficiency**: 2x bandwidth vs float32 (8 elements per 128-bit NEON vector)\n\n**Hardware requirements:**\n\n- **Native FP16 SIMD**: ARM64 with FEAT_FP16 (ARMv8.2-A+)\n  - Apple Silicon (M1/M2/M3/M4) ✅\n  - Cortex-A55, A75, A76, A77, A78, X1, X2, X3 ✅\n  - Raspberry Pi 5 (Cortex-A76) ✅\n- **Pure Go fallback**: All other platforms\n  - Raspberry Pi 3/4 (Cortex-A53/A72 - ARMv8.0) - works but no SIMD acceleration\n  - AMD64 - works but no SIMD acceleration\n\n### `c128` - complex128 Operations\n\nSIMD-accelerated complex number operations for FFT-based signal processing:\n\n| Category       | Function             | Description                        | SIMD Width              |\n| -------------- | -------------------- | ---------------------------------- | ----------------------- |\n| **Arithmetic** | `Mul(dst, a, b)`     | Complex multiplication             | 4x (AVX-512) / 2x (AVX) |\n|                | `MulConj(dst, a, b)` | Multiply by conjugate: a × conj(b) | 4x / 2x                 |\n|                | `Scale(dst, a, s)`   | Scale by complex scalar            | 4x / 2x                 |\n|                | `Add(dst, a, b)`     | Complex addition                   | 4x / 2x                 |\n|                | `Sub(dst, a, b)`     | Complex subtraction                | 4x / 2x                 |\n| **Unary**      | `Abs(dst, a)`        | Complex magnitude \\|a + bi\\|       | 4x (AVX-512) / 2x (AVX) |\n|                | `AbsSq(dst, a)`      | Magnitude squared \\|a + bi\\|²      | 4x / 2x                 |\n|                | `Conj(dst, a)`       | Complex conjugate: a - bi          | 4x / 2x                 |\n\nThese operations are designed for FFT-based signal processing pipelines:\n\n```go\nimport \"github.com/tphakala/simd/c128\"\n\n// Frequency-domain multiplication (FFT convolution)\nsignalFFT := make([]complex128, n)\nkernelFFT := make([]complex128, n)\nresult := make([]complex128, n)\nmagnitude := make([]float64, n)\n\n// Frequency-domain filtering\nc128.Mul(result, signalFFT, kernelFFT)          // Complex multiply\nc128.MulConj(result, signalFFT, kernelFFT)      // Cross-correlation\n\n// Spectrogram and magnitude analysis\nc128.Abs(magnitude, signalFFT)                  // Extract magnitude for display\n```\n\n**Use Cases:**\n\n- **Abs/AbsSq**: Spectrograms, power spectral density, frequency analysis\n- **Conj**: Cross-correlation, frequency-domain filtering\n- **Mul/MulConj**: FFT-based convolution, filtering, correlation\n\n**Benchmark (1024 elements, Intel i7-1260P AVX+FMA):**\n\n| Operation | SIMD    | Pure Go | Speedup   |\n| --------- | ------- | ------- | --------- |\n| Mul       | 341 ns  | 757 ns  | **2.2x**  |\n| MulConj   | 340 ns  | 749 ns  | **2.2x**  |\n| Scale     | 253 ns  | 551 ns  | **2.2x**  |\n| Add       | 86 ns   | 189 ns  | **2.2x**  |\n| Abs       | 1326 ns | 2260 ns | **1.7x**  |\n| AbsSq     | 367 ns  | 504 ns  | **1.37x** |\n| Conj      | 304 ns  | 474 ns  | **1.56x** |\n\n### `c64` - complex64 Operations\n\nSIMD-accelerated single-precision complex number operations:\n\n| Category       | Function             | Description                        | SIMD Width                        |\n| -------------- | -------------------- | ---------------------------------- | --------------------------------- |\n| **Arithmetic** | `Mul(dst, a, b)`     | Complex multiplication             | 8x (AVX-512) / 4x (AVX) / 2x (NEON) |\n|                | `MulConj(dst, a, b)` | Multiply by conjugate: a × conj(b) | 8x / 4x / 2x                      |\n|                | `Scale(dst, a, s)`   | Scale by complex scalar            | 8x / 4x / 2x                      |\n|                | `Add(dst, a, b)`     | Complex addition                   | 8x / 4x / 2x                      |\n|                | `Sub(dst, a, b)`     | Complex subtraction                | 8x / 4x / 2x                      |\n| **Unary**      | `Abs(dst, a)`        | Complex magnitude \\|a + bi\\|       | 8x / 4x / 2x                      |\n|                | `AbsSq(dst, a)`      | Magnitude squared \\|a + bi\\|²      | 8x / 4x / 2x                      |\n|                | `Conj(dst, a)`       | Complex conjugate: a - bi          | 8x / 4x / 2x                      |\n| **Conversion** | `FromReal(dst, src)` | Real to complex: src → src+0i      | 8x / 4x / 2x                      |\n\nSame API as `c128` but for `complex64` with 2x wider SIMD (8 bytes vs 16 bytes per element):\n\n```go\nimport \"github.com/tphakala/simd/c64\"\n\n// Single-precision FFT processing\nsignalFFT := make([]complex64, n)\nkernelFFT := make([]complex64, n)\nresult := make([]complex64, n)\nmagnitude := make([]float32, n)\n\nc64.Mul(result, signalFFT, kernelFFT)     // Complex multiply\nc64.Abs(magnitude, signalFFT)              // Extract magnitude\n```\n\n## Performance\n\n### AMD64 (Intel Core i7-1260P, AVX+FMA)\n\n#### float64 Operations - SIMD vs Pure Go (1024 elements)\n\n| Category        | Operation         | SIMD (ns) | Go (ns) | Speedup  |\n| --------------- | ----------------- | --------- | ------- | -------- |\n| **Arithmetic**  | Add               | 84        | 446     | **5.3x** |\n|                 | Sub               | 84        | 335     | **4.0x** |\n|                 | Mul               | 86        | 436     | **5.1x** |\n|                 | Div               | 441       | 941     | **2.1x** |\n|                 | Scale             | 68        | 272     | **4.0x** |\n|                 | AddScalar         | 68        | 286     | **4.2x** |\n|                 | FMA               | 110       | 557     | **5.0x** |\n| **Unary**       | Abs               | 66        | 365     | **5.6x** |\n|                 | Neg               | 66        | 306     | **4.6x** |\n|                 | Sqrt              | 658       | 1323    | **2.0x** |\n|                 | Reciprocal        | 447       | 920     | **2.1x** |\n| **Reduction**   | DotProduct        | 162       | 859     | **5.3x** |\n|                 | Sum               | 82        | 184     | **2.3x** |\n|                 | Min               | 157       | 340     | **2.2x** |\n|                 | Max               | 154       | 352     | **2.3x** |\n| **Statistical** | Mean              | 82        | 184     | **2.3x** |\n|                 | Variance\\*        | 820       | 902     | **1.1x** |\n|                 | StdDev\\*          | 825       | 905     | **1.1x** |\n| **Vector**      | EuclideanDistance | 216       | 1071    | **5.0x** |\n|                 | Normalize         | 220       | 1080    | **4.9x** |\n|                 | CumulativeSum     | 428       | 425     | 1.0x     |\n| **Range**       | Clamp             | 81        | 640     | **7.9x** |\n\n\\*Variance/StdDev benchmarked at 4096 elements (SIMD benefits at larger sizes)\n\n#### float32 Operations - SIMD vs Pure Go (1024 elements)\n\n| Category       | Operation  | SIMD (ns) | Go (ns) | Speedup   |\n| -------------- | ---------- | --------- | ------- | --------- |\n| **Arithmetic** | Add        | 47        | 441     | **9.4x**  |\n|                | Sub        | 49        | 339     | **6.9x**  |\n|                | Mul        | 49        | 436     | **8.9x**  |\n|                | Div        | 138       | 655     | **4.8x**  |\n|                | Scale      | 40        | 299     | **7.4x**  |\n|                | AddScalar  | 39        | 272     | **7.0x**  |\n|                | FMA        | 64        | 444     | **6.9x**  |\n| **Unary**      | Abs        | 37        | 656     | **17.6x** |\n|                | Neg        | 40        | 273     | **6.9x**  |\n| **Reduction**  | DotProduct | 71        | 424     | **5.9x**  |\n|                | Sum        | 41        | 123     | **3.0x**  |\n|                | Min        | 65        | 340     | **5.2x**  |\n|                | Max        | 66        | 352     | **5.3x**  |\n| **Range**      | Clamp      | 47        | 701     | **14.8x** |\n\n#### Activation Functions - SIMD vs Pure Go\n\n**float32 (1024 elements):**\n\n| Function   | SIMD (ns) | Go (ns)  | Speedup    | SIMD Throughput |\n| ---------- | --------- | -------- | ---------- | --------------- |\n| Sigmoid    | 138       | 5906     | **43x**    | 59.3 GB/s       |\n| ReLU       | 39        | 662      | **17x**    | 211 GB/s        |\n| Tanh       | 138       | 28116    | **204x**   | 59.5 GB/s       |\n\n**float64 (1024 elements):**\n\n| Function   | SIMD (ns) | Go (ns)  | Speedup    | SIMD Throughput |\n| ---------- | --------- | -------- | ---------- | --------------- |\n| ReLU       | 68        | 646      | **9.5x**   | 240 GB/s        |\n| Tanh       | 445       | 6230     | **14x**    | 36.8 GB/s       |\n\n**Key Characteristics:**\n\n- **Tanh**: 200x+ speedup for f32 - fast approximation with saturation vs math.Tanh\n- **ReLU**: Highest throughput (211-240 GB/s) - simple max(0, x) operation\n- **Sigmoid**: 43x speedup for f32 - fast approximation with exponential\n\n#### Batch \u0026 Signal Processing (varied sizes)\n\n| Operation                | Config                | SIMD    | Go      | Speedup  |\n| ------------------------ | --------------------- | ------- | ------- | -------- |\n| DotProductBatch (f64)    | 256 vec × 100 rows    | 3.2 µs  | 20.5 µs | **6.4x** |\n| DotProductBatch (f32)    | 256 vec × 100 rows    | 1.5 µs  | 9.8 µs  | **6.7x** |\n| ConvolveValid (f64)      | 4096 sig × 64 ker     | 26.6 µs | 169 µs  | **6.3x** |\n| ConvolveValid (f32)      | 4096 sig × 64 ker     | 17.9 µs | 80 µs   | **4.5x** |\n| ConvolveValidMulti (f64) | 1000 sig × 64 ker × 2 | 13.4 µs | -       | -        |\n| CubicInterpDot (f64)     | 241 taps              | 47 ns   | 88 ns   | **1.9x** |\n| CubicInterpDot (f32)     | 241 taps              | 21 ns   | 66 ns   | **3.1x** |\n| Int32ToFloat32Scale      | 1024 elements         | 40 ns   | 364 ns  | **9.0x** |\n| Int32ToFloat32Scale      | 4096 elements         | 153 ns  | 1439 ns | **9.4x** |\n| Interleave2 (f64)        | 1000 pairs            | 216 ns  | -       | -        |\n| Deinterleave2 (f64)      | 1000 pairs            | 216 ns  | -       | -        |\n| Interleave2 (f32)        | 1000 pairs            | 109 ns  | -       | -        |\n| Deinterleave2 (f32)      | 1000 pairs            | 216 ns  | -       | -        |\n\n#### Performance Summary\n\n| Package  | Average Speedup | Best         | Operations   |\n| -------- | --------------- | ------------ | ------------ |\n| **f32**  | **6.5x**        | 21.8x (Abs)  | 35 functions |\n| **f64**  | **3.2x**        | 7.9x (Clamp) | 32 functions |\n| **c128** | **1.77x**       | 2.2x (Mul)   | 8 functions  |\n| **c64**  | **~2x**         | ~3x (Mul)    | 9 functions  |\n\n### ARM64 (Raspberry Pi 5, NEON)\n\n#### float64 Operations\n\n| Operation  | Size | Time   | Throughput |\n| ---------- | ---- | ------ | ---------- |\n| DotProduct | 277  | 151 ns | 29 GB/s    |\n| DotProduct | 1000 | 513 ns | 31 GB/s    |\n| Add        | 1000 | 775 ns | 31 GB/s    |\n| Mul        | 1000 | 727 ns | 33 GB/s    |\n| FMA        | 1000 | 890 ns | 36 GB/s    |\n| Sum        | 1000 | 635 ns | 13 GB/s    |\n| Mean       | 1000 | 677 ns | 12 GB/s    |\n\n#### float32 Operations\n\n| Operation  | Size  | Time    | Throughput |\n| ---------- | ----- | ------- | ---------- |\n| DotProduct | 100   | 37 ns   | 21 GB/s    |\n| DotProduct | 1000  | 263 ns  | 30 GB/s    |\n| DotProduct | 10000 | 2.78 µs | 29 GB/s    |\n| Add        | 1000  | 389 ns  | 31 GB/s    |\n| Mul        | 1000  | 390 ns  | 31 GB/s    |\n| FMA        | 1000  | 479 ns  | 33 GB/s    |\n\n#### Comparison vs Pure Go\n\n| Operation        | Size | SIMD   | Pure Go | Speedup  |\n| ---------------- | ---- | ------ | ------- | -------- |\n| DotProduct (f32) | 100  | 37 ns  | 137 ns  | **3.7x** |\n| DotProduct (f32) | 1000 | 262 ns | 1350 ns | **5.2x** |\n| DotProduct (f64) | 100  | 62 ns  | 138 ns  | **2.2x** |\n| DotProduct (f64) | 1000 | 513 ns | 1353 ns | **2.6x** |\n| Add (f32)        | 1000 | 389 ns | 2015 ns | **5.2x** |\n| Sum (f32)        | 1000 | 343 ns | 1327 ns | **3.9x** |\n\n### Performance Notes\n\n- **AMD64**: Explicit SIMD provides **5x** speedups for most operations compared to pure Go, with consistent high throughput across all vector sizes.\n\n- **ARM64**: NEON SIMD provides substantial speedups over pure Go across all operations:\n  - float32: **3.7x - 5.2x** faster (4 elements per 128-bit vector)\n  - float64: **2.2x - 2.6x** faster (2 elements per 128-bit vector)\n\n- **CumulativeSum** is inherently sequential (each element depends on the previous) and uses pure Go on all platforms.\n\n## Known Limitations\n\n### Small Slice Fallback for Min/Max (AMD64)\n\nOn AMD64, the `Min` and `Max` functions fall back to pure Go for small slices:\n\n- **float64**: slices with fewer than 4 elements\n- **float32**: slices with fewer than 8 elements\n\nThis is because AVX assembly loads multiple elements at once (4 float64s or 8 float32s), which would cause out-of-bounds memory access on smaller slices.\n\nThe Go fallback for small slices is intentional and likely optimal - SIMD setup overhead (register loading, masking, horizontal reduction) would exceed the cost of a simple 2-3 element comparison loop.\n\n## Architecture Support\n\n| Architecture | Instruction Set | f64/f32/c128/c64  | f16               |\n| ------------ | --------------- | ----------------- | ----------------- |\n| AMD64        | AVX-512         | Full SIMD support | Pure Go fallback  |\n| AMD64        | AVX + FMA       | Full SIMD support | Pure Go fallback  |\n| AMD64        | SSE4.1          | Full SIMD support | Pure Go fallback  |\n| ARM64        | NEON + FP16     | Full SIMD support | Full SIMD support |\n| ARM64        | NEON only       | Full SIMD support | Pure Go fallback  |\n| Other        | -               | Pure Go fallback  | Pure Go fallback  |\n\n**ARM64 FP16 support by device:**\n\n| Device / SoC              | Core(s)       | Architecture | FP16 SIMD |\n| ------------------------- | ------------- | ------------ | --------- |\n| Apple Silicon (M1-M4)     | Firestorm+    | ARMv8.4-A    | ✅ Yes    |\n| Raspberry Pi 5            | Cortex-A76    | ARMv8.2-A    | ✅ Yes    |\n| Raspberry Pi 4            | Cortex-A72    | ARMv8.0-A    | ❌ No     |\n| Raspberry Pi 3            | Cortex-A53    | ARMv8.0-A    | ❌ No     |\n| AWS Graviton 2/3          | Neoverse N1/V1| ARMv8.2-A+   | ✅ Yes    |\n| Ampere Altra              | Neoverse N1   | ARMv8.2-A    | ✅ Yes    |\n\n## Design Principles\n\n1. **Pure Go assembly** - Native Go assembler for maximum portability and easy cross-compilation\n2. **Runtime dispatch** - CPU features detected once at init time, zero runtime overhead\n3. **Zero allocations** - No heap allocations in hot paths\n4. **Safe defaults** - Gracefully falls back to pure Go on unsupported CPUs\n5. **Boundary safe** - Handles any slice length, not just SIMD-aligned sizes\n\n## Testing\n\nThe library includes comprehensive tests with pure Go reference implementations for validation:\n\n```bash\n# Run all tests\ngo test ./...\n\n# Run tests with verbose output\ntask test\n\n# Run benchmarks\ntask bench\n\n# Compare SIMD vs pure Go performance\ntask bench:compare\n\n# Show CPU SIMD capabilities\ntask cpu\n```\n\nSee [Taskfile.yml](Taskfile.yml) for all available tasks.\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftphakala%2Fsimd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftphakala%2Fsimd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftphakala%2Fsimd/lists"}