{"id":34728825,"url":"https://github.com/lim-james/gemm","last_synced_at":"2026-05-24T01:32:08.167Z","repository":{"id":326988650,"uuid":"1107354564","full_name":"lim-james/gemm","owner":"lim-james","description":"generalized (square) matrix multiplication w/ C++26 experimental::simd","archived":false,"fork":false,"pushed_at":"2025-12-25T09:31:33.000Z","size":888,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-26T09:51:03.609Z","etag":null,"topics":["cpp26","google-benchmark","matrix-multiplication","simd"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lim-james.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-01T02:45:02.000Z","updated_at":"2025-12-25T09:31:36.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lim-james/gemm","commit_stats":null,"previous_names":["lim-james/gemm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lim-james/gemm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lim-james%2Fgemm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lim-james%2Fgemm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lim-james%2Fgemm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lim-james%2Fgemm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lim-james","download_url":"https://codeload.github.com/lim-james/gemm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lim-james%2Fgemm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33418547,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T22:14:44.296Z","status":"ssl_error","status_checked_at":"2026-05-23T22:14:43.778Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp26","google-benchmark","matrix-multiplication","simd"],"created_at":"2025-12-25T02:54:54.671Z","updated_at":"2026-05-24T01:32:08.156Z","avatar_url":"https://github.com/lim-james.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GEMM\n`experimental C++26`, `SIMD`, `matrix multiplication`\n\n\u003e Another low latency experiment ported over because its deserves its own repository\n\n## Motivation\n\nWorking with [SWAR](https://github.com/lim-james/swar-stoi) wasn't enough, I\nwanted to experience the true SIMD benefits. Was an opportunity to be exposed to\ncpp26 documentations and experimental features.\n\n# Benchmark Results\n\nThe following tables present the performance metrics for different algorithms across various problem sizes.\n\n**Legend:**\n* **k, M, G:** Kilo ($10^3$), Mega ($10^6$), Giga ($10^9$).\n* **Scaling (x):** The value in parentheses compares the algorithm's performance to the **Naive** implementation.\n    * For **GOps** and **Bandwidth**, this is **Throughput Improvement** (Algo / Naive). Values $\u003e 1.0$x indicate higher throughput.\n\n## SSE2 vs AVX2 comparison\n\n![GOps AVX2](img/comparison_GOps.png)\n\n#### 1. Execution Cycles (Lower is Better)\n\n\n| Matrix Size () | Method | SSE2 Cycles | AVX2 Cycles | Speedup |\n| --- | --- | --- | --- | --- |\n| **256** | SIMD | 12,392,340 | 4,325,296 | **2.86x** |\n| **512** | SIMD | 119,912,500 | 34,627,020 | **3.46x** |\n| **1024** | SIMD | 1,353,563,000 | 750,581,500 | **1.80x** |\n| **2048** | SIMD | 18,271,790,000 | 13,429,540,000 | **1.36x** |\n| **2048** | **TILING** | 7,979,022,000 | 4,565,528,000 | **1.75x** |\n\n\n#### 2. Instruction Retirement (Efficiency)\n\n\n| Matrix Size () | Method | SSE2 Instructions | AVX2 Instructions | Reduction Ratio |\n| --- | --- | --- | --- | --- |\n| **256** | SIMD | 59,574,530 | 13,633,790 | **4.37x fewer** |\n| **512** | SIMD | 473,173,800 | 104,862,000 | **4.51x fewer** |\n| **1024** | SIMD | 3,771,736,000 | 822,091,200 | **4.58x fewer** |\n| **2048** | SIMD | 30,119,320,000 | 6,509,578,000 | **4.62x fewer** |\n\n\n#### 3. Pipeline \u0026 Cache Telemetry\n\n\n| Matrix Size () | Variant | IPC (Avg) | L1D Misses | LLC Misses | Bottleneck |\n| --- | --- | --- | --- | --- | --- |\n| **512** | SSE2 SIMD | 3.95 | 8,495,677 | 208,755 | Compute |\n| **512** | AVX2 SIMD | 3.03 | 8,495,061 | 205,243 | Compute |\n| **2048** | SSE2 SIMD | 1.65 | 551,055,106 | 14,060,395 | Memory Latency |\n| **2048** | AVX2 SIMD | **0.48** | 549,728,274 | 13,891,965 | **DRAM Bandwidth** |\n| **2048** | **AVX2 TILING** | **1.63** | 292,034,791 | 29,730,258 | Cache Efficient |\n\n\n## Individual results (SSE2)\n\n![GOps SSE2](img/benchmark_graph_GOps_sse2.png)\n\n#### GOps (Billions of Operations per Second)\n\n|   Size | Naive   | Transposed      | Simd            | Tiling          |\n|-------:|:--------|:----------------|:----------------|:----------------|\n|      4 | 183.40  | 182.12 (0.99x)  | 182.93 (1.00x)  | 150.27 (0.82x)  |\n|      8 | 1.49k   | 1.46k (0.98x)   | 1.47k (0.99x)   | 1.47k (0.99x)   |\n|     16 | 11.12k  | 11.94k (1.07x)  | 11.56k (1.04x)  | 10.73k (0.96x)  |\n|     32 | 96.97k  | 92.94k (0.96x)  | 92.89k (0.96x)  | 93.50k (0.96x)  |\n|     64 | 739.54k | 749.09k (1.01x) | 1.04M (1.41x)   | 753.08k (1.02x) |\n|    128 | 5.97M   | 5.98M (1.00x)   | 5.54M (0.93x)   | 6.03M (1.01x)   |\n|    256 | 47.21M  | 48.61M (1.03x)  | 48.14M (1.02x)  | 48.01M (1.02x)  |\n|    512 | 449.29M | 396.31M (0.88x) | 375.03M (0.83x) | 381.22M (0.85x) |\n|   1024 | 274.70M | 3.64G (13.26x)  | 3.74G (13.61x)  | 3.14G (11.44x)  |\n|   2048 | 202.71M | 3.01G (14.83x)  | 4.45G (21.93x)  | 7.36G (36.33x)  |\n|   4096 | 205.20M | 3.40G (16.58x)  | 5.62G (27.39x)  | 8.98G (43.75x)  |\n\n#### Bandwidth\n\n|   Size | Naive   | Transposed      | Simd            | Tiling          |\n|-------:|:--------|:----------------|:----------------|:----------------|\n|      4 | 275.11  | 273.19 (0.99x)  | 274.40 (1.00x)  | 225.40 (0.82x)  |\n|      8 | 1.12k   | 1.10k (0.98x)   | 1.10k (0.99x)   | 1.10k (0.99x)   |\n|     16 | 4.17k   | 4.48k (1.07x)   | 4.33k (1.04x)   | 4.02k (0.96x)   |\n|     32 | 18.18k  | 17.43k (0.96x)  | 17.42k (0.96x)  | 17.53k (0.96x)  |\n|     64 | 69.33k  | 70.23k (1.01x)  | 97.86k (1.41x)  | 70.60k (1.02x)  |\n|    128 | 279.80k | 280.52k (1.00x) | 259.88k (0.93x) | 282.78k (1.01x) |\n|    256 | 1.11M   | 1.14M (1.03x)   | 1.13M (1.02x)   | 1.13M (1.02x)   |\n|    512 | 5.27M   | 4.64M (0.88x)   | 4.39M (0.83x)   | 4.47M (0.85x)   |\n|   1024 | 1.61M   | 21.35M (13.26x) | 21.90M (13.61x) | 18.41M (11.44x) |\n|   2048 | 593.86k | 8.81M (14.83x)  | 13.02M (21.93x) | 21.58M (36.33x) |\n|   4096 | 300.59k | 4.98M (16.58x)  | 8.23M (27.39x)  | 13.15M (43.75x) |\n\n#### L1D cache misses (min)\n|   SIZE | NAIVE   | TRANSPOSED      | SIMD            | TILING          |\n|-------:|:--------|:----------------|:----------------|:----------------|\n|      4 | 16.00   | 25.00 (1.56x)   | 21.00 (1.31x)   | 19.00 (1.19x)   |\n|      8 | 41.00   | 35.00 (0.85x)   | 28.00 (0.68x)   | 35.00 (0.85x)   |\n|     16 | 75.00   | 74.00 (0.99x)   | 73.00 (0.97x)   | 88.00 (1.17x)   |\n|     32 | 231.00  | 212.00 (0.92x)  | 214.00 (0.93x)  | 235.00 (1.02x)  |\n|     64 | 839.00  | 1.03k (1.22x)   | 936.00 (1.12x)  | 1.60k (1.91x)   |\n|    128 | 526.39k | 132.42k (0.25x) | 132.00k (0.25x) | 22.68k (0.04x)  |\n|    256 | 4.27M   | 1.06M (0.25x)   | 1.06M (0.25x)   | 447.18k (0.10x) |\n|    512 | 33.88M  | 8.49M (0.25x)   | 8.49M (0.25x)   | 5.06M (0.15x)   |\n|   1024 | -       | 68.08M          | 68.09M (1.00x)  | 57.14M (0.84x)  |\n|   2048 | -       | 547.22M         | 548.38M (1.00x) | 486.70M (0.89x) |\n\n#### LLC cache misses (avg)\n|   SIZE | NAIVE   | TRANSPOSED      | SIMD            | TILING          |\n|-------:|:--------|:----------------|:----------------|:----------------|\n|      4 | 12.20   | 131.80 (10.80x) | 23.00 (1.89x)   | 19.90 (1.63x)   |\n|      8 | 21.90   | 98.60 (4.50x)   | 28.60 (1.31x)   | 72.40 (3.31x)   |\n|     16 | 42.30   | 65.20 (1.54x)   | 24.50 (0.58x)   | 113.90 (2.69x)  |\n|     32 | 55.10   | 97.40 (1.77x)   | 44.80 (0.81x)   | 106.40 (1.93x)  |\n|     64 | 190.00  | 128.40 (0.68x)  | 110.10 (0.58x)  | 210.20 (1.11x)  |\n|    128 | 613.90  | 289.30 (0.47x)  | 179.30 (0.29x)  | 1.94k (3.16x)   |\n|    256 | 231.29k | 9.52k (0.04x)   | 8.06k (0.03x)   | 17.84k (0.08x)  |\n|    512 | 32.20M  | 207.20k (0.01x) | 198.68k (0.01x) | 166.27k (0.01x) |\n|   1024 | -       | 1.60M           | 1.55M (0.97x)   | 1.57M (0.98x)   |\n|   2048 | -       | 13.94M          | 14.26M (1.02x)  | 19.89M (1.43x)  |\n\n#### CPU Cycles\n|   SIZE | NAIVE   | TRANSPOSED      | SIMD            | TILING          |\n|-------:|:--------|:----------------|:----------------|:----------------|\n|      4 | 890.20  | 5.91k (6.64x)   | 1.25k (1.41x)   | 1.50k (1.69x)   |\n|      8 | 1.56k   | 7.18k (4.62x)   | 2.48k (1.59x)   | 4.07k (2.62x)   |\n|     16 | 6.76k   | 7.45k (1.10x)   | 4.84k (0.72x)   | 9.12k (1.35x)   |\n|     32 | 30.99k  | 28.15k (0.91x)  | 25.88k (0.84x)  | 31.33k (1.01x)  |\n|     64 | 245.44k | 185.70k (0.76x) | 183.36k (0.75x) | 220.70k (0.90x) |\n|    128 | 2.34M   | 1.65M (0.70x)   | 1.66M (0.71x)   | 1.77M (0.76x)   |\n|    256 | 17.70M  | 13.13M (0.74x)  | 13.04M (0.74x)  | 13.79M (0.78x)  |\n|    512 | 292.10M | 108.75M (0.37x) | 108.97M (0.37x) | 110.13M (0.38x) |\n|   1024 | -       | 1.09G           | 1.10G (1.00x)   | 898.54M (0.82x) |\n|   2048 | -       | 12.64G          | 12.75G (1.01x)  | 7.33G (0.58x)   |\n\n#### Instructions\n|   SIZE | NAIVE   | TRANSPOSED      | SIMD            | TILING          |\n|-------:|:--------|:----------------|:----------------|:----------------|\n|      4 | 403.00  | 442.50 (1.10x)  | 484.00 (1.20x)  | 561.00 (1.39x)  |\n|      8 | 1.94k   | 2.58k (1.34x)   | 2.20k (1.13x)   | 2.07k (1.07x)   |\n|     16 | 17.78k  | 17.80k (1.00x)  | 14.54k (0.82x)  | 13.19k (0.74x)  |\n|     32 | 134.24k | 86.40k (0.64x)  | 86.40k (0.64x)  | 97.07k (0.72x)  |\n|     64 | 1.06M   | 639.86k (0.60x) | 641.62k (0.61x) | 774.81k (0.73x) |\n|    128 | 8.43M   | 7.62M (0.90x)   | 8.11M (0.96x)   | 6.19M (0.73x)   |\n|    256 | 67.28M  | 59.97M (0.89x)  | 64.10M (0.95x)  | 48.79M (0.73x)  |\n|    512 | 537.53M | 474.22M (0.88x) | 507.25M (0.94x) | 390.81M (0.73x) |\n|   1024 | -       | 3.78G           | 4.04G (1.07x)   | 3.12G (0.83x)   |\n|   2048 | -       | 30.14G          | 32.28G (1.07x)  | 25.40G (0.84x)  |\n\n#### IPC\n|   SIZE | NAIVE   | TRANSPOSED   | SIMD         | TILING       |\n|-------:|:--------|:-------------|:-------------|:-------------|\n|      4 | 0.45    | 0.07 (0.16x) | 0.39 (0.87x) | 0.37 (0.82x) |\n|      8 | 1.24    | 0.36 (0.29x) | 0.89 (0.72x) | 0.51 (0.41x) |\n|     16 | 2.63    | 2.39 (0.91x) | 3.01 (1.14x) | 1.45 (0.55x) |\n|     32 | 4.33    | 3.07 (0.71x) | 3.34 (0.77x) | 3.10 (0.72x) |\n|     64 | 4.32    | 3.45 (0.80x) | 3.50 (0.81x) | 3.51 (0.81x) |\n|    128 | 3.60    | 4.63 (1.29x) | 4.88 (1.36x) | 3.49 (0.97x) |\n|    256 | 3.80    | 4.57 (1.20x) | 4.91 (1.29x) | 3.54 (0.93x) |\n|    512 | 1.84    | 4.36 (2.37x) | 4.66 (2.53x) | 3.55 (1.93x) |\n|   1024 | -       | 3.45         | 3.69 (1.07x) | 3.48 (1.01x) |\n|   2048 | -       | 2.38         | 2.53 (1.06x) | 3.47 (1.46x) |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flim-james%2Fgemm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flim-james%2Fgemm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flim-james%2Fgemm/lists"}