{"id":42260376,"url":"https://github.com/ooples/aidotnet.tensors","last_synced_at":"2026-05-29T07:18:24.836Z","repository":{"id":333900140,"uuid":"1139185768","full_name":"ooples/AiDotNet.Tensors","owner":"ooples","description":"High-performance tensor operations with SIMD and GPU acceleration for .NET","archived":false,"fork":false,"pushed_at":"2026-01-21T21:12:35.000Z","size":22401,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-22T05:05:43.783Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ooples.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-21T16:25:54.000Z","updated_at":"2026-01-21T21:07:19.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ooples/AiDotNet.Tensors","commit_stats":null,"previous_names":["ooples/aidotnet.tensors"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/ooples/AiDotNet.Tensors","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ooples%2FAiDotNet.Tensors","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ooples%2FAiDotNet.Tensors/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ooples%2FAiDotNet.Tensors/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ooples%2FAiDotNet.Tensors/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ooples","download_url":"https://codeload.github.com/ooples/AiDotNet.Tensors/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ooples%2FAiDotNet.Tensors/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28805362,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-27T05:43:52.625Z","status":"ssl_error","status_checked_at":"2026-01-27T05:43:48.957Z","response_time":168,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-27T06:04:04.280Z","updated_at":"2026-05-26T01:03:16.307Z","avatar_url":"https://github.com/ooples.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AiDotNet.Tensors\n\n[![NuGet](https://img.shields.io/nuget/v/AiDotNet.Tensors.svg)](https://www.nuget.org/packages/AiDotNet.Tensors/)\n[![Build](https://github.com/ooples/AiDotNet.Tensors/actions/workflows/build.yml/badge.svg)](https://github.com/ooples/AiDotNet.Tensors/actions/workflows/build.yml)\n[![License](https://img.shields.io/badge/license-BSL%201.1-blue.svg)](LICENSE)\n\nA high-performance .NET tensor library with hand-written AVX2/AVX-512 SIMD kernels in `SimdKernels.cs` / `SimdGemm.cs` / `SimdConvHelper.cs`. Every hot path runs through our own managed-C# kernels — we do NOT call into `System.Numerics.Tensors`, MKL.NET, or oneDNN through the standard wrappers. Beats ML.NET, TensorFlow.NET, MathNet, and NumSharp outright on every measured op. Against libtorch (TorchSharp's hand-tuned C++ kernels), wins on Mish 2.3×, Mish (double) 2.2×, **GELU (double) 1.6× ahead**, **Tanh (double) within noise**, Tanh (float) 1.4×, TensorMean/Min/Max, MaxPool2D, TensorAdd 100K, and TensorAdd 1M (vs single-thread torch) — all using pure managed C# with hand-tuned AVX2/FMA SIMD kernels and JIT-compiled machine code.\n\n\u003e **Note on dependencies.** The .nupkg ships with the following PackageReferences:\n\u003e `Microsoft.Extensions.Logging.Abstractions`, `System.Text.Json`,\n\u003e `System.Threading.Channels`, `K4os.Compression.LZ4` (LZ4 compression for\n\u003e serialized tensor blobs), `AiDotNet.Native.OpenBLAS` (transitive native\n\u003e OpenBLAS for fallback paths only — our SimdGemm beats it for d=128\n\u003e transformer hot paths), and **MKL via Microsoft.ML.Mkl.Redist (~66 MB on\n\u003e win-x64) + `intelmkl.redist.win-x64` (~500 MB on win-x64)** for the FP64\n\u003e kernels that haven't yet been ported to pure-managed AVX2 (Phase 0 remediation\n\u003e work tracks the port). For air-gapped / federal deployments we ship a custom\n\u003e build with MKL/OpenBLAS removed and the entire telemetry namespace compiled\n\u003e out — see [aidotnet.dev/enterprise](https://aidotnet.dev/enterprise) for\n\u003e the Enterprise tier including air-gapped builds.\n\u003e\n\u003e **Performance numbers above assume net8.0+.** On net471 the SIMD/intrinsics\n\u003e helpers are excluded (System.Runtime.Intrinsics is unavailable pre-net6); a\n\u003e custom net471 SIMD path that beats `System.Numerics.Vector\u003cT\u003e` is on the\n\u003e roadmap as Phase 5.\n\n## Features\n\n- **Zero Allocations**: In-place operations with `ArrayPool\u003cT\u003e` and `Span\u003cT\u003e` for hot paths\n- **Hand-Tuned SIMD**: Custom AVX2/FMA kernels with 4x loop unrolling, not just `Vector\u003cT\u003e` wrappers\n- **JIT-Compiled Kernels**: Runtime x86-64 machine code generation for size-specialized operations\n- **BLIS-Style GEMM**: Tiled matrix multiply with FMA micro-kernel, cache-aware panel packing\n- **GPU Acceleration**: Optional CUDA, HIP/ROCm, and OpenCL support via separate packages\n- **Multi-Target**: Supports .NET 10.0 and .NET Framework 4.7.1\n- **Generic Math**: Works with any numeric type via `INumericOperations\u003cT\u003e` interface\n\n## Installation\n\n```bash\n# Core package (CPU SIMD acceleration)\ndotnet add package AiDotNet.Tensors\n\n# Optional: OpenBLAS for optimized CPU BLAS operations\ndotnet add package AiDotNet.Native.OpenBLAS\n\n# Optional: CLBlast for OpenCL GPU acceleration (AMD/Intel/NVIDIA)\ndotnet add package AiDotNet.Native.CLBlast\n\n# Optional: CUDA for NVIDIA GPU acceleration (requires NVIDIA GPU)\ndotnet add package AiDotNet.Native.CUDA\n```\n\n## Quick Start\n\n```csharp\nusing AiDotNet.Tensors.LinearAlgebra;\n\n// Create vectors\nvar v1 = new Vector\u003cdouble\u003e(new[] { 1.0, 2.0, 3.0, 4.0 });\nvar v2 = new Vector\u003cdouble\u003e(new[] { 5.0, 6.0, 7.0, 8.0 });\n\n// SIMD-accelerated operations\nvar sum = v1 + v2;\nvar dot = v1.Dot(v2);\n\n// Create matrices\nvar m1 = new Matrix\u003cdouble\u003e(3, 3);\nvar m2 = Matrix\u003cdouble\u003e.Identity(3);\n\n// Matrix operations\nvar product = m1 * m2;\nvar transpose = m1.Transpose();\n```\n\n## CPU Benchmarks\n\nAll numbers from the latest BenchmarkDotNet run on AMD Ryzen 9 3950X (16 cores, AVX2/FMA, no AVX-512), .NET 10.0. Reproduce with:\n\n```bash\ndotnet run -c Release --project tests/AiDotNet.Tensors.Benchmarks --framework net10.0 -- --vs-all\n```\n\nThe full per-op result set with error bars lives in [`tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md`](tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md). The summary below is a hand-curated subset.\n\n### vs TorchSharp CPU (libtorch C++ backend)\n\nLatest BDN run, post-#209 perf fixes — captured **after** removing\n`System.Numerics.Tensors` entirely and routing every hot path through\nour in-house `SimdKernels`. **All comparisons are eager-vs-eager** —\nneither side uses `torch.compile` or AiDotNet compiled plans, so this\nis libtorch's hand-rolled C++ kernels against AiDotNet's pure managed\nC# + AVX2 SIMD. See\n[`tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md`](tests/AiDotNet.Tensors.Benchmarks/BENCHMARK_RESULTS.md)\nfor the full per-op table with error bars.\n\n**Big wins** — AiDotNet beats TorchSharp by 2× or more:\n\n| Operation | Size | AiDotNet | TorchSharp | Speedup |\n|-----------|------|---------:|-----------:|--------:|\n| Mish | 1M | **377 µs** | 884 µs | **2.3× faster** |\n| Mish (double) | 1M | **1,038 µs** | 2,313 µs | **2.2× faster** |\n\n**Wins** — AiDotNet beats TorchSharp:\n\n| Operation | Size | AiDotNet | TorchSharp | Speedup |\n|-----------|------|---------:|-----------:|--------:|\n| **GELU (double)** | 1M | **481 µs** | 753 µs | **1.6× faster** (was 3.6× behind!) |\n| **Tanh (double)** | 1M | **586 µs** | 627 µs | **1.07× faster** (was 3.3× behind!) |\n| Tanh (float) | 1M | **282 µs** | 406 µs | **1.4× faster** |\n| TensorAdd | 100K | **33 µs** | 42 µs | **1.3× faster** |\n| TensorMean | 1M | **189 µs** | 243 µs | **1.3× faster** |\n| TensorAdd | 1M (vs 1-thread torch) | **350 µs** | 468 µs | **1.3× vs 1-thread torch** |\n| MaxPool2D | — | **250 µs** | 285 µs | 1.1× faster |\n| TensorMin | 1M | **205 µs** | 215 µs | within noise (slight win) |\n| TensorMultiply | 100K | **37 µs** | 39 µs | within noise (slight win) |\n\n**Closer-to-parity** — AiDotNet within ~1.5× of libtorch:\n\n| Operation | Size | AiDotNet | TorchSharp | Ratio |\n|-----------|------|---------:|-----------:|------:|\n| ReLU | 1M | 261 µs | 191 µs | 1.4× |\n| Sigmoid | 1M | 326 µs | 223 µs | 1.5× |\n| TensorMaxValue | 1M | 195 µs | 189 µs | 1.03× |\n| TensorExp | 1M | 296 µs | 306 µs | within noise |\n| GELU (float) | 1M | 354 µs | 332 µs | 1.07× |\n| TensorSum | 1M | 229 µs | 212 µs | 1.08× |\n| TensorAbs | 1M | 362 µs | 221 µs | 1.6× |\n| LeakyReLU | 1M | 409 µs | 273 µs | 1.5× |\n| Exp (double) | 1M | 753 µs | 284 µs | 2.6× (was 4.3×) |\n| Log (double) | 1M | 612 µs | 355 µs | 1.7× (was 16×!) |\n\n**This PR's #209 close-parity wins** — validated against the pre-fix\nbaseline by fresh BDN re-runs and same-process micro-benchmarks:\n\n| Operation | Pre-fix | Post-fix | Improvement |\n|-----------|------:|--------:|------------:|\n| **Softmax_Double 512×1024** | 3,766 µs | **185 µs** (**slightly AHEAD** of torch's 206!) | **20× faster** |\n| GELU_Double 1M | 2,782 µs | **481 µs** (now **1.6× ahead** of torch!) | **5.8× faster** |\n| Tanh_Double 1M | 2,067 µs | **586 µs** (within noise of torch) | **3.5× faster** |\n| Log_Double 1M  | 5,785 µs | **612 µs** | **9.4× faster** |\n| Exp_Double 1M  | 1,634 µs | 753 µs | 2.2× faster |\n| LayerNorm 32k×64 | 1,347 µs | 890 µs | 1.5× faster |\n| TensorAdd 1M | 480 µs | 350 µs | 1.4× faster |\n| AttentionQKT 512×64 | 599 µs | **419 µs** (parallel-M pre-transpose) | **1.4× faster** |\n| AttentionQKT 512×128 | (not measured) | **451 µs** | (149 GFLOPS, parallel-M kernel) |\n| MatMul 256³ | 510 µs | **196 µs** (parallel-M SgemmDirect) | **2.6× faster** |\n| MatMul 512³ | 1,074 µs | **930 µs** | 1.15× faster |\n| Conv2D 1×16×64×64→32 | 458 µs (regressed to 764 with naive 4-oc) | **397 µs** (Auto policy picks PerChannel) | back to baseline + 13% |\n\n**Residual tracked gaps** — areas where libtorch's Intel MKL-DNN\n(with AVX-512 inner kernels on Intel hardware) still wins. These need\nmulti-day kernel rewrites (single-pass register-resident LayerNorm,\nfused QKᵀ attention kernel, BLIS-style 6×16 micro-kernel prefetch\ntuning) and are left as follow-up work:\n\n| Operation | Size | AiDotNet | TorchSharp | Ratio |\n|-----------|------|---------:|-----------:|------:|\n| TensorMatMul (float) | 256 | **196 µs** (parallel-M SgemmDirect) | 109 µs | 1.8× — was 4.7× |\n| TensorMatMul (float) | 512 | 930 µs | 534 µs | 1.7× — was 2.0× |\n| LayerNorm | 32k×64 | 890 µs | 303 µs | 2.9× |\n| BatchNorm | 32×64×32×32 | 2,201 µs | 745 µs | 3.0× |\n| Conv2D (float) | 1×16×64×64→32 | ~397 µs (Auto picks PerChannel) | 310 µs | 1.3× — was 2.3× before A/B fix |\n| Conv2D (double) | 4×3×32×32 | 438 µs | 115 µs | 3.8× — unchanged this PR |\n| AttentionQKT | 512×64 | **419 µs** (parallel-M pre-transpose) | 135 µs | 3.1× — was 4.3× |\n| AttentionQKT | 512×128 | **451 µs** (parallel-M) | — | 149 GFLOPS, was 1,102 µs |\n| Softmax_Double 512×1024 | — | **185 µs** | 206 µs | **slight win** ✓ closed |\n\n**Zero-external-dependency policy.** Every hot path runs through our\nhand-tuned `SimdKernels` AVX2/AVX-512 implementations. We deliberately\ndo NOT reference `System.Numerics.Tensors`, MKL, MKL.NET, or oneDNN —\nboth for supply-chain hygiene and because we measured several\nTensorPrimitives entry points to regress 4–20× vs our in-house kernels\non Ryzen 9 3950X (notably `Tanh(float)` 20× slower, `Sigmoid(double)`\n12× slower, `Log(double)` 4× slower). All double-precision and\nsingle-precision paths now go through the same hand-tuned SIMD\nkernels — no fallback to any external library.\n\n### vs ML.NET (Microsoft.ML, eager-vs-eager)\n\nLatest BDN run, validated post-#209-perf. Microsoft's general-purpose\nML framework — same Ryzen 9 3950X, same .NET 10.0.7.\n\n| Operation | Size | AiDotNet | ML.NET | Speedup |\n|-----------|------|---------:|-------:|--------:|\n| TensorMean | 1M | **80 µs** | 180 µs | **2.2× faster** |\n| TensorSum | 1M | **92 µs** | 104 µs | 1.1× faster |\n| TensorAdd | 100K | 106 µs | 55 µs | 0.5× (memory-bound — ML.NET stayed allocator-warm) |\n| TensorMultiply | 100K | 106 µs | 60 µs | 0.6× (memory-bound) |\n| TensorAdd | 1M | 800 µs | 601 µs | 0.75× (memory-bound) |\n| TensorMultiply | 1M | 782 µs | 595 µs | 0.76× (memory-bound) |\n\nThe 1M-element bulk ops are memory-bandwidth-bound: at ~50 GB/s\nsustained DRAM bandwidth on Zen 2, a 4 MB read + 4 MB read + 4 MB\nwrite = 12 MB of traffic per call → 240 µs theoretical floor before\nany allocator overhead. Both libraries are within 2× of that floor.\n\n### vs TensorFlow.NET CPU (eager-vs-eager)\n\nLatest BDN run, validated post-#209-perf. SciSharp's TensorFlow .NET\nbinding (eager mode, no graph compile). Same hardware. AiDotNet wins\noutright on every measured op except small-Conv2D and 256×256 MatMul.\n\n| Operation | Size | AiDotNet | TensorFlow.NET | Speedup |\n|-----------|------|---------:|---------------:|--------:|\n| TensorSum | 1M | **77 µs** | 259 µs | **3.4× faster** |\n| TensorMean | 1M | **76 µs** | 189 µs | **2.5× faster** |\n| TensorMultiply | 100K | **119 µs** | 202 µs | **1.7× faster** |\n| Sigmoid | 1M | **1,264 µs** | 1,941 µs | **1.5× faster** |\n| TensorAdd | 100K | **141 µs** | 211 µs | **1.5× faster** |\n| TensorMatMul | 512 | **1,286 µs** | 1,554 µs | **1.2× faster** |\n| TensorAdd | 1M | **1,340 µs** | 1,478 µs | 1.1× faster |\n| ReLU | 1M | 1,680 µs | 1,606 µs | within noise (high stddev 713 µs) |\n| TensorMultiply | 1M | 1,655 µs | 1,347 µs | 0.81× (memory-bound) |\n| TensorMatMul | 256 | 432 µs | 398 µs | 0.92× |\n| Conv2D | 4×3×32×32 | 719 µs | 428 µs | 0.6× |\n\nThe fresh validation run captured full data on bulk Add/Multiply +\n256/512 MatMul (the original `fcb7fea` baseline showed `NA` because\nSciSharp's TensorFlow.NET was crashing at those shapes; later runtime\nversions stabilized).\n\n### vs MathNet.Numerics (Linear Algebra, double, N=1000)\n\n| Operation | AiDotNet | MathNet | Speedup |\n|-----------|----------|---------|---------|\n| Matrix Multiply 1000×1000 | 8.3 ms | 49.2 ms | **6× faster** |\n| Matrix Add | 1.87 ms | 2.50 ms | **1.3× faster** |\n| Matrix Subtract | 2.08 ms | 2.47 ms | **1.2× faster** |\n| Matrix Scalar Multiply | 1.66 ms | 2.14 ms | **1.3× faster** |\n| Transpose | 2.85 ms | 3.68 ms | **1.3× faster** |\n| Dot Product | 97 ns | 817 ns | **8.4× faster** |\n| L2 Norm | 92 ns | 11,552 ns | **125× faster** |\n\n### vs NumSharp (N=1000)\n\n| Operation | AiDotNet | NumSharp | Speedup |\n|-----------|----------|----------|---------|\n| Matrix Multiply 1000×1000 | 8.3 ms | 26.5 s | **3,200× faster** |\n| Matrix Add | 1.87 ms | 1.98 ms | 1.1× faster |\n| Transpose | 2.85 ms | 13.7 ms | **4.8× faster** |\n| Vector Add | 1.47 us | 54.5 us | **37× faster** |\n\n### vs System.Numerics.Tensors.TensorPrimitives (historical — REMOVED)\n\nWe previously referenced `System.Numerics.Tensors` and benchmarked our\nkernels against `TensorPrimitives.*` directly. As of #209 the dependency\nis **removed entirely** — every elementwise op now runs through our\nin-house `SimdKernels`, both for supply-chain hygiene and because we\nmeasured several TensorPrimitives entry points to regress 4–20× vs our\nin-house kernels on Ryzen 9 3950X (notably `Tanh(float)` ~20× slower,\n`Sigmoid(double)` ~12× slower, `Log(double)` ~4× slower).\n\n| Operation | AiDotNet | TensorPrimitives (raw) | Speedup |\n|-----------|----------|------------------------|---------|\n| Sigmoid (1M, float) | **284 µs** | 7,295 µs | **25× faster** |\n| TensorAdd (100K, float) | **24 µs** | 138 µs | **5.7× faster** |\n| TensorAdd (1M, float) | **379 µs** | 614 µs | **1.6× faster** |\n| TensorSum (1M, float) | **196 µs** | 298 µs | **1.5× faster** |\n| Dot Product (1K, double, in-place) | 97 ns | 185 ns | **1.9× faster** |\n| L2 Norm (1K, double, in-place) | 92 ns | 187 ns | **2.0× faster** |\n\n### Small Matrix Multiply (double)\n\n| Size | AiDotNet | MathNet | NumSharp |\n|------|----------|---------|----------|\n| 4×4 | 172 ns | 165 ns | 2,198 ns |\n| 16×16 | 2.1 us | 2.9 us | 107.5 us |\n| 32×32 | 10.5 us | 36.2 us | 774.8 us |\n\nAiDotNet is **1.4× faster** at 16×16 and **3.4× faster** at 32×32 than MathNet.\n\n### SIMD Instruction Support\n\nThe library automatically detects and uses the best available SIMD instructions:\n\n| Instruction Set | Vector Width | Supported |\n|----------------|--------------|-----------|\n| AVX-512 | 512-bit (16 floats) | .NET 8+ |\n| AVX2 + FMA | 256-bit (8 floats) | .NET 6+ |\n| AVX | 256-bit (8 floats) | .NET 6+ |\n| SSE4.2 | 128-bit (4 floats) | .NET 6+ |\n| ARM NEON | 128-bit (4 floats) | .NET 6+ |\n\n### Check Available Acceleration\n\n```csharp\nusing AiDotNet.Tensors.Engines;\n\nvar caps = PlatformDetector.Capabilities;\n\n// SIMD capabilities\nConsole.WriteLine($\"AVX2: {caps.HasAVX2}\");\nConsole.WriteLine($\"AVX-512: {caps.HasAVX512F}\");\n\n// GPU support\nConsole.WriteLine($\"CUDA: {caps.HasCudaSupport}\");\nConsole.WriteLine($\"OpenCL: {caps.HasOpenCLSupport}\");\n\n// Native library availability\nConsole.WriteLine($\"OpenBLAS: {caps.HasOpenBlas}\");\nConsole.WriteLine($\"CLBlast: {caps.HasClBlast}\");\n\n// Or get a full status summary\nConsole.WriteLine(NativeLibraryDetector.GetStatusSummary());\n```\n\n## Optional Acceleration Packages\n\n### AiDotNet.Native.OpenBLAS\n\nProvides optimized CPU BLAS operations using OpenBLAS:\n\n```bash\ndotnet add package AiDotNet.Native.OpenBLAS\n```\n\n**Performance**: Accelerated BLAS operations for matrix multiply and decompositions.\n\n### AiDotNet.Native.CLBlast\n\nProvides GPU acceleration via OpenCL (works on AMD, Intel, and NVIDIA GPUs):\n\n```bash\ndotnet add package AiDotNet.Native.CLBlast\n```\n\n**Performance**: 10x+ faster for large matrix operations on GPU.\n\n### AiDotNet.Native.CUDA\n\nProvides GPU acceleration via NVIDIA CUDA (NVIDIA GPUs only):\n\n```bash\ndotnet add package AiDotNet.Native.CUDA\n```\n\n**Performance**: 30,000+ GFLOPS for matrix operations on modern NVIDIA GPUs.\n\n**Requirements**:\n- NVIDIA GPU (GeForce, Quadro, or Tesla)\n- NVIDIA display driver 525.60+ (includes CUDA driver)\n\n**Usage with helpful error messages**:\n\n```csharp\nusing AiDotNet.Tensors.Engines.DirectGpu.CUDA;\n\n// Recommended: throws beginner-friendly exception if CUDA unavailable\nusing var cuda = CudaBackend.CreateOrThrow();\n\n// Or check availability first\nif (CudaBackend.IsCudaAvailable)\n{\n    using var backend = new CudaBackend();\n    // Use CUDA acceleration\n}\n```\n\nIf CUDA is not available, you'll get detailed troubleshooting steps explaining exactly what's missing and how to fix it.\n\n## Requirements\n\n- .NET 10.0 or .NET Framework 4.7.1+\n- Windows x64, Linux x64, or macOS x64/arm64\n\n## License\n\nApache 2.0 - See [LICENSE](LICENSE) for details.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fooples%2Faidotnet.tensors","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fooples%2Faidotnet.tensors","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fooples%2Faidotnet.tensors/lists"}