{"id":31561007,"url":"https://github.com/lynncoleart/guda","last_synced_at":"2026-03-04T11:03:49.155Z","repository":{"id":308924266,"uuid":"1034308334","full_name":"LynnColeArt/guda","owner":"LynnColeArt","description":"A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library","archived":false,"fork":false,"pushed_at":"2025-08-09T15:49:24.000Z","size":4948,"stargazers_count":7,"open_issues_count":11,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-05T02:53:28.294Z","etag":null,"topics":["ai","blas","cuda","inference","llm-inference"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LynnColeArt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/ROADMAP.md","authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-08T07:22:26.000Z","updated_at":"2025-10-04T00:21:51.000Z","dependencies_parsed_at":"2025-08-08T18:49:04.403Z","dependency_job_id":null,"html_url":"https://github.com/LynnColeArt/guda","commit_stats":null,"previous_names":["lynncoleart/guda"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/LynnColeArt/guda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LynnColeArt%2Fguda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LynnColeArt%2Fguda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LynnColeArt%2Fguda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LynnColeArt%2Fguda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LynnColeArt","download_url":"https://codeload.github.com/LynnColeArt/guda/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LynnColeArt%2Fguda/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30078421,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T08:01:56.766Z","status":"ssl_error","status_checked_at":"2026-03-04T08:00:42.919Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","blas","cuda","inference","llm-inference"],"created_at":"2025-10-05T02:46:35.905Z","updated_at":"2026-03-04T11:03:49.131Z","avatar_url":"https://github.com/LynnColeArt.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🧀 GUDA: A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library\n\n**[📚 Read the Full Manual](docs/manual/)** | **[🚀 Quick Start Guide](docs/manual/02-installation.md)** | **[🏗️ Architecture Overview](docs/manual/04-architecture.md)**\n\n## 🚀 Breaking News: 18x AI Inference Speedup!\n\nGUDA now includes **Fixed-Function Unit (FFU)** support, automatically leveraging specialized CPU instructions:\n- **AVX512-VNNI**: 37.5 GOPS for INT8 ops (18x speedup!)\n- **AES-NI**: 8.4 GB/s crypto throughput\n- **AMX**: 2 TOPS potential (experimental)\n\n## Performance Highlights\n\nGUDA achieves **150+ GFLOPS** sustained performance on modern CPUs through aggressive optimization:\n\n| Operation | Performance | vs. Practical Peak | Cache Impact |\n|-----------|-------------|-------------------|--------------|\n| GEMM 1024×1024 | 154 GFLOPS | 154% | \u003c1% drop cold |\n| AXPY 16K | 40 GFLOPS | Memory limited | 0% drop cold |\n| DOT 1K | 68 GFLOPS | Memory limited | 0% drop cold |\n| Memory BW | 240+ GB/s | 90% of DDR5 | - |\n\n**[📊 Full Benchmark Results](BENCHMARK_RESULTS.md)** | **[🧊 Cold Cache Analysis](COLD_CACHE_ANALYSIS.md)** | **[📈 Benchmarking Guide](docs/BENCHMARKING_GUIDE.md)**\n\n## Abstract\n\nWe present GUDA (Go Unified Device Architecture), a novel implementation of CUDA-compatible APIs designed for CPU execution. Rather than simulating GPU hardware, GUDA provides a unified memory architecture and maps CUDA operations to highly optimized native CPU implementations. This library enables seamless deployment of CUDA applications on CPU-only infrastructure through aggressive SIMD optimization, native BLAS integration, and elimination of host-device memory transfers. Our implementation demonstrates that CPU-native approaches can provide a practical alternative for running CUDA applications where GPU hardware is unavailable.\n\n## 1. Introduction\n\nThe proliferation of GPU computing has created a dichotomy in the high-performance computing landscape, where applications are often tightly coupled to specific hardware accelerators. This coupling presents challenges for deployment flexibility, development workflows, and resource utilization in heterogeneous computing environments. \n\nGUDA addresses these challenges by providing a CPU-based implementation of core CUDA APIs, enabling:\n- Development and testing of CUDA applications without GPU hardware\n- Deployment flexibility in CPU-only environments\n- Performance portability across different architectures\n- A foundation for heterogeneous computing strategies\n\n### 1.1 Motivation\n\nThe primary motivations for this work include:\n\n1. **Development Accessibility**: Enabling CUDA development on systems without NVIDIA GPUs\n2. **Deployment Flexibility**: Running CUDA applications in CPU-only production environments\n3. **Performance Investigation**: Understanding the performance characteristics of GPU algorithms on modern CPUs\n4. **Architectural Research**: Exploring the convergence of CPU and GPU programming models\n\n### 1.2 Contributions\n\nThis work makes the following contributions:\n\n- A comprehensive CPU implementation of core CUDA runtime and cuBLAS APIs\n- Novel SIMD-optimized kernels for both x86-64 (AVX2/AVX-512) and ARM64 (NEON) architectures\n- Architecture-aware floating-point tolerance system handling platform-specific numerical differences\n- Extensive numerical validation framework ensuring compatibility within architecture constraints\n- Performance analysis demonstrating the viability of CPU execution for GPU-designed algorithms\n\n## 2. Background\n\n### 2.1 CUDA Programming Model\n\nCUDA (Compute Unified Device Architecture) provides a parallel computing platform and programming model for NVIDIA GPUs. Key abstractions include:\n- **Kernels**: Functions executed in parallel by many threads\n- **Thread Hierarchy**: Threads organized into blocks and grids\n- **Memory Hierarchy**: Global, shared, and local memory spaces\n- **Synchronization**: Barriers and atomic operations\n\n### 2.2 CPU SIMD Architecture\n\nModern CPUs provide SIMD (Single Instruction, Multiple Data) extensions:\n- **x86-64**: AVX2 (256-bit vectors, 8 float32 values) and AVX-512 (512-bit vectors, 16 float32 values)\n- **ARM64**: NEON (128-bit vectors, 4 float32 values) with architecture-aware floating-point tolerances\n- **Memory Bandwidth**: Increasingly important bottleneck for data-parallel algorithms\n\n### 2.3 Related Work\n\nPrevious efforts to bridge GPU and CPU computing include:\n- Intel's ISPC (Intel SPMD Program Compiler)\n- OpenCL implementations for CPUs\n- Various CUDA-to-CPU translation tools\n\nGUDA differs by providing direct API compatibility rather than source translation.\n\n## 3. Design and Implementation\n\n### 3.1 Architecture Overview\n\nGUDA consists of several key components:\n\n```\nguda/\n├── core.go          # Core CUDA runtime API implementation\n├── memory.go        # Memory management and allocation\n├── stream.go        # Stream and event management\n├── blas.go          # cuBLAS API implementation\n├── kernel.go        # Kernel execution framework\n└── simd/           # Platform-specific SIMD implementations\n```\n\n```mermaid\ngraph TB\n    subgraph \"Application Layer\"\n        APP[CUDA Application]\n    end\n    \n    subgraph \"GUDA API Layer\"\n        RT[Runtime API\u003cbr/\u003ecudaMalloc, cudaMemcpy]\n        BLAS[cuBLAS API\u003cbr/\u003eGEMM, AXPY, DOT]\n        KERNEL[Kernel API\u003cbr/\u003eLaunch, Synchronize]\n    end\n    \n    subgraph \"Execution Layer\"\n        MEM[Memory Manager\u003cbr/\u003eUnified Memory Model]\n        EXEC[Kernel Executor\u003cbr/\u003eThread→CPU Core Mapping]\n        STREAM[Stream Manager\u003cbr/\u003eAsync Operations]\n    end\n    \n    subgraph \"Optimization Layer\"\n        SIMD[SIMD Kernels\u003cbr/\u003eAVX2/AVX-512]\n        FUSED[Fused Operations\u003cbr/\u003eGEMM+Bias+Activation]\n        CACHE[Cache Optimization\u003cbr/\u003eTiling \u0026 Prefetch]\n    end\n    \n    subgraph \"Hardware\"\n        CPU[CPU Cores\u003cbr/\u003ex86-64 AVX2/AVX-512\u003cbr/\u003eARM64 NEON]\n    end\n    \n    APP --\u003e RT\n    APP --\u003e BLAS\n    APP --\u003e KERNEL\n    \n    RT --\u003e MEM\n    BLAS --\u003e EXEC\n    KERNEL --\u003e EXEC\n    \n    MEM --\u003e SIMD\n    EXEC --\u003e SIMD\n    EXEC --\u003e FUSED\n    STREAM --\u003e EXEC\n    \n    SIMD --\u003e CPU\n    FUSED --\u003e CPU\n    CACHE --\u003e CPU\n    \n    %% High contrast styling for accessibility\n    classDef appLayer fill:#2E86AB,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    classDef apiLayer fill:#A23B72,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    classDef execLayer fill:#F18F01,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    classDef optLayer fill:#C73E1D,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    classDef hwLayer fill:#1B1B1B,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    \n    class APP appLayer\n    class RT,BLAS,KERNEL apiLayer\n    class MEM,EXEC,STREAM execLayer\n    class SIMD,FUSED,CACHE optLayer\n    class CPU hwLayer\n```\n\n### 3.2 Unified Memory Architecture\n\nGUDA fundamentally differs from traditional GPU computing by implementing a **unified memory model** where all memory is CPU RAM:\n\n```go\ntype DevicePtr struct {\n    ptr    unsafe.Pointer\n    size   int\n    offset int\n}\n```\n\nKey architectural decisions:\n- **No Separate Device Memory**: `cudaMalloc` allocates regular CPU RAM, not GPU memory\n- **Zero-Copy Operations**: `cudaMemcpy` operations are no-ops or simple `copy()` calls  \n- **Memory Pool Management**: Efficient allocation/deallocation with free list reuse\n- **Type-Safe Access**: DevicePtr provides `.Float32()`, `.Int32()`, `.Byte()` views of the same memory\n\nThis eliminates PCIe transfer overhead entirely and enables direct CPU access to all data.\n\n### 3.3 CPU-Native Execution Model\n\nRather than simulating GPU threads, GUDA maps CUDA execution patterns to CPU-native optimizations:\n\n1. **Native BLAS Integration**: GEMM and BLAS operations call highly optimized CPU libraries (assimilated Gonum)\n2. **SIMD-First Design**: GPU warps (32 threads) map to AVX2 vectors (8 float32 operations)  \n3. **Thread Block → Goroutine**: Grid/block structures become parallel goroutine work distribution\n4. **Cache-Aware Scheduling**: Work stealing and tiling optimize for CPU cache hierarchy\n5. **Assembly Kernels**: Hand-optimized AVX2/FMA assembly for critical mathematical operations\n\n**Key Insight**: GUDA is NOT a GPU simulator - it's a CPU-optimized implementation providing CUDA API compatibility.\n\n```mermaid\ngraph LR\n    subgraph \"GPU Model\"\n        GRID[\"Grid\u003cbr/\u003edim3(4,2,1)\"]\n        BLOCK1[\"Block(0,0)\"]\n        BLOCK2[\"Block(1,0)\"]\n        BLOCK3[\"Block(...)\"]\n        THREAD1[\"Thread(0)\"]\n        THREAD2[\"Thread(1)\"]\n        THREAD3[\"Thread(...)\"]\n        \n        GRID --\u003e BLOCK1\n        GRID --\u003e BLOCK2\n        GRID --\u003e BLOCK3\n        BLOCK1 --\u003e THREAD1\n        BLOCK1 --\u003e THREAD2\n        BLOCK1 --\u003e THREAD3\n    end\n    \n    subgraph \"GUDA CPU Mapping\"\n        CORES[\"CPU Cores\u003cbr/\u003e8 cores\"]\n        LOOP[\"Parallel Loops\u003cbr/\u003eOpenMP-style\"]\n        SIMD1[\"SIMD Lane 0-7\u003cbr/\u003eAVX2 256-bit\"]\n        SIMD2[\"SIMD Lane 8-15\u003cbr/\u003eAVX2 256-bit\"]\n        \n        CORES --\u003e LOOP\n        LOOP --\u003e SIMD1\n        LOOP --\u003e SIMD2\n    end\n    \n    BLOCK1 -.-\u003e|\"maps to\"| LOOP\n    THREAD1 -.-\u003e|\"maps to\"| SIMD1\n    \n    %% High contrast GPU styling\n    classDef gpuGrid fill:#76B900,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    classDef gpuBlock fill:#4CAF50,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    classDef gpuThread fill:#2E7D32,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    \n    %% High contrast CPU styling  \n    classDef cpuCore fill:#FF6B35,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    classDef cpuLoop fill:#E65100,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    classDef cpuSIMD fill:#BF360C,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    \n    class GRID gpuGrid\n    class BLOCK1,BLOCK2,BLOCK3 gpuBlock\n    class THREAD1,THREAD2,THREAD3 gpuThread\n    class CORES cpuCore\n    class LOOP cpuLoop\n    class SIMD1,SIMD2 cpuSIMD\n    \n    style GRID fill:#e3f2fd\n    style CORES fill:#000000\n    style SIMD1 fill:#000000\n    style SIMD2 fill:#000000\n```\n\n### 3.4 High-Performance BLAS Implementation\n\nGUDA's cuBLAS compatibility layer leverages the fully assimilated Gonum mathematical computing library:\n\n**Architecture**:\n- **Native CPU BLAS**: Direct calls to optimized CPU BLAS routines, not GPU simulation\n- **SIMD Assembly Kernels**: Hand-written AVX2/FMA assembly for common matrix sizes (4x4, 8x8, 16x16)\n- **Parallel Execution**: Goroutine-based work distribution across all CPU cores\n- **Memory Hierarchy Optimization**: L1/L2/L3 cache-aware algorithms with prefetching\n\n**Performance Features**:\n- **Fused Operations**: GEMM+Bias+ReLU and other fused kernels in single operations\n- **Float16 Hardware Acceleration**: F16C instruction support for half-precision\n- **Adaptive Algorithms**: Different implementations chosen based on matrix size and cache characteristics\n- **Zero Memory Copy**: Unified memory eliminates host-device transfer overhead\n\nPerformance benchmarks are currently being validated and will be published in future releases.\n\n## 4. Performance Characteristics\n\n### 4.1 Experimental Setup\n\nTesting environment:\n- CPU: AMD Ryzen 7 7700X (8 cores, 16 threads) \n- Memory: 32GB DDR5-5600\n- Architecture: x86-64 with AVX2 support\n- Compiler: Go 1.21 with CGO for SIMD intrinsics\n\n**Platform Support**: This proof-of-concept currently supports **x86-64 only**. ARM64 and other architectures are not yet implemented.\n\n### 4.2 Performance Results\n\nGUDA achieves exceptional performance through optimized CPU utilization:\n\n#### GEMM Performance (Hot Cache)\n| Matrix Size | GFLOPS | Efficiency* | Arithmetic Intensity |\n|-------------|--------|-------------|---------------------|\n| 256×256     | 126.5  | 126%        | 42.7 FLOPS/byte    |\n| 512×512     | 148.7  | 149%        | 85.3 FLOPS/byte    |\n| 1024×1024   | 154.2  | 154%        | 170.7 FLOPS/byte   |\n| 2048×2048   | 153.5  | 154%        | 341.3 FLOPS/byte   |\n\n*Efficiency relative to practical peak of 100 GFLOPS (40% of theoretical 288 GFLOPS)\n\n#### Memory Bandwidth Operations\n| Operation | Size | Performance | Memory Bandwidth |\n|-----------|------|-------------|------------------|\n| AXPY      | 16K  | 40.4 GFLOPS | 242.3 GB/s      |\n| DOT       | 1K   | 68.4 GFLOPS | 273.7 GB/s      |\n\nThese results demonstrate:\n- **Near-peak performance**: \u003e150 GFLOPS sustained on compute-bound operations\n- **Memory saturation**: \u003e240 GB/s achieved (approaching DDR5 theoretical limits)\n- **Efficient vectorization**: Full AVX2 utilization with 8-wide float32 operations\n- **Cache optimization**: Hot cache performance with effective blocking\n\nSee [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) for detailed performance analysis including cold cache results and performance counter validation.\n\nFor guidance on running and interpreting benchmarks, see [Benchmarking Guide](docs/BENCHMARKING_GUIDE.md).\n\n### 4.3 Convolution Implementation\n\nOur convolution implementation uses the im2col + GEMM approach:\n\n```mermaid\ngraph LR\n    subgraph \"Input\"\n        IMG[Input Image\u003cbr/\u003eN×C×H×W]\n        KERNEL[Kernels\u003cbr/\u003eK×C×R×S]\n    end\n    \n    subgraph \"Transform\"\n        IM2COL[im2col Transform\u003cbr/\u003eUnfold patches]\n        RESHAPE[Reshape Kernels\u003cbr/\u003eK×CRS]\n    end\n    \n    subgraph \"Compute\"\n        GEMM[GEMM\u003cbr/\u003eK×CRS × CRS×NHW]\n    end\n    \n    subgraph \"Output\"\n        OUT[Output\u003cbr/\u003eN×K×H'×W']\n    end\n    \n    IMG --\u003e IM2COL\n    KERNEL --\u003e RESHAPE\n    IM2COL --\u003e GEMM\n    RESHAPE --\u003e GEMM\n    GEMM --\u003e OUT\n    \n    %% High contrast styling for accessibility\n    classDef inputData fill:#2E86AB,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    classDef transform fill:#A23B72,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    classDef compute fill:#C73E1D,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    classDef output fill:#F18F01,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    \n    class IMG,KERNEL inputData\n    class IM2COL,RESHAPE transform\n    class GEMM compute\n    class OUT output\n```\n\n### 4.4 Numerical Accuracy\n\nValidation testing indicates:\n- IEEE 754 compliance for floating-point operations\n- Bit-exact results for memory operations\n- Numerical parity with reference implementations for core operations\n\n## 5. Use Cases and Applications\n\n```mermaid\ngraph TB\n    subgraph \"Development Environment\"\n        DEV_LAPTOP[Developer Laptop\u003cbr/\u003eNo GPU Required]\n        CI_PIPELINE[CI/CD Pipeline\u003cbr/\u003eGitHub Actions]\n        DEBUG[CPU Debugging Tools\u003cbr/\u003eGDB, Valgrind, perf]\n    end\n    \n    subgraph \"Production Deployment\"\n        CLOUD[Cloud CPU Instance\u003cbr/\u003eCost-Optimized]\n        EDGE[Edge Device\u003cbr/\u003eARM/x86 CPUs]\n        CONTAINER[Container Platform\u003cbr/\u003eDocker/Kubernetes]\n    end\n    \n    subgraph \"Research \u0026 Education\"\n        EDUCATION[CUDA Learning\u003cbr/\u003eAcademic Environment]\n        PROTOTYPE[Algorithm Prototyping\u003cbr/\u003eRapid Iteration]\n        ANALYSIS[Performance Analysis\u003cbr/\u003eCPU vs GPU Studies]\n    end\n    \n    subgraph \"Application Types\"\n        INFERENCE[ML Inference\u003cbr/\u003eLow-Latency]\n        SIMULATION[Scientific Computing\u003cbr/\u003eBatch Processing]\n        TRAINING[Small Model Training\u003cbr/\u003eDevelopment Phase]\n    end\n    \n    DEV_LAPTOP --\u003e CLOUD\n    CI_PIPELINE --\u003e CONTAINER\n    DEBUG --\u003e ANALYSIS\n    \n    CLOUD --\u003e INFERENCE\n    EDGE --\u003e INFERENCE\n    CONTAINER --\u003e SIMULATION\n    \n    EDUCATION --\u003e PROTOTYPE\n    PROTOTYPE --\u003e TRAINING\n    ANALYSIS --\u003e SIMULATION\n    \n    %% High contrast deployment styling\n    classDef devEnv fill:#2E7D32,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    classDef prodDeploy fill:#1565C0,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    classDef research fill:#7B1FA2,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    classDef appTypes fill:#D84315,stroke:#ffffff,stroke-width:3px,color:#ffffff\n    \n    class DEV_LAPTOP,CI_PIPELINE,DEBUG devEnv\n    class CLOUD,EDGE,CONTAINER prodDeploy\n    class EDUCATION,PROTOTYPE,ANALYSIS research\n    class INFERENCE,SIMULATION,TRAINING appTypes\n```\n\n### 5.1 Development and Testing\n\nGUDA enables:\n- CI/CD pipelines without GPU infrastructure\n- Local development on laptops\n- Debugging with standard CPU tools\n\n### 5.2 Edge Deployment\n\nSuitable for:\n- Inference on edge devices without GPUs\n- Embedded systems with powerful CPUs\n- Cost-sensitive deployments\n\n### 5.3 Education and Research\n\nProvides:\n- Accessible CUDA learning environment\n- Algorithm prototyping platform\n- Performance analysis opportunities\n\n## 6. Limitations and Future Work\n\n### 6.1 Current Limitations\n\n- **Platform Support**: x86-64 only - ARM64, RISC-V, and other architectures not yet supported\n- **API Coverage**: Limited to CUDA runtime and cuBLAS APIs\n- **SIMD Requirements**: Requires AVX2 for optimal performance; fallback implementations may be slower\n- **No GPU Hardware Features**: No support for advanced GPU features (tensor cores, RT cores, etc.)\n\n### 6.2 Future Directions\n\n1. **Multi-Architecture Support**: ARM64 with NEON/SVE, RISC-V with vector extensions\n2. **API Coverage**: Implement cuDNN, cuFFT, cuSPARSE, and other CUDA libraries  \n3. **Advanced SIMD**: AVX-512 optimizations for Intel processors\n4. **Heterogeneous Execution**: CPU+GPU cooperative processing for hybrid workloads\n\n## 7. Conclusion\n\nGUDA demonstrates that CPU-native implementations of GPU APIs can provide practical deployment options for CUDA applications on CPU-only infrastructure. Through unified memory architecture, optimized BLAS integration, and elimination of host-device transfers, GUDA enables running CUDA applications where GPU hardware is unavailable or cost-prohibitive. This proof-of-concept validates the viability of CPU-first architectural approaches for CUDA compatibility and suggests future opportunities for heterogeneous computing strategies.\n\n## Installation\n\n### System Requirements\n- **Architecture**: x86-64 processor with AVX2 support (Intel Haswell/AMD Excavator or newer)\n- **OS**: Linux, macOS, or Windows\n- **Go**: Version 1.19 or later\n- **CGO**: Required for SIMD assembly optimizations\n\n### Install\n```bash\ngo get github.com/LynnColeArt/guda\n```\n\n**Note**: ARM64 and other architectures are not currently supported in this proof-of-concept.\n\n## Usage Example\n\n```go\npackage main\n\nimport (\n    \"github.com/LynnColeArt/guda\"\n)\n\nfunc main() {\n    // Initialize GUDA\n    guda.Init(0)\n    defer guda.Reset()\n    \n    // Allocate memory\n    d_a, _ := guda.Malloc(1024 * 1024 * 4)\n    d_b, _ := guda.Malloc(1024 * 1024 * 4)\n    d_c, _ := guda.Malloc(1024 * 1024 * 4)\n    defer guda.Free(d_a)\n    defer guda.Free(d_b)\n    defer guda.Free(d_c)\n    \n    // Perform SGEMM\n    guda.GEMM(false, false, 1024, 1024, 1024,\n        1.0, d_a, 1024, d_b, 1024,\n        0.0, d_c, 1024)\n}\n```\n\n## License\n\nMIT License - See LICENSE file for details\n\n## Citation\n\nIf you use GUDA in your research, please cite:\n```bibtex\n@software{guda2025,\n  author = {Lynn Cole},\n  title = {GUDA: A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library},\n  year = {2025},\n  url = {https://github.com/LynnColeArt/guda}\n}\n```\n\n## Acknowledgments\n\nThis work was inspired by the need for accessible high-performance computing and the convergence of CPU and GPU architectures.\n\n### Gonum Integration\n\nGUDA incorporates substantial portions of the Gonum project (https://github.com/gonum/gonum), a set of numeric libraries for the Go programming language. The Gonum BLAS implementation forms the foundation of GUDA's linear algebra operations. We are grateful to the Gonum authors and contributors for their excellent work in bringing high-performance numeric computing to Go. The Gonum code is used under the BSD 3-Clause License.\n\n### Additional Thanks\n\nSpecial thanks to:\n- The Go community for providing excellent tools for systems programming\n- The developers of CUDA and cuBLAS for establishing the programming model and APIs\n- The open-source community for fostering collaborative scientific computing","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flynncoleart%2Fguda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flynncoleart%2Fguda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flynncoleart%2Fguda/lists"}