https://github.com/lynncoleart/guda

A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library
https://github.com/lynncoleart/guda
ai blas cuda inference llm-inference
Last synced: 4 months ago
JSON representation
A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library
Host: GitHub
URL: https://github.com/lynncoleart/guda
Owner: LynnColeArt
License: bsd-3-clause
Created: 2025-08-08T07:22:26.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-08-09T15:49:24.000Z (11 months ago)
Last Synced: 2025-10-05T02:53:28.294Z (9 months ago)
Topics: ai, blas, cuda, inference, llm-inference
Language: Go
Homepage:
Size: 4.72 MB
Stars: 7
Watchers: 1
Forks: 2
Open Issues: 11
Metadata Files:
- Readme: README.md
- Contributing: docs/CONTRIBUTING.md
- License: LICENSE
- Roadmap: docs/ROADMAP.md
- Authors: AUTHORS
Awesome Lists containing this project

README

          # 🧀 GUDA: A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library

**[📚 Read the Full Manual](docs/manual/)** | **[🚀 Quick Start Guide](docs/manual/02-installation.md)** | **[🏗️ Architecture Overview](docs/manual/04-architecture.md)**

## 🚀 Breaking News: 18x AI Inference Speedup!

GUDA now includes **Fixed-Function Unit (FFU)** support, automatically leveraging specialized CPU instructions:

- **AVX512-VNNI**: 37.5 GOPS for INT8 ops (18x speedup!)

- **AES-NI**: 8.4 GB/s crypto throughput

- **AMX**: 2 TOPS potential (experimental)

## Performance Highlights

GUDA achieves **150+ GFLOPS** sustained performance on modern CPUs through aggressive optimization:

| Operation | Performance | vs. Practical Peak | Cache Impact |

|-----------|-------------|-------------------|--------------|

| GEMM 1024×1024 | 154 GFLOPS | 154% | <1% drop cold |

| AXPY 16K | 40 GFLOPS | Memory limited | 0% drop cold |

| DOT 1K | 68 GFLOPS | Memory limited | 0% drop cold |

| Memory BW | 240+ GB/s | 90% of DDR5 | - |

**[📊 Full Benchmark Results](BENCHMARK_RESULTS.md)** | **[🧊 Cold Cache Analysis](COLD_CACHE_ANALYSIS.md)** | **[📈 Benchmarking Guide](docs/BENCHMARKING_GUIDE.md)**

## Abstract

We present GUDA (Go Unified Device Architecture), a novel implementation of CUDA-compatible APIs designed for CPU execution. Rather than simulating GPU hardware, GUDA provides a unified memory architecture and maps CUDA operations to highly optimized native CPU implementations. This library enables seamless deployment of CUDA applications on CPU-only infrastructure through aggressive SIMD optimization, native BLAS integration, and elimination of host-device memory transfers. Our implementation demonstrates that CPU-native approaches can provide a practical alternative for running CUDA applications where GPU hardware is unavailable.

## 1. Introduction

The proliferation of GPU computing has created a dichotomy in the high-performance computing landscape, where applications are often tightly coupled to specific hardware accelerators. This coupling presents challenges for deployment flexibility, development workflows, and resource utilization in heterogeneous computing environments. 

GUDA addresses these challenges by providing a CPU-based implementation of core CUDA APIs, enabling:

- Development and testing of CUDA applications without GPU hardware

- Deployment flexibility in CPU-only environments

- Performance portability across different architectures

- A foundation for heterogeneous computing strategies

### 1.1 Motivation

The primary motivations for this work include:

1. **Development Accessibility**: Enabling CUDA development on systems without NVIDIA GPUs

2. **Deployment Flexibility**: Running CUDA applications in CPU-only production environments

3. **Performance Investigation**: Understanding the performance characteristics of GPU algorithms on modern CPUs

4. **Architectural Research**: Exploring the convergence of CPU and GPU programming models

### 1.2 Contributions

This work makes the following contributions:

- A comprehensive CPU implementation of core CUDA runtime and cuBLAS APIs

- Novel SIMD-optimized kernels for both x86-64 (AVX2/AVX-512) and ARM64 (NEON) architectures

- Architecture-aware floating-point tolerance system handling platform-specific numerical differences

- Extensive numerical validation framework ensuring compatibility within architecture constraints

- Performance analysis demonstrating the viability of CPU execution for GPU-designed algorithms

## 2. Background

### 2.1 CUDA Programming Model

CUDA (Compute Unified Device Architecture) provides a parallel computing platform and programming model for NVIDIA GPUs. Key abstractions include:

- **Kernels**: Functions executed in parallel by many threads

- **Thread Hierarchy**: Threads organized into blocks and grids

- **Memory Hierarchy**: Global, shared, and local memory spaces

- **Synchronization**: Barriers and atomic operations

### 2.2 CPU SIMD Architecture

Modern CPUs provide SIMD (Single Instruction, Multiple Data) extensions:

- **x86-64**: AVX2 (256-bit vectors, 8 float32 values) and AVX-512 (512-bit vectors, 16 float32 values)

- **ARM64**: NEON (128-bit vectors, 4 float32 values) with architecture-aware floating-point tolerances

- **Memory Bandwidth**: Increasingly important bottleneck for data-parallel algorithms

### 2.3 Related Work

Previous efforts to bridge GPU and CPU computing include:

- Intel's ISPC (Intel SPMD Program Compiler)

- OpenCL implementations for CPUs

- Various CUDA-to-CPU translation tools

GUDA differs by providing direct API compatibility rather than source translation.

## 3. Design and Implementation

### 3.1 Architecture Overview

GUDA consists of several key components:

```

guda/

├── core.go          # Core CUDA runtime API implementation

├── memory.go        # Memory management and allocation

├── stream.go        # Stream and event management

├── blas.go          # cuBLAS API implementation

├── kernel.go        # Kernel execution framework

└── simd/           # Platform-specific SIMD implementations

```

```mermaid

graph TB

    subgraph "Application Layer"

        APP[CUDA Application]

    end

    

    subgraph "GUDA API Layer"

        RT[Runtime API
cudaMalloc, cudaMemcpy]

        BLAS[cuBLAS API
GEMM, AXPY, DOT]

        KERNEL[Kernel API
Launch, Synchronize]

    end

    

    subgraph "Execution Layer"

        MEM[Memory Manager
Unified Memory Model]

        EXEC[Kernel Executor
Thread→CPU Core Mapping]

        STREAM[Stream Manager
Async Operations]

    end

    

    subgraph "Optimization Layer"

        SIMD[SIMD Kernels
AVX2/AVX-512]

        FUSED[Fused Operations
GEMM+Bias+Activation]

        CACHE[Cache Optimization
Tiling & Prefetch]

    end

    

    subgraph "Hardware"

        CPU[CPU Cores
x86-64 AVX2/AVX-512
ARM64 NEON]

    end

    

    APP --> RT

    APP --> BLAS

    APP --> KERNEL

    

    RT --> MEM

    BLAS --> EXEC

    KERNEL --> EXEC

    

    MEM --> SIMD

    EXEC --> SIMD

    EXEC --> FUSED

    STREAM --> EXEC

    

    SIMD --> CPU

    FUSED --> CPU

    CACHE --> CPU

    

    %% High contrast styling for accessibility

    classDef appLayer fill:#2E86AB,stroke:#ffffff,stroke-width:3px,color:#ffffff

    classDef apiLayer fill:#A23B72,stroke:#ffffff,stroke-width:2px,color:#ffffff

    classDef execLayer fill:#F18F01,stroke:#ffffff,stroke-width:2px,color:#ffffff

    classDef optLayer fill:#C73E1D,stroke:#ffffff,stroke-width:2px,color:#ffffff

    classDef hwLayer fill:#1B1B1B,stroke:#ffffff,stroke-width:3px,color:#ffffff

    

    class APP appLayer

    class RT,BLAS,KERNEL apiLayer

    class MEM,EXEC,STREAM execLayer

    class SIMD,FUSED,CACHE optLayer

    class CPU hwLayer

```

### 3.2 Unified Memory Architecture

GUDA fundamentally differs from traditional GPU computing by implementing a **unified memory model** where all memory is CPU RAM:

```go

type DevicePtr struct {

    ptr    unsafe.Pointer

    size   int

    offset int

}

```

Key architectural decisions:

- **No Separate Device Memory**: `cudaMalloc` allocates regular CPU RAM, not GPU memory

- **Zero-Copy Operations**: `cudaMemcpy` operations are no-ops or simple `copy()` calls  

- **Memory Pool Management**: Efficient allocation/deallocation with free list reuse

- **Type-Safe Access**: DevicePtr provides `.Float32()`, `.Int32()`, `.Byte()` views of the same memory

This eliminates PCIe transfer overhead entirely and enables direct CPU access to all data.

### 3.3 CPU-Native Execution Model

Rather than simulating GPU threads, GUDA maps CUDA execution patterns to CPU-native optimizations:

1. **Native BLAS Integration**: GEMM and BLAS operations call highly optimized CPU libraries (assimilated Gonum)

2. **SIMD-First Design**: GPU warps (32 threads) map to AVX2 vectors (8 float32 operations)  

3. **Thread Block → Goroutine**: Grid/block structures become parallel goroutine work distribution

4. **Cache-Aware Scheduling**: Work stealing and tiling optimize for CPU cache hierarchy

5. **Assembly Kernels**: Hand-optimized AVX2/FMA assembly for critical mathematical operations

**Key Insight**: GUDA is NOT a GPU simulator - it's a CPU-optimized implementation providing CUDA API compatibility.

```mermaid

graph LR

    subgraph "GPU Model"

        GRID["Grid
dim3(4,2,1)"]

        BLOCK1["Block(0,0)"]

        BLOCK2["Block(1,0)"]

        BLOCK3["Block(...)"]

        THREAD1["Thread(0)"]

        THREAD2["Thread(1)"]

        THREAD3["Thread(...)"]

        

        GRID --> BLOCK1

        GRID --> BLOCK2

        GRID --> BLOCK3

        BLOCK1 --> THREAD1

        BLOCK1 --> THREAD2

        BLOCK1 --> THREAD3

    end

    

    subgraph "GUDA CPU Mapping"

        CORES["CPU Cores
8 cores"]

        LOOP["Parallel Loops
OpenMP-style"]

        SIMD1["SIMD Lane 0-7
AVX2 256-bit"]

        SIMD2["SIMD Lane 8-15
AVX2 256-bit"]

        

        CORES --> LOOP

        LOOP --> SIMD1

        LOOP --> SIMD2

    end

    

    BLOCK1 -.->|"maps to"| LOOP

    THREAD1 -.->|"maps to"| SIMD1

    

    %% High contrast GPU styling

    classDef gpuGrid fill:#76B900,stroke:#ffffff,stroke-width:3px,color:#ffffff

    classDef gpuBlock fill:#4CAF50,stroke:#ffffff,stroke-width:2px,color:#ffffff

    classDef gpuThread fill:#2E7D32,stroke:#ffffff,stroke-width:2px,color:#ffffff

    

    %% High contrast CPU styling  

    classDef cpuCore fill:#FF6B35,stroke:#ffffff,stroke-width:3px,color:#ffffff

    classDef cpuLoop fill:#E65100,stroke:#ffffff,stroke-width:2px,color:#ffffff

    classDef cpuSIMD fill:#BF360C,stroke:#ffffff,stroke-width:2px,color:#ffffff

    

    class GRID gpuGrid

    class BLOCK1,BLOCK2,BLOCK3 gpuBlock

    class THREAD1,THREAD2,THREAD3 gpuThread

    class CORES cpuCore

    class LOOP cpuLoop

    class SIMD1,SIMD2 cpuSIMD

    

    style GRID fill:#e3f2fd

    style CORES fill:#000000

    style SIMD1 fill:#000000

    style SIMD2 fill:#000000

```

### 3.4 High-Performance BLAS Implementation

GUDA's cuBLAS compatibility layer leverages the fully assimilated Gonum mathematical computing library:

**Architecture**:

- **Native CPU BLAS**: Direct calls to optimized CPU BLAS routines, not GPU simulation

- **SIMD Assembly Kernels**: Hand-written AVX2/FMA assembly for common matrix sizes (4x4, 8x8, 16x16)

- **Parallel Execution**: Goroutine-based work distribution across all CPU cores

- **Memory Hierarchy Optimization**: L1/L2/L3 cache-aware algorithms with prefetching

**Performance Features**:

- **Fused Operations**: GEMM+Bias+ReLU and other fused kernels in single operations

- **Float16 Hardware Acceleration**: F16C instruction support for half-precision

- **Adaptive Algorithms**: Different implementations chosen based on matrix size and cache characteristics

- **Zero Memory Copy**: Unified memory eliminates host-device transfer overhead

Performance benchmarks are currently being validated and will be published in future releases.

## 4. Performance Characteristics

### 4.1 Experimental Setup

Testing environment:

- CPU: AMD Ryzen 7 7700X (8 cores, 16 threads) 

- Memory: 32GB DDR5-5600

- Architecture: x86-64 with AVX2 support

- Compiler: Go 1.21 with CGO for SIMD intrinsics

**Platform Support**: This proof-of-concept currently supports **x86-64 only**. ARM64 and other architectures are not yet implemented.

### 4.2 Performance Results

GUDA achieves exceptional performance through optimized CPU utilization:

#### GEMM Performance (Hot Cache)

| Matrix Size | GFLOPS | Efficiency* | Arithmetic Intensity |

|-------------|--------|-------------|---------------------|

| 256×256     | 126.5  | 126%        | 42.7 FLOPS/byte    |

| 512×512     | 148.7  | 149%        | 85.3 FLOPS/byte    |

| 1024×1024   | 154.2  | 154%        | 170.7 FLOPS/byte   |

| 2048×2048   | 153.5  | 154%        | 341.3 FLOPS/byte   |

*Efficiency relative to practical peak of 100 GFLOPS (40% of theoretical 288 GFLOPS)

#### Memory Bandwidth Operations

| Operation | Size | Performance | Memory Bandwidth |

|-----------|------|-------------|------------------|

| AXPY      | 16K  | 40.4 GFLOPS | 242.3 GB/s      |

| DOT       | 1K   | 68.4 GFLOPS | 273.7 GB/s      |

These results demonstrate:

- **Near-peak performance**: >150 GFLOPS sustained on compute-bound operations

- **Memory saturation**: >240 GB/s achieved (approaching DDR5 theoretical limits)

- **Efficient vectorization**: Full AVX2 utilization with 8-wide float32 operations

- **Cache optimization**: Hot cache performance with effective blocking

See [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) for detailed performance analysis including cold cache results and performance counter validation.

For guidance on running and interpreting benchmarks, see [Benchmarking Guide](docs/BENCHMARKING_GUIDE.md).

### 4.3 Convolution Implementation

Our convolution implementation uses the im2col + GEMM approach:

```mermaid

graph LR

    subgraph "Input"

        IMG[Input Image
N×C×H×W]

        KERNEL[Kernels
K×C×R×S]

    end

    

    subgraph "Transform"

        IM2COL[im2col Transform
Unfold patches]

        RESHAPE[Reshape Kernels
K×CRS]

    end

    

    subgraph "Compute"

        GEMM[GEMM
K×CRS × CRS×NHW]

    end

    

    subgraph "Output"

        OUT[Output
N×K×H'×W']

    end

    

    IMG --> IM2COL

    KERNEL --> RESHAPE

    IM2COL --> GEMM

    RESHAPE --> GEMM

    GEMM --> OUT

    

    %% High contrast styling for accessibility

    classDef inputData fill:#2E86AB,stroke:#ffffff,stroke-width:3px,color:#ffffff

    classDef transform fill:#A23B72,stroke:#ffffff,stroke-width:2px,color:#ffffff

    classDef compute fill:#C73E1D,stroke:#ffffff,stroke-width:3px,color:#ffffff

    classDef output fill:#F18F01,stroke:#ffffff,stroke-width:3px,color:#ffffff

    

    class IMG,KERNEL inputData

    class IM2COL,RESHAPE transform

    class GEMM compute

    class OUT output

```

### 4.4 Numerical Accuracy

Validation testing indicates:

- IEEE 754 compliance for floating-point operations

- Bit-exact results for memory operations

- Numerical parity with reference implementations for core operations

## 5. Use Cases and Applications

```mermaid

graph TB

    subgraph "Development Environment"

        DEV_LAPTOP[Developer Laptop
No GPU Required]

        CI_PIPELINE[CI/CD Pipeline
GitHub Actions]

        DEBUG[CPU Debugging Tools
GDB, Valgrind, perf]

    end

    

    subgraph "Production Deployment"

        CLOUD[Cloud CPU Instance
Cost-Optimized]

        EDGE[Edge Device
ARM/x86 CPUs]

        CONTAINER[Container Platform
Docker/Kubernetes]

    end

    

    subgraph "Research & Education"

        EDUCATION[CUDA Learning
Academic Environment]

        PROTOTYPE[Algorithm Prototyping
Rapid Iteration]

        ANALYSIS[Performance Analysis
CPU vs GPU Studies]

    end

    

    subgraph "Application Types"

        INFERENCE[ML Inference
Low-Latency]

        SIMULATION[Scientific Computing
Batch Processing]

        TRAINING[Small Model Training
Development Phase]

    end

    

    DEV_LAPTOP --> CLOUD

    CI_PIPELINE --> CONTAINER

    DEBUG --> ANALYSIS

    

    CLOUD --> INFERENCE

    EDGE --> INFERENCE

    CONTAINER --> SIMULATION

    

    EDUCATION --> PROTOTYPE

    PROTOTYPE --> TRAINING

    ANALYSIS --> SIMULATION

    

    %% High contrast deployment styling

    classDef devEnv fill:#2E7D32,stroke:#ffffff,stroke-width:3px,color:#ffffff

    classDef prodDeploy fill:#1565C0,stroke:#ffffff,stroke-width:3px,color:#ffffff

    classDef research fill:#7B1FA2,stroke:#ffffff,stroke-width:3px,color:#ffffff

    classDef appTypes fill:#D84315,stroke:#ffffff,stroke-width:3px,color:#ffffff

    

    class DEV_LAPTOP,CI_PIPELINE,DEBUG devEnv

    class CLOUD,EDGE,CONTAINER prodDeploy

    class EDUCATION,PROTOTYPE,ANALYSIS research

    class INFERENCE,SIMULATION,TRAINING appTypes

```

### 5.1 Development and Testing

GUDA enables:

- CI/CD pipelines without GPU infrastructure

- Local development on laptops

- Debugging with standard CPU tools

### 5.2 Edge Deployment

Suitable for:

- Inference on edge devices without GPUs

- Embedded systems with powerful CPUs

- Cost-sensitive deployments

### 5.3 Education and Research

Provides:

- Accessible CUDA learning environment

- Algorithm prototyping platform

- Performance analysis opportunities

## 6. Limitations and Future Work

### 6.1 Current Limitations

- **Platform Support**: x86-64 only - ARM64, RISC-V, and other architectures not yet supported

- **API Coverage**: Limited to CUDA runtime and cuBLAS APIs

- **SIMD Requirements**: Requires AVX2 for optimal performance; fallback implementations may be slower

- **No GPU Hardware Features**: No support for advanced GPU features (tensor cores, RT cores, etc.)

### 6.2 Future Directions

1. **Multi-Architecture Support**: ARM64 with NEON/SVE, RISC-V with vector extensions

2. **API Coverage**: Implement cuDNN, cuFFT, cuSPARSE, and other CUDA libraries  

3. **Advanced SIMD**: AVX-512 optimizations for Intel processors

4. **Heterogeneous Execution**: CPU+GPU cooperative processing for hybrid workloads

## 7. Conclusion

GUDA demonstrates that CPU-native implementations of GPU APIs can provide practical deployment options for CUDA applications on CPU-only infrastructure. Through unified memory architecture, optimized BLAS integration, and elimination of host-device transfers, GUDA enables running CUDA applications where GPU hardware is unavailable or cost-prohibitive. This proof-of-concept validates the viability of CPU-first architectural approaches for CUDA compatibility and suggests future opportunities for heterogeneous computing strategies.

## Installation

### System Requirements

- **Architecture**: x86-64 processor with AVX2 support (Intel Haswell/AMD Excavator or newer)

- **OS**: Linux, macOS, or Windows

- **Go**: Version 1.19 or later

- **CGO**: Required for SIMD assembly optimizations

### Install

```bash

go get github.com/LynnColeArt/guda

```

**Note**: ARM64 and other architectures are not currently supported in this proof-of-concept.

## Usage Example

```go

package main

import (

    "github.com/LynnColeArt/guda"

)

func main() {

    // Initialize GUDA

    guda.Init(0)

    defer guda.Reset()

    

    // Allocate memory

    d_a, _ := guda.Malloc(1024 * 1024 * 4)

    d_b, _ := guda.Malloc(1024 * 1024 * 4)

    d_c, _ := guda.Malloc(1024 * 1024 * 4)

    defer guda.Free(d_a)

    defer guda.Free(d_b)

    defer guda.Free(d_c)

    

    // Perform SGEMM

    guda.GEMM(false, false, 1024, 1024, 1024,

        1.0, d_a, 1024, d_b, 1024,

        0.0, d_c, 1024)

}

```

## License

MIT License - See LICENSE file for details

## Citation

If you use GUDA in your research, please cite:

```bibtex

@software{guda2025,

  author = {Lynn Cole},

  title = {GUDA: A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library},

  year = {2025},

  url = {https://github.com/LynnColeArt/guda}

}

```

## Acknowledgments

This work was inspired by the need for accessible high-performance computing and the convergence of CPU and GPU architectures.

### Gonum Integration

GUDA incorporates substantial portions of the Gonum project (https://github.com/gonum/gonum), a set of numeric libraries for the Go programming language. The Gonum BLAS implementation forms the foundation of GUDA's linear algebra operations. We are grateful to the Gonum authors and contributors for their excellent work in bringing high-performance numeric computing to Go. The Gonum code is used under the BSD 3-Clause License.

### Additional Thanks

Special thanks to:

- The Go community for providing excellent tools for systems programming

- The developers of CUDA and cuBLAS for establishing the programming model and APIs

- The open-source community for fostering collaborative scientific computing
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lynncoleart/guda

Awesome Lists containing this project

README