An open API service indexing awesome lists of open source software.

https://github.com/lynncoleart/guda

A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library
https://github.com/lynncoleart/guda

ai blas cuda inference llm-inference

Last synced: 4 months ago
JSON representation

A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library

Awesome Lists containing this project

README

          

# ๐Ÿง€ GUDA: A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library

**[๐Ÿ“š Read the Full Manual](docs/manual/)** | **[๐Ÿš€ Quick Start Guide](docs/manual/02-installation.md)** | **[๐Ÿ—๏ธ Architecture Overview](docs/manual/04-architecture.md)**

## ๐Ÿš€ Breaking News: 18x AI Inference Speedup!

GUDA now includes **Fixed-Function Unit (FFU)** support, automatically leveraging specialized CPU instructions:
- **AVX512-VNNI**: 37.5 GOPS for INT8 ops (18x speedup!)
- **AES-NI**: 8.4 GB/s crypto throughput
- **AMX**: 2 TOPS potential (experimental)

## Performance Highlights

GUDA achieves **150+ GFLOPS** sustained performance on modern CPUs through aggressive optimization:

| Operation | Performance | vs. Practical Peak | Cache Impact |
|-----------|-------------|-------------------|--------------|
| GEMM 1024ร—1024 | 154 GFLOPS | 154% | <1% drop cold |
| AXPY 16K | 40 GFLOPS | Memory limited | 0% drop cold |
| DOT 1K | 68 GFLOPS | Memory limited | 0% drop cold |
| Memory BW | 240+ GB/s | 90% of DDR5 | - |

**[๐Ÿ“Š Full Benchmark Results](BENCHMARK_RESULTS.md)** | **[๐ŸงŠ Cold Cache Analysis](COLD_CACHE_ANALYSIS.md)** | **[๐Ÿ“ˆ Benchmarking Guide](docs/BENCHMARKING_GUIDE.md)**

## Abstract

We present GUDA (Go Unified Device Architecture), a novel implementation of CUDA-compatible APIs designed for CPU execution. Rather than simulating GPU hardware, GUDA provides a unified memory architecture and maps CUDA operations to highly optimized native CPU implementations. This library enables seamless deployment of CUDA applications on CPU-only infrastructure through aggressive SIMD optimization, native BLAS integration, and elimination of host-device memory transfers. Our implementation demonstrates that CPU-native approaches can provide a practical alternative for running CUDA applications where GPU hardware is unavailable.

## 1. Introduction

The proliferation of GPU computing has created a dichotomy in the high-performance computing landscape, where applications are often tightly coupled to specific hardware accelerators. This coupling presents challenges for deployment flexibility, development workflows, and resource utilization in heterogeneous computing environments.

GUDA addresses these challenges by providing a CPU-based implementation of core CUDA APIs, enabling:
- Development and testing of CUDA applications without GPU hardware
- Deployment flexibility in CPU-only environments
- Performance portability across different architectures
- A foundation for heterogeneous computing strategies

### 1.1 Motivation

The primary motivations for this work include:

1. **Development Accessibility**: Enabling CUDA development on systems without NVIDIA GPUs
2. **Deployment Flexibility**: Running CUDA applications in CPU-only production environments
3. **Performance Investigation**: Understanding the performance characteristics of GPU algorithms on modern CPUs
4. **Architectural Research**: Exploring the convergence of CPU and GPU programming models

### 1.2 Contributions

This work makes the following contributions:

- A comprehensive CPU implementation of core CUDA runtime and cuBLAS APIs
- Novel SIMD-optimized kernels for both x86-64 (AVX2/AVX-512) and ARM64 (NEON) architectures
- Architecture-aware floating-point tolerance system handling platform-specific numerical differences
- Extensive numerical validation framework ensuring compatibility within architecture constraints
- Performance analysis demonstrating the viability of CPU execution for GPU-designed algorithms

## 2. Background

### 2.1 CUDA Programming Model

CUDA (Compute Unified Device Architecture) provides a parallel computing platform and programming model for NVIDIA GPUs. Key abstractions include:
- **Kernels**: Functions executed in parallel by many threads
- **Thread Hierarchy**: Threads organized into blocks and grids
- **Memory Hierarchy**: Global, shared, and local memory spaces
- **Synchronization**: Barriers and atomic operations

### 2.2 CPU SIMD Architecture

Modern CPUs provide SIMD (Single Instruction, Multiple Data) extensions:
- **x86-64**: AVX2 (256-bit vectors, 8 float32 values) and AVX-512 (512-bit vectors, 16 float32 values)
- **ARM64**: NEON (128-bit vectors, 4 float32 values) with architecture-aware floating-point tolerances
- **Memory Bandwidth**: Increasingly important bottleneck for data-parallel algorithms

### 2.3 Related Work

Previous efforts to bridge GPU and CPU computing include:
- Intel's ISPC (Intel SPMD Program Compiler)
- OpenCL implementations for CPUs
- Various CUDA-to-CPU translation tools

GUDA differs by providing direct API compatibility rather than source translation.

## 3. Design and Implementation

### 3.1 Architecture Overview

GUDA consists of several key components:

```
guda/
โ”œโ”€โ”€ core.go # Core CUDA runtime API implementation
โ”œโ”€โ”€ memory.go # Memory management and allocation
โ”œโ”€โ”€ stream.go # Stream and event management
โ”œโ”€โ”€ blas.go # cuBLAS API implementation
โ”œโ”€โ”€ kernel.go # Kernel execution framework
โ””โ”€โ”€ simd/ # Platform-specific SIMD implementations
```

```mermaid
graph TB
subgraph "Application Layer"
APP[CUDA Application]
end

subgraph "GUDA API Layer"
RT[Runtime API
cudaMalloc, cudaMemcpy]
BLAS[cuBLAS API
GEMM, AXPY, DOT]
KERNEL[Kernel API
Launch, Synchronize]
end

subgraph "Execution Layer"
MEM[Memory Manager
Unified Memory Model]
EXEC[Kernel Executor
Threadโ†’CPU Core Mapping]
STREAM[Stream Manager
Async Operations]
end

subgraph "Optimization Layer"
SIMD[SIMD Kernels
AVX2/AVX-512]
FUSED[Fused Operations
GEMM+Bias+Activation]
CACHE[Cache Optimization
Tiling & Prefetch]
end

subgraph "Hardware"
CPU[CPU Cores
x86-64 AVX2/AVX-512
ARM64 NEON]
end

APP --> RT
APP --> BLAS
APP --> KERNEL

RT --> MEM
BLAS --> EXEC
KERNEL --> EXEC

MEM --> SIMD
EXEC --> SIMD
EXEC --> FUSED
STREAM --> EXEC

SIMD --> CPU
FUSED --> CPU
CACHE --> CPU

%% High contrast styling for accessibility
classDef appLayer fill:#2E86AB,stroke:#ffffff,stroke-width:3px,color:#ffffff
classDef apiLayer fill:#A23B72,stroke:#ffffff,stroke-width:2px,color:#ffffff
classDef execLayer fill:#F18F01,stroke:#ffffff,stroke-width:2px,color:#ffffff
classDef optLayer fill:#C73E1D,stroke:#ffffff,stroke-width:2px,color:#ffffff
classDef hwLayer fill:#1B1B1B,stroke:#ffffff,stroke-width:3px,color:#ffffff

class APP appLayer
class RT,BLAS,KERNEL apiLayer
class MEM,EXEC,STREAM execLayer
class SIMD,FUSED,CACHE optLayer
class CPU hwLayer
```

### 3.2 Unified Memory Architecture

GUDA fundamentally differs from traditional GPU computing by implementing a **unified memory model** where all memory is CPU RAM:

```go
type DevicePtr struct {
ptr unsafe.Pointer
size int
offset int
}
```

Key architectural decisions:
- **No Separate Device Memory**: `cudaMalloc` allocates regular CPU RAM, not GPU memory
- **Zero-Copy Operations**: `cudaMemcpy` operations are no-ops or simple `copy()` calls
- **Memory Pool Management**: Efficient allocation/deallocation with free list reuse
- **Type-Safe Access**: DevicePtr provides `.Float32()`, `.Int32()`, `.Byte()` views of the same memory

This eliminates PCIe transfer overhead entirely and enables direct CPU access to all data.

### 3.3 CPU-Native Execution Model

Rather than simulating GPU threads, GUDA maps CUDA execution patterns to CPU-native optimizations:

1. **Native BLAS Integration**: GEMM and BLAS operations call highly optimized CPU libraries (assimilated Gonum)
2. **SIMD-First Design**: GPU warps (32 threads) map to AVX2 vectors (8 float32 operations)
3. **Thread Block โ†’ Goroutine**: Grid/block structures become parallel goroutine work distribution
4. **Cache-Aware Scheduling**: Work stealing and tiling optimize for CPU cache hierarchy
5. **Assembly Kernels**: Hand-optimized AVX2/FMA assembly for critical mathematical operations

**Key Insight**: GUDA is NOT a GPU simulator - it's a CPU-optimized implementation providing CUDA API compatibility.

```mermaid
graph LR
subgraph "GPU Model"
GRID["Grid
dim3(4,2,1)"]
BLOCK1["Block(0,0)"]
BLOCK2["Block(1,0)"]
BLOCK3["Block(...)"]
THREAD1["Thread(0)"]
THREAD2["Thread(1)"]
THREAD3["Thread(...)"]

GRID --> BLOCK1
GRID --> BLOCK2
GRID --> BLOCK3
BLOCK1 --> THREAD1
BLOCK1 --> THREAD2
BLOCK1 --> THREAD3
end

subgraph "GUDA CPU Mapping"
CORES["CPU Cores
8 cores"]
LOOP["Parallel Loops
OpenMP-style"]
SIMD1["SIMD Lane 0-7
AVX2 256-bit"]
SIMD2["SIMD Lane 8-15
AVX2 256-bit"]

CORES --> LOOP
LOOP --> SIMD1
LOOP --> SIMD2
end

BLOCK1 -.->|"maps to"| LOOP
THREAD1 -.->|"maps to"| SIMD1

%% High contrast GPU styling
classDef gpuGrid fill:#76B900,stroke:#ffffff,stroke-width:3px,color:#ffffff
classDef gpuBlock fill:#4CAF50,stroke:#ffffff,stroke-width:2px,color:#ffffff
classDef gpuThread fill:#2E7D32,stroke:#ffffff,stroke-width:2px,color:#ffffff

%% High contrast CPU styling
classDef cpuCore fill:#FF6B35,stroke:#ffffff,stroke-width:3px,color:#ffffff
classDef cpuLoop fill:#E65100,stroke:#ffffff,stroke-width:2px,color:#ffffff
classDef cpuSIMD fill:#BF360C,stroke:#ffffff,stroke-width:2px,color:#ffffff

class GRID gpuGrid
class BLOCK1,BLOCK2,BLOCK3 gpuBlock
class THREAD1,THREAD2,THREAD3 gpuThread
class CORES cpuCore
class LOOP cpuLoop
class SIMD1,SIMD2 cpuSIMD

style GRID fill:#e3f2fd
style CORES fill:#000000
style SIMD1 fill:#000000
style SIMD2 fill:#000000
```

### 3.4 High-Performance BLAS Implementation

GUDA's cuBLAS compatibility layer leverages the fully assimilated Gonum mathematical computing library:

**Architecture**:
- **Native CPU BLAS**: Direct calls to optimized CPU BLAS routines, not GPU simulation
- **SIMD Assembly Kernels**: Hand-written AVX2/FMA assembly for common matrix sizes (4x4, 8x8, 16x16)
- **Parallel Execution**: Goroutine-based work distribution across all CPU cores
- **Memory Hierarchy Optimization**: L1/L2/L3 cache-aware algorithms with prefetching

**Performance Features**:
- **Fused Operations**: GEMM+Bias+ReLU and other fused kernels in single operations
- **Float16 Hardware Acceleration**: F16C instruction support for half-precision
- **Adaptive Algorithms**: Different implementations chosen based on matrix size and cache characteristics
- **Zero Memory Copy**: Unified memory eliminates host-device transfer overhead

Performance benchmarks are currently being validated and will be published in future releases.

## 4. Performance Characteristics

### 4.1 Experimental Setup

Testing environment:
- CPU: AMD Ryzen 7 7700X (8 cores, 16 threads)
- Memory: 32GB DDR5-5600
- Architecture: x86-64 with AVX2 support
- Compiler: Go 1.21 with CGO for SIMD intrinsics

**Platform Support**: This proof-of-concept currently supports **x86-64 only**. ARM64 and other architectures are not yet implemented.

### 4.2 Performance Results

GUDA achieves exceptional performance through optimized CPU utilization:

#### GEMM Performance (Hot Cache)
| Matrix Size | GFLOPS | Efficiency* | Arithmetic Intensity |
|-------------|--------|-------------|---------------------|
| 256ร—256 | 126.5 | 126% | 42.7 FLOPS/byte |
| 512ร—512 | 148.7 | 149% | 85.3 FLOPS/byte |
| 1024ร—1024 | 154.2 | 154% | 170.7 FLOPS/byte |
| 2048ร—2048 | 153.5 | 154% | 341.3 FLOPS/byte |

*Efficiency relative to practical peak of 100 GFLOPS (40% of theoretical 288 GFLOPS)

#### Memory Bandwidth Operations
| Operation | Size | Performance | Memory Bandwidth |
|-----------|------|-------------|------------------|
| AXPY | 16K | 40.4 GFLOPS | 242.3 GB/s |
| DOT | 1K | 68.4 GFLOPS | 273.7 GB/s |

These results demonstrate:
- **Near-peak performance**: >150 GFLOPS sustained on compute-bound operations
- **Memory saturation**: >240 GB/s achieved (approaching DDR5 theoretical limits)
- **Efficient vectorization**: Full AVX2 utilization with 8-wide float32 operations
- **Cache optimization**: Hot cache performance with effective blocking

See [BENCHMARK_RESULTS.md](BENCHMARK_RESULTS.md) for detailed performance analysis including cold cache results and performance counter validation.

For guidance on running and interpreting benchmarks, see [Benchmarking Guide](docs/BENCHMARKING_GUIDE.md).

### 4.3 Convolution Implementation

Our convolution implementation uses the im2col + GEMM approach:

```mermaid
graph LR
subgraph "Input"
IMG[Input Image
Nร—Cร—Hร—W]
KERNEL[Kernels
Kร—Cร—Rร—S]
end

subgraph "Transform"
IM2COL[im2col Transform
Unfold patches]
RESHAPE[Reshape Kernels
Kร—CRS]
end

subgraph "Compute"
GEMM[GEMM
Kร—CRS ร— CRSร—NHW]
end

subgraph "Output"
OUT[Output
Nร—Kร—H'ร—W']
end

IMG --> IM2COL
KERNEL --> RESHAPE
IM2COL --> GEMM
RESHAPE --> GEMM
GEMM --> OUT

%% High contrast styling for accessibility
classDef inputData fill:#2E86AB,stroke:#ffffff,stroke-width:3px,color:#ffffff
classDef transform fill:#A23B72,stroke:#ffffff,stroke-width:2px,color:#ffffff
classDef compute fill:#C73E1D,stroke:#ffffff,stroke-width:3px,color:#ffffff
classDef output fill:#F18F01,stroke:#ffffff,stroke-width:3px,color:#ffffff

class IMG,KERNEL inputData
class IM2COL,RESHAPE transform
class GEMM compute
class OUT output
```

### 4.4 Numerical Accuracy

Validation testing indicates:
- IEEE 754 compliance for floating-point operations
- Bit-exact results for memory operations
- Numerical parity with reference implementations for core operations

## 5. Use Cases and Applications

```mermaid
graph TB
subgraph "Development Environment"
DEV_LAPTOP[Developer Laptop
No GPU Required]
CI_PIPELINE[CI/CD Pipeline
GitHub Actions]
DEBUG[CPU Debugging Tools
GDB, Valgrind, perf]
end

subgraph "Production Deployment"
CLOUD[Cloud CPU Instance
Cost-Optimized]
EDGE[Edge Device
ARM/x86 CPUs]
CONTAINER[Container Platform
Docker/Kubernetes]
end

subgraph "Research & Education"
EDUCATION[CUDA Learning
Academic Environment]
PROTOTYPE[Algorithm Prototyping
Rapid Iteration]
ANALYSIS[Performance Analysis
CPU vs GPU Studies]
end

subgraph "Application Types"
INFERENCE[ML Inference
Low-Latency]
SIMULATION[Scientific Computing
Batch Processing]
TRAINING[Small Model Training
Development Phase]
end

DEV_LAPTOP --> CLOUD
CI_PIPELINE --> CONTAINER
DEBUG --> ANALYSIS

CLOUD --> INFERENCE
EDGE --> INFERENCE
CONTAINER --> SIMULATION

EDUCATION --> PROTOTYPE
PROTOTYPE --> TRAINING
ANALYSIS --> SIMULATION

%% High contrast deployment styling
classDef devEnv fill:#2E7D32,stroke:#ffffff,stroke-width:3px,color:#ffffff
classDef prodDeploy fill:#1565C0,stroke:#ffffff,stroke-width:3px,color:#ffffff
classDef research fill:#7B1FA2,stroke:#ffffff,stroke-width:3px,color:#ffffff
classDef appTypes fill:#D84315,stroke:#ffffff,stroke-width:3px,color:#ffffff

class DEV_LAPTOP,CI_PIPELINE,DEBUG devEnv
class CLOUD,EDGE,CONTAINER prodDeploy
class EDUCATION,PROTOTYPE,ANALYSIS research
class INFERENCE,SIMULATION,TRAINING appTypes
```

### 5.1 Development and Testing

GUDA enables:
- CI/CD pipelines without GPU infrastructure
- Local development on laptops
- Debugging with standard CPU tools

### 5.2 Edge Deployment

Suitable for:
- Inference on edge devices without GPUs
- Embedded systems with powerful CPUs
- Cost-sensitive deployments

### 5.3 Education and Research

Provides:
- Accessible CUDA learning environment
- Algorithm prototyping platform
- Performance analysis opportunities

## 6. Limitations and Future Work

### 6.1 Current Limitations

- **Platform Support**: x86-64 only - ARM64, RISC-V, and other architectures not yet supported
- **API Coverage**: Limited to CUDA runtime and cuBLAS APIs
- **SIMD Requirements**: Requires AVX2 for optimal performance; fallback implementations may be slower
- **No GPU Hardware Features**: No support for advanced GPU features (tensor cores, RT cores, etc.)

### 6.2 Future Directions

1. **Multi-Architecture Support**: ARM64 with NEON/SVE, RISC-V with vector extensions
2. **API Coverage**: Implement cuDNN, cuFFT, cuSPARSE, and other CUDA libraries
3. **Advanced SIMD**: AVX-512 optimizations for Intel processors
4. **Heterogeneous Execution**: CPU+GPU cooperative processing for hybrid workloads

## 7. Conclusion

GUDA demonstrates that CPU-native implementations of GPU APIs can provide practical deployment options for CUDA applications on CPU-only infrastructure. Through unified memory architecture, optimized BLAS integration, and elimination of host-device transfers, GUDA enables running CUDA applications where GPU hardware is unavailable or cost-prohibitive. This proof-of-concept validates the viability of CPU-first architectural approaches for CUDA compatibility and suggests future opportunities for heterogeneous computing strategies.

## Installation

### System Requirements
- **Architecture**: x86-64 processor with AVX2 support (Intel Haswell/AMD Excavator or newer)
- **OS**: Linux, macOS, or Windows
- **Go**: Version 1.19 or later
- **CGO**: Required for SIMD assembly optimizations

### Install
```bash
go get github.com/LynnColeArt/guda
```

**Note**: ARM64 and other architectures are not currently supported in this proof-of-concept.

## Usage Example

```go
package main

import (
"github.com/LynnColeArt/guda"
)

func main() {
// Initialize GUDA
guda.Init(0)
defer guda.Reset()

// Allocate memory
d_a, _ := guda.Malloc(1024 * 1024 * 4)
d_b, _ := guda.Malloc(1024 * 1024 * 4)
d_c, _ := guda.Malloc(1024 * 1024 * 4)
defer guda.Free(d_a)
defer guda.Free(d_b)
defer guda.Free(d_c)

// Perform SGEMM
guda.GEMM(false, false, 1024, 1024, 1024,
1.0, d_a, 1024, d_b, 1024,
0.0, d_c, 1024)
}
```

## License

MIT License - See LICENSE file for details

## Citation

If you use GUDA in your research, please cite:
```bibtex
@software{guda2025,
author = {Lynn Cole},
title = {GUDA: A High-Performance CPU-Based CUDA-Compatible Linear Algebra Library},
year = {2025},
url = {https://github.com/LynnColeArt/guda}
}
```

## Acknowledgments

This work was inspired by the need for accessible high-performance computing and the convergence of CPU and GPU architectures.

### Gonum Integration

GUDA incorporates substantial portions of the Gonum project (https://github.com/gonum/gonum), a set of numeric libraries for the Go programming language. The Gonum BLAS implementation forms the foundation of GUDA's linear algebra operations. We are grateful to the Gonum authors and contributors for their excellent work in bringing high-performance numeric computing to Go. The Gonum code is used under the BSD 3-Clause License.

### Additional Thanks

Special thanks to:
- The Go community for providing excellent tools for systems programming
- The developers of CUDA and cuBLAS for establishing the programming model and APIs
- The open-source community for fostering collaborative scientific computing