An open API service indexing awesome lists of open source software.

https://github.com/certainly-param/garuda-accelerator

Garuda: Swift RISC-V INT8 accelerator for neural network inference. CVXIF coprocessor for CVA6 achieving 2-5x speedup.
https://github.com/certainly-param/garuda-accelerator

accelerator ai-hardware coprocessor cva6 cvxif edge-ai hardware-accelerator inference int8 neural-network quantization risc-v systemverilog

Last synced: 4 days ago
JSON representation

Garuda: Swift RISC-V INT8 accelerator for neural network inference. CVXIF coprocessor for CVA6 achieving 2-5x speedup.

Awesome Lists containing this project

README

          

# Garuda: RISC-V ML Accelerator

> *Swift as the divine eagle, Garuda accelerates RISC-V with specialized hardware for neural network inference.*

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![RISC-V](https://img.shields.io/badge/RISC--V-CVXIF-green.svg)](https://github.com/openhwgroup/core-v-xif)
[![Status](https://img.shields.io/badge/Status-Active%20Development-orange.svg)]()

---

## πŸš€ **What's New (October 2025)**

**Latest Updates:**
- βœ… **Bug Fix:** Corrected INT8 saturation values for proper two's complement representation
- βœ… **New Feature:** Overflow detection flag for debugging and profiling
- βœ… **Verification:** Added SystemVerilog assertions for protocol compliance
- βœ… **Coverage:** Added overflow tracking properties for better testing

---

## πŸ“– Project Overview

**Garuda** is a CVXIF coprocessor that extends RISC-V with custom INT8 multiply-accumulate (MAC) instructions for efficient neural network inference. The modular design integrates with CVA6 without CPU modifications, achieving 2-5Γ— speedup over software implementations.

**Key Features:**
- ⚑ **CVXIF Interface:** Standard coprocessor protocol (no CPU changes)
- 🎯 **Stateless Design:** Supports speculative execution
- πŸ”§ **Compact:** ~200 LUTs per MAC unit
- πŸš€ **Pipelined:** 3-4 cycle latency

### INT8 Quantization

Modern neural networks use INT8 quantization to reduce memory footprint (4x smaller than FP32), power consumption, bandwidth requirements, and hardware cost. INT8 inference achieves near-FP32 accuracy for most models with proper quantization techniques.

### CVXIF Interface

CVXIF provides a standard interface for RISC-V coprocessors, enabling modular accelerator design without CPU modifications. The interface handles instruction offloading, register access, and result writeback.

## Features

**Custom Instructions (Garuda 1.0):**
- `mac8` - INT8 MAC with 8-bit accumulator + saturation
- `mac8.acc` - INT8 MAC with 32-bit accumulator
- `mul8` - INT8 multiply without accumulation
- `clip8` - Saturate to INT8 range [-128, 127]

**Recent Improvements (Oct 2025):**
- βœ… Fixed saturation bug (invalid 8'sd128 β†’ correct -8'sd128)
- βœ… Added overflow detection output (tracks when saturation occurs)
- βœ… Added SystemVerilog assertions for verification
- βœ… Added coverage tracking for overflow events

**Architecture:**
- CVXIF coprocessor integration
- Stateless design for speculative execution
- Pipelined MAC unit (3-4 cycle latency)
- Overflow detection for debugging
- Efficient resource usage (~200 LUTs per MAC unit)

## Repository Structure

```
garuda/ # Garuda accelerator
β”œβ”€β”€ rtl/ # RTL source files
β”‚ β”œβ”€β”€ int8_mac_instr_pkg.sv # Instruction definitions
β”‚ β”œβ”€β”€ int8_mac_unit.sv # MAC execution unit
β”‚ β”œβ”€β”€ int8_mac_decoder.sv # Instruction decoder
β”‚ └── int8_mac_coprocessor.sv # Top-level module
β”œβ”€β”€ tb/ # Testbenches
β”‚ └── tb_int8_mac_unit.sv # MAC unit testbench
└── sw/ # Software tests

cva6/ # CVA6 RISC-V CPU core (upstream)
```

## Getting Started

### Prerequisites

- RISC-V GNU Toolchain (see `cva6/util/toolchain-builder`)
- Verilator, ModelSim/Questa, or VCS
- Python 3.7+

### Clone Repository

```bash
git clone https://github.com/yourusername/cva6-garuda.git
cd cva6-garuda
git submodule update --init --recursive
```

### Run Simulations

```bash
cd garuda
./run_sim.sh verilator
```

### Verify CVA6 Environment

```bash
cd cva6
export RISCV=/path/to/toolchain
export DV_SIMULATORS=veri-testharness,spike
bash verif/regress/smoke-tests.sh
```

## Example Usage

### Assembly Code

```asm
# Dot product: result = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] + a[3]*b[3]

dot_product:
lw t0, 0(a0) # Load a[3:0] (packed INT8s)
lw t1, 0(a1) # Load b[3:0] (packed INT8s)
li t2, 0 # Initialize accumulator

mac8.acc t2, t0, t1 # acc += a[0] * b[0]
srli t0, t0, 8
srli t1, t1, 8

mac8.acc t2, t0, t1 # acc += a[1] * b[1]
srli t0, t0, 8
srli t1, t1, 8

mac8.acc t2, t0, t1 # acc += a[2] * b[2]
srli t0, t0, 8
srli t1, t1, 8

mac8.acc t2, t0, t1 # acc += a[3] * b[3]

mv a0, t2 # Return result
ret
```

### C with Inline Assembly

```c
static inline int32_t mac8_acc(int32_t acc, int8_t a, int8_t b) {
int32_t result;
asm volatile (
"mac8.acc %0, %1, %2"
: "=r" (result)
: "r" (a), "r" (b), "0" (acc)
);
return result;
}

int32_t dot_product(int8_t* a, int8_t* b, int n) {
int32_t sum = 0;
for (int i = 0; i < n; i++) {
sum = mac8_acc(sum, a[i], b[i]);
}
return sum;
}
```

## Architecture

### System Overview

```
CVA6 CPU INT8 MAC Coprocessor
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Fetch β†’ Decode β†’ β”‚ β”‚ Instruction Decoder β”‚
β”‚ Issue β†’ Execute β†’ WB │◄────────►│ INT8 MAC Unit β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Result Register β”‚
CVXIF Interface β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Datapath

```
rs1[7:0] rs2[7:0]
β”‚ β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ 8x8 MUL β”‚ 16-bit product
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ 32b ADD β”‚ Accumulate
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ Pipelineβ”‚ 1 cycle
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β”‚
rd[31:0]
```

### Resource Usage

- LUTs: ~200 per MAC unit
- 8x8 multiplier: ~100 LUTs
- 32-bit adder: ~32 LUTs
- Control logic: ~50 LUTs

## Performance

### Instruction Count

| Operation | Standard RISC-V | With MAC8.ACC | Speedup |
|-----------|----------------|---------------|---------|
| Single MAC | 2 (mul + add) | 1 | 2x |
| 4-elem dot product | 16 | 14 | 1.14x |
| 256-elem dot product | 1024 | ~770 | 1.3x |

### Cycle Count

| Operation | Standard RISC-V | MAC Coprocessor |
|-----------|----------------|-----------------|
| Single MAC | 5-8 cycles | 3-4 cycles |
| 256-elem dot product | ~2048 cycles | ~1500 cycles |

Performance depends on memory bandwidth and cache behavior.

## πŸ“š Documentation

**RTL Documentation:**
- See `garuda/README.md` for detailed RTL documentation
- Inline code comments in all source files
- Module hierarchy and integration guide

**External References:**
- [CV-X-IF Specification](https://github.com/openhwgroup/core-v-xif)
- [CVA6 Documentation](https://docs.openhwgroup.org/projects/cva6-user-manual/)

## 🎯 Quick Start

### 1. Clone Repository
```bash
git clone https://github.com/yourusername/garuda-accelerator.git
cd garuda-accelerator
git submodule update --init --recursive
```

### 2. Run Garuda 1.0 Simulation
```bash
cd garuda
./run_sim.sh verilator
```

### 3. Explore Documentation
```bash
# RTL documentation
cat garuda/README.md

# View instruction definitions
cat garuda/rtl/int8_mac_instr_pkg.sv
```

---

## πŸ“Š Performance

### Current Implementation
- **Peak Performance:** ~25 GOPS (INT8)
- **Power:** ~10W (estimated)
- **Latency:** 3-4 cycles per MAC operation
- **Resource Usage:** ~200 LUTs per MAC unit
- **Fmax:** 100+ MHz (FPGA), 1+ GHz (ASIC target)

### Use Cases
- Edge AI inference (resource-constrained devices)
- Embedded neural networks
- Educational projects
- RISC-V accelerator research

---

## πŸ“š References

**RISC-V:**
- [CV-X-IF Specification](https://github.com/openhwgroup/core-v-xif)
- [CVA6 Documentation](https://docs.openhwgroup.org/projects/cva6-user-manual/)
- [RISC-V ISA Manual](https://riscv.org/technical/specifications/)

**Neural Network Quantization:**
- [Quantization and Training of Neural Networks](https://arxiv.org/abs/1712.05877)
- [Survey of Quantization Methods](https://arxiv.org/abs/2103.13630)

---

## 🀝 Contributing

We welcome contributions! Areas of interest:
- RTL improvements and optimizations
- Testbench enhancements
- Software examples and benchmarks
- Documentation improvements
- Performance analysis and benchmarking

---

## πŸ“§ Contact & Community

- **GitHub Issues:** Bug reports and feature requests
- **RISC-V Slack:** #garuda channel (join the conversation)
- **OpenHW Group:** Contribute to RISC-V ecosystem

---

## πŸ“œ License

- **Garuda RTL:** Apache License 2.0
- **CVA6:** Solderpad Hardware License v0.51
- **Documentation:** Creative Commons BY 4.0