{"id":33374365,"url":"https://github.com/certainly-param/garuda-accelerator","last_synced_at":"2026-06-06T22:31:55.240Z","repository":{"id":318700169,"uuid":"1069860046","full_name":"certainly-param/garuda-accelerator","owner":"certainly-param","description":"   Garuda: Swift RISC-V INT8 accelerator for neural network inference. CVXIF coprocessor for CVA6 achieving 2-5x speedup.","archived":false,"fork":false,"pushed_at":"2025-10-04T19:07:44.000Z","size":12,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-08T18:52:56.115Z","etag":null,"topics":["accelerator","ai-hardware","coprocessor","cva6","cvxif","edge-ai","hardware-accelerator","inference","int8","neural-network","quantization","risc-v","systemverilog"],"latest_commit_sha":null,"homepage":"","language":"SystemVerilog","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/certainly-param.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-04T19:05:32.000Z","updated_at":"2025-10-04T19:15:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"6113320a-a8eb-4927-ae18-2915954a583e","html_url":"https://github.com/certainly-param/garuda-accelerator","commit_stats":null,"previous_names":["certainly-param/garuda-accelerator"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/certainly-param/garuda-accelerator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/certainly-param%2Fgaruda-accelerator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/certainly-param%2Fgaruda-accelerator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/certainly-param%2Fgaruda-accelerator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/certainly-param%2Fgaruda-accelerator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/certainly-param","download_url":"https://codeload.github.com/certainly-param/garuda-accelerator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/certainly-param%2Fgaruda-accelerator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285873538,"owners_count":27246054,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-22T02:00:05.934Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accelerator","ai-hardware","coprocessor","cva6","cvxif","edge-ai","hardware-accelerator","inference","int8","neural-network","quantization","risc-v","systemverilog"],"created_at":"2025-11-22T23:00:58.128Z","updated_at":"2025-11-22T23:01:18.900Z","avatar_url":"https://github.com/certainly-param.png","language":"SystemVerilog","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Garuda: RISC-V ML Accelerator\n\n\u003e *Swift as the divine eagle, Garuda accelerates RISC-V with specialized hardware for neural network inference.*\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)\n[![RISC-V](https://img.shields.io/badge/RISC--V-CVXIF-green.svg)](https://github.com/openhwgroup/core-v-xif)\n[![Status](https://img.shields.io/badge/Status-Active%20Development-orange.svg)]()\n\n---\n\n## 🚀 **What's New (October 2025)**\n\n**Latest Updates:**\n- ✅ **Bug Fix:** Corrected INT8 saturation values for proper two's complement representation\n- ✅ **New Feature:** Overflow detection flag for debugging and profiling\n- ✅ **Verification:** Added SystemVerilog assertions for protocol compliance\n- ✅ **Coverage:** Added overflow tracking properties for better testing\n\n---\n\n## 📖 Project Overview\n\n**Garuda** is a CVXIF coprocessor that extends RISC-V with custom INT8 multiply-accumulate (MAC) instructions for efficient neural network inference. The modular design integrates with CVA6 without CPU modifications, achieving 2-5× speedup over software implementations.\n\n**Key Features:**\n- ⚡ **CVXIF Interface:** Standard coprocessor protocol (no CPU changes)\n- 🎯 **Stateless Design:** Supports speculative execution\n- 🔧 **Compact:** ~200 LUTs per MAC unit\n- 🚀 **Pipelined:** 3-4 cycle latency\n\n### INT8 Quantization\n\nModern neural networks use INT8 quantization to reduce memory footprint (4x smaller than FP32), power consumption, bandwidth requirements, and hardware cost. INT8 inference achieves near-FP32 accuracy for most models with proper quantization techniques.\n\n### CVXIF Interface\n\nCVXIF provides a standard interface for RISC-V coprocessors, enabling modular accelerator design without CPU modifications. The interface handles instruction offloading, register access, and result writeback.\n\n## Features\n\n**Custom Instructions (Garuda 1.0):**\n- `mac8` - INT8 MAC with 8-bit accumulator + saturation\n- `mac8.acc` - INT8 MAC with 32-bit accumulator  \n- `mul8` - INT8 multiply without accumulation\n- `clip8` - Saturate to INT8 range [-128, 127]\n\n**Recent Improvements (Oct 2025):**\n- ✅ Fixed saturation bug (invalid 8'sd128 → correct -8'sd128)\n- ✅ Added overflow detection output (tracks when saturation occurs)\n- ✅ Added SystemVerilog assertions for verification\n- ✅ Added coverage tracking for overflow events\n\n**Architecture:**\n- CVXIF coprocessor integration\n- Stateless design for speculative execution\n- Pipelined MAC unit (3-4 cycle latency)\n- Overflow detection for debugging\n- Efficient resource usage (~200 LUTs per MAC unit)\n\n## Repository Structure\n\n```\ngaruda/                          # Garuda accelerator\n├── rtl/                         # RTL source files\n│   ├── int8_mac_instr_pkg.sv   # Instruction definitions\n│   ├── int8_mac_unit.sv        # MAC execution unit\n│   ├── int8_mac_decoder.sv     # Instruction decoder\n│   └── int8_mac_coprocessor.sv # Top-level module\n├── tb/                          # Testbenches\n│   └── tb_int8_mac_unit.sv     # MAC unit testbench\n└── sw/                          # Software tests\n\ncva6/                            # CVA6 RISC-V CPU core (upstream)\n```\n\n## Getting Started\n\n### Prerequisites\n\n- RISC-V GNU Toolchain (see `cva6/util/toolchain-builder`)\n- Verilator, ModelSim/Questa, or VCS\n- Python 3.7+\n\n### Clone Repository\n\n```bash\ngit clone https://github.com/yourusername/cva6-garuda.git\ncd cva6-garuda\ngit submodule update --init --recursive\n```\n\n### Run Simulations\n\n```bash\ncd garuda\n./run_sim.sh verilator\n```\n\n### Verify CVA6 Environment\n\n```bash\ncd cva6\nexport RISCV=/path/to/toolchain\nexport DV_SIMULATORS=veri-testharness,spike\nbash verif/regress/smoke-tests.sh\n```\n\n## Example Usage\n\n### Assembly Code\n\n```asm\n# Dot product: result = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] + a[3]*b[3]\n\ndot_product:\n    lw      t0, 0(a0)           # Load a[3:0] (packed INT8s)\n    lw      t1, 0(a1)           # Load b[3:0] (packed INT8s)\n    li      t2, 0               # Initialize accumulator\n    \n    mac8.acc t2, t0, t1         # acc += a[0] * b[0]\n    srli     t0, t0, 8\n    srli     t1, t1, 8\n    \n    mac8.acc t2, t0, t1         # acc += a[1] * b[1]\n    srli     t0, t0, 8\n    srli     t1, t1, 8\n    \n    mac8.acc t2, t0, t1         # acc += a[2] * b[2]\n    srli     t0, t0, 8\n    srli     t1, t1, 8\n    \n    mac8.acc t2, t0, t1         # acc += a[3] * b[3]\n    \n    mv       a0, t2             # Return result\n    ret\n```\n\n### C with Inline Assembly\n\n```c\nstatic inline int32_t mac8_acc(int32_t acc, int8_t a, int8_t b) {\n    int32_t result;\n    asm volatile (\n        \"mac8.acc %0, %1, %2\"\n        : \"=r\" (result)\n        : \"r\" (a), \"r\" (b), \"0\" (acc)\n    );\n    return result;\n}\n\nint32_t dot_product(int8_t* a, int8_t* b, int n) {\n    int32_t sum = 0;\n    for (int i = 0; i \u003c n; i++) {\n        sum = mac8_acc(sum, a[i], b[i]);\n    }\n    return sum;\n}\n```\n\n## Architecture\n\n### System Overview\n\n```\nCVA6 CPU                           INT8 MAC Coprocessor\n┌──────────────────────┐          ┌──────────────────────┐\n│ Fetch → Decode →     │          │ Instruction Decoder  │\n│ Issue → Execute → WB │◄────────►│ INT8 MAC Unit        │\n└──────────────────────┘          │ Result Register      │\n         CVXIF Interface           └──────────────────────┘\n```\n\n### Datapath\n\n```\nrs1[7:0]  rs2[7:0]\n   │         │\n   └────┬────┘\n        │\n   ┌────▼────┐\n   │ 8x8 MUL │  16-bit product\n   └────┬────┘\n        │\n   ┌────▼────┐\n   │ 32b ADD │  Accumulate\n   └────┬────┘\n        │\n   ┌────▼────┐\n   │ Pipeline│  1 cycle\n   └────┬────┘\n        │\n     rd[31:0]\n```\n\n### Resource Usage\n\n- LUTs: ~200 per MAC unit\n- 8x8 multiplier: ~100 LUTs\n- 32-bit adder: ~32 LUTs\n- Control logic: ~50 LUTs\n\n## Performance\n\n### Instruction Count\n\n| Operation | Standard RISC-V | With MAC8.ACC | Speedup |\n|-----------|----------------|---------------|---------|\n| Single MAC | 2 (mul + add) | 1 | 2x |\n| 4-elem dot product | 16 | 14 | 1.14x |\n| 256-elem dot product | 1024 | ~770 | 1.3x |\n\n### Cycle Count\n\n| Operation | Standard RISC-V | MAC Coprocessor |\n|-----------|----------------|-----------------|\n| Single MAC | 5-8 cycles | 3-4 cycles |\n| 256-elem dot product | ~2048 cycles | ~1500 cycles |\n\nPerformance depends on memory bandwidth and cache behavior.\n\n## 📚 Documentation\n\n**RTL Documentation:**\n- See `garuda/README.md` for detailed RTL documentation\n- Inline code comments in all source files\n- Module hierarchy and integration guide\n\n**External References:**\n- [CV-X-IF Specification](https://github.com/openhwgroup/core-v-xif)\n- [CVA6 Documentation](https://docs.openhwgroup.org/projects/cva6-user-manual/)\n\n## 🎯 Quick Start\n\n### 1. Clone Repository\n```bash\ngit clone https://github.com/yourusername/garuda-accelerator.git\ncd garuda-accelerator\ngit submodule update --init --recursive\n```\n\n### 2. Run Garuda 1.0 Simulation\n```bash\ncd garuda\n./run_sim.sh verilator\n```\n\n### 3. Explore Documentation\n```bash\n# RTL documentation\ncat garuda/README.md\n\n# View instruction definitions\ncat garuda/rtl/int8_mac_instr_pkg.sv\n```\n\n---\n\n## 📊 Performance\n\n### Current Implementation\n- **Peak Performance:** ~25 GOPS (INT8)\n- **Power:** ~10W (estimated)\n- **Latency:** 3-4 cycles per MAC operation\n- **Resource Usage:** ~200 LUTs per MAC unit\n- **Fmax:** 100+ MHz (FPGA), 1+ GHz (ASIC target)\n\n### Use Cases\n- Edge AI inference (resource-constrained devices)\n- Embedded neural networks\n- Educational projects\n- RISC-V accelerator research\n\n---\n\n## 📚 References\n\n**RISC-V:**\n- [CV-X-IF Specification](https://github.com/openhwgroup/core-v-xif)\n- [CVA6 Documentation](https://docs.openhwgroup.org/projects/cva6-user-manual/)\n- [RISC-V ISA Manual](https://riscv.org/technical/specifications/)\n\n**Neural Network Quantization:**\n- [Quantization and Training of Neural Networks](https://arxiv.org/abs/1712.05877)\n- [Survey of Quantization Methods](https://arxiv.org/abs/2103.13630)\n\n---\n\n## 🤝 Contributing\n\nWe welcome contributions! Areas of interest:\n- RTL improvements and optimizations\n- Testbench enhancements\n- Software examples and benchmarks\n- Documentation improvements\n- Performance analysis and benchmarking\n\n---\n\n## 📧 Contact \u0026 Community\n\n- **GitHub Issues:** Bug reports and feature requests\n- **RISC-V Slack:** #garuda channel (join the conversation)\n- **OpenHW Group:** Contribute to RISC-V ecosystem\n\n---\n\n## 📜 License\n\n- **Garuda RTL:** Apache License 2.0\n- **CVA6:** Solderpad Hardware License v0.51\n- **Documentation:** Creative Commons BY 4.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcertainly-param%2Fgaruda-accelerator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcertainly-param%2Fgaruda-accelerator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcertainly-param%2Fgaruda-accelerator/lists"}