https://github.com/widgetii/orangepi5plus-npu

Open-source reverse engineering of the Rockchip RK3588 NPU — standalone driver, Mesa patches, QEMU emulator
https://github.com/widgetii/orangepi5plus-npu

Last synced: 3 months ago
JSON representation

Open-source reverse engineering of the Rockchip RK3588 NPU — standalone driver, Mesa patches, QEMU emulator

Host: GitHub
URL: https://github.com/widgetii/orangepi5plus-npu
Owner: widgetii
Created: 2026-03-24T11:12:52.000Z (3 months ago)
Default Branch: master
Last Pushed: 2026-04-03T12:22:12.000Z (3 months ago)
Last Synced: 2026-04-03T17:12:46.099Z (3 months ago)
Language: C
Size: 1.51 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # RK3588 NPU Research — Orange Pi 5 Plus

[![CI](https://github.com/widgetii/orangepi5plus-npu/actions/workflows/ci.yml/badge.svg)](https://github.com/widgetii/orangepi5plus-npu/actions/workflows/ci.yml)

Open-source reverse engineering of the Rockchip RK3588 Neural Processing Unit.

This project includes a standalone NPU driver (`librocketnpu`), Mesa Gallium

optimization patches, a QEMU emulator for the NPU hardware, and research into

the proprietary RKNN register programming model.

## Project Components

### librocketnpu — Standalone Open-Source NPU Driver

A zero-dependency C library that drives the RK3588 NPU directly via DRM IOCTLs,

without Mesa or the proprietary RKNN stack.

```

librocketnpu/

  src/

    rnpu_tflite.c    # TFLite FlatBuffer parser (zero deps)

    rnpu_onnx.c      # ONNX protobuf parser (protobuf-c)

    rnpu_model.c     # Graph analysis, per-channel grouping, scheduling

    rnpu_task.c      # CBUF bank allocation, spatial tiling

    rnpu_coefs.c     # Weight/bias quantization formatting

    rnpu_regcmd.c    # Hardware register command generation

    rnpu_drm.c       # DRM IOCTL submission (CREATE_BO, SUBMIT)

    rnpu_rknn.c      # RKNN binary parser (BRDMA extraction)

    rnpu_sw_ops.c    # CPU fallback: concat, maxpool, pad, resize, sigmoid, softmax

    rnpu_convert.c   # NPU tensor format conversion (NHWC <-> NCHW interleaved)

  include/

    rocketnpu.h      # Public C API

  tests/

    test_sw_ops.c    # 27 unit tests (CPU-only, no hardware)

    test_rknpu_abi.c # 29 ABI regression tests (CPU-only)

    test_onnx_parse.c # ONNX parser validation

  docs/

    compiler_architecture.md  # Research: open-source NPU compiler design

```

**Features:**

- Loads TFLite INT8 models directly — no conversion step

- ONNX frontend (protobuf-c) for RKNN-toolkit graph consumption

- Dual-driver: supports both Rocket (mainline) and RKNPU (vendor) kernel drivers

- Per-channel quantization via scale-sorted grouping + BRDMA DMA

- Multi-task batching for cross-operation chaining

- 56 unit tests, CI on GitHub Actions

**Build:**

```bash

# On board (aarch64, Armbian)

apt install libdrm-dev

cd librocketnpu && make

# With ONNX support

apt install libprotobuf-c-dev

make test_onnx_parse

# Run tests (no hardware needed)

make test

```

### Mesa Optimization Patches

Performance patches for the upstream Rocket Gallium driver:

| Patch | Description | Impact |

|-------|-------------|--------|

| `0003` | BO pool, cache sync reduction, NEON I/O conversion, cached submit | 12% avg latency reduction |

| `0004` | SW ops: concat, maxpool, pad, resize, logistic | YOLO mixed HW/SW execution |

| `0005` | Fix INT8 regression: batch tasks per operation | Correctness fix for upstream |

### QEMU NPU Emulator

Full-system emulation of the RK3588 NPU for development without hardware:

- Boots unmodified Armbian disk images

- Emulates CRU (Clock Reset Unit) for kernel driver probe

- NPU MMIO register model (PC, CNA, Core units)

- IOMMU stub for DMA address translation

## Benchmark Results

### MobileNetV1 224x224 INT8 (single core)

| Stack | Latency | Status |

|-------|---------|--------|

| RKNN proprietary (vendor kernel) | **2.6ms** | Bit-exact |

| librocketnpu (vendor kernel, BRDMA) | **10.2ms** | Bit-exact (max_diff=0) |

| Mesa Rocket + patches (mainline kernel) | **10.2ms** | Bit-exact |

| Mesa Rocket stock (mainline kernel) | 11.6ms | Bit-exact |

### YOLOv5s-relu 640x640 INT8

| Stack | Latency | Accuracy vs RKNN golden |

|-------|---------|------------------------|

| RKNN proprietary (3 cores) | **9.5ms** | Reference |

| RKNN simulator (x86, ONNX Runtime) | N/A | ~0.2 mean diff |

| librocketnpu (vendor, per-channel groups) | **292ms** | ~18-25 mean diff |

The accuracy gap is due to per-channel quantization hardware limitations — the

NVDLA-derived CNA applies one requantization scale per operation, while YOLO's

per-axis weights need per-channel scaling. librocketnpu approximates this with

scale-sorted channel grouping and BRDMA MUL correction.

## NPU Hardware Architecture

The RK3588 NPU has **3 independent cores** (6 TOPS total), each with:

| Offset | Unit | Function |

|--------|------|----------|

| +0x0000 | PC (Frontend) | DMA engine: reads register command buffers, writes to CNA |

| +0x1000 | CNA | Convolution Neural Accelerator (INT8 MAC array) |

| +0x2000 | DPU + RDMA | Data Processing: bias, batch norm, element-wise, output quantization |

| +0x3000 | Core | Power, clock, interrupt control |

The NPU is **register-programmed** — no instruction set. Each "instruction" is a

`(register_address, value)` pair packed into 64-bit entries, DMA'd from memory by

the PC unit. A typical convolution requires ~130 register writes across CNA, DPU,

and RDMA units.

## Research Documents

| Document | Description |

|----------|-------------|

| [`librocketnpu/docs/compiler_architecture.md`](librocketnpu/docs/compiler_architecture.md) | Open-source NPU compiler design (ONNC, TVM, MLIR comparison) |

| [`optimization_report.md`](optimization_report.md) | Mesa Rocket driver optimizations (12% latency reduction) |

| [`rocket_ioctl_analysis.md`](rocket_ioctl_analysis.md) | Decoded IOCTL protocols for Rocket and RKNPU drivers |

| [`npu_research_report.md`](npu_research_report.md) | Full research report: architecture, ftrace, driver comparison |

| [`per_axis_quantization_research.md`](per_axis_quantization_research.md) | Per-channel quantization hardware investigation |

## Board Setup

| | |

|---|---|

| Board | Orange Pi 5 Plus (RK3588, 16GB LPDDR4X, 233GB eMMC) |

| OS | Armbian 25.11.1 Noble (Ubuntu 24.04) |

| Kernels | 6.18.10-current-rockchip64 (mainline) / 6.1.115-vendor-rk35xx (vendor) |

## License

librocketnpu: MIT. Research documents and patches: as noted per file.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/widgetii/orangepi5plus-npu

Awesome Lists containing this project

README