Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cbiffle/hapenny

Small 32-bit RISC-V CPU with a half-width datapath inspired by the 68000
https://github.com/cbiffle/hapenny

Last synced: about 1 month ago
JSON representation

Small 32-bit RISC-V CPU with a half-width datapath inspired by the 68000

Awesome Lists containing this project

README

        

# `hapenny`: a half-width RISC-V

`hapenny` is a 32-bit RISC-V CPU implementation that operates internally on
16-bit chunks. This means it takes longer to do things, but uses less space.

This approach was inspired by the MC68000 (1979), which also implemented a
32-bit instruction set using a 16-bit datapath. (`hapenny` uses about half as
many cycles per instruction as the MC68000, after optimization.)

`hapenny` was written to evaluate the Amaranth HDL.

(The current `hapenny` was formerly version 2; once it became mature enough I
removed version 1.)

## Bullet points

- Over 12M inst/sec on iCE40 HX1K, while occupying under 800 LCs, or less than
63% of the chip. (Throughput compares favorably to some 32-bit implementations
occupying twice the area.)
- Native 16-bit bus allows for simpler peripherals and external RAMs. (Can run
out of external 16-bit SRAM with no penalty.)
- Parameterized with knobs for trading off size vs capability.
- Implements the RV32I unprivileged instruction set (currently missing FENCE and
SYSTEM).
- Optional interrupt support in the older core. (yet to come in the revised one)
- Written in Python using Amaranth.

## But why

There are a bazillion open-source RISC-V CPU implementations out there, which is
what happens when you release a well-designed and free-to-implement instruction
set spec -- nerds like me will crank out implementations.

I wrote `hapenny` as an experiment to see if I could target the space between
the PicoRV32 core and the SERV core, in terms of size and performance. I
specifically wanted to produce a CPU with decent performance that could fit into
an iCE40 HX1K part (like on the Icestick evaluation board) with enough space
left over for useful logic. PicoRV32 doesn't quite fit on that chip; SERV fits
but takes 32-64 cycles per instruction.

| Property | PicoRV32-small | `hapenny` | SERV |
| ------------------------------ | -------------- | --------- | ---- |
| Datapath width (bits) | 32 | **16** | 1 |
| External data bus width | 32 | **16** | 32 |
| Average cycles per instruction | 5.429 | **5.525** | 40-ish |
| Minimal size on iCE40 (LCs) | 1500-ish | **796** | 200-ish |
| Typical MHz on iCE40 | 40s? | **72+** | 40s? |

(Cycles/instruction is measured on Dhrystone. Minimal size is the output
produced by the `icestick-smallest.py` script. I would appreciate help getting
apples-to-apples comparison numbers!)

So, basically,

- `hapenny` is significantly smaller than a similarly-configured PicoRV32 core
for only 1.7% less performance per clock. (Of course, PicoRV32 is a far more
general and well-tested processor, and in practice you'd configure it with
performance-enhancing features like a dual-port register file and faster
shifts.)

- `hapenny` is much faster than SERV, but also about 4x larger. (SERV is also
better tested than `hapenny`.)

`hapenny` is easy to interface to 16-bit peripherals and external memory with no
(additional) performance loss. This can result in smaller overall designs and
simpler boards. For instance, `hapenny` can run at full rate out of the 16-bit
SRAM on the Icoboard.

Independent from the datapath width, I also did some fairly aggressive manual
register retiming in the decoder and datapath, which means `hapenny` can often
close timing at higher Fmax than other simple RV32 cores. (I miss automatic
retiming from ASIC toolchains.)

## Details

`hapenny` executes (most of) the RV32I instruction set in 16-bit pieces. It uses
16-bit memory, a 16-bit (single-ported) register file, and a 16-bit ALU. To
perform 32-bit operations, it uses the same techniques a programmer might use in
software on a 16-bit computer, e.g. "chaining" operations using preserved
carry/zero bits.

All memory interfaces in `hapenny` are synchronous, including the register file,
which is another reason why operations take more cycles. The RV32I register file
is comparatively large (at 1024 bits), and using a synchronous register file
ensures that it can be mapped into an FPGA block RAM if desired.

Here's what the CPU does during the timing of a typical instruction like `ADD`.
I've color/brightness-coded three different executions that are in flight during
this diagram.

![A timing diagram showing a typical instruction cycle.](doc/instruction-cycle.svg)

- The "FD-Box" is responsible for fetch and decode, and is always working on the
_next_ instruction. It requires three cycles to fetch both halfwords of an
instruction, and then uses the `DECODE` cycle to do initial instruction
decoding and start the read of rs1's low half. (It spends one cycle out of
four essentially idle to make the state machines line up conveniently.)
- The "EW-Box" is responsible for execute and writeback. It goes through at
least four states in every instruction:
- `R2L` starts the load of the low half of rs2 from the register file.
- `OPL` operates on the low halves of rs1 and rs2 (or rs1 and an immediate),
and also starts the load of the high half of rs1.
- `R2H` and `OPH` do the same thing for the high half.

Most instructions take four cycles, as shown in that diagram. Some take more if
they need to do additional things (by adding states), or if they change control
flow such that the FD-Box's speculative fetch was wrong. The CPU test bench
(`sim-cpu.py`) measures the cycle timing for every instruction; here's where
things currently stand:

| Instruction | Cycles | Notes |
| ------------ | ------ | ----- |
| AUIPC | 4 | |
| LUI | 4 | |
| JAL | 8 | Includes four-cycle re-fetch penalty |
| JALR | 8 | Includes four-cycle re-fetch penalty |
| Branch | 5/10 | Not Taken / Taken |
| Load | 6 | |
| SW | 5 | |
| SB/SH | 4 | |
| SLT(I)(U) | 6 | |
| Shift | 6 + N | N is number of bits shifted |
| Other ALU op | 4 | |

On the instruction mix in Dhrystone, this yields an average of 5.525
cycles/instruction.

## Interfaces

`hapenny` uses a very simple bus interface with up to 32-bit addressing. In
practice, applications will wire up fewer than 32 address lines, which saves
space.

| Signal | Driver | Width | Description |
| ---------- | ------ | -------- | ----------- |
| `addr` | CPU | up to 31 | addresses a halfword, i.e. LSB missing |
| `data_out` | CPU | 16 | carries data for a write |
| `lanes` | CPU | 2 | signals a write of either or both byte in a halfword; zero means a load |
| `valid` | CPU | 1 | when high, indicates that the signals above are valid and starts a bus transaction. |
| `response` | device |16 | on the cycle after a load, carries back data from the addressed device. |

The PC can be shrunk separately from the address bus if you know that all
program memory appears in e.g. the bottom half of the address space. This
further saves space.

The bus interface does not support wait states, to reduce complexity. This makes
it difficult to interface to things like XIP SPI Flash or SDRAM. `hapenny` is
really intended for applications that don't rely on such things.

`hapenny` exposes a fairly flexible debug interface capable of inspecting
processor state and reading and writing the register file. These feautres are
only available when the processor is halted, which can be achieved by holding
`halt_request` high until the processor confirms (at the next instruction
boundary) by asserting `halted`. Release `halt_request` to resume.

Finally, `hapenny` has an RVFI (RISC-V Formal Interface) trace port for
generating a trace of instruction effects, though I haven't wired up the actual
test suite.

## Interrupt options

Currently, `hapenny` does not support interrupts, but I'm planning on changing
this. (An earlier version did, support was removed when I rearchitected the core
for v2.)

## Drawbacks

- Written by someone who pretends to be an electrical engineer as a way to
procrastinate finishing his slides for a talk.

- Used for exactly one thing so far, so not exactly battle-hardened.

- Less general than more mature implementations like PicoRV32 -- e.g. no support
for wait states, hardware multiply, coprocessors, or (currently) interrupts.

- 16-bit external data bus means that, currently, 32-bit reads/writes are not
atomic -- a problem when interfacing with peripherals with 32-bit
memory-mapped registers. (Peripherals with 16-bit memory-mapped registers work
fine, however.)

- Not exactly well factored/commented.

- Written in Python, so chances are pretty good the code won't keep working
across OS updates / minor runtime versions.

## What's with the name

`hapenny` is implemented using about half the logic of other cheap RV32 cores.

The half-penny, or "ha'penny," is a historical English coin worth (as the name
implies) half a penny. So if the other cheap cores cost a penny, this is a
ha'penny.