{"id":13537095,"url":"https://github.com/ucb-bar/gemmini","last_synced_at":"2025-05-14T20:03:12.543Z","repository":{"id":40494321,"uuid":"155482388","full_name":"ucb-bar/gemmini","owner":"ucb-bar","description":"Berkeley's Spatial Array Generator","archived":false,"fork":false,"pushed_at":"2025-02-19T17:57:54.000Z","size":4430,"stargazers_count":915,"open_issues_count":96,"forks_count":189,"subscribers_count":30,"default_branch":"master","last_synced_at":"2025-04-06T13:04:12.584Z","etag":null,"topics":["accelerator","asic","dnn"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ucb-bar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-31T01:51:48.000Z","updated_at":"2025-04-06T07:30:22.000Z","dependencies_parsed_at":"2022-07-12T18:02:35.613Z","dependency_job_id":"cea823b9-ec03-49dc-84a7-806cd379d202","html_url":"https://github.com/ucb-bar/gemmini","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucb-bar%2Fgemmini","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucb-bar%2Fgemmini/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucb-bar%2Fgemmini/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucb-bar%2Fgemmini/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ucb-bar","download_url":"https://codeload.github.com/ucb-bar/gemmini/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248741144,"owners_count":21154250,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accelerator","asic","dnn"],"created_at":"2024-08-01T09:00:55.098Z","updated_at":"2025-05-14T20:03:12.503Z","avatar_url":"https://github.com/ucb-bar.png","language":"Scala","funding_links":[],"categories":["Accelerators"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg width=\"1000\" src=\"./img/full-logo.svg\"\u003e\n\u003c/p\u003e\n\nGemmini\n====================================\n\nThe Gemmini project is developing a full-system, full-stack DNN hardware exploration and evaluation platform.\nGemmini enables architects to make useful insights into how different components of the system and software stack (outside of just the accelerator itself) interact to affect overall DNN performance.\n\nGemmini is part of the [Chipyard](https://github.com/ucb-bar/chipyard) ecosystem, and was developed using the [Chisel](https://www.chisel-lang.org/) hardware description language.\n\nThis document is intended to provide information for beginners wanting to try out Gemmini, as well as more advanced in-depth information for those who might want to start hacking on Gemmini's source code.\n\n![Gemmini's high-level architecture](./img/gemmini-system.png)\n\nQuick Start\n==========\n\nWe provide here a quick guide to installing Gemmini's dependencies (Chipyard and Spike), building Gemmini hardware and software, and then running that software on our hardware simulators.\n\nDependencies\n---------\n\nBefore beginning, install the [Chipyard dependencies](https://chipyard.readthedocs.io/en/latest/Chipyard-Basics/Initial-Repo-Setup.html#default-requirements-installation).\n\nInstalling Chipyard and Spike\n-----------------------------\n\nRun these steps to install Chipyard and Spike (make sure to checkout the correct Chipyard and Spike commits as shown below):\n\n```shell\ngit clone https://github.com/ucb-bar/chipyard.git\ncd chipyard\n./build-setup.sh\n\nsource env.sh\n\ncd generators/gemmini\nmake -C software/libgemmini install\n```\n\n\nBuilding Gemmini Software\n-------------------------\n\nRun the steps below to compile Gemmini programs, including large DNN models like ResNet50, and small matrix-multiplication tests.\n\n```shell\ncd chipyard/generators/gemmini/software/gemmini-rocc-tests\n./build.sh\n```\n\nAfterwards, you'll find RISC-V binaries in `build/`, for \"baremetal\" environments, Linux environments, and \"proxy-kernel\" environments.\n\nLinux binaries are meant to be executed on SoCs that run Linux.\nThese binaries are dynamically linked, and support all syscalls.\nTypically, our users run them on [FireSim](https://fires.im/) simulators.\n\nBaremetal binaries are meant to be run in an environment without any operating system available.\nThey lack support for most syscalls, and do not support virtual memory either.\nOur users typically run them on cycle-accurate simulators like Verilator or VCS.\n\n\"Proxy-kernel\" binaries are meant to be run on a stripped down version of Linux, called the [\"RISC-V Proxy Kernel.\"](https://github.com/riscv-software-src/riscv-pk)\nThese binaries support virtual memory, and are typically run on cycle-accurate simulators like Verilator.\n\n**Warning:** Proxy-kernel binaries have limited heap space, so some Gemmini programs that work correctly in baremetal or Linux environments may fail on the proxy-kernel.\n\nBuilding Gemmini Hardware and Cycle-Accurate Simulators\n-----------------------------------------------\n\nRun the instructions below to build a cycle-accurate Gemmini simulator using Verilator.\n\n```shell\ncd chipyard/sims/verilator\nmake CONFIG=GemminiRocketConfig\n\n# Or, if you want a simulator that can generate waveforms, run this:\nmake debug CONFIG=GemminiRocketConfig\n```\n\nAfter running this, in addition to the cycle-accurate simulator, you will be able to find the Verilog description of your SoC in `generated-src/`.\n\nUsing Gemmini Functional Simulators\n---------------------------\n\nSpike typically runs _much_ faster than cycle-accurate simulators like Verilator or VCS.\nHowever, Spike can only verify functional correctness; it cannot give accurate performance metrics or profiling information.\n\nRun Simulators\n---------------\n\nRun the instructions below to run the Gemmini RISCV binaries that we built previously, using the simulators that we built above:\n\n```shell\ncd chipyard/sims/verilator\n\n# Run a large DNN workload in the functional simulator\nspike --extension=gemmini pk ../../generators/gemmini/software/gemmini-rocc-tests/build/imagenet/resnet50-pk\n\n# Run a small DNN workload in the functional simulator\nspike --extension=gemmini ../../generators/gemmini/software/gemmini-rocc-tests/build/imagenet/resnet50-baremetal\n\n# Run a smaller workload in baremetal mode, on a cycle-accurate simulator\nmake CONFIG=GemminiRocketConfig run-binary BINARY=../../generators/gemmini/software/gemmini-rocc-tests/build/bareMetalC/template-baremetal\n```\n\nNext steps\n--------\n\nCheck out our [MLSys 2022 tutorial](https://sites.google.com/berkeley.edu/gemmini-tutorial-mlsys-2022) (or our earlier but more out-of-date [IISWC 2021 tutorial](https://sites.google.com/berkeley.edu/gemminitutorialiiswc2021/)) to learn how to:\n* build different types of diverse accelerators using Gemmini.\n* add custom datatypes to Gemmini.\n* write your own Gemmini programs.\n* profile your workloads using Gemmini's performance counters.\n\nAlso, consider learning about [FireSim](fires.im), a platform for FPGA-accelerated cycle-accurate simulation.\nWe use FireSim to run end-to-end DNN workloads that would take too long to run on Verilator/VCS.\nFireSim also allows users to check that their Gemmini hardware/software will work when running on a Linux environment.\n\nOr, continue reading the rest of this document for descriptions of Gemmini's architecture, ISA, and configuration parameters.\n\nArchitecture\n================\n\nGemmini is implemented as a RoCC accelerator with non-standard RISC-V custom instructions.\nThe Gemmini unit uses the RoCC port of a Rocket or BOOM _tile_, and by default connects to the memory system through the System Bus (i.e., directly to the L2 cache).\n\nAt the heart of the accelerator lies a systolic array which performs matrix multiplications.\nBy default, the matrix multiplication support both _output-stationary_ and _weight-stationary_ dataflows, which programmers can pick between at runtime.\nHowever, the dataflow can also be hardened at elaboration time.\n\nThe systolic array's inputs and outputs are stored in an explicity managed scratchpad, made up of banked SRAMs.\nA DMA engine facilitates the transfer of data between main memory (which is visible to the host CPU) and the scratchpad.\n\nBecause weight-stationary dataflows require an accumulator outside the systolic array, we add a final SRAM bank, equipped with adder units, which can be conceptually considered an extension of the scratchpad memory space. The systolic array can store results to any address in the accumulator, and can also read new inputs from any address in the accumulator. The DMA engine can also tranfer data directly between the accumulator and main memory, which is often necessary to load in biases.\n\nGemmini also includes peripheral circuitry to optionally apply activation functions such as ReLU or ReLU6, scale results down by powers-of-2 to support quantized workloads, or to transpose matrices before feeding them into the systolic array to support the output-stationary dataflow.\n\nGenerator Parameters\n--------------------------\n\nMajor parameters of interest include:\n\n* Systolic array dimensions (``tileRows``, ``tileColumns``, ``meshRows``, ``meshColumns``): The systolic array is composed of a 2-level hierarchy, in which each tile is fully combinational, while a mesh of tiles has pipeline registers between each tile.\n\n![Gemmini's systolic two-tiered hierarchy](./img/gemmini-systolic-array.png)\n\n* Dataflow parameters (``dataflow``): Determine whether the systolic array in Gemmini is output-stationary or weight-stationary, or whether it supports both dataflows so that programmers may choose between them at runtime.\n\n* Scratchpad and accumulator memory parameters (``sp_banks``, ``sp_capacity``, ``acc_capacity``): Determine the properties of the Gemmini scratchpad memory: overall capacity of the scratchpad or accumulators (in KiB), and the number of banks the scratchpad is divided into.\n\n* Type parameters (``inputType``, ``outputType``, ``accType``):\nDetermine the data-types flowing through different parts of a Gemmini accelerator.\nFor example, ``inputType`` may be an 8-bit fixed-point number, while ``accType``, which determines the type of partial accumulations in a matrix multiplication, may be a 32-bit integer.\n``outputType`` only determines the type of the data passed between two processing elements (PEs); for example, an 8-bit multiplication may produce a 16-bit result which must be shared between PEs in a systolic array.\n    - Examples of possible datatypes are:\n        - `SInt(8.W)` for a signed 8-bit integer\n        - `UInt(32.W)` for an unsigned 32-bit integer\n        - `Float(8, 24)` for a single-precision IEEE floating point number\n    - If your datatype is a floating-point number, then you might also want to change the ``pe_latency`` parameter, which specifies how many shift registers to add inside the PEs.\nThis might be necessary if your datatype cannot complete a multiply-accumulate operation within a single cycle.\n\n* Access-execute queue parameters (``ld_queue_length``, ``st_queue_length``, ``ex_queue_length``, ``rob_entries``): To implement access-execute decoupling, a Gemmini accelerator has a load instruction queue, a store instruction queue, and an execute instruction queue. The relative sizes of these queue determine the level of access-execute decoupling. Gemmini also implements a reorder buffer (ROB) - the number of entries in the ROB determines possible dependency management limitations.\n\n* DMA parameters (``dma_maxbytes``, ``dma_buswidth``, ``mem_pipeline``): Gemmini implements a DMA to move data from main memory to the Gemmini scratchpad, and from the Gemmini accumulators to main memory. The size of these DMA transactions is determined by the DMA parameters. These DMA parameters are tightly coupled with Rocket Chip SoC system parameters: in particular ``dma_buswidth`` is associated with the ``SystemBusKey`` ``beatBytes`` parameter, and ``dma_maxbytes`` is associated with ``cacheblockbytes`` Rocket Chip parameters.\n\nThere are also optional features, which can be either enabled or left out of Gemmini at elaboration-time.\nFor example:\n\n* Scaling during \"move-in\" operations (``mvin_scale_args``, ``mvin_scale_acc_args``):\nWhen data is being moved in from DRAM or main memory into Gemmini's local scratchpad memory, it can optionally be multiplied by a scaling factor.\nThese parameters specify what the datatype of the scaling factor is, and how the scaling is actually done.\nIf these are set to ``None``, then this optional feature will be disabled at elaboration time.\nIf both the scratchpad inputs are accumulator inputs are to be scaled in the same say, then the ``mvin_scale_shared`` parameter can be set to ``true`` so that the multipliers and functional units are shared.\n\nMajor Components\n----------------\n\nThis subsection is aimed towards those who wish to start hacking on Gemmini's RTL.\nHere, we briefly describe Gemmini's main hardware components, and how they fit together.\nIf you have no interest in changing Gemmini's hardware (besides just changing configuration parameters), then feel free to skip this section.\n\n### Decoupled Access/Execute\n\nGemmini is a decoupled access/execute architecture, which means that \"memory-access\" and \"execute\" instructions happen concurrently, in different regions of the hardware.\nWe divide the hardware broadly into three \"controllers\": one for \"execute\" instructions, another for \"load\" instructions, and a third for \"store\" instructions.\nEach of these controllers consume direct ISA commands from the programmer, decode this commands, and execute them, while sharing access to the scratchpad and acccumulator SRAMs.\n\n* `ExecuteController`: This module is responsible for executing \"execute\"-type ISA commands, such as matrix multiplications.\nIt includes a systolic array for dot-products, and a transposer.\n\n* `LoadController`: This module is responsible for all instructions that move data from main memory into Gemmini's private scratchpad or accumulator.\n\n* `StoreController`: This module is responsible for all instructions that move data from Gemmini's private SRAMs into main memory.\nThis module is also responsible for \"max-pooling\" instructions, because Gemmini performs pooling when moving unpooled data from the private SRAMs into main memory.\n\n### Scratchpad and Accumulator\n\nGemmini stores inputs and outputs for the systolic array in a set of private SRAMs, which we call the \"scratchpad\" and the \"accumulator\".\nTypically, inputs are stored in the scratchpad, while partial sums and final results are stored in the the accumulator.\n\nThe scratchpad and accumulator are both instantiated within `Scratchpad.scala`.\nThe scratchpad banks are implemented by the `ScratchpadBank` module, and the accumulator banks are implemented by the `AccumulatorMem` module.\n\nEach row of the scratchpad and accumulator SRAMs is `DIM` \"elements\" wide, where `DIM` is the number of PEs along the width of the systolic array.\nEach \"element\" represents a single scalar value that Gemmini operates upon.\n\nEach \"element\" in the scratchpad is of type `inputType` (which, in the default config, is an 8-bit integer).\nEach \"element\" in the acccumulator is of type `accType` (which, in the default config, is a 32-bit integer).\n\nSo, for example, in the default config, which has a 16x16 systolic array, the scratchpad banks have a row-width of `16*bits(inputType)=128` bits, and the accumulatorr banks have a row-width of `16*bits(accType)=512` bits.\n\nBoth inputs and outputs to the scratchpad must be of type `inputType`. \n\nBoth inputs and outputs from the accumulator can be either of type `accType` _or_ `inputType`.\nIf `inputType` values are input to the accumulator, they will be cast up to `accType`.\nIf `inputType` values are output from the accumulator, they will first be \"scaled\" down to be of type `inputType`.\nThe exact \"scaling\" function can be configured as the as the user wishes, but in the default config, the scaling function is a simple multiplication by a `float32` value that casts an `int32` down to an `int8`.\n\nThe scratchpad banks are very simple, comprising little more than an SRAM and a queue.\n\nThe accumulator banks are a bit more complex: in addition to the underlying SRAM, they also include a set of adders to support in-place accumulations.\nIn addition, they have a set of \"scalers\" (described above), and activation function units.\nThe scaling and activation functions are applied when the programmer wishes to transform `accType` values down to `inputType` values while reading data out of the accumulator.\nThis is typically done to transform the partial-sum outputs of one layer into the low-bitwidth quantized inputs of the next layer. \n\n### Systolic Array and Transposer\n\n`MeshWithDelays`, which is instantiated within the `ExecuteController`, contains the systolic array (`Mesh`), a transposer (`Transposer`), and a set of delay registers which shift the inputs to the systolic array.\nThe `MeshWithDelays` module takes in three matrices one row at a time per cycle (`A`, `B`, and `D`), and outputs the result `C = A * B + D` one row at a time per cycle.\n\nIn the weight-stationary mode, the `B` values are \"preloaded\" into the systolic array, and `A` and `D` values are fed through.\nIn the output-stationary mode, the `D` values are \"preloaded\" into the systolic array, and `A` and `B` values are fed through.\n\n`A`, `B`, and `D` are all of type `inputType`, while `C` is of type `outputType`.\nIf the programmer wishes to write `C` into the scratchpad, then `C` is cast down to `inputType`.\nHowever, if the programmer instead wishes to write `C` into the accumulator, then `C` is cast up to `accType`.\n\nNote that in the weight-stationary mode, an `inputType` D usually has insufficient bitwidth to accurately represent partial sums.\nTherefore, in the weight-stationary mode, `D` is usually just the 0-matrix, while the `accType` accumulator SRAMs are used to accumulate partial sum outputs of the systolic array instead.\n\nThe inputs (`A`, `B`, and `D`) must be delayed with shift-registers so that each input from one matrix reaches the correct PE at exactly the right time to be multiplied-and-accumulated with the correct input from another matrix.\nThe diagram below shows an example of a 2x2 output-stationary matmul (ignoring `D`), with the appropriate delay registers at the inputs and outputs of the systolic array:\n\n![Systolic array with delay registers](./img/delay-registers.png)\n\nThe systolic array itself (implemented in `Mesh.scala`), is composed of a two-tier hierarchy of `Tiles` and `PEs`.\nThe `Mesh` is composed of a set of `Tiles`, separated by pipeline registers.\nEvery `Tile` is composed of a combinational set of `PEs`, where each PE performs a single matmul operation, with either the weight-stationary, or output-stationary dataflow.\n\n![Systolic array](./img/gemmini-systolic-array.png)\n\nThe `MeshWithDelays` module also includes a number of counters and configuration registers.\n`MeshWithDelays` assumes that every matmul operation will be exactly of size `DIM x DIM`, where `DIM` is the number of PEs across the width of the systolic array itself (16 in the default config).\nThese counters count up to `DIM`, and then update the configuration registers from the inputs to `MeshWithDelays`.\nThese configuration registers control which of `A` and `B` are to be transposed before being fed into the systolic array.\nThey also control whether the preloaded values in the systolic array are to be maintained for the next matmul, or whether they are to be overwritten and replaced.\n\nThe transposer itself is implemented as a very simple systolic array, which transports inputs from left-to-right for `DIM` cycles, and then down-to-up for another `DIM` cycles.\nThis is illustrated in the diagram below:\n\n![Transposer](./img/transposer.png)\n\nNote that for output-stationary matmuls, the transposer is used even when the programmer does not request a transposition.\nThis is because the systolic array expects inputs from the same row of `A` to enter the same PE in the output-stationary mode, but all values in a single row of `A` are stored within the same scratchpad SRAM row.\nTherefore, the rows have to be transposed after being read out of the scratchpad, so that elements on the same row can be fed into the same PE one-after-another, rather than being fed into adjacent PEs.\n\n### DMA\n\nGemmini includes two DMAs, one for reading data from main memory into Gemmini's private SRAMs, and another for moving data from Gemmini's private SRAMs into main memory.\nBoth these modules are implemented in `DMA.scala`.\n\nBoth DMAs operate on virtual addresses, and share access to a TLB to translate these into physical main memory addresses.\nIf the TLB misses, it transparently falls back to a PTW that is shared with Gemmini's host CPU.\n\nAfter physical addresses are obtained from Gemmini's private TLB, the DMAs break large memory requests up into smaller [TileLink](https://sifive.cdn.prismic.io/sifive%2Fcab05224-2df1-4af8-adee-8d9cba3378cd_tilelink-spec-1.8.0.pdf) read and write requests.\nTo satisfy the TileLink protocol, each memory request must be aligned to the number of bytes requested from/to main memory, and the size of each memory request (in bytes) must be a power of 2.\nThe DMAs generally attempt to minimize the number of TileLink requests as much as possible, even if this requires reading a larger total amount of data from main memory.\nEmpirically, we have found that an excessive number TileLink requests can limit performance more than reading a small amount of extra data.\n\nThe DMAWriter, which writes data from private SRAMs to main memory, also includes a set of `\u003e` comparators that are used for max-pooling data during a memory-write operation.\n\n### ROB\n\nDue to Gemmini's decoupled access-execute architecture, instructions in the `LoadController`, `StoreController`, and `ExecuteController` may operate concurrently and out-of-order with respect to instructions in other controllers.\nGemmini includes an ROB which is meant to detect hazards between instructions in different controllers.\nThe instructions in the ROB are only issued to their respective controllers once they have no dependencies on instructions in other controllers.\n\nNote that instructions that are destined for the same controller are issued in-order.\nThe ROB does not check hazards between instructions within the same controller, because each controller is obligated to handle it's own dependencies and hazards internally, assuming that it receives it's own instructions in program-order.\n\n### Matmul and Conv Loop Unrollers\n\nGemmini's systolic array can only operate on matmuls that are up to `DIMxDIM` elements large.\nWhen performing matmuls and convolutions that are larger than this, programmers must tile their matmuls into a sequence of smaller `DIMxDIM` matmuls.\n\nHowever, tiling these operations efficiently can be difficult for programmers, due to CPU and loop overheads, and the difficulty of unrolling and pipelining software loops.\n\nTo alleviate this difficulty, Gemmini's ISA includes high-level CISC-type instructions, which automatically tile and unroll large matmuls and convolutions.\nThese are implemented in the `LoopMatmul` and `LoopConv` modules.\n\nThese modules are implemented as FSMs, which double-buffer matmul/conv tiles to maximize performance, and which monitor the proportion of load/store/execute instructions in the ROB to maximize overlap between memory accesses and dot-product computations.\nFor example, if the ROB is dominated by matmul instructions, without leaving any slots for incoming load instructions, then the FSMs will pause the issuing of matmul instructions to allow more space for concurrent load instructions in Gemmini's datapath.\n\nSoftware\n==========\n\nThe Gemmini ISA is specified in the `ISA` section below.\nThe ISA includes configuration instructions, data movement instructions (from main memory to/from Gemmini's private memory), and matrix multiplication execution instructions.\n\nSince Gemmini instructions are not exposed through the GNU binutils assembler, several C macros are provided in order to construct the instruction encodings to call these instructions.\n\nThe Gemmini generator includes a C library which wraps the calls to the custom Gemmini instructions into common DNN operators like matmuls, convolutions (with or without pooling), matrix-additions, etc.\nThe ``software`` directory of the generator includes the aforementioned library and macros, as well as baremetal tests, and some FireMarshal workloads to run the tests in a Linux environment. In particular, the C library can be found in the ``software/gemmini-rocc-tests/include/gemmini.h`` file.\n\nThe Gemmini generator generates a C header file based on the generator parameters. This header files gets compiled together with the C library to tune library performance. The generated header file can be found under ``software/gemmini-rocc-tests/include/gemmini_params.h``\n\nGemmini can also be used to run ONNX-specified neural-networks through a port of Microsoft's ONNX-Runtime framework. The port is included as the [onnxruntime-riscv](https://github.com/pranav-prakash/onnxruntime-riscv) repository submoduled in the `software` directory.\nTo start using ONNX-Runtime, run `git submodule update --init --recursive software/onnxruntime-riscv`, and read the documentation [here](https://github.com/pranav-prakash/onnxruntime-riscv/blob/systolic/systolic_runner/docs).\n\n## Build and Run Gemmini Tests\n\nTo build the Gemmini tests:\n\n```shell\ncd software/gemmini-rocc-tests/\n./build.sh\n```\n\nAfterwards, the test binaries will be found in `software/gemmini-rocc-tests/build`.\nBinaries whose names end in `-baremetal` are meant to be run in a bare-metal environment, while binaries whose names end in `-linux` are meant to run in a Linux environment.\nYou can run the tests either on a cycle-accurate RTL simulator, or on a (much faster) functional ISA simulator called Spike.\n\nWe use a special extension of Spike, found [here](https://github.com/ucb-bar/libgemmini), which has support for Gemmini instructions.\nIf you are using Chipyard, you can easily build Spike by running `./scripts/build-toolchains.sh riscv-tools` from Chipyard's root directory, then by running `make -C software/libgemmini install` in the Gemmini directory.\nThen, to run the `mvin_mvout` test, which simply moves a matrix into Gemmini's scratchpad before moving it back out into main memory, run the following commands:\n\n```shell\ncd build/bareMetalC\nspike --extension=gemmini mvin_mvout-baremetal\n```\n\n## Writing Your Own Gemmini Tests\n`software/gemmini-rocc-tests/bareMetalC/template.c` is a template Gemmini test that you can base your own Gemmini tests off of. To write your own Gemmini test, run:\n\n```shell\ncd software/gemmini-rocc-tests/\ncp bareMetalC/template.c bareMetalC/my_test.c\n```\n\nThen, add `my_test` to the `tests` list at the top of `bareMetalC/Makefile`. Afterwards, running `./build.sh` will install `my_test-baremetal` in `build/bareMetalC`.\n\n## DNN Tests\n\nExample DNNs, such as ResNet50, can be found in `software/gemmini-rocc-tests/imagenet` and `software/gemmini-rocc-tests/mlps`.\nThese tests are built and run the same way as the other tests described above, but they typically take too long to run in a software simulator like VCS or Verilator.\nWe recommend instead that you run these tests through [Firesim](https://fires.im/), an FPGA-accelerated simulation platform, which will reduce your runtime from days to minutes.\n\nNote that the DNN tests rely upon our C library of common DNN operators (found in `gemmini.h`).\nThey call very few direct Gemmini ISA instructions, and mostly call the wrappers around them found in the C library.\n\n# Memory Addressing Scheme\n\nGemmini's private memory is \"row-addressed\", where each row is `DIM` elements wide, where `DIM` is the number of PEs across the width of the systolic array (16 in the default config).\nThese elements will be of type `inputType` in the scratchpad, and of type `accType` in the accumulator.\n\nEvery private Gemmini memory address is 32 bits long.\nThe three most signficant bits are reserved, and have special meanings:\n* Bit 31 (the MSB) is 0 if we are addressing the scratchpad, and 1 if we are addressing the accumulator.\n* Bit 30 is ignored if we are addressing the scratchpad, or if we are reading from the accumulator. If, instead, we are writing to the accumulator, then bit 30 is 0 if we want to overwrite the data at that address, and 1 if we want to accumulate on top of the data already at that address.\n* Bit 29 is ignored if we are addressing the scratchpad, or if we are writing to the accumulator. If, instead, we are reading from the accumulator, then bit 29 is 0 if we want to read scaled-down `inputType` data from the accumulator, and 1 if we want to read `accType` data from the accumulator.\n    - If bit 29 is 1 for an accumulator read address, then we do not apply activation functions or scaling to the output of the accumulator.\n\nThe memory addressing scheme for a Gemmini config with a 2x2 systolic array is illustrated below:\n\n![Gemmini's memory addressing scheme](./img/memory-addressing.png)\n\nGemmini accesses main memory addresses (which are also visible to the CPU) through their software-visible virtual addresses.\nPhysical translation addresses are handled by Gemmini, transparently to the programmer.\n\n# ISA\n\nThis section describes Gemmini's assembly-level ISA which is made up of custom RISC-V instructions.\n\n## Data Movement\n### `mvin` Move Data From Main Memory to Scratchpad\n**Format:** `mvin rs1, rs2`\n- `rs1` = virtual DRAM address (byte addressed) to load into scratchpad\n- `rs2[31:0]` = local scratchpad or accumulator address\n- `rs2[47:32]` = number of columns to load in\n- `rs2[63:48]` = number of rows to load in. Must be less than or equal to `DIM`.\n- `funct` = 2\n\n**Action:** Scratchpad[rs2] \u003c= DRAM[Translate[rs1]]\n- Loads a 2D matrix from main memory into Gemmini's private memory.\n- Load is sequential from the rs1/rs2 base addresses.\n- Main memory stride must be set by the `config_mvin` command.\n- If the number of columns we load in are greater than `DIM`, then multiple submatrices will be moved in.\nThe private-memory stride between these submatrices is set by the `config_mvin` command.\n\nThe figure below illustrates how the `mvin` command works:\n\n![Gemmini's mvin command](./img/mvin.png)\n\nIn addition, the figure below illustrates the special case where the number of columns moved-in is greater than `DIM`:\n\n![Gemmini's mvin command with many cols](./img/block-mvin.png)\n\n**Notes:**\n* There are actually **three** `mvin` instructions in Gemmini: `mvin`, `mvin2`, and `mvin3`.\n`mvin2` and `mvin3` are completely identical to `mvin`, except that they have their own independent set of configuration registers.\nWhen calling `config_mvin` (described below), the programmer can choose which `mvin` instruction they want to configure.\n* The reason we have three `mvin` instructions is so that the programmer can overlap loads for A, B, and D matrices (for a `A*B+D` matmul), where A, B, and D may all have different main-memory-strides. \n\n### `mvout` Move Data from Scratchpad to L2/DRAM\n**Format:** `mvout rs1, rs2`\n- `rs1` = virtual DRAM address (byte addressed) to write to from scratchpad\n- `rs2[31:0]` = local scratchpad address\n- `rs2[47:32]` = number of columns to store\n- `rs2[63:48]` = number of rows to store\n- `funct` = 3\n\n**Action:** DRAM[Translate[rs1]] \u003c= Scratchpad[rs2]\n- Stores a 2D matrix from the scratchpad to main-memory\n- Store is sequential from the rs1/rs2 base addresses. Stride must be set by the `config_mvout` command\n\n## Configuration\n### `config_ex` configures the Execute pipeline\n**Format:** `config_ex rs1 rs2`\n- `rs1[1:0]` must be `00`\n- `rs1[2]` determines if output (0) or weight (1) stationary\n- `rs1[3]` = activation function: either relu (1) or no activation function (0)\n- `rs1[8]` = should A be transposed?\n- `rs1[9]` = should B be transposed?\n- `rs1[31:16]` = the stride (in scratchpad addresses) by which the rows of A are fed into the systolic array.\n\"A\" in this context refers to the left-hand matrix A in the matmul represented by A * B = C.\nIf this stride is 1, then we feed consecutive rows in the scratchpad, starting from the starting address of A, into the systolic array as the A matrix.\nIf the stride is 2, then we feed every other row into the systolic array instead.\n- `rs1[63:32]` = the scalar value by which we scale the `accType` output of the accumulator down to `inputType` values when reading from the accumulator.\n    - In the default config, `rs1[63:32]` is of type `float32`\n- `rs2[31:0]` = the number of bits by which the accumulated result of a matmul is right-shifted when leaving the systolic array\n    - This parameter is only relevant in output-stationary mode, when partial sums must be accumulated within the systolic array itself, and scaled-down when leaving the systolic array and being written into the scratchpad.\n- `funct` = 0\n\n**Action:** mode \u003c= rs1(2); shift \u003c= rs2; A_stride \u003c= rs1[31:16]\n\n**Notes:**\n- As of now, certain combinations of transpose options cannot be performed unless the right dataflow is chosen.\nThis limitation may be lifted in the future.\n\n| Dataflow | Transpose A | Transpose B | Permitted? |\n| :---: | :---: | :---: | :---: | \n| OS | No | No | Yes |\n| OS | No | Yes | No |\n| OS | Yes | No | Yes |\n| OS | Yes | Yes | Yes |\n| WS | No | No | Yes |\n| WS | No | Yes | Yes |\n| WS | Yes | No | Yes |\n| WS | Yes | Yes | No |\n\n### `config_mvin` configures the Load pipeline\n**Format:** `config_mvin rs1 rs2`\n- `rs1[1:0]` must be `01`\n- `rs1[2]` is 0 if `mvin`s to the accumulator are of type `accType`, and 1 if they are `inputType`\n- `rs1[4:3]` is 0 if the stride is being set for `mvin`, 1 if the stride is being set for `mvin2`, and 2 if the stride is being set for `mvin3`\n- `rs1[31:16]` is the scratchpad-memory stride (also called the \"private-memory stride\" above)\n- `rs1[63:32]` is the \"scale\" by which to multiply data as it's being moved in to the scratchpad. This is ignored if Gemmini isn't configured to have the ability to scale values during `mvin`s.\n- `rs2` is the main-memory stride in bytes\n- `funct` = 0\n\n**Action:** stride \u003c= rs2; scale \u003c= rs1[63:32]\n\n### `config_mvout` configures the Store pipeline\n**Format:** `config_mvout rs1 rs2`\n- `rs1[1:0]` must be `10`\n- `rs2` = the stride in bytes \n- `funct` = 0\n\nDuring `mvout` operations, Gemmini can also perform max-pooling.\n**This is an experimental feature, and is subject to change.**\nThis feature assumes that data is stored in the scratchpad or accumulator in NHWC format.\nThe parameters controlling this feature are:\n\n- `rs1[5:4]` = max-pooling stride. If this is 0, then max-pooling is deactivated.\n- `rs1[7:6]` = max-pooling window size\n- `rs1[9:8]` = upper zero-padding\n- `rs1[11:10]` = left zero-padding\n- `rs1[31:24]` = output dimension of image after pooling\n- `rs1[39:32]` = number of pooled rows to output\n- `rs1[47:40]` = number of pooled columns to output\n- `rs1[55:48]` = number of unpooled rows to pool\n- `rs1[63:56]` = number of unpooled columns to pool\n\n**Action:** stride \u003c= rs2; max-pooling parameters \u003c= rs1\n\n### `config_norm` configures normalization commands\n**Format:** `config_norm rs1 rs2`\n\n`config_norm` is an **experimental** command added primarily to support an integer-only variant of BERT called [I-BERT](https://arxiv.org/abs/2101.01321) on Gemmini.\nThe command allows users to set scalar constants that are used by I-BERT's GELU, layernorm, and softmax variants.\n\n### `flush` flushes the TLB\n**Format:** `flush rs1`\n- `rs1` = If `rs1[0]` is 1, then the current TLB request is skipped (if it has hit a page-fault and is waiting for an interrupt).\nOtherwise, the current TLB request is repeated.\n\n**Notes:**\n\n- This instruction executes _as soon as it is received_ without waiting for other instructions which may be queued up.\nIt is the programmer's responsibility to insert fences if necessary.\n\n## Core Matmul Sequences\nEvery single matrix multiply operation is a combination of `matmul.preload` and `matmul.compute` (due to the length of a single instruction, it was split into two instructions).\n`matmul.preload` should precede the `matmul.compute`.\n\nExample:\n```\n//// OS matmul example ////\n// rs1 = InputD\n// rs2 = OutputC\n// rs3 = InputA\n// rs4 = InputB\n// matmul InputA InputB OutputC InputD\n1. matmul.preload $rs1 $rs2\n2. matmul.compute $rs3 $rs4\n```\n**Action:** Scratchpad[rs2] \u003c= Scratchpad[rs3] \\* Scratchpad[rs4] + Scratchpad[rs1]\n\n**Notes on addressing:**\n- For B or D, the address can be replaced with all high bits to input a 0 matrix instead.\n- For A, the address can be replaced with all high bits to input a matrix with undefined garbage data instead.\n\n### Preloading\n**Format:** `matmul.preload rs1, rs2`\n- `rs1[31:0]` = local scratchpad address of D matrix (when output-stationary), or B matrix (when weight-stationary)\n- `rs1[47:32]` = number of columns of D/B matrix\n- `rs1[63:48]` = number of rows of D/B matrix\n- `rs2[31:0]` = local scratchpad address of C matrix.\nIf this is set to all high bits, then C will not be written to the scratchpad or accumulator.\n- `rs2[47:32]` = number of columns of C matrix\n- `rs2[63:48]` = number of rows of C matrix\n- `funct` = 6\n\n**Commit Behavior:** This instruction commits on the cycle after the systolic array receives it. The systolic array remains idle until the subsequent OS/WS specific instructions are seen.\n\n### Computing\n#### Explicitly Preloaded\n**Format:** `matmul.compute.preloaded rs1, rs2`\n- `rs1[31:0]` = local scratchpad address (systolic array single-axis addressed) of A matrix\n- `rs1[47:32]` = number of columns of A matrix\n- `rs1[63:48]` = number of rows of A matrix\n- `rs2[31:0]` = local scratchpad address (systolic array single-axis addressed) of B matrix (when output-stationary), or D matrix (when weight-stationary)\n- `rs2[47:32]` = number of columns of B/D matrix\n- `rs2[63:48]` = number of rows of B/D matrix\n- `funct` = 4\n- This instruction will compute on the value preloaded (D if output-stationary, or B if weight-stationary)\n\n#### Re-use Previous Preloads\n**Format:** `matmul.compute.accumulated rs1, rs2`\n- `funct` = 5\n- `rs1` and `rs2` have the same encoding as the `matmul.compute.preloaded` encoding\n- If output-stationary, this instruction will compute on the previously computed result (C) in the systolic array, accumulating on top of it\n- If weight-stationary, this instruction will compute on the previously preloaded weights (B) in the systolic array\n\n## Loop Instructions\n\nGemmini includes CISC-type instructions which can perform matmuls and convolutions on data that is much larger than `DIMxDIM`.\n\nThere's nothing these CISC instructions do which a programmer couldn't do by tiling and looping through the other ISA instructions described above;\nhowever, these CISC instructions may achieve higher throughput than such tiled loops written by non-expert programmers.\nThe CISC instructions should be considered performance enhancers; they do not give the accelerator any new functionality that it wouldn't have otherwise.\n\nThe CISC instructions have too many operands to fit into a single RISC-V custom instruction.\nTherefore, they are implemented as a sequence of many RISC-V custom instructions which must be called consecutively by the programmer.\n\nThese instructions can be found `software/gemmini-rocc-tests/include/gemmini.h`, together with example usages.\nWe list below their arguments.\n\n**These loop instructions are experimental and subject to change.**\n\n### `gemmini_loop_ws` Matmul Loop (WS Dataflow)\n\nThis instruction calculates `A * B + D = C`, but `A`, `B`, `D`, and `C` can all be larger than `DIMxDIM`.\n`A`, and `B` must be of type `inputType`, but both `D` and `C` can be _either_ `inputType` or `accType`.\n\nThe sizes of these matrices are represented by `I`, `J`, and `K`:\n\n```\nscratchpad rows of A = I * K * DIM\nscratchpad rows of B = K * J * DIM\naccumulator rows of D = I * J * DIM\naccumulator rows of C = I * J * DIM\n```\n\nHowever, the total number of scratchpad rows taken up by a single `gemmini_loop_ws` must be at most **half** of the total scratchpad size, because Gemmini performs double-buffering during CISC instructions.\nTo compute larger matrix multiplies, the loop instructions must also be tiled within an outer loop.\n\nTo support outer-tiling of the `gemmini_loop_ws` instruction, we include an argument called `ex_accumulate`, which determines whether to perform a matmul on top of the partial sums that already exist within the accumulator (from previous calls to `gemmini_loop_ws` within the same outer-loop).\n\n### `gemmini_loop_conv_ws` Conv Loop (WS Dataflow)\n\nGemmini also includes a CISC instruction for convolutions, implemented similarly to the matmul CISC instruction.\n`gemmini_loop_conv_ws` will perform a convolution with the WS dataflow, and also supports features such as max-pooling, transpose convolutions, and various preprocessing transformations on the weight and input data.\n\nLike `gemmini_loop_ws`, the inputs to a single `gemmini_loop_conv_ws` call must fit within half of Gemmini's private memory, to support double-buffering.\nIf the programmer would like to perform larger convolutions, they must tile and wrap `gemmini_loop_conv_ws` within an outer-loop.\n\n# Citing Gemmini\nIf Gemmini helps you in your academic research, you are encouraged to cite our paper. Here is an example bibtex:\n```\n@INPROCEEDINGS{gemmini-dac,\n  author={Genc, Hasan and Kim, Seah and Amid, Alon and Haj-Ali, Ameer and Iyer, Vighnesh and Prakash, Pranav and Zhao, Jerry and Grubb, Daniel and Liew, Harrison and Mao, Howard and Ou, Albert and Schmidt, Colin and Steffl, Samuel and Wright, John and Stoica, Ion and Ragan-Kelley, Jonathan and Asanovic, Krste and Nikolic, Borivoje and Shao, Yakun Sophia},\n  booktitle={Proceedings of the 58th Annual Design Automation Conference (DAC)}, \n  title={Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration}, \n  year={2021},\n  volume={},\n  number={},\n  pages={}\n}\n```\n\n# Acknowledgements\n\n- This project was, in part, funded by the U.S. Government under the DARPA RTML program (contract FA8650-20-2-7006). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.\n- The Gemmini [logo](./img/full-logo.svg) was designed by Dima Nikiforov ([@CobbledSteel](https://github.com/CobbledSteel)).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucb-bar%2Fgemmini","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fucb-bar%2Fgemmini","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucb-bar%2Fgemmini/lists"}