{"id":13774820,"url":"https://github.com/OpenMachine-ai/tinyfive","last_synced_at":"2025-05-11T07:30:37.221Z","repository":{"id":63932720,"uuid":"570009086","full_name":"OpenMachine-ai/tinyfive","owner":"OpenMachine-ai","description":"TinyFive is a lightweight RISC-V emulator and assembler written in Python with neural network examples","archived":false,"fork":false,"pushed_at":"2023-11-01T23:04:08.000Z","size":364,"stargazers_count":58,"open_issues_count":1,"forks_count":8,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-15T15:01:59.618Z","etag":null,"topics":["ai","assembler","assembly","compiler","machine-learning","ml","risc-v","risc-v-32-simulation","risc-v-simulator","riscv","riscv-asm","riscv-assembler","riscv-assembly","riscv-emulator","riscv-simulator","riscv32"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenMachine-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-11-24T06:00:35.000Z","updated_at":"2025-04-14T15:55:57.000Z","dependencies_parsed_at":"2024-01-17T13:11:52.112Z","dependency_job_id":"26b5d91e-1787-4324-90ac-b5693ac63f68","html_url":"https://github.com/OpenMachine-ai/tinyfive","commit_stats":{"total_commits":115,"total_committers":5,"mean_commits":23.0,"dds":0.4695652173913043,"last_synced_commit":"956cb1b71b884a5dfbdffe385ac2cf5d8bc437c1"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMachine-ai%2Ftinyfive","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMachine-ai%2Ftinyfive/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMachine-ai%2Ftinyfive/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMachine-ai%2Ftinyfive/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenMachine-ai","download_url":"https://codeload.github.com/OpenMachine-ai/tinyfive/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253532981,"owners_count":21923340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","assembler","assembly","compiler","machine-learning","ml","risc-v","risc-v-32-simulation","risc-v-simulator","riscv","riscv-asm","riscv-assembler","riscv-assembly","riscv-emulator","riscv-simulator","riscv32"],"created_at":"2024-08-03T17:01:30.678Z","updated_at":"2025-05-11T07:30:36.964Z","avatar_url":"https://github.com/OpenMachine-ai.png","language":"Python","funding_links":[],"categories":["Electronics Simulators","硬件_其他"],"sub_categories":["网络服务_其他"],"readme":"# TinyFive\n\n\u003ca href=\"https://colab.research.google.com/github/OpenMachine-ai/tinyfive/blob/main/misc/colab.ipynb\"\u003e \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Colab\" height=\"20\"\u003e \u003c/a\u003e\n[![Downloads](https://static.pepy.tech/badge/tinyfive)](https://pepy.tech/project/tinyfive)\n\n\u003c!--- view counter is currently commented out\n[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FOpenMachine-ai%2Ftinyfive\u0026title_bg=%23555555\u0026icon=\u0026title=views+%28today+%2F+total%29\u0026edge_flat=false)](https://hits.seeyoufarm.com)\n ---\u003e\n\nTinyFive is a lightweight RISC-V emulator and assembler written entirely in Python:\n- TinyFive brings the power of Python and NumPy to assembly code.\n- Useful for running neural networks on RISC-V: Simulate your RISC-V assembly code along with a neural network in Keras or PyTorch (and without relying on RISC-V toolchains).\n- Custom instructions can be added for easy HW/SW codesign in Python (without C++ and compiler toolchains).\n- If you want to learn how RISC-V works, TinyFive lets you play with instructions and assembly code in [this colab](https://colab.research.google.com/github/OpenMachine-ai/tinyfive/blob/main/misc/colab.ipynb).\n- TinyFive might also be useful for ML scientists who are using ML/RL for compiler optimizations (see e.g. [CompilerGym](https://github.com/facebookresearch/CompilerGym/blob/development/README.md)) or to replace compiler toolchains by AI.\n- Can be very fast if you only use the upper-case instructions defined in the [first ~200 lines of machine.py](machine.py#L1-L200).\n- [Fewer than 1000 lines](machine.py) of code (w/o tests and examples)\n- Uses NumPy for math\n\n## Contents\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Example 1: Multiply two numbers](#example-1-multiply-two-numbers)\n  - [Example 2: Add two vectors](#example-2-add-two-vectors)\n  - [Example 3: Multiply two matrices](#example-3-multiply-two-matrices)\n  - [Example 4: Neural network layers](#example-4-neural-network-layers)\n  - [Example 5: MobileNet](#example-5-mobilenet)\n- [Running in colab](#running-in-colab)\n- [Running without package](#running-without-package)\n- [Contribute](#contribute)\n- [Latest status](#latest-status)\n- [Speed](#speed)\n- [Comparison](#comparison)\n- [References](#references)\n- [Tiny Tech promise](#tiny-tech-promise)\n\n## Installation\n```\npip install tinyfive\n```\n\n## Usage\nTinyFive can be used in the following three ways:\n- **Option A:** Use upper-case instructions such as `ADD()` and `MUL()`, see examples 1.1, 1.2, 2.1, and 3.1 below.\n- **Option B:** Use `asm()` and `exe()` functions without branch instructions, see examples 1.3 and 2.2 below.\n- **Option C:** Use `asm()` and `exe()` functions with branch instructions, see example 2.3, 3.2, and 3.3 below.\n\nFor the examples below, import and instantiate a RISC-V machine with at least 4KB of memory as follows:\n```python\nfrom tinyfive.machine import machine\nm = machine(mem_size=4000)  # instantiate RISC-V machine with 4KB of memory\n```\n\n### Example 1: Multiply two numbers\n**Example 1.1:** Use upper-case instructions (option A) with back-door loading of registers.\n```python\nm.x[11] = 6        # manually load '6' into register x[11]\nm.x[12] = 7        # manually load '7' into register x[12]\nm.MUL(10, 11, 12)  # x[10] := x[11] * x[12]\nprint(m.x[10])\n# Output: 42\n```\n**Example 1.2:** Same as example 1.1, but now load the data from memory. Specifically, the data values are stored at addresses 0 and 4. Here, each value is 32 bits wide (i.e. 4 bytes wide), which occupies 4 addresses in the byte-wide memory.\n```python\nm.write_i32(6, 0)  # manually write '6' into mem[0] (memory @ address 0)\nm.write_i32(7, 4)  # manually write '7' into mem[4] (memory @ address 4)\nm.LW (11, 0,  0)   # load register x[11] from mem[0 + 0]\nm.LW (12, 4,  0)   # load register x[12] from mem[4 + 0]\nm.MUL(10, 11, 12)  # x[10] := x[11] * x[12]\nprint(m.x[10])\n# Output: 42\n```\n**Example 1.3:** Same as example 1.2, but now use `asm()` and `exe()` (option B). The assembler function `asm()` function takes an instruction and converts it into machine code and stores it in memory at address `s.pc`. Once the entire assembly program is written into memory `mem[]`, the `exe()` function (aka ISS) can then exectute the machine code stored in memory.\n```python\nm.write_i32(6, 0)  # manually write '6' into mem[0] (memory @ address 0)\nm.write_i32(7, 4)  # manually write '7' into mem[4] (memory @ address 4)\n\n# store assembly program in mem[] starting at address 4*20\nm.pc = 4*20\nm.asm('lw',  11, 0,  0)   # load register x[11] from mem[0 + 0]\nm.asm('lw',  12, 4,  0)   # load register x[12] from mem[4 + 0]\nm.asm('mul', 10, 11, 12)  # x[10] := x[11] * x[12]\n\n# execute program from address 4*20: execute 3 instructions and then stop\nm.exe(start=4*20, instructions=3)\nprint(m.x[10])\n# Output: 42\n```\n\n### Example 2: Add two vectors\nWe are using the following memory map for adding two 8-element vectors `res[] := a[] + b[]`, where each vector element is 32 bits wide (i.e. each element occupies 4 byte-addresses in memory).\n| Byte address | Contents |\n| ------------ | -------- |\n|  0   .. 4\\*7   | a-vector: `a[0]` is at address 0, `a[7]` is at address 4\\*7 |\n| 4\\*8  .. 4\\*15 | b-vector: `b[0]` is at address 4\\*8, `b[7]` is at address 4\\*15 |\n| 4\\*16 .. 4\\*23 | result-vector: `res[0]` is at address 4\\*16, `res[7]` is at address 4\\*23 |\n\n**Example 2.1:** Use upper-case instructions (option A) with Python for-loop.\n```python\n# generate 8-element vectors a[] and b[] and store them in memory\na = np.random.randint(100, size=8)\nb = np.random.randint(100, size=8)\nm.write_i32_vec(a, 0)    # write vector a[] to mem[0]\nm.write_i32_vec(b, 4*8)  # write vector b[] to mem[4*8]\n\n# pseudo-assembly for adding vectors a[] and b[] using Python for-loop\nfor i in range(8):\n  m.LW (11, 4*i,      0)   # load x[11] with a[i] from mem[4*i + 0]\n  m.LW (12, 4*(i+8),  0)   # load x[12] with b[i] from mem[4*(i+8) + 0]\n  m.ADD(10, 11,       12)  # x[10] := x[11] + x[12]\n  m.SW (10, 4*(i+16), 0)   # store results in mem[], starting at address 4*16\n\n# compare results against golden reference\nres = m.read_i32_vec(4*16, size=8)  # read result vector from address 4*16\nref = a + b                         # golden reference: simply add a[] + b[]\nprint(res - ref)                    # print difference (should be all-zero)\n# Output: [0 0 0 0 0 0 0 0]\n```\n**Example 2.2**: Same as example 2.1, but now use `asm()` and `exe()` functions without branch instructions (option B).\n```python\n# generate 8-element vectors a[] and b[] and store them in memory\na = np.random.randint(100, size=8)\nb = np.random.randint(100, size=8)\nm.write_i32_vec(a, 0)    # write vector a[] to mem[0]\nm.write_i32_vec(b, 4*8)  # write vector b[] to mem[4*8]\n\n# store assembly program in mem[] starting at address 4*48\nm.pc = 4*48\nfor i in range(8):\n  m.asm('lw',  11, 4*i,      0)   # load x[11] with a[i] from mem[4*i + 0]\n  m.asm('lw',  12, 4*(i+8),  0)   # load x[12] with b[i] from mem[4*(i+8) + 0]\n  m.asm('add', 10, 11,       12)  # x[10] := x[11] + x[12]\n  m.asm('sw',  10, 4*(i+16), 0)   # store results in mem[], starting at address 4*16\n\n# execute program from address 4*48: execute 8*4 instructions and then stop\nm.exe(start=4*48, instructions=8*4)\n\n# compare results against golden reference\nres = m.read_i32_vec(4*16, size=8)  # read result vector from address 4*16\nref = a + b                         # golden reference: simply add a[] + b[]\nprint(res - ref)                    # print difference (should be all-zero)\n# Output: [0 0 0 0 0 0 0 0]\n```\n**Example 2.3:** Same as example 2.2, but now use `asm()` and `exe()` functions with branch instructions (option C). The `lbl()` function defines labels, which are symbolic names that represent memory addresses. These labels improve the readability of branch instructions and mark the start and end of the assembly code executed by the `exe()` function.\n```python\n# generate 8-element vectors a[] and b[] and store them in memory\na = np.random.randint(100, size=8)\nb = np.random.randint(100, size=8)\nm.write_i32_vec(a, 0)    # write vector a[] to mem[0]\nm.write_i32_vec(b, 4*8)  # write vector b[] to mem[4*8]\n\n# store assembly program starting at address 4*48\nm.pc = 4*48\n# x[13] is the loop-variable that is incremented by 4: 0, 4, .., 28\n# x[14] is the constant 28+4 = 32 for detecting the end of the for-loop\nm.lbl('start')                 # define label 'start'\nm.asm('add',  13, 0, 0)        # x[13] := x[0] + x[0] = 0 (because x[0] is always 0)\nm.asm('addi', 14, 0, 32)       # x[14] := x[0] + 32 = 32 (because x[0] is always 0)\nm.lbl('loop')                  # label 'loop'\nm.asm('lw',   11, 0,    13)    # load x[11] with a[] from mem[0 + x[13]]\nm.asm('lw',   12, 4*8,  13)    # load x[12] with b[] from mem[4*8 + x[13]]\nm.asm('add',  10, 11,   12)    # x[10] := x[11] + x[12]\nm.asm('sw',   10, 4*16, 13)    # store x[10] in mem[4*16 + x[13]]\nm.asm('addi', 13, 13,   4)     # x[13] := x[13] + 4 (increment x[13] by 4)\nm.asm('bne',  13, 14, 'loop')  # branch to 'loop' if x[13] != x[14]\nm.lbl('end')                   # label 'end'\n\n# execute program: start at label 'start', stop when label 'end' is reached\nm.exe(start='start', end='end')\n\n# compare results against golden reference\nres = m.read_i32_vec(4*16, size=8)  # read result vector from address 4*16\nref = a + b                         # golden reference: simply add a[] + b[]\nprint(res - ref)                    # print difference (should be all-zero)\n# Output: [0 0 0 0 0 0 0 0]\n```\nA slightly more efficient implementation would decrement the loop variable `x[13]` (instead of incrementing) so that the branch instruction compares against `x[0] = 0` (instead of the constant stored in `x[14]`), which frees up register `x[14]` and reduces the total number of instructions by 1.\n\nUse `print_perf()` to analyze performance and `dump_state()` to print out the current values of the register files and the the program counter (PC) as follows:\n```python\n\u003e\u003e\u003e m.print_perf()\nOps counters: {'total': 50, 'load': 16, 'store': 8, 'mul': 0, 'add': 18, 'madd': 0, 'branch': 8}\nx[] regfile : 5 out of 31 x-registers are used\nf[] regfile : 0 out of 32 f-registers are used\nImage size  : 32 Bytes\n\n\u003e\u003e\u003e m.dump_state()\npc   :  224\nx[ 0]:    0, x[ 1]:    0, x[ 2]:    0, x[ 3]:    0\nx[ 4]:    0, x[ 5]:    0, x[ 6]:    0, x[ 7]:    0\nx[ 8]:    0, x[ 9]:    0, x[10]:   34, x[11]:   27\nx[12]:    7, x[13]:   32, x[14]:   32, x[15]:    0\nx[16]:    0, x[17]:    0, x[18]:    0, x[19]:    0\nx[20]:    0, x[21]:    0, x[22]:    0, x[23]:    0\nx[24]:    0, x[25]:    0, x[26]:    0, x[27]:    0\nx[28]:    0, x[29]:    0, x[30]:    0, x[31]:    0\n```\n\n### Example 3: Multiply two matrices\nWe are using the following memory map for multiplying two 4x4 matrices as `res := np.matmul(A, B)`, where each matrix element is 32 bits wide (i.e. each element occupies 4 byte-addresses in memory).\n| Byte address | Contents |\n| ------------ | -------- |\n|  0    .. 4\\*15 | A-matrix in row-major order: `A[0, 0], A[0, 1], ... A[3, 3]` |\n| 4\\*16 .. 4\\*31 | B-matrix in row-major order: `B[i, j]` is at address `4*(16+i*4+j)` |\n| 4\\*32 .. 4\\*47 | result matrix `res[0, 0] ... res[3, 3]` |\n\n**Example 3.1:** Use upper-case instructions (option A) with Python for-loop.\n```python\n# generate 4x4 matrices A and B and store them in memory\nA = np.random.randint(100, size=(4, 4))\nB = np.random.randint(100, size=(4, 4))\nm.write_i32_vec(A.flatten(), 0)     # write matrix A to mem[0]\nm.write_i32_vec(B.flatten(), 4*16)  # write matrix B to mem[4*16]\n\n# pseudo-assembly for matmul(A, B) using Python for-loops\nfor i in range(4):\n  # load x[10] ... x[13] with row i of A\n  for k in range(4):\n    m.LW (10+k, 4*(4*i+k), 0)  # load x[10+k] with A[i, k]\n\n  for j in range(4):\n    # calculate dot product\n    m.LW (18, 4*(16+j), 0)        # load x[18] with B[0, j]\n    m.MUL(19, 10, 18)             # x[19] := x[10] * x[18] = A[i, 0] * B[0, j]\n    for k in range(1, 4):\n      m.LW (18, 4*(16+4*k+j), 0)  # load x[18] with B[k, j]\n      m.MUL(18, 10+k, 18)         # x[18] := x[10+k] * x[18] = A[i, k] * B[k, j]\n      m.ADD(19, 19, 18)           # x[19] := x[19] + x[18]\n    m.SW (19, 4*(32+i*4+j), 0)    # store res[i, j] from x[19]\n\n# compare results against golden reference\nres = m.read_i32_vec(4*32, size=4*4).reshape(4, 4)  # read result matrix\nref = np.matmul(A, B)            # golden reference\nprint(np.array_equal(res, ref))  # should return 'True'\n# Output: True\n```\n**Example 3.2:** Same as example 3.1, but now use `asm()` and `exe()` functions with branch instructions (option C).\n```python\n# generate 4x4 matrices A and B and store them in memory\nA = np.random.randint(100, size=(4, 4))\nB = np.random.randint(100, size=(4, 4))\nm.write_i32_vec(A.flatten(), 0)     # write matrix A to mem[0]\nm.write_i32_vec(B.flatten(), 4*16)  # write matrix B to mem[4*16]\n\n# store assembly program starting at address 4*128\nm.pc = 4*128\n# here, we decrement the loop variables down to 0 so that we don't need an\n# additional register to hold the constant for detecting the end of the loop:\n#  - x[20] is 4*4*i (i.e. the outer-loop variable) and is decremented by 16 from 64\n#  - x[21] is 4*j (i.e. the inner-loop variable) and is decremented by 4 from 16\nm.lbl('start')\nm.asm('addi', 20, 0, 64)          # x[20] := 0 + 64\n\nm.lbl('outer-loop')\nm.asm('addi', 20, 20, -16)        # decrement loop-variable: x[20] := x[20] - 16\nm.asm('lw',   10, 0,   20)        # load x[10] with A[i, 0] from mem[0 + x[20]]\nm.asm('lw',   11, 4,   20)        # load x[11] with A[i, 1] from mem[4 + x[20]]\nm.asm('lw',   12, 2*4, 20)        # load x[12] with A[i, 2] from mem[2*4 + x[20]]\nm.asm('lw',   13, 3*4, 20)        # load x[13] with A[i, 3] from mem[3*4 + x[20]]\nm.asm('addi', 21, 0, 16)          # reset loop-variable j: x[21] := 0 + 16\n\nm.lbl('inner-loop')\nm.asm('addi', 21, 21, -4)         # decrement j: x[21] := x[21] - 4\n\nm.asm('lw',  18, 4*16, 21)        # load x[18] with B[0, j] from mem[4*16 + x[21]]\nm.asm('mul', 19, 10, 18)          # x[19] := x[10] * x[18] = A[i, 0] * B[0, j]\n\nm.asm('lw',  18, 4*(16+4), 21)    # load x[18] with B[1, j]\nm.asm('mul', 18, 11, 18)          # x[18] := x[11] * x[18] = A[i, 1] * B[1, j]\nm.asm('add', 19, 19, 18)          # x[19] := x[19] + x[18]\n\nm.asm('lw',  18, 4*(16+2*4), 21)  # load x[18] with B[2, j]\nm.asm('mul', 18, 12, 18)          # x[18] := x[11] * x[18] = A[i, 2] * B[2, j]\nm.asm('add', 19, 19, 18)          # x[19] := x[19] + x[18]\n\nm.asm('lw',  18, 4*(16+3*4), 21)  # load x[18] with B[3, j]\nm.asm('mul', 18, 13, 18)          # x[18] := x[11] * x[18] = A[i, 3] * B[3, j]\nm.asm('add', 19, 19, 18)          # x[19] := x[19] + x[18]\n\nm.asm('add', 24, 20, 21)          # calculate base address for result-matrix\nm.asm('sw',  19, 4*32, 24)        # store res[i, j] from x[19]\n\nm.asm('bne', 21, 0, 'inner-loop') # branch to 'inner-loop' if x[21] != 0\nm.asm('bne', 20, 0, 'outer-loop') # branch to 'outer-loop' if x[20] != 0\nm.lbl('end')\n\n# execute program from 'start' to 'end'\nm.exe(start='start', end='end')\n\n# compare results against golden reference\nres = m.read_i32_vec(4*32, size=4*4).reshape(4, 4)  # read result matrix\nref = np.matmul(A, B)            # golden reference\nprint(np.array_equal(res, ref))  # should return 'True'\n# Output: True\n```\n**Example 3.3:** Same as example 3.2,  but now use Python for-loops in the assembly code to improve readability.\n```python\n# generate 4x4 matrices A and B and store them in memory\nA = np.random.randint(100, size=(4, 4))\nB = np.random.randint(100, size=(4, 4))\nm.write_i32_vec(A.flatten(), 0)     # write matrix A to mem[0]\nm.write_i32_vec(B.flatten(), 4*16)  # write matrix B to mem[4*16]\n\n# store assembly program starting at address 4*128\nm.pc = 4*128\n# here, we decrement the loop variables down to 0 so that we don't need an\n# additional register to hold the constant for detecting the end of the loop:\n#  - x[20] is 4*4*i (i.e. the outer-loop variable) and is decremented by 16 from 64\n#  - x[21] is 4*j (i.e. the inner-loop variable) and is decremented by 4 from 16\nm.lbl('start')\nm.asm('addi', 20, 0, 64)            # x[20] := 0 + 64\nm.lbl('outer-loop')\nm.asm('addi', 20, 20, -16)          # decrement loop-variable: x[20] := x[20] - 16\nfor k in range(4):\n  m.asm('lw', 10+k, k*4, 20)        # load x[10+k] with A[i, k] from mem[k*4 + x[20]]\nm.asm('addi', 21, 0, 16)            # reset loop-variable j: x[21] := 0 + 16\nm.lbl('inner-loop')\nm.asm('addi', 21, 21, -4)           # decrement j: x[21] := x[21] - 4\nm.asm('lw',   18, 4*16, 21)         # load x[18] with B[0, j] from mem[4*16 + x[21]]\nm.asm('mul',  19, 10, 18)           # x[19] := x[10] * x[18] = A[i, 0] * B[0, j]\nfor k in range(1, 4):\n  m.asm('lw',  18, 4*(16+k*4), 21)  # load x[18] with B[k, j]\n  m.asm('mul', 18, 10+k, 18)        # x[18] := x[10+k] * x[18] = A[i, k] * B[k, j]\n  m.asm('add', 19, 19, 18)          # x[19] := x[19] + x[18]\nm.asm('add', 24, 20, 21)            # calculate base address for result-matrix\nm.asm('sw',  19, 4*32, 24)          # store res[i, j] from x[19]\nm.asm('bne', 21, 0, 'inner-loop')   # branch to 'inner-loop' if x[21] != 0\nm.asm('bne', 20, 0, 'outer-loop')   # branch to 'outer-loop' if x[20] != 0\nm.lbl('end')\n\n# execute program from 'start' to 'end'\nm.exe(start='start', end='end')\n\n# compare results against golden reference\nres = m.read_i32_vec(4*32, size=4*4).reshape(4, 4)  # read result matrix\nref = np.matmul(A, B)            # golden reference\nprint(np.array_equal(res, ref))  # should return 'True'\n# Output: True\n```\nPerformance numbers for example 3.3:\n```python\n\u003e\u003e\u003e m.print_perf()\nOps counters: {'total': 269, 'load': 80, 'store': 16, 'mul': 64, 'add': 89, 'madd': 0, 'branch': 20}\nx[] regfile : 9 out of 31 x-registers are used\nf[] regfile : 0 out of 32 f-registers are used\nImage size  : 92 Bytes\n```\n**Example 3.4:** 4x4 matrix multiplication optimized for runtime at the expense of image size and register file usage. Specifically, we first store the entire B matrix in the register file. And we fully unroll the for-loops to eliminate loop variables and branch instructions at the expense of a larger image size.\n```python\n# generate 4x4 matrices A and B and store them in memory\nA = np.random.randint(100, size=(4, 4))\nB = np.random.randint(100, size=(4, 4))\nm.write_i32_vec(A.flatten(), 0)     # write matrix A to mem[0]\nm.write_i32_vec(B.flatten(), 4*16)  # write matrix B to mem[4*16]\n\n# store assembly program starting at address 4*128\nm.pc = 4*128\nm.lbl('start')\n# load entire B matrix into registers x[16] ... x[31]\nfor i in range(4):\n  for j in range(4):\n    m.asm('lw', 16+4*i+j, 4*(16+4*i+j), 0)\n# perform matmul in row-major order\nfor i in range(4):\n  for k in range(4):                    # load x[10] ... x[13] with row i of A\n    m.asm('lw', 10+k, 4*(4*i+k), 0)     # load x[10+k] with A[i, k]\n  for j in range(4):\n    m.asm('mul', 15, 10, 16+j)          # x[15] := x[10] * x[16+j] = A[i, 0] * B[0, j]\n    for k in range(1, 4):\n      m.asm('mul', 14, 10+k, 16+4*k+j)  # x[14] := x[10+k] * x[16+4k+j] = A[i, k] * B[k, j]\n      m.asm('add', 15, 15, 14)          # x[15] := x[15] + x[14]\n    m.asm('sw', 15, 4*(32+i*4+j), 0)    # store res[i, j] from x[15]\nm.lbl('end')\n\n# execute program from 'start' to 'end'\nm.exe(start='start', end='end')\n\n# compare results against golden reference\nres = m.read_i32_vec(4*32, size=4*4).reshape(4, 4)  # read result matrix\nref = np.matmul(A, B)            # golden reference\nprint(np.array_equal(res, ref))  # should return 'True'\n# Output: True\n```\nThe table below shows a speedup of 1.7 with the following caveats:\n- The bit-widths don't make sense for fixed point (in general, multiplying two 32-bit integers produces a 64-bit product; and adding 4 of these products requires up to 66 bits).\n- For runtime calculations, we assume that our RISC-V CPU can only perform one instruction per cycle (while many RISC-V cores can perform multiple instructions per cycle).\n- We assume all 31 registers can be used, which is unrealistic because we ignore register allocation conventions such as the procedure\ncalling conventions specified [here](https://github.com/riscv-non-isa/riscv-elf-psabi-doc).\n\n|             | Image | Registers | Load | Store | Mul | Add | Branch | Total ops | Speedup |\n|:-----------:|:-----:|:---------:|:----:|:-----:|:---:|:---:|:------:|:---------:|:-------:|\n| Example 3.3 | 92B   | 9         | 80   | 16    | 64  | 89  | 20     | 269       | 1       |\n| Example 3.4 | 640B  | 22        | 32   | 16    | 64  | 48  | 0      | 160       | 1.7     |\n\n### Example 4: Neural network layers\nComing soon, see [file layer_examples.py](layer_examples.py) for now\n\n### Example 5: MobileNet\nComing soon-ish, see [file mobilenet_v1_0.25.py](mobilenet_v1_0.25.py) for now\n\n## Running in colab\n\u003ca href=\"https://colab.research.google.com/github/OpenMachine-ai/tinyfive/blob/main/misc/colab.ipynb\"\u003e\n  \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Colab\" height=\"20\"\u003e\n\u003c/a\u003e  This is the quickest way to get started and should work on any machine.\n\nIf you have a free Google Drive account, you can make a copy of this colab via the menu `File` -\u003e `Save a copy in Drive`. Now you can edit the code.\n\nAlternatively, start a new colab in your Google Drive as follows: [Go here](https://drive.google.com/drive/my-drive) and click on `New` -\u003e `More` -\u003e `Google Colaboratory`. Then copy below lines into your colab:\n\n```python\n!pip install tinyfive\nfrom tinyfive.machine import machine\nimport numpy as np\n\nm = machine(mem_size=4000)  # instantiate RISC-V machine with 4KB of memory\n```\n\n## Running without package\nIf you don't want to use the TinyFive python package, then you can clone the latest repo and install numpy as follows:\n```bash\ngit clone https://github.com/OpenMachine-ai/tinyfive.git\ncd tinyfive\npip install numpy\n```\nTo run the examples, type:\n```bash\npython3 examples.py\n```\nTo run the test suite, type:\n```bash\npython3 tests.py\n```\n\nIf you don't want to run above steps on your local machine, you can run it in a colab as follows: Start a new colab in your Google Drive by [going here](https://drive.google.com/drive/my-drive) and clicking on `New` -\u003e `More` -\u003e `Google Colaboratory`. Then copy below lines into your colab:\n```python\n!git clone https://github.com/OpenMachine-ai/tinyfive.git\n%cd tinyfive\n\n# run examples\n!python3 examples.py\n\n# run test suite\n!python3 tests.py\n```\n## Contribute\nIf you like this project, give it a ⭐ and share it with friends!  And if you are interested in helping make TinyFive better,\nI highly welcome you to do so. I thank you in advance for your interest.  If you are unsure of what you could do to improve the project, you may have a look [here](https://github.com/OpenMachine-ai/tinyfive/issues/5).\n\n## Latest status\n- TinyFive is still under construction, many things haven't been implemented and tested yet.\n- 37 of the 40 base instructions (RV32I), all instructions of the M-extension (RV32M) and the F-extension (RV32F) with the default rounding mode are already implemented, and many of them are tested.  (The three missing RV32I instructions `fence`, `ebreak`, and `ecall` are not applicable here.)\n- Remaining work: improve testing, add more extensions. See TODOs in the code for more details.\n- Stay updated by following us on [Twitter](https://twitter.com/OpenMachine_AI), [Post.news](https://post.news/@/openmachine), and [LinkedIn](https://www.linkedin.com/in/nilsgraef/).\n\n## Speed\n- TinyFive is not optimized for speed (but for ease-of-use and [LOC](https://en.wikipedia.org/wiki/Source_lines_of_code)).\n- You might be able to use PyPy or [Codon](https://github.com/exaloop/codon) to speed up TinyFive (see e.g. the [Pydgin paper](https://www.csl.cornell.edu/~berkin/ilbeyi-pydgin-riscv2016.pdf) for details).\n- If you only use the upper-case instructions such as `ADD()`, then TinyFive is very fast because there is no instruction decoding. And you should be able to accelerate it on a GPU or TPU.\n- If you use the lower-case instructions with `asm()` and `exe()`, then execution of these functions is slow as they involve look-up and string matching with O(n) complexity where \"n\" is the total number of instructions. The current implementations of `asm()` and `dec()` are optimized for ease-of-use and readability. A faster implementation would collapse multiple look-ups into one look-up, optimize the pattern-matching for the instruction decoding (bits -\u003e instruction), and change the order of the instructions so that more frequently used instructions are at the top of the list. [Here is an older version](https://github.com/OpenMachine-ai/tinyfive/blob/2aa4987391561c9c6692602ed3fccdeaee333e0b/tinyfive.py) of TinyFive with a faster `dec()` function that collapses two look-ups (`bits -\u003e instruction` and `instruction -\u003e uppeer-case instruction`) and doesn't use `fnmatch`.\n\n## Comparison\nThe table below compares TinyFive with other [ISS](https://en.wikipedia.org/wiki/Instruction_set_simulator) and emulator projects.\n\n| ISS | Author | Language | Mature? | Extensions | LOC |\n| --- | ------ | -------- | ------- | ---------- | --- |\n| [TinyFive](https://github.com/OpenMachine-ai/tinyfive)             | OpenMachine          | Python    | No               | I, M, some F  | \u003c 1k |\n| [Pydgin](https://github.com/cornell-brg/pydgin)                    | Cornell University   | Python, C | Last update 2016 | A, D, F, I, M | |\n| [Spike](https://github.com/riscv-software-src/riscv-isa-sim)       | UC Berkeley          | C, C++    | Yes              | All           | |\n| [QEMU](https://www.qemu.org/) | [Fabrice Bellard](https://en.wikipedia.org/wiki/Fabrice_Bellard) | C  | Yes              | All           | |\n| [TinyEMU](https://bellard.org/tinyemu/) | [Fabrice Bellard](https://en.wikipedia.org/wiki/Fabrice_Bellard) | C  | Yes    | All           | |\n| [riscvOVPsim](https://github.com/riscv-ovpsim/imperas-riscv-tests) | Imperas              | C         | Yes              | All           | |\n| [Whisper](https://github.com/chipsalliance/SweRV-ISS)              | Western Digital      | C, C++    | Yes | Almost all                 | |\n| [Sail Model](https://github.com/riscv/sail-riscv)                  | Cambridge, Edinburgh | Sail, C   | Yes | All                        | |\n| [PiMaker/rvc](https://github.com/PiMaker/rvc)                      | PiMaker              | C         |     |                            | |\n| [mini-rv32ima](https://github.com/cnlohr/mini-rv32ima)             | Charles Lohr         | C         |     | A, I, M, Zifencei, Zicsr   | \u003c 1k |\n\n## References\n- [HuggingFive:raised_hand_with_fingers_splayed:](https://github.com/OpenMachine-ai/HuggingFive)\n- Official [RISC-V spec](https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf)\n- See [this RISC-V card](https://inst.eecs.berkeley.edu/~cs61c/fa18/img/riscvcard.pdf) for a brief description of most instructions. See also the [RISC-V reference card](http://riscvbook.com/greencard-20181213.pdf).\n- Book [The RISC-V Reader: An Open Architecture Atlas](https://www.abebooks.com/book-search/author/patterson-david-waterman-andrew/) by David Patterson and Andrew Waterman. Appendix A of this book defines all instructions. The Spanish version of this book is [available for free](http://riscvbook.com/spanish/guia-practica-de-risc-v-1.0.5.pdf),\nother free versions are [available here](http://riscvbook.com).\n- Pydgin [paper](https://www.csl.cornell.edu/~berkin/ilbeyi-pydgin-riscv2016.pdf) and [video](https://youtu.be/-p_AGki7Vsk)\n- [Online simulator](https://ascslab.org/research/briscv/simulator/simulator.html) for debug\n\n## Tiny Tech promise\nSimilar to [TinyEMU](https://bellard.org/tinyemu/), [tinygrad](https://github.com/geohot/tinygrad), and other “tiny tech” projects, we believe that core technology should be simple and small (in terms of LOC). Therefore, we will make sure that the core of TinyFive (without tests and examples) will always be below 1000 lines.\n\nSimplicity and size (in terms of number of instructions) is a key feature of [RISC](https://en.wikipedia.org/wiki/Reduced_instruction_set_computer): the \"R\" in RISC stands for \"reduced\" (as opposed to complex CISC). Specifically, the ISA manual of RISC-V has only ~200 pages while the ARM-32 manual is over 2000 pages long according to Fig. 1.6 of\nthe [RISC-V Reader](http://riscvbook.com/spanish/guia-practica-de-risc-v-1.0.5.pdf).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/OpenMachine-ai/tinyfive/blob/main/misc/logo.jpg\"\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenMachine-ai%2Ftinyfive","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenMachine-ai%2Ftinyfive","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenMachine-ai%2Ftinyfive/lists"}