{"id":13831390,"url":"https://github.com/mn416/QPULib","last_synced_at":"2025-07-09T13:33:42.322Z","repository":{"id":40685785,"uuid":"55894578","full_name":"mn416/QPULib","owner":"mn416","description":"Language and compiler for the Raspberry Pi GPU","archived":false,"fork":false,"pushed_at":"2020-12-09T13:57:00.000Z","size":1002,"stargazers_count":430,"open_issues_count":25,"forks_count":63,"subscribers_count":30,"default_branch":"master","last_synced_at":"2024-11-20T12:48:53.036Z","etag":null,"topics":["compiler","gpu","qpu","raspberry-pi","vector"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mn416.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-04-10T11:13:21.000Z","updated_at":"2024-10-06T11:30:19.000Z","dependencies_parsed_at":"2022-08-30T18:50:42.473Z","dependency_job_id":null,"html_url":"https://github.com/mn416/QPULib","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mn416/QPULib","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mn416%2FQPULib","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mn416%2FQPULib/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mn416%2FQPULib/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mn416%2FQPULib/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mn416","download_url":"https://codeload.github.com/mn416/QPULib/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mn416%2FQPULib/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264468253,"owners_count":23613064,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compiler","gpu","qpu","raspberry-pi","vector"],"created_at":"2024-08-04T10:01:26.850Z","updated_at":"2025-07-09T13:33:41.581Z","avatar_url":"https://github.com/mn416.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# QPULib\n\nVersion 0.1.0.\n\nQPULib is a programming language and compiler for the [Raspberry\nPi](https://www.raspberrypi.org/)'s *Quad Processing Units* (QPUs).\nIt is implemented as a C++ library that runs on the Pi's ARM CPU,\ngenerating and offloading programs to the QPUs at runtime.  This page\nintroduces and documents QPULib.  For build instructions, see the\n[Getting Started Guide](Doc/GettingStarted.md).\n\nNote that QPULib is an experimental library, no longer under\ndevelopment.\n\n## Contents\n\n* [Background](#background)\n* [Example 1: Euclid's Algorithm](#example-1-euclids-algorithm)\n    * [Scalar version](#scalar-version)\n    * [Vector version 1](#vector-version-1)\n    * [Invoking the QPUs](#invoking-the-qpus)\n    * [Vector version 2: loop unrolling](#vector-version-2-loop-unrolling)\n* [Example 2: 3D Rotation](#example-2-3d-rotation)\n    * [Scalar version](#scalar-version-1)\n    * [Vector version 1](#vector-version-1-1)\n    * [Vector version 2: non-blocking loads and stores](#vector-version-2-non-blocking-loads-and-stores)\n    * [Vector version 3: multiple QPUs](#vector-version-3-multiple-qpus)\n    * [Performance](#performance)\n* [Example 3: 2D Convolution (Heat Transfer)](#example-3-2d-convolution-heat-transfer)\n    * [Scalar version](#scalar-version-2)\n    * [Vector version](#vector-version)\n    * [Performance](#performance-1)\n* [References](#user-content-references)\n\n## Background\n\nThe\n[QPU](http://www.broadcom.com/docs/support/videocore/VideoCoreIV-AG100-R.pdf)\nis a [vector\nprocessor](https://en.wikipedia.org/wiki/Vector_processor) developed by\n[Broadcom](http://www.broadcom.com/) with\ninstructions that operate on 16-element vectors of 32-bit integer or\nfloating point values.\nFor example, given two 16-element vectors\n\n`10 11 12 13` `14 15 16 17` `18 19 20 21` `22 23 24 25`\n\nand\n\n`20 21 22 23` `24 25 26 27` `28 29 30 31` `32 33 34 35`\n\nthe QPU's *integer-add* instruction computes a third vector\n\n`30 32 34 36` `38 40 42 44` `46 48 50 52` `54 56 58 60`\n\nwhere each element in the output is the sum of the\ncorresponding two elements in the inputs.\n\nEach 16-element vector is comprised of four *quads*.  This is where\nthe name \"Quad Processing Unit\" comes from: a QPU processes one quad\nper clock cycle, and a QPU instruction takes four consecutive clock\ncycles to deliver a full 16-element result vector.\n\nThe Pi contains 12 QPUs in total, each running at 250MHz.  That's a\nmax throughput of 750M vector instructions per second (250M cycles\ndivided by 4 cycles-per-instruction times 12 QPUs).  Or: 12B\noperations per second (750M instructions times 16 vector elements).\nQPU instructions can in some cases deliver two results at a\ntime, so the Pi's QPUs are often advertised at 24\n[GFLOPS](https://en.wikipedia.org/wiki/FLOPS).\n\nThe QPUs are part of the Raspberry Pi's graphics pipeline.  If you're\ninterested in doing efficient graphics on the Pi then you probably\nwant [OpenGL\nES](https://www.raspberrypi.org/documentation/usage/demos/hello-teapot.md).\nBut if you'd like to try accellerating a non-graphics part of your Pi\nproject then QPULib is worth a look.  (And so too are\n[these references](#user-content-references).)\n\n## Example 1: Euclid's Algorithm\n\nFollowing tradition, let's start by implementing [Euclid's\nalgorithm](https://en.wikipedia.org/wiki/Euclidean_algorithm).  Given\na pair of positive integers larger then zero, Euclid's algorithm\ncomputes the largest integer that divides into both without a\nremainder, also known as the *greatest common divisor*, or GCD for\nshort.\n\nWe present two versions of the algorithm:\n\n  1. a **scalar** version that runs on the ARM CPU and computes a\n     single GCD; and\n\n  2. a **vector** version that runs on a single QPU and computes 16\n     different GCDs in parallel.\n\n### Scalar version\n\nIn plain C++, we can express the algorithm as follows.\n\n```C++\nvoid gcd(int* p, int* q, int* r)\n{\n  int a = *p;\n  int b = *q;\n  while (a != b) {\n    if (a \u003e b) \n      a = a-b;\n    else\n      b = b-a;\n  }\n  *r = a;\n}\n```\n\nAdmittedly, it's slightly odd to write `gcd` in this way, operating\non pointers to integers rather than integers directly.  However, it\nprepares the way for the vector version which operates on \n*arrays* of inputs and outputs.\n\n### Vector version 1\n\nUsing QPULib, the algorithm looks as follows.\n\n```c++\n#include \u003cQPULib.h\u003e\n\nvoid gcd(Ptr\u003cInt\u003e p, Ptr\u003cInt\u003e q, Ptr\u003cInt\u003e r)\n{\n  Int a = *p;\n  Int b = *q;\n  While (any(a != b))\n    Where (a \u003e b)\n      a = a-b;\n    End\n    Where (a \u003c b)\n      b = b-a;\n    End\n  End\n  *r = a;\n}\n```\n\nEven this simple example introduces a number of concepts:\n\n  * the `Int` type denotes a 16-element vector of 32-bit integers;\n\n  * the `Ptr\u003cInt\u003e` type denotes a 16-element vector of *addresses* of\n    `Int` vectors;\n\n  * the expression `*p` denotes the `Int` vector in memory starting at address\n    \u003ctt\u003ep\u003csub\u003e0\u003c/sub\u003e\u003c/tt\u003e, i.e. starting at the *first* address in the\n    vector `p`;\n\n  * the expression `a != b` computes a vector of booleans via a \n    pointwise comparison of vectors `a` and `b`;\n\n  * the condition `any(a != b)` is true when *any* of the booleans in the\n    vector `a != b` are true;\n\n  * the statement `Where (a \u003e b) a = a-b; End` is a conditional assigment:\n    only elements in vector `a` for which `a \u003e b` holds will be\n    modified.\n\nIt's worth reiterating that QPULib is just standard C++ code: there\nare no pre-processors being used other than the standard C\npre-processor.  All the QPULib language constructs are simply\nclasses, functions, and macros exported by QPULib.  This kind of\nlanguage is somtimes known as a [Domain Specific Embedded\nLanguage](http://cs.yale.edu/c2/images/uploads/dsl.pdf).\n\n### Invoking the QPUs\n\nNow, to compute 16 GCDs on a single QPU, we write the following\nprogram.\n\n```c++\nint main()\n{\n  // Compile the gcd function to a QPU kernel k\n  auto k = compile(gcd);\n\n  // Allocate and initialise arrays shared between CPU and QPUs\n  SharedArray\u003cint\u003e a(16), b(16), r(16);\n\n  // Initialise inputs to random values in range 100..199\n  srand(0);\n  for (int i = 0; i \u003c 16; i++) {\n    a[i] = 100 + rand()%100;\n    b[i] = 100 + rand()%100;\n  }\n\n  // Set the number of QPUs to use\n  k.setNumQPUs(1);\n\n  // Invoke the kernel\n  k(\u0026a, \u0026b, \u0026r);\n\n  // Display the result\n  for (int i = 0; i \u003c 16; i++)\n    printf(\"gcd(%i, %i) = %i\\n\", a[i], b[i], r[i]);\n  \n  return 0;\n}\n```\n\nUnpacking this a bit:\n\n  * `compile` takes function defining a QPU computation and returns a\n    CPU-side handle that can be used to invoke it;\n\n  * the handle `k` is of type `Kernel\u003cPtr\u003cInt\u003e, Ptr\u003cInt\u003e,\n    Ptr\u003cInt\u003e\u003e`, capturing the types of `gcd`'s parameters,\n    but we use the `auto` keyword to avoid clutter;\n\n  * when the kernel is invoked by writing `k(\u0026a, \u0026b, \u0026r)`, QPULib knows\n    how to automatically convert CPU values of type\n    `SharedArray\u003cint\u003e*` into QPU values of type `Ptr\u003cInt\u003e`;\n\n  * the \u003ctt\u003eSharedArray\u0026lt;\u0026alpha;\u0026gt;\u003c/tt\u003e type is used to allocate\n    memory that is accessed\n    by both the CPU and the QPUs: memory allocated with `new` and\n    `malloc()` will not be accessible from the QPUs.\n\nRunning this program, we get:\n\n```\ngcd(183, 186) = 3\ngcd(177, 115) = 1\ngcd(193, 135) = 1\ngcd(186, 192) = 6\ngcd(149, 121) = 1\ngcd(162, 127) = 1\ngcd(190, 159) = 1\ngcd(163, 126) = 1\ngcd(140, 126) = 14\ngcd(172, 136) = 4\ngcd(111, 168) = 3\ngcd(167, 129) = 1\ngcd(182, 130) = 26\ngcd(162, 123) = 3\ngcd(167, 135) = 1\ngcd(129, 102) = 3\n```\n\n### Vector version 2: loop unrolling\n\n[Loop unrolling](https://en.wikipedia.org/wiki/Loop_unrolling) is a\ntechnique for improving performance by reducing the number of costly\nbranch instructions executed.\n\nThe QPU's branch instruction can indeed be costly: it requires three\n[delay slots](https://en.wikipedia.org/wiki/Delay_slot) (that's 12\nclock cycles), and QPULib currently makes no attempt to fill these\nslots with useful work.  Although QPULib doesn't do loop unrolling\nfor you, it does make it easy to express: we can simply\nuse a C++ loop to generate multiple QPU statements.\n\n```c++\nvoid gcd(Ptr\u003cInt\u003e p, Ptr\u003cInt\u003e q, Ptr\u003cInt\u003e r)\n{\n  Int a = *p;\n  Int b = *q;\n  While (any(a != b))\n    // Unroll the loop body 32 times\n    for (int i = 0; i \u003c 32; i++) {\n      Where (a \u003e b)\n        a = a-b;\n      End\n      Where (a \u003c b)\n        b = b-a;\n      End\n    }\n  End\n  *r = a;\n}\n```\n\nUsing C++ as a meta-language in this way is one of the attractions\nof QPULib.  We will see lots more examples of this later!\n\n## Example 2: 3D Rotation\n\nLet's move to another simple example that helps to introduce\nideas: a routine to rotate 3D objects.\n\n(Of course, [OpenGL\nES](https://www.raspberrypi.org/documentation/usage/demos/hello-teapot.md)\nwould be a much better path for doing efficient graphics; this is just\nfor illustration purposes.)\n\n### Scalar version\n\nThe following function will rotate `n` vertices about the Z axis by\n\u0026theta; degrees.\n\n```c++\nvoid rot3D(int n, float cosTheta, float sinTheta, float* x, float* y)\n{\n  for (int i = 0; i \u003c n; i++) {\n    float xOld = x[i];\n    float yOld = y[i];\n    x[i] = xOld * cosTheta - yOld * sinTheta;\n    y[i] = yOld * cosTheta + xOld * sinTheta;\n  }\n}\n```\n\nIf we apply this to the vertices in [Newell's\nteapot](https://github.com/rm-hull/newell-teapot/blob/master/teapot)\n(rendered using [Richard Hull's\nwireframes](https://github.com/rm-hull/wireframes) tool)\n\n\u003cimg src=\"Doc/teapot.png\" alt=\"Newell's teapot\" width=30%\u003e\n\nwith \u0026theta; = 180 degrees, then we get\n\n\u003cimg src=\"Doc/teapot180.png\" alt=\"Newell's teapot\" width=30%\u003e\n\n### Vector version 1\n\nOur first vector version is almost identical to the scalar version\nabove: the only difference is that each loop iteration now processes\n16 vertices at a time rather than a single vertex.\n\n```c++\nvoid rot3D(Int n, Float cosTheta, Float sinTheta, Ptr\u003cFloat\u003e x, Ptr\u003cFloat\u003e y)\n{\n  For (Int i = 0, i \u003c n, i = i+16)\n    Float xOld = x[i];\n    Float yOld = y[i];\n    x[i] = xOld * cosTheta - yOld * sinTheta;\n    y[i] = yOld * cosTheta + xOld * sinTheta;\n  End\n}\n```\n\nUnfortunately, this simple solution is not the most efficient: it will\nspend a lot of time blocked on the memory subsystem, waiting for\nvector loads and stores to complete.  To get good performance on a\nQPU, it is desirable to overlap memory access with computation, and\nthe current QPULib compiler is not clever enough to do this\nautomatically.  We can however solve the problem manually, using\n*non-blocking* load and store operations.\n\n### Vector version 2: non-blocking loads and stores\n\nQPULib supports non-blocking loads through two functions:\n\n  * Given a vector of addresses `p`, the\n    statement `gather(p)` will *request* \n    the value at each address in `p`.\n\n  * A subsequent a call to `receive(x)`, where `x` is vector,\n    will block until the value at each address in\n    `p` has been loaded into `x`.\n\nUnlike the statement `x = *p`, the statement `gather(p)` will request\nthe value *at each address* in `p`, not the vector beginning at the\nfirst address in `p`.  In addition, `gather(p)` does not\nblock until the loads have completed: between `gather(p)`\nand `receive(x)` the program is free to perform computation *in\nparallel* with the slow memory accesses.\n\nInside the QPU, an 4-element FIFO is used to hold `gather`\nrequests: each call to `gather` will enqueue the FIFO, and each call\nto `receive` will dequeue it.  This means that a maximum of four\n`gather` calls may be issued before a `receive` must be called.\n\nNon-blocking stores are not as powerfull, but they are\nstill useful:\n\n  * Given vector of addresses `p` and a vector `x`,\n    the statement `store(x, p)` will write\n    vector `x` to memory beginning at the first address in `p`.\n\nUnlike the statement `*p = x`, the statement `store(p, x)` will not\nwait until `x` has been written.  However, any subsequent call to\n`store` will wait until the previous store has completed.  (Future\nimprovements to QPULib could allow several outstanding stores instead of\njust one.)\n\nWe are now ready to implement a vectorised rotation routine that\noverlaps memory access with computation:\n\n```c++\nvoid rot3D(Int n, Float cosTheta, Float sinTheta, Ptr\u003cFloat\u003e x, Ptr\u003cFloat\u003e y)\n{\n  // Function index() returns vector \u003c0 1 2 ... 14 15\u003e\n  Ptr\u003cFloat\u003e p = x + index();\n  Ptr\u003cFloat\u003e q = y + index();\n  // Pre-fetch first two vectors\n  gather(p); gather(q);\n\n  Float xOld, yOld;\n  For (Int i = 0, i \u003c n, i = i+16)\n    // Pre-fetch two vectors for the *next* iteration\n    gather(p+16); gather(q+16);\n    // Receive vectors for *this* iteration\n    receive(xOld); receive(yOld);\n    // Store results\n    store(xOld * cosTheta - yOld * sinTheta, p);\n    store(yOld * cosTheta + xOld * sinTheta, q);\n    p = p+16; q = q+16;\n  End\n\n  // Discard pre-fetched vectors from final iteration\n  receive(xOld); receive(yOld);\n}\n```\n\nWhile the outputs from one iteration are being computed and written to\nmemory, the inputs for the *next* iteration are being loaded *in\nparallel*.\n\n### Vector version 3: multiple QPUs\n\nQPULib provides a simple mechanism to execute the same kernel on\nmultiple QPUs in parallel: before invoking a kernel `k`, call\n`k.setNumQPUs(n)` to use `n` QPUs.\nFor this to be useful the programmer needs a way to tell\neach QPU to compute a different part of the overall result.\nAccordingly,\nQPULib provides the `me()` function which returns the unique id of the\nQPU that called it.  More specifically, `me()` returns a vector of\ntype `Int` with all elements holding the QPU id.  In addition, the\n`numQPUs()` function returns the number of QPUs that are executing the\nkernel.  A QPU id will always lie in the range `0` to `numQPUs()-1`.\n\nNow, to spread the `rot3D` computation accross multiple QPUs we will\nuse a loop increment of `16*numQPUs()` instead of `16`, and offset the\ninitial pointers `x` and `y` by `16*me()`.\n\n```c++\nvoid rot3D(Int n, Float cosTheta, Float sinTheta, Ptr\u003cFloat\u003e x, Ptr\u003cFloat\u003e y)\n{\n  Int inc = numQPUs() \u003c\u003c 4;\n  Ptr\u003cFloat\u003e p = x + index() + (me() \u003c\u003c 4);\n  Ptr\u003cFloat\u003e q = y + index() + (me() \u003c\u003c 4);\n  gather(p); gather(q);\n\n  Float xOld, yOld;\n  For (Int i = 0, i \u003c n, i = i+inc)\n    gather(p+inc); gather(q+inc);\n    receive(xOld); receive(yOld);\n    store(xOld * cosTheta - yOld * sinTheta, p);\n    store(yOld * cosTheta + xOld * sinTheta, q);\n    p = p+inc; q = q+inc;\n  End\n\n  // Discard pre-fetched vectors from final iteration\n  receive(xOld); receive(yOld);\n}\n```\n\n### Performance\n\nTimes taken to rotate an object with 192,000 vertices:\n\n  Version  | Number of QPUs | Run-time (s) |\n  ---------| -------------: | -----------: |\n  Scalar   | 0              | 0.018        |\n  Vector 1 | 1              | 0.040        |\n  Vector 2 | 1              | 0.018        |\n  Vector 3 | 1              | 0.018        |\n  Vector 3 | 2              | 0.016        |\n\nNon-blocking loads and stores (vector version 2) give a\nsignificant performance boost: in this case a factor of 2.\n\nUnforunately, the program does not scale well to multiple QPUs.  I'm\nnot entirely sure why, but my suspicion is that the compute-to-memory\nratio is too low: we do only 2 arithmetic operations for every memory\naccess, perhaps overwhelming the memory subsystem.  If there are\npossibilities for QPULib to generate better code here, hopefully they\nwill be discovered in due course.  (Do let me know if you\nhave any suggestions.)\n\n## Example 3: 2D Convolution (Heat Transfer)\n\nLet's move to a somewhat more substantial example: modelling the heat\nflow across a 2D surface.  [Newton's law of\ncooling](https://en.wikipedia.org/wiki/Newton%27s_law_of_cooling)\nstates that an object cools at a rate proportional to the difference\nbetween its temperature `T` and the temperature of its environment (or\nambient temperature) `A`:\n\n```\ndT/dt = −k(T − A)\n```\n\nWhen simulating this equation below, we will consider each point on\nour 2D surface to be a seperate object, and the ambient temperature of\neach object to be the average of the temperatures of the 8 surrounding\nobjects.  This is very similar to 2D convolution using a mean filter.\n\n### Scalar version\n\nThe following function simulates a single time-step of the\ndifferential equation, applied to each object in the 2D grid.\n\n```c++\nvoid step(float** grid, float** gridOut, int width, int height)\n{\n  for (int y = 1; y \u003c height-1; y++) {\n    for (int x = 1; x \u003c width-1; x++) {\n      float surroundings =\n        grid[y-1][x-1] + grid[y-1][x]   + grid[y-1][x+1] +\n        grid[y][x-1]   +                  grid[y][x+1]   +\n        grid[y+1][x-1] + grid[y+1][x]   + grid[y+1][x+1];\n      surroundings *= 0.125;\n      gridOut[y][x] = grid[y][x] - (K * (grid[y][x] - surroundings));\n    }\n  }\n}\n```\n\nIf we apply heat at the north and east edges of our 2D surface, and\ncold at the south and west edges, then after of several simulation\nsteps we get:\n\n\u003cimg src=\"Doc/heat.png\" alt=\"Heat flow across 2D surface\" width=30%\u003e\n\n### Vector version\n\nBefore vectorising the simulation routine, we will introduce the idea\nof a **cursor** which is useful for implementing sliding window\nalgorithms.  A cursor points to a window of three continguous vectors\nin memory: `prev`, `current` and `next`.\n\n```\n  cursor  ------\u003e  +---------+---------+---------+\n                   |  prev   | current |  next   |\n                   +---------+---------+---------+\n                 +0:      +16:      +32:      +48:\n```\n\nand supports three main operations:\n\n  1. **advance** the cursor by one vector, i.e. slide the window right\n     by one vector;\n\n  2. **shift-left** the `current` vector by one element,\n     using the value of the `next` vector;\n\n  3. **shift-right** the `current` vector by one element,\n     using the value of the `prev` vector.\n\nHere is a QPULib implementation of a cursor, using a C++ class.\n\n```c++\nclass Cursor {\n  Ptr\u003cFloat\u003e cursor;\n  Float prev, current, next;\n\n public:\n\n  // Initialise to cursor to a given pointer\n  // and fetch the first vector.\n  void init(Ptr\u003cFloat\u003e p) {\n    gather(p);\n    current = 0;\n    cursor = p+16;\n  }\n\n  // Receive the first vector and fetch the second.\n  // (prime the software pipeline)\n  void prime() {\n    receive(next);\n    gather(cursor);\n  }\n\n  // Receive the next vector and fetch another.\n  void advance() {\n    cursor = cursor+16;\n    prev = current;\n    gather(cursor);\n    current = next;\n    receive(next);\n  }\n\n  // Receive final vector and don't fetch any more.\n  void finish() {\n    receive(next);\n  }\n\n  // Shift the current vector left one element\n  void shiftLeft(Float\u0026 result) {\n    result = rotate(current, 15);\n    Float nextRot = rotate(next, 15);\n    Where (index() == 15)\n      result = nextRot;\n    End\n  }\n\n  // Shift the current vector right one element\n  void shiftRight(Float\u0026 result) {\n    result = rotate(current, 1);\n    Float prevRot = rotate(prev, 1);\n    Where (index() == 0)\n      result = prevRot;\n    End\n  }\n};\n```\n\nGiven a vector `x`, the QPULib operation `rotate(x, n)` will rotate\n`x` right by `n` places where `n` is a integer in the range 0 to 15.\nNotice that rotating right by 15 is the same as rotating left by 1.\n\nNow, using cursors the vectorised simulation step is expressed below.\nA slight structural difference from the scalar version is that we no\nlonger treat the grid as a 2D array: it is now 1D array with a `pitch`\nparameter that gives the increment needed to get from the start of one\nrow to the start of the next.\n\n```C++\nvoid step(Ptr\u003cFloat\u003e grid, Ptr\u003cFloat\u003e gridOut, Int pitch, Int width, Int height)\n{\n  Cursor row[3];\n  grid = grid + pitch*me() + index();\n\n  // Skip first row of output grid\n  gridOut = gridOut + pitch;\n\n  For (Int y = me(), y \u003c height, y=y+numQPUs())\n    // Point p to the output row\n    Ptr\u003cFloat\u003e p = gridOut + y*pitch;\n\n    // Initilaise three cursors for the three input rows\n    for (int i = 0; i \u003c 3; i++) row[i].init(grid + i*pitch);\n    for (int i = 0; i \u003c 3; i++) row[i].prime();\n\n    // Compute one output row\n    For (Int x = 0, x \u003c width, x=x+16)\n\n      for (int i = 0; i \u003c 3; i++) row[i].advance();\n\n      Float left[3], right[3];\n      for (int i = 0; i \u003c 3; i++) {\n        row[i].shiftLeft(right[i]);\n        row[i].shiftRight(left[i]);\n      }\n\n      Float sum = left[0] + row[0].current + right[0] +\n                  left[1] +                  right[1] +\n                  left[2] + row[2].current + right[2];\n\n      store(row[1].current - K * (row[1].current - sum * 0.125), p);\n      p = p + 16;\n\n    End\n\n    // Cursors are finished for this row\n    for (int i = 0; i \u003c 3; i++) row[i].finish();\n\n    // Move to the next input rows\n    grid = grid + pitch*numQPUs();\n  End\n}\n```\n\n### Performance\n\nTimes taken to simulate a 512x512 surface for 2000 steps:\n\n  Version | Number of QPUs | Run-time (s) |\n  --------| -------------: | -----------: |\n  Scalar  | 0              | 431.46       |\n  Vector  | 1              | 49.34        |\n  Vector  | 2              | 24.91        |\n  Vector  | 4              | 20.36        |\n\n\n## References\n\nThe following works were *very* helpful in the development of\nQPULib.\n\n  * The [VideoCore IV manual](http://www.broadcom.com/docs/support/videocore/VideoCoreIV-AG100-R.pdf) by Broadcom.\n\n  * The [documentation, demos, and\n    assembler](https://github.com/hermanhermitage/videocoreiv-qpu)\n    by Herman Hermitage.\n\n  * The [FFT implementation](http://www.aholme.co.uk/GPU_FFT/Main.htm)\n    by Andrew Holme.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmn416%2FQPULib","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmn416%2FQPULib","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmn416%2FQPULib/lists"}