An open API service indexing awesome lists of open source software.

https://github.com/viathefalcon/vk_merkle_roots

A program to demonstrate Merkle root calculation on GPUs through Vulkan
https://github.com/viathefalcon/vk_merkle_roots

gpgpu gpu merkle-root sha-256 vulkan

Last synced: 4 months ago
JSON representation

A program to demonstrate Merkle root calculation on GPUs through Vulkan

Awesome Lists containing this project

README

          

# Vulkan Merkle Roots

## Summary
A program to demonstrate the calculation of the roots of [Merkle/hash trees](https://en.wikipedia.org/wiki/Merkle_tree) of arbitrary inputs on GPUs via the [Vulkan API](https://en.wikipedia.org/wiki/Vulkan).

## Components
### `vkmr`
This is the primary program; it reads inputs from `stdin` and then calculates their Merkle root, either serially on the CPU or in parallel on a selected compute-capable GPU reported by Vulkan.

### `strm`
This helper program accepts an arbitrary number of command-line arguments and writes them to a line-separated stream in `stdout`.

### `rndm`
This helper program generates a specified number of randomly-filled input strings and writes them to a line-separated stream in `stdout`.

## Building, Running
The Visual Studio Code project includes tasks which will build the programs; it assumes that either the Visual C++ compiler (on Windows) or Clang (elsewhere) is on the `PATH`.

### On (Steam) Deck
To build the project on Steam Deck, I run [VS Code in Flatpak](https://flathub.org/apps/com.visualstudio.code). For building, and running, on Steam Deck, the project has an implicit dependency on LLVM 18, which I satisfy with the [LLVM 18 extension for the flatpak Freedesktop SDK](https://github.com/flathub/org.freedesktop.Sdk.Extension.llvm18). I installed VS Code via the Discovery package manager, and the LLVM extension via the command-line:
```
flatpak install org.freedesktop.Sdk.Extension.llvm18
```

#### Launching VS Code
(and enabling the LLVM 18 extension)
```
FLATPAK_ENABLE_SDK_EXT=llvm18 flatpak run --devel com.visualstudio.code
```

#### Running the Program
The program, once built, also needs to run in a Flatpak container in which the LLVM 18 extension is available. So, I open a shell in the VS Code sandbox:
```
FLATPAK_ENABLE_SDK_EXT=llvm18 flatpak run --command=sh --devel com.visualstudio.code
```

And, inside this shell:
```
cd bin
./rndm.app 1712489279 1024 127 | ./vkmr.app
```

## Background
Initially, the goal was to investigate the practicality of using a GPU, as an asynchronous co-processor, to accelerate the generation of blockchain block headers; these will typically contain a hash representation of the block data, and this is often generated by computing the Merkle root of the records in the block.

Pulling a proof-of-concept together, using OpenCL, really wasn't too difficult, but iterating on it proved .. if not impossible then certainly impractical. The main sticking point was overcoming the requirement for a synchronous round trip between the host (CPU) and the accelerator (GPU) for each level of the tree. The specs for OpenCL 2.x+ provide a mechanism to avoid this - [Device-Side Enqueue](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#device-side-enqueue) - but support for OpenCL across vendors and devices varied, to put it mildly, and I could only get device-side enqueueing to work on one (1) integrated Intel GPU. When I tried to code around this, using OpenCL ~1.2 and pushing as much of the conditional logic as I could into compute shaders, I ended up with [this hot mess](https://gist.github.com/viathefalcon/6d82a14214d6e4f7af29b75133ef6c16).

## Design

### Goals
* Be able to compute the Merkle roots of arbitrary data sets entirely on GPU via parallel reduction;
* Not have any dependencies beyond Vulkan and the C++ Standard Library.

### Non-Goals
* Performance: i.e., where there is a choice between doing something on the GPU and doing it more performantly on the CPU, prefer to do it on the GPU (the aim being, recall, that the Merkle root calculation runs entirely asynchronously with respect to whatever's happening on the CPU).

### Choices

#### SHA-256d

I chose [SHA-256d](https://bitcoinwiki.org/wiki/sha-256d), or double `SHA-256`, as the hash function because it is used in Bitcoin, sparing myself the effort of evaluating different hash functions. Additionally, it has the agreeable property that the 256-bit/32-byte output is naturally aligned in most places that such alignment matters.

`SHA-256d` outputs the result of applying the `SHA-256` algorithm to the result of applying the `SHA-256` algorithm to an input.

#### Vulkan

It was when I got a Steam Deck that I decided to reboot the project based on Vulkan. In part, this was to have something for the Deck to do, but also for me to learn how to work with Vulkan.

#### Subgroups

Merkle root calculation generates a _lot_ of intermediate values, all of which are ultimately discarded. We avoid writing many of those values to memory by using [Subgroups](https://docs.vulkan.org/guide/latest/subgroups.html) (where supported), such that instead of each reduction invocation taking a pair of inputs and writing a single output, groups of invocations reduce whole sub-trees by sharing intermediate values - using subgroup shuffle operations, c.f. _#extension GL_KHR_shader_subgroup_shuffle_relative_ in [Vulkan Subgroup Tutorial](https://www.khronos.org/blog/vulkan-subgroup-tutorial) - and writing a single output.

Using subgroups in this way also reduces the total number of dispatches needed to calculate the root of the sub-tree for any given slice.

### Basic Flow

The program implements a kind of stream processor. Inputs are read from a stream and accumulated into _batches_; once a given batch is full, or the end of the input stream has been reached, the batch is sent to the GPU to be mapped.

_Mapping_ comprises two operations: applying the hash function to inputs and writing the outputs to "device local" memory, which is divided into _slices_. Each such slice holds up to some power of 2 number of hashes, which comprise the leaves of the tree whose root we are looking to calculate, and all slices are the same size.

Once a given slice is full, or the end of the input stream has been reached, the slice is sent for _reduction_. Each reduction calculates the root of the sub-tree of the slice to which the reduction is applied. Once all reductions have concluded, the outputs from each are used to calculate (on the CPU, in contravention of the goals outlined above..) the root of the tree for which they are the leaves.

Once each mapping and reduction conclude, the memory associated with the the corresponding batch or slice is immediately returned to the system. Additionally, every mapping and reduction runs asynchronously with respect to every other mapping and reduction as well as reading of any subsequent inputs, and the program does not need to have read in the entire dataset before it can start calculating the Merkle root.

## Non-Functional Outputs

### The Power of the Powers of 2
The key insight which made the whole thing work was in part a by-product of working with different implementations of the Vulkan memory model: namely, the only way to reliably allocate on-device memory is in chunks-at-a-time, and getting shaders/shader invocations to span arbitrary numbers of such chunks is prohibitively difficult (if not impossible).

It is possible to query Vulkan for the amount of memory provided by implementations, the likely amount of such memory available for allocation and the maximum size of any such allocation supported by an implementation. But, as I have discovered, an implementation can reject any request to allocate that satisfies those constraints regardless.

So, the most reliable way appears to be to simply allocate smaller chunks until you have enough or the system runs out. This being the case, to be able to reduce datasets with more than, for example, 8388608 ([256MB](https://gpuopen.com/learn/vulkan-device-memory/) worth of) elements, then the program needed to be able to handle datasets for which the corresponding Merkle tree must be distributed across non-contiguous memory buffers.

Because Merkle trees are binary trees, they can be divided into `n` sub-trees spanning an equal number of elements, `m`, where `m` is some power of 2, plus up to one additional sub-tree for datasets whose cardinality is not a multiple of `m`. The Merkle root of each of these sub-trees can then be calculated, independently of one another, with special consideration only needing to be given to sub-tree `n+1`, if it exists: it needs to be reduced as if it had the same height as the other sub-trees. (This essentially means that we need to keep iterating on it even after it has been reduced to a single element.)

The Merkle root of the whole dataset can then be calculated as the Merkle root of the tree for which the sub-trees' Merkle roots form the leaves.

### Other Numbers

Overall, the program is not terribly quick, but I was curious to see how much time was being spent processing on the GPU, and:

| Platform | Mapping | Reduction (w/Subgroups) | Effective Hashrate | Reduction (w/o Subgroups) | Effective Hashrate |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| Steam Deck (LCD) | ~252MB in ~190ms = ~1.295GB/s | 256MB in ~800ms = ~320MB/s | ~20.9MH/s | 256MB in ~250ms = ~1GB/s | ~67.1MH/s |
| Intel Iris Xe Graphics (12th Gen, integrated) | ~252MB in ~80ms = ~3.039GB/s | 256MB in 300ms = ~853.33MB/s | ~55.9MH/s | 256MB in ~110ms = ~2.3GB/s | ~152.5MH/s |
| nVidia RTX 4070 Super (via Thunderbolt) | ~252MB in ~265ms = ~951MB/s | 256MB in _<16ms_ = ~16GB/s | ~1.082 GH/s | 256MB in _<9ms_ = ~16GB/s | ~1.864GH/s |

I had only intended on including the numbers from the Steam Deck (since it is a relatively standardised platform) but the results of running the same tests - using the same dataset - against an RTX 4070 Super, even one connected as an eGPU, were .. eyebrow-raising.

### To Do, Sometime. Maybe.
#### Improve Throughput
Had I the time/incentive, these are the improvements I would probably work on next:
* where there is at least one in-flight mapping (or reduction) operation and a request to allocate memory for a new batch (or slice) is rejected by the implementation, then block on the completion of the operation and re-use the associated batch (or slice), rather than halting;
* where calculating the Merkle root of a dataset requires more than one reduction, then feed the output of those reductions into a new reduction operation.

These would allow the program to use GPUs to calculate the roots of datasets for which the corresponding Merkle tree would require more memory than is available to the target GPU at runtime when the dataset could be read in faster than it could be reduced on the GPU.

#### Generate, Output Merkle Proofs

It wouldn't be difficult to modify the program, shader(s), etc. to generate a Merkle proof for a given element of the dataset at the same time as calculating the root: for an indicated leaf, allocate a buffer to hold and write out the intermediate values during reduction.