{"id":15684533,"url":"https://github.com/itzmeanjan/blake3","last_synced_at":"2025-08-18T16:06:03.847Z","repository":{"id":44902257,"uuid":"445055439","full_name":"itzmeanjan/blake3","owner":"itzmeanjan","description":"SYCL accelerated BLAKE3 Hash Implementation","archived":false,"fork":false,"pushed_at":"2022-01-22T15:37:44.000Z","size":106,"stargazers_count":16,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-05-07T18:10:06.306Z","etag":null,"topics":["avx2","avx512","binary-merklization","blake3","cpu","cryptographic-hash-functions","cuda","dpcpp","gpu","gpu-computing","merkle-tree","sycl"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/itzmeanjan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-01-06T05:45:19.000Z","updated_at":"2025-03-14T14:53:05.000Z","dependencies_parsed_at":"2022-09-10T20:22:09.870Z","dependency_job_id":null,"html_url":"https://github.com/itzmeanjan/blake3","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/itzmeanjan/blake3","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itzmeanjan%2Fblake3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itzmeanjan%2Fblake3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itzmeanjan%2Fblake3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itzmeanjan%2Fblake3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/itzmeanjan","download_url":"https://codeload.github.com/itzmeanjan/blake3/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itzmeanjan%2Fblake3/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271019447,"owners_count":24685688,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avx2","avx512","binary-merklization","blake3","cpu","cryptographic-hash-functions","cuda","dpcpp","gpu","gpu-computing","merkle-tree","sycl"],"created_at":"2024-10-03T17:18:32.195Z","updated_at":"2025-08-18T16:06:03.332Z","avatar_url":"https://github.com/itzmeanjan.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# blake3\nSYCL accelerated BLAKE3 Hash Implementation\n\n## Motivation\n\nIn recent times I've been exploring data parallel programming domain using SYCL, which is a heterogeneous accelerator programming API. Few weeks back I completed writing Zk-STARK friendly [Rescue Prime Hash using SYCL](https://github.com/itzmeanjan/ff-gpu/), then I decided to take a look at BLAKE3, because blake3's algorithmic construction naturally lends itself for heavy parallelism. Compared to Rescue Prime Hash, BLAKE3 should be able to much better harness accelerator's compute capability when input size is relatively large ( say \u003e= 1MB ).\n\nSYCL -backed Rescue Prime implementation shines when there are lots of (short) indepedent inputs and multiple Rescue Prime Hashes can be executed independently on each of them, because Rescue Prime can be vectorized but doesn't provide with good scope of (multi-threaded/ OpenCL work-item based) parallelism inherently.\n\nOn the other hand SYCL implementation of BLAKE3 performs good when (single) input size is \u003e= 1MB, then each 1KB chunk of input can be compressed parallelly --- very good fit for data parallel acceleration. After that BLAKE3 is simply Binary Merkle Tree construction, which itself is highly parallelizable, _though multi-phase kernel enqueue required (increasing host-device interaction) due to hierarchical structure of Binary Merkle Tree, which results into data dependence_.\n\nIn following implementation I heavily use SYCL2020's USM, which allows me to work with much familiar pointer arithmetics. I also use SYCL's vector intrinsics ( i.e. 4 -element array of type `sycl::uint4` ) for representing/ operating on hash state of BLAKE3. Another way to accelerate BLAKE3 (as proposed in specification) is compressing multiple chunks in parallel by distributing hash state of those participating chunks across 16 vectors, each with N -lanes, where N = # -of chunks being compressed together. N can generally be {2, 4, 8, 16}. I've implemented that scheme under namespace `blake3::v2::*`, while simpler variant is placed under namespace `blake3::v1::*`.\n\nI've also written Binary Merklization implementation using BLAKE3 2-to-1 hash function, which takes N -many leaf nodes of some binary tree and produces all intermediate nodes. Note, here `N = 2 ^ i | i = {1, 2, ...}`. For binary merklization, each BLAKE3 hash invocation takes 64 -bytes of input and produces 32 -bytes of output. Those 64 -bytes of input is nothing but two concatenated BLAKE3 digests.\n\n**I strongly suggest you go through (hyperlinked below) BLAKE3 specification's section 5.3 for understanding where I got this idea from.**\n\n\u003e I followed BLAKE3 [specification](https://github.com/BLAKE3-team/BLAKE3-specs/blob/ac78a717924dd9e6f16f547baa916c6f71470b1a/blake3.pdf) and used Rust reference [implementation](https://github.com/BLAKE3-team/BLAKE3/blob/da4c792d8094f35c05c41c9aeb5dfe4aa67ca1ac/reference_impl/reference_impl.rs) as my guide while writing SYCL implementation.\n\n\u003e **Note,** at this moment to keep Merkle Tree construction both easy and simple, this SYCL implementation can only generate BLAKE3 digest when input has power of 2 -many chunks, given each chunk of size 1KB. That means minimum input size should be 2KB, after that it can be increased as 4KB, 8KB ....\n\n\u003e If input size is not \u003e= 1MB, you probably don't want to use this implementation, because submitting job ( read enqueuing kernels ) to accelerator is not cheap and all those (required) ceremonies might defeat the whole purpose and essence of acceleration.\n\n## Prerequisites\n\n- Ensure you've Intel SYCL/ DPC++ compiler toolchain. See [here](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html) for downloading precompiled binaries.\n- If you happen to be interested in running on Nvidia GPU; you have to compile Intel's open-source llvm-based SYCL implementation from source; see [here](https://intel.github.io/llvm-docs/GetStartedGuide.html#prerequisites).\n- For running test cases, which uses Rust Blake3 [implementation](https://docs.rs/blake3/1.2.0/blake3) for assertion, you'll need to have Rust `cargo` toolchain installed; get that [here](https://rustup.rs/)\n- I'm on\n\n```bash\n$ lsb_release -d\nDescription:    Ubuntu 20.04.3 LTS\n```\n\n- Using Intel's SYCL/ DPC++ compiler version\n\n```bash\n$ dpcpp --version\nIntel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)\nTarget: x86_64-unknown-linux-gnu\nThread model: posix\nInstalledDir: /opt/intel/oneapi/compiler/2022.0.1/linux/bin-llvm\n```\n\n- For CUDA backend on Nvidia Tesla V100 GPU, I used Intel's `clang++` version\n\n```bash\n$ clang++ --version\nclang version 14.0.0 (https://github.com/intel/llvm dc9bd3fafdeacd28528eb4b1fef3ad9b76ef3b92)\nTarget: x86_64-unknown-linux-gnu\nThread model: posix\n```\n\n- I'm on `rustc` version\n\n```bash\n$ rustc --version\nrustc 1.59.0-nightly (efec54529 2021-12-04)\n```\n\n- You'll also need `make` utility for running test/ benchmark etc.\n- For formatting `C++` source consider using `clang-format` tool\n\n```bash\nmake format\n```\n\n## Usage\n\nThis is a header only library; so clone this repo and include [blake3.hpp](./include/blake3.hpp) in your SYCL project.\n\n```cpp\n// Find full example https://github.com/itzmeanjan/blake3/blob/1de036a/test/src/main.cpp\n\n#include \"blake3.hpp\"\n#include \u003ciostream\u003e\n\nint main() {\n    sycl::device d{ sycl::default_selector{} }; // choose sycl device\n    sycl::queue q{ d };                         // make sycl queue\n\n    // @note\n    // At this moment only power of 2 -many chunks are supported\n    // meaning input size will be `chunk_count * chunk_size` -bytes\n    //\n    // chunk_size   = 1024 bytes\n    // chunk_count  = 2^i, where i = {1, 2, ...}\n\n    // allocate input/ output memory\n    // fill input with data\n    // see https://github.com/itzmeanjan/blake3/blob/095e80f/test/src/main.cpp#L15-L37\n\n    // invoke hasher; last argument denotes execution doesn't need to be timed\n    blake3::v1::hash(q, in_d, i_size, chunk_count, wg_size, out_d, nullptr); // either\n\n    blake3::v2::hash(q, in_d, i_size, chunk_count, wg_size, out_d, nullptr); // or\n    // see https://github.com/itzmeanjan/blake3/blob/095e80f/test/src/main.cpp#L40-L43\n\n    // deallocate heap memory\n\n    return 0;\n}\n```\n\nFor Binary Merklization implementation consider including [merklize.hpp](./include/merklize.hpp) into your SYCL project. You may want to see [this](https://github.com/itzmeanjan/blake3/blob/d4085fcbb77dbbb8ce2b0748e0b973889044a8ff/include/bench_merklize.hpp#L42-L55) for example.\n\n## Test\n\nFor executing accompanying test cases run\n\n```bash\nBLAKE3_SIMD_LANES=2 make; make clean\nBLAKE3_SIMD_LANES=4 make; make clean\nBLAKE3_SIMD_LANES=8 make; make clean\nBLAKE3_SIMD_LANES=16 make; make clean\n```\n\nwhich prepares random input of 1MB; then applies BLAKE3 using [Rust implementation](https://docs.rs/blake3/1.2.0/blake3) and both of my [SYCL implementations of BLAKE3](https://github.com/itzmeanjan/blake3/blob/b459e95539fbc203f48bccbccd356ff21c1a59b6/include/blake3.hpp). Finally both of these 32 -bytes digests are asserted. It also asserts BLAKE3 2-to-1 hashing implementation which is used for Binary Merklization. ✅\n\nImplementation | Comment\n--- | ---\n`blake3::v1::hash(...)` | Each SYCL work-item compresses one and only one chunk\n`blake3::v2::hash(...)` | Each SYCL work-item can compress either 2/ 4/ 8/ 16 contiguous chunks; selectable using `BLAKE3_SIMD_LANES`\n`blake3::v1::merge(...)` | Takes 64 -bytes input ( two BLAKE3 digests ) and produces 32 -bytes output digest, it's called BLAKE3 2-to-1 hashing, which is used in Binary Merklization\n\n## Dockerised Testing\n\nFor running test cases inside Docker container (without installing any dependencies on your host, expect `docker` itself) consider using Dockerfile provided with.\n\nBuild image\n\n```bash\ndocker build -t blake3-test . # can be time consuming\n```\n\nThen run test cases inside container\n\n```bash\ndocker run blake3-test\n```\n\n## Benchmark\n\n### BLAKE3 Hash Function\n\nFollowing benchmark results denote what was \n\n- kernel execution time\n- time required to transfer input bytes to device\n- time needed to transfer 32 -bytes digest back to host\n\nwhen computing BLAKE3 hash ( v1 \u0026 v2 ) using SYCL implementation and input was of given size on first column. Input is generated on host; then explicitly transferred to accelerator because I'm using `sycl::malloc_host` and `sycl::malloc_device` for heap allocation; finally computed BLAKE3 digest ( i.e. 32 -bytes ) is transferred back to host. *None of these data transfer costs are included in kernel execution time*. For benchmarking purposes, I enable profiling in SYCL queue and sum of all differences between kernel enqueue event's start and end times are taken. I've also used a static SYCL work-group size of 32 for each of these executions rounds; total of 8 rounds are executed for each row before taking average of obtained kernel execution time/ host \u003c-\u003e device data transfer time.\n\n- [On Nvidia GPU](./results/nvidia_gpu.md)\n- [On Intel GPU](./results/intel_gpu.md)\n- [On Intel CPU](./results/intel_cpu.md)\n\n### Binary Merklization using BLAKE3 2-to-1 Hash Function\n\nBelow I'm presenting benchmark results of Binary Merklization using BLAKE3 2-to-1 hashing. Four columns which are shown are as follows\n\n\nField | Description\n--- | ---\nleaf count | input binary tree's leaf count [ `note, this is always power of 2` ]\nexecution time | time spent executing all kernels which are enqueued for computing all intermediate nodes of specified binary tree with N -many leaf nodes\nhost-to-device data tx cost | time required to transfer (leaf_count \u003c\u003c 5) -bytes random input to accelerator [ `because each leaf node is a BLAKE3 digest` ]\ndevice-to-host data tx cost | time spent on transferring back all (leaf_count - 1) -many intermediate nodes back to host\n\nI prepare random input of (leaf_count \u003c\u003c 5) -bytes on host, which is explicitly transferred to accelerator using SYCL USM API. As soon as input is ready to be operated on, binary merklization begins and computes all intermediate nodes of Merkle Tree in multiple rounds. At end, all these intermediate nodes are brought back to host. I've enabled SYCL queue profiling, which I make use of for timing all events, I get after enqueuing commands i.e. data transfer/ kernel execution etc..\n\n\u003e Note, this Binary Merklization implementation only works with leaf count which is power of 2 value.\n\n\u003e For all these benchmarking, I'm using static SYCL work-group size 32. [ **changing it to runtime decision should be explored !** ]\n\n- [On Nvidia GPU](./results/merklization/nvidia_gpu.md)\n- [On Intel GPU](./results/merklization/intel_gpu.md)\n- [On Intel CPU](./results/merklization/intel_cpu.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitzmeanjan%2Fblake3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fitzmeanjan%2Fblake3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitzmeanjan%2Fblake3/lists"}