{"id":15833673,"url":"https://github.com/itzmeanjan/blake3-fpga","last_synced_at":"2025-03-15T07:32:04.534Z","repository":{"id":43740279,"uuid":"455214998","full_name":"itzmeanjan/blake3-fpga","owner":"itzmeanjan","description":"BLAKE3 on FPGA","archived":false,"fork":false,"pushed_at":"2022-02-21T00:55:27.000Z","size":1354,"stargazers_count":6,"open_issues_count":0,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-10-06T13:41:33.342Z","etag":null,"topics":["blake3","dpcpp","fpga","high-level-synthesis","sycl"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/itzmeanjan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-02-03T15:17:52.000Z","updated_at":"2024-04-16T09:55:59.000Z","dependencies_parsed_at":"2022-08-21T21:20:19.479Z","dependency_job_id":null,"html_url":"https://github.com/itzmeanjan/blake3-fpga","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itzmeanjan%2Fblake3-fpga","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itzmeanjan%2Fblake3-fpga/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itzmeanjan%2Fblake3-fpga/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itzmeanjan%2Fblake3-fpga/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/itzmeanjan","download_url":"https://codeload.github.com/itzmeanjan/blake3-fpga/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243701276,"owners_count":20333615,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["blake3","dpcpp","fpga","high-level-synthesis","sycl"],"created_at":"2024-10-05T13:41:21.143Z","updated_at":"2025-03-15T07:32:04.197Z","avatar_url":"https://github.com/itzmeanjan.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# blake3-fpga\nBLAKE3 on FPGA\n\n## Design\n\nImagine input byte array of 2048 -bytes to be hashed using BLAKE3, meaning input has 2 chunks, because BLAKE3 chunk size is 1KB. Each chunk has 16 message blocks, each of length 64 -bytes ( read 16 message words, because BLAKE3 word size is 32 -bit ). Each chunk is required to be compressed 16 times sequentially ( because it consists of 16 message blocks ) --- output chaining value of i-th message block compression is used as input chaining value of (i + 1)-th message block, while first message block's input chaining value is constant initial hash values and 0 \u003c= i \u003c= 14. Due to this data dependency, in following FPGA design of BLAKE3, I compress j-th message block of i-th chunk, then j-th message block of (i + 1)-th chunk and it continues until we reach last chunk's j-th message block. All these N -many output chaining values of j-th message block compression are written to global memory. Now in next iteration it's time to compress (j + 1)-th message block for each of N -many chunks, while using j-th message block compression's output chaining values as input chaining values for respective chunk. This way all 16 message blocks are compressed for each of N -many chunks and those N -many output chaining values are considered leaf nodes of Binary Merkle Tree. Now computing BLAKE3 digest is simply finding root of Merkle Tree, while all intermediate nodes are computed by BLAKE3 `compress( ... )` function.\n\n![blake3-design-on-fpga](pic/blake3-fpga-design.png)\n\nIn above design diagram, you may want to following color coding to find out how 16 message blocks of each chunks are scheduled for compression in *chunk compression* phase.\n\n\u003e You may want to see BLAKE3 targeting multi-core CPU/ GPGPU, written using SYCL/ DPC++; see [here](https://github.com/itzmeanjan/blake3)\n\n\u003e BLAKE3 [specification](https://github.com/BLAKE3-team/BLAKE3-specs/blob/ac78a717924dd9e6f16f547baa916c6f71470b1a/blake3.pdf) which I followed during this design.\n\n\u003e BLAKE3 reference [implementation](https://github.com/BLAKE3-team/BLAKE3/blob/da4c792d8094f35c05c41c9aeb5dfe4aa67ca1ac/reference_impl/reference_impl.rs) was also helpful.\n\n## Prerequisite\n\nI'm on\n\n```bash\nlsb_release -d\n\nDescription:    Ubuntu 20.04.3 LTS\n```\n\nwhile using `dpcpp` as SYCL compiler\n\n```bash\ndpcpp --version\n\nIntel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)\nTarget: x86_64-unknown-linux-gnu\nThread model: posix\nInstalledDir: /opt/intel/oneapi/compiler/2022.0.2/linux/bin-llvm\n```\n\nYou'd probably like to get Intel oneAPI basekit, which has everything required for FPGA development. See [here](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).\n\nUse `sycl-ls` utility to see if you can emulate FPGA design to check for functional correctness.\n\n```bash\nsycl-ls\n\n[opencl:0] ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.13.11.0.23_160000]\n[opencl:0] CPU : Intel(R) OpenCL 3.0 [2021.13.11.0.23_160000]\n[host:0] HOST: SYCL host platform 1.2 [1.2]\n```\n\nFor running FPGA h/w synthesis and execution, you need to use Intel Devcloud.\n\n## Usage\n\nYou can check functional correctness of BLAKE3 implementation on CPU by emulation.\n\n```bash\nmake\n```\n\nYou probably would like to see optimization report, which can be generated on non-FPGA attached host; just installing oneAPI basekit allows doing this.\n\n```bash\nmake fpga_opt_test\n```\n\n*You also have the option of running benchmark on CPU emulation, but you don't want to use those numbers as actual benchmark.*\n\n```bash\nmake fpga_emu_bench # don't take them as actual benchmark !\n```\n\nFor running FPGA h/w test/ benchmark you'll need to go through **long** h/w synthesis phase, which can be executed on Intel Devcloud platform. See [here](https://devcloud.intel.com/oneapi/get_started/opencl).\n\n### Job Submission\n\nFor easing FPGA h/w compilation/ execution job submissions on Intel Devcloud platform, I use following scripts.\n\nAssuming you're in root of this project\n\n```bash\ngit clone https://github.com/itzmeanjan/blake3-fpga.git\ncd blake3-fpga\n```\n\n#### Compilation Flow\n\nCreate job submission bash script\n\n```bash\ntouch build_fpga_bench_hw.sh\n```\n\nAnd populate it with following content\n\n```bash\n#!/bin/bash\n\n# file name: build_fpga_hw.sh\n\n# env setup\nexport PATH=/glob/intel-python/python2/bin/:${PATH}\nsource /opt/intel/inteloneapi/setvars.sh \u003e /dev/null 2\u003e\u00261\n\n# hardware compilation\n#\n# or use `fpga_hw_test`\ntime make fpga_hw_bench\n```\n\nNow submit compilation job targeting Intel Arria 10 board, while noting down job id\n\n```bash\nqsub -l nodes=1:fpga_compile:ppn=2 -l walltime=24:00:00 -d . build_fpga_bench_hw.sh\n\n# note down job id e.g. 1850154\n```\n\n**Note :** If you happen to be interested in targeting Intel Stratix 10 board, consider using following compilation command instead of above Make build recipe.\n\n```bash\n# hardware compilation\ntime dpcpp -Wall -std=c++20 -I./include -O3 -DFPGA_HW -fintelfpga -Xshardware -Xsboard=intel_s10sx_pac:pac_s10 -reuse-exe=benchmark/fpga_hw.out benchmark/main.cpp -o benchmark/fpga_hw.out\n\n# or consider reading Makefile\n```\n\nAnd finally submit job on `fpga_compile` enabled VM with same command shown as above.\n\n#### Execution Flow\n\nCreate job submission shell script\n\n```bash\ntouch run_fpga_bench_hw.sh\n```\n\nAnd populate it with environment setup and binary execution commands\n\n```bash\n#!/bin/bash\n\n# file name: run_fpga_hw.sh\n\n# env setup\nexport PATH=/glob/intel-python/python2/bin/:${PATH}\nsource /opt/intel/inteloneapi/setvars.sh \u003e /dev/null 2\u003e\u00261\n\n# hardware image execution\n#\n# if testing using `fpga_hw_test` recipe,\n# consider using `pushd test`\npushd benchmark; ./fpga_hw.out; popd\n```\n\nNow submit execution job on VM, enabled with `fpga_runtime` capability \u0026 Intel Arria 10 board, while creating job dependency chain, which will ensure as soon as **long** FPGA h/w synthesis is completed, h/w image execution will start running\n\n```bash\nqsub -l nodes=1:fpga_runtime:arria10:ppn=2 -d . run_fpga_bench_hw.sh -W depend=afterok:1850154\n\n# use compilation flow job id ( e.g. 1850154 ) to create dependency chain\n```\n\n**Note :** If you compiled h/w image targeting Intel Stratix 10 board, consider using following job submission command\n\n```bash\nqsub -l nodes=1:fpga_runtime:stratix10:ppn=2 -d . run_fpga_bench_hw.sh -W depend=afterok:1850157\n\n# place proper compilation job id ( e.g. 1850157 ), to form dependency chain\n```\n\nAfter completion of compilation/ execution job submission, consider checking status using\n\n```bash\nwatch -n 1 qstat -n -1\n\n# or just `qstat -n -1`\n```\n\nWhen completed, following command(s) should reveal newly created files, having stdout/ stderr output of compilation/ execution flow in `{build|run}_fpga_bench_hw.sh.{o|e}1850157` files\n\n```bash\nls -lhrt   # created files shown towards end of list\ngit status # untracked, newly created files\n```\n\n\u003e Note, I found [this](https://devcloud.intel.com/oneapi/documentation/job-submission) guide on job submission helpful.\n\n## Benchmark\n\nI've h/w synthesized BLAKE3 design targeting Intel Arria 10 board on Intel Devcloud platform.\n\n```bash\nrunning on pac_a10 : Intel PAC Platform (pac_ee00000)\n\nBenchmarking BLAKE3 FPGA implementation\n\n              input size                  execution time                host-to-device tx time          device-to-host tx time\n                   1 MB                  138.487375 us                   421.993125 us                    61.492625 us\n                   2 MB                  259.962875 us                   696.308000 us                    85.265375 us\n                   4 MB                  496.056500 us                     1.091515 ms                    84.755625 us\n                   8 MB                  985.286625 us                     1.674159 ms                    82.991875 us\n                  16 MB                    1.941844 ms                     3.048934 ms                    63.686000 us\n                  32 MB                    3.848572 ms                     6.395413 ms                    71.175375 us\n                  64 MB                    7.703389 ms                    12.999837 ms                    82.688000 us\n                 128 MB                   15.282095 ms                    23.696630 ms                    87.255250 us\n                 256 MB                   31.177414 ms                    44.916047 ms                    84.804875 us\n                 512 MB                   61.033350 ms                    86.641925 ms                    98.026375 us\n                1024 MB                  122.169668 ms                   170.085517 ms                    90.151250 us\n```\n\nNote, this design can benefit from replicating data path which compresses message blocks many more number of times ( currently replication factor set to 1 ), but that comes with increased resource consumption. This primary design, which interacts with global memory quite often, slows down due to high global memory access latency.\n\nFuture efforts that can be put in improving this design is reducing interaction with global memory system and increasing usage of on-chip ( stall-free ) BRAM for double bufferring purposes, while synthesizing more ( power of 2 -many ) replicas of BLAKE3 `compress( ... )` function, at cost of higher resource usage.\n\n\u003e I've also experimented with SYCL pipe based design pattern ( in BLAKE3 context ) where producer ( read orchestrator ) \u003c-\u003e consumer ( read compressor ) pattern is utilized, reducing global memory access; but it turns out that due to hierarchical data dependency in BLAKE3 binary merkle tree, that pattern doesn't yield much useful results and pipe ends up slowing down due to stalling on both ends.\n\n**👇 are taken from final report generated after FPGA h/w synthesis, targeting Intel Arria 10 board**\n\n### Quartus Fitter Summary\n\n![quartus-fitter](pic/quartus-fitter.png)\n\n### Design Clocking at\n\n![clock-freq](pic/clock-freq.png)\n\n### Loop Pipelining/ II/ fMAX/ latency\n\n![loop-status](pic/loop-status.png)\n\n### Resource Usage\n\n![resource-usage](pic/resource-usage.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitzmeanjan%2Fblake3-fpga","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fitzmeanjan%2Fblake3-fpga","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitzmeanjan%2Fblake3-fpga/lists"}