{"id":13533273,"url":"https://github.com/Heteroflow/Heteroflow","last_synced_at":"2025-04-01T21:32:11.739Z","repository":{"id":80935827,"uuid":"200889030","full_name":"Heteroflow/Heteroflow","owner":"Heteroflow","description":"Concurrent CPU-GPU Programming using Task Models","archived":false,"fork":false,"pushed_at":"2019-12-19T01:43:18.000Z","size":1654,"stargazers_count":100,"open_issues_count":1,"forks_count":13,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-11-02T20:33:05.894Z","etag":null,"topics":["cpu-gpu-scheduling","cuda","gpu","gpu-acceleration","gpu-computing","gpu-programming","heterogeneous-computing","heterogeneous-parallel-programming","heterogeneous-systems","multithreaded","multithreading","task-parallelism"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Heteroflow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-08-06T16:39:06.000Z","updated_at":"2024-10-20T16:36:56.000Z","dependencies_parsed_at":"2023-03-26T12:50:23.462Z","dependency_job_id":null,"html_url":"https://github.com/Heteroflow/Heteroflow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Heteroflow%2FHeteroflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Heteroflow%2FHeteroflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Heteroflow%2FHeteroflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Heteroflow%2FHeteroflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Heteroflow","download_url":"https://codeload.github.com/Heteroflow/Heteroflow/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246713340,"owners_count":20821875,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpu-gpu-scheduling","cuda","gpu","gpu-acceleration","gpu-computing","gpu-programming","heterogeneous-computing","heterogeneous-parallel-programming","heterogeneous-systems","multithreaded","multithreading","task-parallelism"],"created_at":"2024-08-01T07:01:18.254Z","updated_at":"2025-04-01T21:32:11.072Z","avatar_url":"https://github.com/Heteroflow.png","language":"C++","funding_links":[],"categories":["Software"],"sub_categories":["Trends"],"readme":"# Heteroflow \u003cimg align=\"right\" width=\"10%\" src=\"images/heteroflow-logo.png\"\u003e\n\nA header-only C++ library to help you quickly write\nconcurrent CPU-GPU programs using task models\n\n# Why Heteroflow?\n\nParallel CPU-GPU programming is never an easy job to begin with.\nHeteroflow helps you deal with this challenge \nthrough a new *task-based* programming model\nusing modern C++ and [Nvidia CUDA Toolkit][cuda-toolkit].\n\n# Table of Contents\n\n* [Write Your First Heteroflow Program](#write-your-first-heteroflow-program)\n* [Create a Heteroflow Application](#create-a-heteroflow-application)\n   * [Step 1: Create a Heteroflow Graph](#step-1-create-a-heteroflow-graph)\n   * [Step 2: Define Task Dependencies](#step-2-define-task-dependencies)\n   * [Step 3: Execute a Heteroflow](#step-3-execute-a-heteroflow)\n* [Visualize a Heteroflow Graph](#visualize-a-heteroflow-graph)\n* [Compile Unit Tests and Examples](#compile-unit-tests-and-examples)\n* [System Requirements](#system-requirements)\n* [Get Involved](#get-involved)\n\n# Write Your First Heteroflow Program\n\nThe code below [saxpy.cu](./examples/saxpy.cu) implements\nthe canonical single-precision A·X Plus Y (\"saxpy\") operation.\n\n\n```cpp\n#include \u003cheteroflow/heteroflow.hpp\u003e  // Heteroflow is header-only\n\n__global__ void saxpy(int n, float a, float *x, float *y) {\n  int i = blockIdx.x*blockDim.x + threadIdx.x;\n  if (i \u003c n) y[i] = a*x[i] + y[i];\n}\n\nint main(void) {\n\n  const int items = 1\u003c\u003c20;                // total items\n  const int bytes = items*sizeof(float);  // total bytes\n  float* x {nullptr};\n  float* y {nullptr};\n\n  hf::Executor executor;                  // create an executor\n  hf::Heteroflow hf(\"saxpy\");             // create a task dependency graph \n  \n  auto host_x = hf.host([\u0026]{ x = create_vector(N, 1.0f); });\n  auto host_y = hf.host([\u0026]{ y = create_vector(N, 2.0f); }); \n  auto span_x = hf.span(std::ref(x), B);\n  auto span_y = hf.span(std::ref(y), B);\n  auto kernel = hf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, span_x, span_y);\n  auto copy_x = hf.copy(std::ref(x), span_x, B);\n  auto copy_y = hf.copy(std::ref(y), span_y, B);\n  auto verify = hf.host([\u0026]{ verify_result(x, y, N); });\n  auto kill_x = hf.host([\u0026]{ delete_vector(x); });\n  auto kill_y = hf.host([\u0026]{ delete_vector(y); });\n\n  host_x.precede(span_x);                 // host tasks run before span tasks\n  host_y.precede(span_y);\n  kernel.precede(copy_x, copy_y)          // kernel runs before/after copy/span tasks\n        .succeed(span_x, span_y); \n  verify.precede(kill_x, kill_y)          // verifier runs before/after kill/copy tasks\n        .succeed(copy_x, copy_y); \n\n  executor.run(hf).wait();                // execute the task dependency graph\n}\n```\n\nThe saxpy task dependency graph is shown in the following figure:\n\n![SaxpyTaskGraph](images/saxpy.png)\n\n\nCompile and run the code with the following commands:\n\n```bash\n~$ nvcc saxpy.cu -std=c++14 -O2 -o saxpy -I path/to/Heteroflow/header\n~$ ./saxpy\n```\n\nHeteroflow is header-only. Simply copy the entire folder \n[heteroflow/](heteroflow/) to your project and add the include path accordingly.\nSee [System Requirements](#system-requirements)\nfor detailed system specification and compliation environment.\n\n\n# Create a Heteroflow Application\n\nHeteroflow manages concurrent CPU-GPU programming \nusing a *task dependency graph* model.\nEach node in the graph represents either a CPU (host) task \nor a GPU (device) task.\nEach edge indicates\na dependency constraint between two tasks.\nMost applications are developed through the following steps:\n\n## Step 1: Create a Heteroflow Graph\n\nCreate a heteroflow object to start a task dependency graph:\n\n```cpp\nhf::Heteroflow hf;\nhf.name(\"MyHeteroflow\");  // assigns a name to the heteroflow object\n```\n\nEach task belongs to one of the following categories: \n*host*, *span*, *fill*, *copy*, and *kernel*.\n\n\n### Task Type #1: Host Task\n\nA host task is a callable for which [std::invoke][std::invoke] is applicable\non any CPU core.\n\n```cpp\nhf::HostTask host = heteroflow.host([](){ std::cout \u003c\u003c \"my host task\\n\"; });\n```\n\n### Task Type #2: Span Task\n\nA span task allocates memory on a GPU device. \nThe code below creates a span task that allocates\n256 bytes of an uninitialized storage on a GPU device.\n\n```cpp\nhf::SpanTask span = hf.span(256);\n```\n\nAlternatively, you can create a span task to allocate an initialized storage\nfrom a host memory area.\nThe code blow creates a span task that allocates a device memory block\nwith size and value equal to the data in `vec`.\n\n\n```cpp\nstd::vector\u003cint\u003e vec(256, 0);\nhf::SpanTask span = hf.span(vec.data(), 256*sizeof(int));\n```\n\nHeteroflow performs GPU memory operations through *span* tasks\nrather than raw pointers.\nThis layer of abstraction allows users to focus on building\nefficient task graphs with transparent scalability to manycore CPUs \nand multiple GPUs.\n\n \n### Task Type #3: Fill Task\n\nA fill task sets a GPU memory area managed by a span task to a given value\n*byte by byte*.\nThe code below creates fill tasks that set each byte\nin the specified range of a GPU memory block managed by a span task\nto zero.\n\n```cpp\n// sets each byte in [0, 1024) of span to 0\nhf::FillTask fill1 = hf.fill(span, 1024, 0);      \n\n// sets each byte in [1000, 1020) of span to 0\nhf::FillTask fill2 = hf.fill(span, 1000, 20, 0);  \n```\n\n### Task Type #4: Copy Task\n\nA copy task performs data transfers in one of the three directions,\n*host to device* (H2D), *device to device* (D2D), and *device to host* (D2H).\nThe code below creates copy tasks that transfer\ndata from a host memory area to a GPU memory block managed by a span task.\n\n```cpp\nstd::string str(\"H2D data transfers\");\n\n// copies the entire string to the span\nhf::CopyTask h2d1 = hf.copy(span, str.data(), str.size());  \n\n// copies [10, 13) bytes (characters) from span to the host string\nhf::CopyTask h2d2 = hf.copy(span, 10, str.data(), 3);       \n```\n\nThe code below creates copy tasks that transfer\ndata from a GPU memory block managed by a span task to a host memory area.\n\n```cpp\nstd::string str(\"D2H data transfers\");\n\n// copies 10 bytes from span to the host string\nhf::CopyTask d2h1 = hf.copy(str.data(), span, 10);\n\n// copies 10 bytes from [5, 15) of span to the host string\nhf::CopyTask d2h2 = hf.copy(str.data(), span, 5, 10);\n```\n\nThe code below creates copy tasks that transfer data between\ntwo GPU memory blocks managed by two span tasks.\n\n```cpp\n// copies 100 bytes from src_span to tgt_span\nhf::CopyTask d2d1 = copy(tgt_span, src_span, 100);\n\n// copies 100 bytes from [5, 105) of src_span to tgt_span\nhf::CopyTask d2d2 = copy(tgt_span, src_span, 5, 100);\n\n// copies 100 bytes from src_span to [10, 110) of tgt_span\nhf::CopyTask d2d3 = copy(tgt_span, 10, src_span, 100);\n\n// copies 100 bytes from [10, 110) of src_span to [20, 120) of tgt_span\nhf::CopyTask d2d4 = copy(tgt_span, 20, src_span, 10, 100);\n```\n\n\n### Task Type #5: Kernel Task\n\nA kernel task offloads a kernel function to a GPU device.\nHeteroflow abstracts GPU memory through span tasks \nto facilitate the design of task scheduling with automatic GPU device mapping.\nEach span task manages a GPU memory pointer that\nwill implicitly convert to the pointer type \nof the corresponding entry in binding a kernel task to a kernel function.\nThe code below demonstrates the creation of a kernel task.\n\n```cpp\n// GPU kernel to set each entry of an integer array to a given value\n__global__ void gpu_set(int* data, size_t N, int value) {\n  int i = blockIdx.x*blockDim.x + threadIdx.x;\n  if (i \u003c N) {\n    data[i] = value;\n  }\n}\n\n// creates a span task to allocates a raw storage of 65536 integers\nhf::SpanTask span = hf.span(65536*sizeof(int));\n\n// kernel execution configuration\ndim3 grid  {(65536+256-1)/256, 1, 1};\ndim3 block {256, 1, 1};\nsize_t Ns  {0};\n\n// creates a kernel task to offload gpu_set to a GPU device\nhf::KernelTask k1 = hf.kernel(\n  grid,           // dimension of the grid\n  block,          // dimension of the block\n  shared_memory,  // number of bytes in shared memory\n  gpu_set,        // kernel function to offload\n  span,           // 1st argument to pass to the kernel function\n  65536,          // 2nd argument to pass to the kernel function\n  1               // 3rd argument to pass to the kernel function\n); \n```\n\nHeteroflow gives users full privileges to \ncraft a [CUDA][cuda-zone] kernel \nthat is commensurate with their domain knowledge.\nUsers focus on developing high-performance kernel tasks using \nthe native CUDA programming toolkit,\nwhile leaving task parallelism to Heteroflow.\n\n### Access/Modify Task Attributes\n\nYou can query or modify the attributes of a task directly\nfrom its handle.\n\n```cpp\n// names a task and queries the task name\ntask.name(\"my task\");\nstd::cout \u003c\u003c task.name();\n\n// queries if a task is empty\nstd::cout \u003c\u003c \"task is empty? \" \u003c\u003c (task.empty() ? \"yes\" : \"no\");\n\n// queries the in/out degree of a task\nstd::cout \u003c\u003c task.num_successors() \u003c\u003c '/' \u003c\u003c task.num_dependents();\n```\n\n### Placeholder Tasks\n\nSometimes, you may need to initialize a task after its creation.\nHeteroflow allows users to create a *placeholder* for each task type\nwith storage allocated in advance.\n\n```cpp\n// creates a placeholder for host task\nhf::HostTask host = tf.placeholder\u003chf::HostTask\u003e();\n\n// creates a placeholder for span task\nhf::SpanTask span = tf.placeholder\u003chf::SpanTask\u003e();\n\n// creates a placeholder for fill task\nhf::FillTask fill = tf.placeholder\u003chf::FillTask\u003e();\n\n// creates a placeholder for copy task\nhf::CopyTask copy = tf.placeholder\u003chf::CopyTask\u003e();\n\n// creates a placeholder for kernel task\nhf::KernelTask kernel = tf.placeholder\u003chf::KernelTask\u003e();\n```\n\nEach task handle has exactly the same method as the heteroflow \nto initialize its content.\n\n```cpp\nhost.host([](){}).name(\"assign an empty lambda\");\nspan.span(256).name(\"allocate a 256-byte uninitialized storage\");\nfill.fill(span, 0).name(\"fill the span with 0\");\ncopy.copy(span, host_ptr, 256).name(\"copy 256 bytes from host_ptr to span\");\nkernel.kernel(1, 256, 0, my_kernel, span, 256).name(\"offload my_kernel onto a GPU\");\n\nhost.precede(span);     // span runs after host\nspan.precede(fill);     // fill runs after span\nfill.precede(copy);     // copy runs after fill\ncopy.precede(kernel);   // kernel runs after copy\n```\n\n\n## Step 2: Define Task Dependencies\n\nYou can add dependency links between tasks to enforce one task to run after another.\nThe dependency links must be a\n[Directed Acyclic Graph (DAG)](https://en.wikipedia.org/wiki/Directed_acyclic_graph).\nYou can add a preceding link to force one task to run before another.\n\n```cpp\nA.precede(B);        // A runs before B\nA.precede(C, D, E);  // A runs before C, D, and E\n```\n\nOr you can add a succeeding link to force one task to run after another.\n\n```cpp\nA.succeed(B);        // A runs after B\nA.succeed(C, D, E);  // A runs after C, D, and E\n```\n\n## Step 3: Execute a Heteroflow\n\nTo execute a heteroflow, you need to create an *executor*.\nAn executor manages a set of worker threads to execute \ndependent tasks in a heteroflow\nthrough an efficient *work-stealing* algorithm.\n\n```cpp\nhf::Executor executor;\n```\n\nYou can configure an executor to operate on a fixed degree of CPU-GPU \nparallelism.\nThe code below creates 32 worker threads to schedule and execute CPU tasks\nand 4 worker threads for the GPU counterpart.\n\n```cpp\nhf::Executor executor(32, 4);  // 32 and 4 threads to work on CPU and GPU tasks, respectively\n```\n\nThe executor provides many methods to run a heteroflow.\nYou can run a heteroflow one time, multiple times, or \nbased on a stopping criteria.\nThese methods are *non-blocking* with a [std::future][std::future] return\nto let you query the execution status.\nAll executor methods are *thread-safe*.\n\n```cpp\nstd::future\u003cvoid\u003e r1 = executor.run(heteroflow);       // run heteroflow once\nstd::future\u003cvoid\u003e r2 = executor.run_n(heteroflow, 2);  // run heteroflow twice\n\n// keep running heteroflow until the predicate becomes true (4 times in this example)\nexecutor.run_until(heteroflow, [counter=4](){ return --counter == 0; } );\n```\n\nYou can call `wait_for_all` to block the executor until all associated heteroflows complete.\n\n```cpp\nexecutor.wait_for_all();  // blocks until all running heteroflows finish\n```\n\nNotice that executor does not own any heteroflows. \nIt is your responsibility to keep a heteroflow alive during its execution,\nor it can result in undefined behavior.\nFor instance, the code below can lead to crash.\n\n```cpp\nhf::Executor executor;\n{\n  hf::Heteroflow scoped_heteroflow;\n  scoped_heteroflow.span(256);\n  // ... build dependent tasks\n  executor.run(scoped_heteroflow);\n}  // scoped_heteroflow is destroyed here while executor might still be running its tasks\n```\n\nIn most applications, you need only one executor to run multiple heteroflows\neach representing a specific part of your parallel decomposition.\n\n## Stateful Execution\n\nWhen you create a task, the heteroflow object marshals all arguments\nalong with a unique task execution function to form a \n*stateful closure* using C++ lambda and reference wrapper [std::ref][std::ref].\nAny changes on referenced variables will be visible to the execution\ncontext of the task.\nStateful execution enables flexible runtime controls\nfor *fine-grained* task parallelism.\nUsers can partition a large workload into small parallel blocks and append\ndependencies between tasks to keep variable states consistent.\nBelow the code snippet demonstrates this concept.\n\n```cpp\n__global my_kernel(int* ptr, size_t N);  // custom kernel\n\nint* data {nullptr};\nsize_t size{0};\ndim3 grid;\n\nauto host = heteroflow.host([\u0026] () {     // captures everything by reference\n  data = new float[1000];                // changes data and size at runtime\n  size = 1000*sizeof(int);\n  grid = (1000+256-1)/256;               // changes the kernel execution shape\n});\n\n// new data and size values are visible to this pull task's execution context\nauto span = heteroflow.span(std::ref(data), std::ref(size))\n                      .succeed(host);\n\n// new grid size is visible to this kernel task's execution context\nauto kernel = heteroflow.kernel(std::ref(grid), 256, 0, my_kernel, span, 1000)\n                        .succeed(span);\n```\n\nAll the arguments, except `SpanTask` which is always captured by copy, \nforwarded to each task construction method\ncan be made stateful through [std::ref][std::ref].\n\n\n\n\n# Visualize a Heteroflow Graph\n\nVisualization is a great way to inspect a task graph\nfor refinement or debugging purpose.\nYou can dump a heteroflow graph to a [DOT format][dot-format]\nand visualize it through free online [GraphViz][GraphViz] tools.\n\n```cpp\nhf::Heteroflow hf;\n\nauto ha = hf.host([](){}).name(\"allocate_a\");\nauto hb = hf.host([](){}).name(\"allocate_b\");\nauto hc = hf.host([](){}).name(\"allocate_c\");\nauto sa = hf.span(1024).name(\"span_a\");\nauto sb = hf.span(1024).name(\"span_b\");\nauto sc = hf.span(1024).name(\"span_c\");\nauto op = hf.kernel({(1024+32-1)/32}, 32, 0, fn_kernel, sa, sb, sc).name(\"kernel\");\nauto cc = hf.copy(host_data, sc, 1024).name(\"copy_c\");\n  \nha.precede(sa);\nhb.precede(sb);\nop.succeed(sa, sb, sc).precede(cc);\ncc.succeed(hc);\n\nhf.dump(std::cout);  // dump the graph to a DOT format through standard output\n```\n\nThe program generates the following graph drawn by \n[Graphviz Online](https://dreampuf.github.io/GraphvizOnline/):\n\n\u003cimg align=\"right\" src=\"images/visualization.png\" width=\"50%\"\u003e\n\n```bash\ndigraph p0x7ffc17d62b40 {\n  rankdir=\"TB\";\n  p0x510[label=\"allocate_a\"];\n  p0x510 -\u003e p0xdc0;\n  p0xc10[label=\"allocate_b\"];\n  p0xc10 -\u003e p0xe90;\n  p0xcf0[label=\"allocate_c\"];\n  p0xcf0 -\u003e p0x100;\n  p0xdc0[label=\"span_a\"];\n  p0xdc0 -\u003e p0x030;\n  p0xe90[label=\"span_b\"];\n  p0xe90 -\u003e p0x030;\n  p0xf60[label=\"span_c\"];\n  p0xf60 -\u003e p0x030;\n  p0x030[label=\"kernel\" shape=\"box3d\"];\n  p0x030 -\u003e p0x100;\n  p0x100[label=\"copy_c\"];\n}\n```\n\n\n\n# Compile Unit Tests and Examples\n\nHeteroflow uses [CMake](https://cmake.org/) to build examples and unit tests.\nWe recommend out-of-source build.\n\n```bash\n~$ cmake --version  # must be at least 3.9 or higher\n~$ mkdir build\n~$ cd build\n~$ cmake ../\n~$ make \n```\n\n## Unit Tests\n\nWe use CMake's testing framework to run all unit tests.\n\n```bash\n~$ make test\n```\n\n## Examples\n\nThe folder [examples/](./examples) contains a number of practical CPU-GPU applications and is a great place to learn to use Heteroflow.\n\n| Example |  Description |\n| ------- |  ----------- | \n| [saxpy.cu](./examples/saxpy.cu) | implements a saxpy (single-precision A·X Plus Y) task graph |\n| [matrix-multiplication.cu](./examples/matrix-multiplication.cu)| implements two matrix multiplication task graphs, with and without GPU |\n\n# System Requirements\n\nTo use Heteroflow, you need a [Nvidia's CUDA Compiler (NVCC)][nvcc] \nof version at least 9.0 to support C++14 standards.\n\n# Get Involved\n\n+ Report bugs/issues by submitting a [GitHub issue][GitHub issues]\n+ Submit contributions using [pull requests][GitHub pull requests]\n+ Visit a curated list of [awesome parallel computing resources](https://github.com/tsung-wei-huang/awesome-parallel-computing)\n\n# License\n\nHeteroflow is licensed under the [MIT License](./LICENSE).\n\n* * *\n\n[std::ref]:              https://en.cppreference.com/w/cpp/utility/functional/ref\n[span::data]:            https://en.cppreference.com/w/cpp/container/span/data\n[std::invoke]:           https://en.cppreference.com/w/cpp/utility/functional/invoke\n[std::future]:           https://en.cppreference.com/w/cpp/thread/future\n[cuda-zone]:             https://developer.nvidia.com/cuda-zone\n[nvcc]:                  https://developer.nvidia.com/cuda-llvm-compiler\n[cuda-toolkit]:          https://developer.nvidia.com/cuda-toolkit\n\n[GitHub issues]:         https://github.com/heteroflow/heteroflow/issues\n[GitHub insights]:       https://github.com/heteroflow/heteroflow/pulse\n[GitHub pull requests]:  https://github.com/heteroflow/heteroflow/pulls\n\n[dot-format]:            https://en.wikipedia.org/wiki/DOT_(graph_description_language)\n[GraphViz]:              https://www.graphviz.org/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHeteroflow%2FHeteroflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHeteroflow%2FHeteroflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHeteroflow%2FHeteroflow/lists"}