Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/mitsuba-renderer/nanothread

nanothread — Minimal thread pool for task parallelism
https://github.com/mitsuba-renderer/nanothread
Last synced: about 2 months ago
JSON representation
nanothread — Minimal thread pool for task parallelism
Host: GitHub
URL: https://github.com/mitsuba-renderer/nanothread
Owner: mitsuba-renderer
License: bsd-3-clause
Created: 2020-11-26T21:16:04.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2024-05-08T08:17:26.000Z (8 months ago)
Last Synced: 2024-05-08T09:32:17.788Z (8 months ago)
Language: C++
Homepage:
Size: 64.5 KB
Stars: 49
Watchers: 7
Forks: 6
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # nanothread — Minimal thread pool for task parallelism

## Introduction

This library provides a minimal cross-platform interface for task parallelism.

Given a computation that is partitioned into a set of interdependent tasks, the

library efficiently distributes this work to a thread pool using lock-free

queues, while respecting dependencies between tasks.

Each task is associated with a callback function that is potentially invoked

multiple times if the task consists of multiple work units. This whole

process is arbitrarily recursive: task callbacks can submit further jobs, wait

for their completion, etc. Parallel loops, reductions, and more complex

graph-based computations are easily realized using these abstractions.

This project is internally implemented in C++11, but exposes the main

functionality using a pure C99 API, along with a header-only C++11 convenience

wrapper. It has no dependencies other than CMake and a C++11-capable compiler.

The entire project requires less than 1000 lines of header and

implementation code (according to [cloc](http://cloc.sourceforge.net/)).

This library is part of the larger

[Dr.Jit](https://github.com/mitsuba-renderer/drjit) project and parallelizes

workloads generated by the

[Dr.Jit-Core](https://github.com/mitsuba-renderer/drjit-core) library. However,

this project has no dependencies on these parent projects and can be used in

any other context.

## Why?

Many of my previous projects have built on [Intel's Thread Building

Blocks](https://software.intel.com/content/www/us/en/develop/tools/threading-building-blocks.html)

for exactly this type of functionality. Unfortunately, large portions of TBB's

task interface were recently deprecated as part of the oneAPI / oneTBB

transition. Rather than struggling with this complex dependency, I decided to

build something minimal and stable that satisfies my requirements.

## Examples (C++11 interface)

The follow examples showcase the C++11 interface, which is a thin header-only

layer over the C99 API.

### Parallel for loops (synchronous)

```cpp

template 

void parallel_for(const blocked_range &range, Func &&func, Pool *pool = nullptr);

```

This function submits a single task consisting of a arbitrarily many work units

that are processed in blocks of a specified size, and waits for their

completion. If no thread pool ``Pool *`` is specified, the default pool will be

used (and created on the fly, if needed).

Example:

```cpp

#include 

namespace dr = drjit;

int main(int, char **) {

    int result[100];

    // Call the provided lambda function 99 times with blocks of size 1

    dr::parallel_for(

        dr::blocked_range(/* begin = */ 0, /* end = */ 100, /* block_size = */ 1),

        // The callback is allowed to be a stateful lambda function

        [&](dr::blocked_range range) {

            for (uint32_t i = range.begin(); i != range.end(); ++i) {

                printf("Worker thread %u is starting to process work unit %u\n",

                       pool_thread_id(), i);

                // Write to variables defined in the caller's frame

                result[i] = i;

            }

        }

    );

}

```

Small amounts of work that only consist of a single block will immediately be

executed on the calling thread instead of involving the thread pool. Exceptions

occurring during parallel execution will be captured and re-thrown by

``dr::parallel_for``.

### Parallel for loops (asynchronous)

Parallel `for` loops can also run asynchronously—in that case, the function

immediately returns a ``Task *`` handle that can be used to wait for

completion, or to schedule *child tasks*, whose execution will be delayed until

all parents have completed.

```cpp

template 

Task *parallel_for_async(const blocked_range &range, Func &&func,

                         std::initializer_list parents = { },

                         Pool *pool = nullptr);

```

The returned task handle must eventually be released using the functions

``task_release(Task *)`` (which is instantaneous) or

``task_wait_and_release(Task *)`` (which blocks until the task has terminated).

A failure to do so will leak memory.

Example:

```cpp

#include 

namespace dr = drjit;

int main(int, char **) {

    // Schedule task 1

    Task *task_1 = dr::parallel_for_async(

        dr::blocked_range(/* ... */),

        [&](dr::blocked_range range) { /* ... */ }

    );

    // Schedule task 2

    Task *task_2 = dr::parallel_for_async(

        dr::blocked_range(/* ... */),

        [&](dr::blocked_range range) { /* ... */ }

    );

    // Schedule task 3 ...

    Task *task_3 = dr::parallel_for_async(

        dr::blocked_range(/* ... */),

        [&](dr::blocked_range range) { /* ... */ },

        { task_1, task_2 } // ... <- but don't execute until these tasks are done

    );

    task_release(task_1);

    task_release(task_2);

    task_wait_and_release(task_3);

}

```

If a task only consists of single-threaded work that cannot easily be converted

into a parallel ``for`` loop, the function ``do_async`` provides an more

convenient interface that is analogous to ``parallel_for_async`` with a

``blocked_range`` of size 1.

```cpp

template 

Task *do_async(Func &&func, std::initializer_list parents = {},

               Pool *pool = nullptr);

```

## Examples (C99 interface)

The following code fragment submits a single task consisting of 100 work units

and waits for its completion.

```c

#include 

#include 

#include 

// Task callback function. Will be called with index = 0..99

void my_task(uint32_t index, void *payload) {

    printf("Worker thread %u is starting to process work unit %u\n",

           pool_thread_id(), index);

    // Use payload to communicate some data to the caller

    ((uint32_t *) payload)[index] = index;

}

int main(int argc, char** argv) {

    uint32_t temp[100];

    // Create a worker per CPU thread

    Pool *pool = pool_create(NANOTHREAD_AUTO);

    // Synchronous interface: submit a task and wait for it to complete

    task_submit_and_wait(

        pool,

        100,     // How many work units does this task contain?

        my_task, // Function to be executed

        temp     // Optional payload, will be passed to function

    );

    // .. contents of 'temp' are now ready ..

    // Clean up used resources

    pool_destroy(pool);

}

```

Tasks can also be executed *asynchronously*, in which case extra steps must be

added to wait for tasks, and to release task handles.

```c

/// Heap-allocate scratch space for inter-task communication

uint32_t *payload = malloc(100 * sizeof(uint32_t));

/// Submit a task and return immediately

Task *task_1 = task_submit(

    pool,

    100,       // How many work units does this task contain?

    my_task_1, // Function to be executed

    payload,   // Optional payload, will be passed to function

    0,         // Size of the payload (only relevant if it should be copied)

    nullptr,   // Payload deletion callback

    0          // Enforce asynchronous execution even if task is small?

);

/// Submit a task that is dependent on other tasks (specifically task_1)

Task *task_2 = task_submit_dep(

    pool,

    &task_1,   // Pointer to a list of parent tasks

    1,         // Number of parent tasks

    100,       // How many work units does this task contain?

    my_task_2, // Function to be executed

    payload,   // Optional payload, will be passed to function

    0,         // Size of the payload (only relevant if it should be copied)

    free,      // Call free(payload) once this task completes

    0          // Enforce asynchronous execution even if task is small?

);

/* Now that the parent-child relationship is specified,

   the handle of task 1 can be released */

task_release(task_1);

// Wait for the completion of task 2 and also release its handle

task_wait_and_release(task_2);

```

## Documentation

The complete API is documented in the file

[nanothread/nanothread.h](https://github.com/mitsuba-renderer/nanothread/blob/master/include/nanothread/nanothread.h).

## Technical details

This library follows a lock-free design: tasks that are ready for execution are

stored in a [Michael-Scott

queue](https://www.cs.rochester.edu/u/scott/papers/1996_PODC_queues.pdf) that

is continuously polled by workers, and task submission/removal relies on atomic

compare-and-swap (CAS) operations. Workers that idle for more than roughly 50

milliseconds are put to sleep until more work becomes available.

The lock-free design is important: the central data structures of a task

submission system are heavily contended, and traditional abstractions (e.g.

``std::mutex``) will immediately put contending threads to sleep to defer lock

resolution to the OS kernel. The associated context switches produce an

extremely large overhead that can make a parallel program orders of magnitude

slower than a single-threaded version.

The implementation catches exception that occur while executing parallel work

and re-throws them the caller's thread (this part is of no relevance for

software written in C99).

The functions ``task_wait()`` and ``task_wait_and_release()`` do not just

wait---they spend the wait time fetching and executing work from the task

queue, which has two implications: first, it is not wasteful to wait for the

completion of another task while executing a task. Second, the thread pool can

be set to a size of zero via ``pool_create(0)`` or ``pool_set_size(pool, 0)``,

in which case the program will still run correctly without launching any

additional threads.