An open API service indexing awesome lists of open source software.

https://github.com/bshoshany/thread-pool

BS::thread_pool: a fast, lightweight, modern, and easy-to-use C++17 / C++20 / C++23 thread pool library
https://github.com/bshoshany/thread-pool

concurrency cplusplus cplusplus-17 cplusplus-20 cplusplus-23 cpp cpp17 cpp20 cpp20-modules cpp23 easy-to-use high-performance high-performance-computing multithreading parallel scientific-computing thread-pool threading threadpool

Last synced: about 1 year ago
JSON representation

BS::thread_pool: a fast, lightweight, modern, and easy-to-use C++17 / C++20 / C++23 thread pool library

Awesome Lists containing this project

README

          

[![Author: Barak Shoshany](https://img.shields.io/badge/author-Barak_Shoshany-009933)](https://baraksh.com/)
[![DOI: 10.1016/j.softx.2024.101687](https://img.shields.io/badge/DOI-10.1016%2Fj.softx.2024.101687-b31b1b)](https://doi.org/10.1016/j.softx.2024.101687)
[![arXiv:2105.00613](https://img.shields.io/badge/arXiv-2105.00613-b31b1b)](https://arxiv.org/abs/2105.00613)
[![License: MIT](https://img.shields.io/github/license/bshoshany/thread-pool)](https://github.com/bshoshany/thread-pool/blob/master/LICENSE.txt)
[![Language: C++17 / C++20 / C++23](https://img.shields.io/badge/Language-C%2B%2B17%20%2F%20C%2B%2B20%20%2F%20C%2B%2B23-yellow)](https://cppreference.com/)
[![GitHub stars](https://img.shields.io/github/stars/bshoshany/thread-pool?style=flat&color=009999)](https://github.com/bshoshany/thread-pool/stargazers)
[![GitHub forks](https://img.shields.io/github/forks/bshoshany/thread-pool?style=flat&color=009999)](https://github.com/bshoshany/thread-pool/forks)
[![GitHub release](https://img.shields.io/github/v/release/bshoshany/thread-pool?color=660099)](https://github.com/bshoshany/thread-pool/releases)
[![Vcpkg version](https://img.shields.io/vcpkg/v/bshoshany-thread-pool?color=6600ff)](https://vcpkg.io/)
[![Meson WrapDB](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fraw.githubusercontent.com%2Fmesonbuild%2Fwrapdb%2Fmaster%2Freleases.json&query=%24%5B%22bshoshany-thread-pool%22%5D.versions%5B0%5D&label=wrapdb&color=6600ff)](https://mesonbuild.com/Wrapdb-projects.html)
[![Conan version](https://img.shields.io/conan/v/bshoshany-thread-pool?color=6600ff)](https://conan.io/center/recipes/bshoshany-thread-pool)
[![Open in Visual Studio Code](https://img.shields.io/badge/Open_in_Visual_Studio_Code-007acc)](https://vscode.dev/github/bshoshany/thread-pool)

# `BS::thread_pool`: a fast, lightweight, modern, and easy-to-use C++17 / C++20 / C++23 thread pool library

By **Barak Shoshany**\
Email: \
Website: \
GitHub:

This is the complete documentation for **v5.0.0** of the library, released on **2024-12-19**.

* [Introduction](#introduction)
* [Motivation](#motivation)
* [Overview of features](#overview-of-features)
* [Getting started](#getting-started)
* [Installing the library](#installing-the-library)
* [Compiling and compatibility](#compiling-and-compatibility)
* [Constructors](#constructors)
* [Getting and resetting the number of threads in the pool](#getting-and-resetting-the-number-of-threads-in-the-pool)
* [Submitting tasks to the queue](#submitting-tasks-to-the-queue)
* [Submitting tasks with no arguments and receiving a future](#submitting-tasks-with-no-arguments-and-receiving-a-future)
* [Submitting tasks with arguments and receiving a future](#submitting-tasks-with-arguments-and-receiving-a-future)
* [Detaching and waiting for tasks](#detaching-and-waiting-for-tasks)
* [Waiting for submitted or detached tasks with a timeout](#waiting-for-submitted-or-detached-tasks-with-a-timeout)
* [Class member functions as tasks](#class-member-functions-as-tasks)
* [Parallelizing loops](#parallelizing-loops)
* [Automatic parallelization of loops](#automatic-parallelization-of-loops)
* [Optimizing the number of blocks](#optimizing-the-number-of-blocks)
* [Common index types](#common-index-types)
* [Parallelizing loops without futures](#parallelizing-loops-without-futures)
* [Parallelizing individual indices vs. blocks](#parallelizing-individual-indices-vs-blocks)
* [Loops with return values](#loops-with-return-values)
* [Parallelizing sequences](#parallelizing-sequences)
* [More about `BS::multi_future`](#more-about-bsmulti_future)
* [Utility classes](#utility-classes)
* [Synchronizing printing to a stream with `BS::synced_stream`](#synchronizing-printing-to-a-stream-with-bssynced_stream)
* [Synchronizing tasks with `BS::counting_semaphore` and `BS::binary_semaphore`](#synchronizing-tasks-with-bscounting_semaphore-and-bsbinary_semaphore)
* [Managing tasks](#managing-tasks)
* [Monitoring the tasks](#monitoring-the-tasks)
* [Purging tasks](#purging-tasks)
* [Exception handling](#exception-handling)
* [Getting information about the current thread](#getting-information-about-the-current-thread)
* [Thread initialization functions](#thread-initialization-functions)
* [Thread cleanup functions](#thread-cleanup-functions)
* [Passing task arguments by constant reference](#passing-task-arguments-by-constant-reference)
* [Optional features](#optional-features)
* [Enabling features](#enabling-features)
* [Setting task priority](#setting-task-priority)
* [Pausing the pool](#pausing-the-pool)
* [Avoiding wait deadlocks](#avoiding-wait-deadlocks)
* [Native extensions](#native-extensions)
* [Enabling the native extensions](#enabling-the-native-extensions)
* [Setting thread priority](#setting-thread-priority)
* [Setting thread affinity](#setting-thread-affinity)
* [Setting thread names](#setting-thread-names)
* [Setting process priority](#setting-process-priority)
* [Setting process affinity](#setting-process-affinity)
* [Accessing native thread handles](#accessing-native-thread-handles)
* [Testing the library](#testing-the-library)
* [Automated tests](#automated-tests)
* [Performance tests](#performance-tests)
* [Finding the version of the library](#finding-the-version-of-the-library)
* [Importing the library as a C++20 module](#importing-the-library-as-a-c20-module)
* [Compiling the module](#compiling-the-module)
* [Compiling with `compile_cpp.py` using `import BS.thread_pool`](#compiling-with-compile_cpppy-using-import-bsthread_pool)
* [Compiling with Clang using `import BS.thread_pool`](#compiling-with-clang-using-import-bsthread_pool)
* [Compiling with GCC using `import BS.thread_pool`](#compiling-with-gcc-using-import-bsthread_pool)
* [Compiling with MSVC using `import BS.thread_pool`](#compiling-with-msvc-using-import-bsthread_pool)
* [Compiling with CMake using `import BS.thread_pool`](#compiling-with-cmake-using-import-bsthread_pool)
* [Importing the C++23 Standard Library as a module](#importing-the-c23-standard-library-as-a-module)
* [Enabling `import std`](#enabling-import-std)
* [Compiling with `compile_cpp.py` using `import std`](#compiling-with-compile_cpppy-using-import-std)
* [Compiling with Clang and LLVM libc++ using `import std`](#compiling-with-clang-and-llvm-libc-using-import-std)
* [Compiling with MSVC and Microsoft STL using `import std`](#compiling-with-msvc-and-microsoft-stl-using-import-std)
* [Compiling with CMake using `import std`](#compiling-with-cmake-using-import-std)
* [Installing the library using package managers](#installing-the-library-using-package-managers)
* [Installing using vcpkg](#installing-using-vcpkg)
* [Installing using Conan](#installing-using-conan)
* [Installing using Meson](#installing-using-meson)
* [Installing using CMake with CPM](#installing-using-cmake-with-cpm)
* [Installing using CMake with `FetchContent`](#installing-using-cmake-with-fetchcontent)
* [Complete library reference](#complete-library-reference)
* [The `BS::thread_pool` class template](#the-bsthread_pool-class-template)
* [Optional features and the template parameter](#optional-features-and-the-template-parameter)
* [The `BS::this_thread` class](#the-bsthis_thread-class)
* [The native extensions](#the-native-extensions)
* [The `BS::multi_future` class](#the-bsmulti_future-class)
* [The `BS::synced_stream` class](#the-bssynced_stream-class)
* [The `BS::version` class](#the-bsversion-class)
* [Diagnostic variables](#diagnostic-variables)
* [All names exported by the C++20 module](#all-names-exported-by-the-c20-module)
* [Development tools](#development-tools)
* [The `compile_cpp.py` script](#the-compile_cpppy-script)
* [Other included tools](#other-included-tools)
* [About the project](#about-the-project)
* [Bug reports and feature requests](#bug-reports-and-feature-requests)
* [Contribution and pull request policy](#contribution-and-pull-request-policy)
* [Starring the repository](#starring-the-repository)
* [Acknowledgements](#acknowledgements)
* [Copyright and citing](#copyright-and-citing)
* [About the author](#about-the-author)
* [Learning more about C++](#learning-more-about-c)
* [Other projects to check out](#other-projects-to-check-out)

## Introduction

### Motivation

Multithreading is essential for modern high-performance computing. Since C++11, the C++ standard library has included built-in low-level multithreading support using constructs such as `std::thread`. However, `std::thread` creates a new thread each time it is called, which can have a significant performance overhead. Furthermore, it is possible to create more threads than the hardware can handle simultaneously, potentially resulting in a substantial slowdown.

The library presented here contains a C++ thread pool class, `BS::thread_pool`, which avoids these issues by creating a fixed pool of threads once and for all, and then continuously reusing the same threads to perform different tasks throughout the lifetime of the program. By default, the number of threads in the pool is equal to the maximum number of threads that the hardware can run in parallel.

The user submits tasks to be executed into a queue. Whenever a thread becomes available, it retrieves the next task from the queue and executes it. The pool optionally produces an `std::future` for each task, which allows the user to wait for the task to finish executing and/or obtain its eventual return value, if applicable. Threads and tasks are autonomously managed by the pool in the background, without requiring any input from the user aside from submitting the desired tasks.

The design of this library is guided by four important principles. First, *compactness*: the entire library consists of just one self-contained header file, with no other components or dependencies. Second, *portability*: the library only utilizes the C++ standard library, without relying on any compiler extensions or 3rd-party libraries, and is therefore compatible with any modern standards-conforming C++ compiler on any platform, as long as it supports C++17 or later. Third, *ease of use*: the library is extensively documented, and programmers of any level should be able to use it right out of the box.

The fourth and final guiding principle is *performance*: each and every line of code in this library was carefully designed with maximum performance in mind, and performance was tested and verified on a variety of compilers and platforms. Indeed, the library was originally designed for use in the author's own computationally-intensive scientific computing projects, running both on high-end desktop/laptop computers and high-performance computing nodes.

Among the available C++ thread pool libraries, `BS::thread_pool` occupies the crucial middle ground between small bare-bones thread pool classes that offer rudimentary functionality and are only suitable for simple programs, and very large libraries that offer many advanced features but consist of multiple components and dependencies and involve complex APIs that require a substantial time investment to learn. `BS::thread_pool` was designed for users who want a simple and lightweight header-only library that is easy to learn and use, and can be readily incorporated into existing or new projects, but do not want to compromise on performance or functionality.

Obtaining the library is quick and easy; it can be downloaded manually from [the GitHub repository](https://github.com/bshoshany/thread-pool), or installed automatically using a variety of package managers and build systems. The library can be imported either as a traditional [header-only library](#installing-the-library), or as a modern [C++20 module](#importing-the-library-as-a-c20-module). `BS::thread_pool` has undergone extensive testing on multiple platforms and is actively used by thousands of C++ developers worldwide for a wide range of applications, from scientific computing to game development.

### Overview of features

* **Fast:**
* Built from scratch with [maximum performance](#performance-tests) in mind.
* Suitable for use in high-performance computing nodes with a very large number of CPU cores.
* Reusing threads avoids the overhead of creating and destroying them for individual tasks.
* A task queue ensures that there are never more tasks running in parallel than is allowed by the hardware.
* All optional features can be selectively turned on to ensure minimal overhead.
* **Lightweight:**
* Single header file: simply [`#include "BS_thread_pool.hpp"`](#installing-the-library) and you're all set!
* Header-only: no need to install or build the library.
* Self-contained: no external requirements or dependencies.
* Portable: uses only the C++ standard library, and works with any C++17-compliant compiler on any platform.
* Only 487 lines of code, including all optional features and utility classes (excluding comments, blank lines, lines containing only a single brace, C++17 polyfills, and native extensions).
* **Modern:**
* Fully supports C++17, C++20, and C++23, taking advantage of the latest language features when available for maximum performance, reliability, and usability.
* In C++20, the library can an be imported as a C++20 module using [`import BS.thread_pool`](#importing-the-library-as-a-c20-module), with many benefits, such as faster compilation times and avoiding namespace pollution.
* In C++23, the library can import the C++ Standard Library as a module using [`import std`](#importing-the-c23-standard-library-as-a-module) on supported compilers and platforms.
* Makes use of modern C++ programming practices for readability, maintainability, performance, safety, portability, and reliability.
* **Easy to use:**
* Very simple operation, using only a handful of member functions for basic use, with many additional member functions, classes, and functions for more advanced use.
* Every task submitted to the queue using [`submit_task()`](#submitting-tasks-to-the-queue) automatically generates an `std::future`, which can be used to wait for the task to finish executing, obtain its eventual return value, and/or catch any thrown exceptions.
* Loops can be automatically parallelized into any number of tasks using [`submit_loop()`](#parallelizing-loops), which returns a [`BS::multi_future`](#more-about-bsmulti_future) that can be used to track the execution of all parallel tasks at once.
* If futures are not needed, tasks may be submitted using [`detach_task()`](#detaching-and-waiting-for-tasks), and loops can be parallelized using [`detach_loop()`](#parallelizing-loops-without-futures) - sacrificing convenience for even greater performance. In that case, `wait()`, `wait_for()`, and `wait_until()` can be used to wait for all the tasks in the queue to complete.
* Extremely thorough and detailed documentation, with numerous examples, is available in the library's [`README.md` file](https://github.com/bshoshany/thread-pool/blob/master/README.md), with a total of 3,359 lines and 25,506 words!
* The code is thoroughly documented using Doxygen comments - not only the interface, but also the implementation, in case the user would like to make modifications.
* Optionally, the included Python script [`compile_cpp.py`](#the-compile_cpppy-script) can be used to easily compile any programs that are using the library, with full support for C++20 modules and C++23 Standard Library modules where applicable.
* **Additional features:**
* Get the current thread count of the pool using [`get_thread_count()`](#getting-and-resetting-the-number-of-threads-in-the-pool).
* Change the number of threads in the pool safely and on-the-fly as needed using [`reset()`](#getting-and-resetting-the-number-of-threads-in-the-pool).
* Monitor the number of queued and/or running tasks using [`get_tasks_queued()`, `get_tasks_running()`, and `get_tasks_total()`](#monitoring-the-tasks).
* Purge all tasks currently waiting in the queue with [`purge()`](#purging-tasks).
* Run an [initialization function](#thread-initialization-functions) in each thread before it starts to execute any submitted tasks, by passing it to the `BS::thread_pool` constructor.
* Run a cleanup function in each thread right before it is destroyed, using [`set_cleanup_func()`](#thread-cleanup-functions).
* Assume lower-level control of parallelized loops using [`detach_blocks()` and `submit_blocks()`](#parallelizing-individual-indices-vs-blocks).
* Parallelize a sequence of tasks enumerated by indices to the queue using [`detach_sequence()` and `submit_sequence()`](#parallelizing-sequences).
* Get [information about the current thread](#getting-information-about-the-current-thread): the pool index using `BS::this_thread::get_index()` and a pointer to the owning pool using `BS::this_thread::get_pool()`.
* Get the unique thread IDs for all threads in the pool using [`get_thread_ids()`](#getting-and-resetting-the-number-of-threads-in-the-pool).
* Synchronize output to one or more streams from multiple threads in parallel using the [`BS::synced_stream`](#synchronizing-printing-to-a-stream-with-bssynced_stream) utility class.
* Access C++20 semaphores in C++17 using the [`BS::binary_semaphore` and `BS::counting_semaphore`](#synchronizing-tasks-with-bscounting_semaphore-and-bsbinary_semaphore) polyfill classes.
* **Optional features:**
* [Optional features](#enabling-features) can be enabled by passing a bitmask template parameter to the `BS::thread_pool` class template.
* Assign a priority to each task using the optional [task priority](#setting-task-priority) feature. The priority, in the range -128 to +127, is passed as the last argument to all `submit` and `detach` member functions. Tasks with higher priorities will be executed first.
* Freely pause and resume the pool using `pause()`, `unpause()`, and `is_paused()` with the optional [pausing](#pausing-the-pool) feature. When paused, threads do not retrieve new tasks out of the queue.
* Avoid deadlocks using the optional [wait deadlock checks](#avoiding-wait-deadlocks) feature. If a deadlock is detected while waiting for tasks, the pool will throw the exception `BS::wait_deadlock`.
* **Native extensions:**
* The library includes optional [native extensions](#native-extensions), which contain non-portable features using the operating system's native API, enabled by defining the macro `BS_THREAD_POOL_NATIVE_EXTENSIONS` at compilation time. This feature should work on most Windows, Linux, and macOS systems.
* Use [`BS::this_thread::get_os_thread_priority()` and `BS::this_thread::set_os_thread_priority()`](#setting-thread-priority) to get and set the priority of the current thread.
* Use [`BS::this_thread::get_os_thread_affinity()` and `BS::this_thread::set_os_thread_affinity()`](#setting-thread-affinity) to get and set the processor affinity of the current thread.
* Use [`BS::this_thread::get_os_thread_name()` and `BS::this_thread::set_os_thread_name()`](#setting-thread-names) to get and set the name of the current thread.
* Use [`BS::get_os_process_priority()` and `BS::set_os_process_priority()`](#setting-process-priority) to get and set the priority of the current process.
* Use [`BS::get_os_process_affinity()` and `BS::set_os_process_affinity()`](#setting-process-affinity) to get and set the processor affinity of the current process.
* Get the implementation-defined thread handles for all threads in the pool using [`get_native_handles()`](#accessing-native-thread-handles).
* **Well-tested:**
* The included test program [`BS_thread_pool_test.cpp`](#automated-tests) performs hundreds of automated tests, and also serves as a comprehensive example of how to properly use the library.
* The test program also performs [benchmarks](#performance-tests) using a highly-optimized multithreaded algorithm which generates a plot of the Mandelbrot set.
* The included Python script `test_all.py` provides a portable way to easily run the tests with multiple compilers.
* [Compatibility](#compiling-and-compatibility) is comprehensively tested on the latest versions of Windows, Ubuntu, and macOS, using Clang, GCC, and MSVC.
* Under continuous and active development. Bug reports and feature requests are welcome, and should be made via [GitHub issues](https://github.com/bshoshany/thread-pool/issues).

## Getting started

### Installing the library

To install `BS::thread_pool`, simply download the [latest release](https://github.com/bshoshany/thread-pool/releases) from [the GitHub repository](https://github.com/bshoshany/thread-pool), place the header file `BS_thread_pool.hpp` from the `include` folder in the desired folder, and include it in your program:

```cpp
#include "BS_thread_pool.hpp"
```

The thread pool will now be accessible via the `BS::thread_pool` class. For an even quicker installation, you can download the header file itself directly [at this URL](https://raw.githubusercontent.com/bshoshany/thread-pool/master/include/BS_thread_pool.hpp); no additional files are required, as the library is a single-header library.

This library is also available on various package managers and build system, including [vcpkg](https://vcpkg.io/), [Conan](https://conan.io/), [Meson](https://mesonbuild.com/), and [CMake](https://cmake.org/). Please [see below](#installing-the-library-using-package-managers) for more details.

If C++20 features are available, the library can also be imported as a C++20 module, in which case `#include "BS_thread_pool.hpp"` should be replaced with `import BS.thread_pool;`. This requires one additional file, and the module must be compiled before it can be used; please see detailed instructions [below](#importing-the-library-as-a-c20-module).

### Compiling and compatibility

This library officially supports C++17, C++20, and C++23. If compiled with C++20 and/or C++23 support, the library will make use of newly available features for maximum performance and usability. However, the library is fully compatible with C++17, and should successfully compile on any C++17 standard-compliant compiler, on all operating systems and architectures for which such a compiler is available.

Compatibility was verified using the bundled test program `BS_thread_pool_test.cpp`, compiled using the bundled Python scripts `test_all.py` and `compile_cpp.py` with native extensions enabled, importing the library [as a C++20 module](#importing-the-library-as-a-c20-module) where applicable, and importing the [C++23 Standard Library as a module](#importing-the-c23-standard-library-as-a-module) where applicable, on a 24-core (8P+16E) / 32-thread Intel i9-13900K CPU, using the following compilers, C++ standard libraries, and platforms:

* Windows 11 23H2 build 22631.4602:
* [Clang](https://clang.llvm.org/) v19.1.4 with LLVM libc++ v19.1.4 ([MSYS2 build](https://www.msys2.org/))
* [GCC](https://gcc.gnu.org/) v14.2.0 with GNU libstdc++ v14 (20240801) ([MSYS2 build](https://www.msys2.org/))
* [MSVC](https://docs.microsoft.com/en-us/cpp/) v19.42.34435 with Microsoft STL v143 (202408).
* Ubuntu 24.10:
* [Clang](https://clang.llvm.org/) v19.1.6 with LLVM libc++ v19.1.6
* [GCC](https://gcc.gnu.org/) v14.2.0 with GNU libstdc++ v14 (20240908)
* macOS 15.1 build 24B83:
* [Clang](https://clang.llvm.org/) v19.1.6 with LLVM libc++ v19.1.6 ([Homebrew build](https://formulae.brew.sh/formula/llvm))
* Note: Apple Clang is currently not officially supported, as it does not support C++20 modules.

As this library requires C++17 features, the code must be compiled with C++17 support:

* For Clang or GCC, use the `-std=c++17` flag. On Linux, you will also need to use the `-pthread` flag to enable the POSIX threads library.
* For MSVC, use `/std:c++17`, and also `/permissive-` to ensure standards conformance.

For maximum performance, it is recommended to compile with all available compiler optimizations:

* For Clang or GCC, use the `-O3` flag.
* For MSVC, use `/O2`.

As an example, to compile the test program `BS_thread_pool_test.cpp` with compiler optimizations, it is recommended to use the following commands:

* Windows:
* GCC: `g++ BS_thread_pool_test.cpp -std=c++17 -O3 -o BS_thread_pool_test.exe`
* Clang: `clang++ BS_thread_pool_test.cpp -std=c++17 -O3 -o BS_thread_pool_test.exe`
* MSVC: `cl BS_thread_pool_test.cpp /std:c++17 /permissive- /O2 /EHsc /Fo:BS_thread_pool_test.obj /Fe:BS_thread_pool_test.exe`
* Linux/macOS:
* GCC: `g++ BS_thread_pool_test.cpp -std=c++17 -O3 -pthread -o BS_thread_pool_test`
* Clang: `clang++ BS_thread_pool_test.cpp -std=c++17 -O3 -pthread -o BS_thread_pool_test`

If your compiler and codebase support C++20 and/or C++23, it is recommended to enable them in order to allow the library access to the latest features:

* For Clang or GCC, use the `-std=c++20` or `-std=c++23` flag.
* For MSVC, use `/std:c++20` for C++20 or `/std:c++latest` for C++23.

In addition, if C++20 features are available, the library can be imported as a module; instructions for doing so are provided [below](#importing-the-library-as-a-c20-module).

### Constructors

The default constructor creates a thread pool with as many threads as the hardware can handle concurrently, as reported by the implementation via `std::thread::hardware_concurrency()`. This is usually determined by the number of cores in the CPU. If a core is hyperthreaded, it will count as two threads. For example:

```cpp
// Constructs a thread pool with as many threads as are available in the hardware.
BS::thread_pool pool;
```

Optionally, a number of threads different from the hardware concurrency can be specified as an argument to the constructor. However, note that adding more threads than the hardware can handle will **not** improve performance, and in fact will most likely hinder it. This option exists in order to allow using **fewer** threads than the hardware concurrency, in cases where you wish to leave some threads available for other processes. For example:

```cpp
// Constructs a thread pool with only 12 threads.
BS::thread_pool pool(12);
```

Usually, when the thread pool is used, a program's main thread should only submit tasks to the thread pool and wait for them to finish, and should not perform any computationally intensive tasks on its own. If this is the case, it is recommended to use the default value for the number of threads. This ensures that all the threads available in the hardware will be put to work while the main thread waits.

However, if the main thread also performs computationally intensive tasks, it may be beneficial to use one fewer thread than the hardware concurrency, leaving one hardware thread available for the main thread. Furthermore, if more than one thread pool is used in the program simultaneously, the total number of thread across all pools should not exceed the hardware concurrency.

### Getting and resetting the number of threads in the pool

The member function `get_thread_count()` returns the number of threads in the pool. This will be equal to `std::thread::hardware_concurrency()` if the default constructor was used.

It is generally unnecessary to change the number of threads in the pool after it has been created, since the whole point of a thread pool is that you only create the threads once. However, if needed, this can be done, safely and on-the-fly, using the `reset()` member function.

`reset()` will wait for all currently running tasks to be completed, but will leave the rest of the tasks in the queue. Then it will destroy the thread pool and create a new one with the desired new number of threads, as specified in the function's argument (or the hardware concurrency if no argument is given). The new thread pool will then resume executing the tasks that remained in the queue and any newly submitted tasks.

The member function `get_thread_ids()` returns a vector containing the unique identifiers for each of the pool's threads, as obtained by `std::thread::get_id()`. These values are not so useful on their own, but can be used to identify and distinguish between threads, or for allocating resources.

## Submitting tasks to the queue

### Submitting tasks with no arguments and receiving a future

In this section we will learn how to submit a task with no arguments, but potentially with a return value, to the queue. Once a task has been submitted, it will be executed as soon as a thread becomes available. Tasks are executed in the order that they were submitted (first-in, first-out), unless task priority is enabled ([see below](#setting-task-priority)).

For example, if the pool has 8 threads and an empty queue, and we submitted 16 tasks, then we should expect the first 8 tasks to be executed in parallel, with the remaining tasks being picked up by the threads one by one as each thread finishes executing its first task, until no tasks are left in the queue.

The member function `submit_task()` is used to submit tasks to the queue. It takes exactly one input, the task to submit. This task must be a function with no arguments, but it can have a return value.

`submit_task()` returns an `std::future` associated to the task. If the submitted task has a return value of type `T`, then the future will be of type `std::future`, and will be set to the task's return value when the task finishes its execution. If the submitted task does not have a return value, then the future will be an `std::future`, which will not contain any value, but may still be used to wait for the task to finish.

To wait until the task finishes, use the member function `wait()` of the future. To obtain the return value, use the member function `get()`, which will also automatically wait for the task to finish if it hasn't yet. Here is a simple example:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::future
#include // std::cout

int the_answer()
{
return 42;
}

int main()
{
BS::thread_pool pool;
std::future my_future = pool.submit_task(the_answer);
std::cout << my_future.get() << '\n';
}
```

In this example we submitted the function `the_answer()`, which returns an `int`. The member function `submit_task()` of the pool therefore returned an `std::future`. We then used used the `get()` member function of the future to get the return value, and printed it out.

In addition to submitting a pre-defined function, we can also use a [lambda expression](https://en.cppreference.com/w/cpp/language/lambda) to quickly define the task on-the-fly. Rewriting the previous example in terms of a lambda expression, we get:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::future
#include // std::cout

int main()
{
BS::thread_pool pool;
std::future my_future = pool.submit_task([]{ return 42; });
std::cout << my_future.get() << '\n';
}
```

Here, the lambda expression `[]{ return 42; }` has two parts:

1. An empty capture clause, denoted by `[]`. This signifies to the compiler that a lambda expression is being defined.
2. A code block `{ return 42; }` that simply returns the value `42`.

It is generally simpler and faster to submit lambda expressions rather than pre-defined functions, especially due to the ability to capture local variables, which we will discuss in the next section.

Of course, tasks do not have to return values. In the following example, we submit a function with no return value and then using the future to wait for it to finish executing:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::chrono
#include // std::future
#include // std::cout
#include // std::this_thread

int main()
{
BS::thread_pool pool;
const std::future my_future = pool.submit_task(
[]
{
std::this_thread::sleep_for(std::chrono::milliseconds(500));
});
std::cout << "Waiting for the task to complete... ";
my_future.wait();
std::cout << "Done." << '\n';
}
```

Here we split the lambda into multiple lines to make it more readable. The command `std::this_thread::sleep_for(std::chrono::milliseconds(500))` instructs the task to simply sleep for 500 milliseconds, simulating a computationally-intensive task.

### Submitting tasks with arguments and receiving a future

As stated in the previous section, tasks submitted using `submit_task()` cannot have any arguments. However, it is easy to submit tasks with argument either by wrapping the function in a lambda or using lambda captures directly. The following is an example of submitting a pre-defined function with arguments by wrapping it in a lambda:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::future
#include // std::cout

double multiply(const double lhs, const double rhs)
{
return lhs * rhs;
}

int main()
{
BS::thread_pool pool;
std::future my_future = pool.submit_task(
[]
{
return multiply(6, 7);
});
std::cout << my_future.get() << '\n';
}
```

As you can see, to pass the arguments to `multiply()` we simply called `multiply(6, 7)` explicitly inside a lambda. If the arguments are not literals, we can use the lambda capture clause to capture the arguments from the local scope:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::future
#include // std::cout

double multiply(const double lhs, const double rhs)
{
return lhs * rhs;
}

int main()
{
BS::thread_pool pool;
constexpr double first = 6;
constexpr double second = 7;
std::future my_future = pool.submit_task(
[first, second]
{
return multiply(first, second);
});
std::cout << my_future.get() << '\n';
}
```

We could even get rid of the `multiply()` function entirely and just put everything inside a lambda, if desired:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::future
#include // std::cout

int main()
{
BS::thread_pool pool;
constexpr double first = 6;
constexpr double second = 7;
std::future my_future = pool.submit_task(
[first, second]
{
return first * second;
});
std::cout << my_future.get() << '\n';
}
```

### Detaching and waiting for tasks

Usually, it is best to submit a task to the queue using `submit_task()`. This allows you to wait for the task to finish and/or get its return value later. However, sometimes a future is not needed, for example when you just want to "set and forget" a certain task, or if the task already communicates with the main thread or with other tasks without using futures, such as via condition variables.

In such cases, you may wish to avoid the overhead involved in assigning a future to the task, in order to increase performance. This is called "detaching" the task, as the task detaches from the main thread and runs independently.

Detaching tasks is done using the `detach_task()` member function, which allows you to detach a task to the queue without generating a future for it. As with `submit_task()`, the task must have no arguments, but you can pass arguments by wrapping it in a lambda, as shown in the previous section. However, tasks executed via `detach_task()` cannot have a return value, as there would be no way for the main thread to retrieve that value.

Since `detach_task()` does not return a future, there is no built-in way for the user to know when the task finishes executing. You must manually ensure that the task finishes executing before trying to use anything that depends on its output. Otherwise, bad things will happen!

`BS::thread_pool` provides the member function `wait()` to facilitate waiting for all the tasks in the queue to complete, whether they were detached or submitted with a future. The `wait()` member function works similarly to the `wait()` member function of `std::future`. Consider, for example, the following code:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::chrono
#include // std::cout
#include // std::this_thread

int main()
{
BS::thread_pool pool;
int result = 0;
pool.detach_task(
[&result]
{
std::this_thread::sleep_for(std::chrono::milliseconds(100));
result = 42;
});
std::cout << result << '\n';
}
```

This program first defines a local variable named `result` and initializes it to `0`. It then detaches a task in the form of a lambda expression. Note that the lambda captures `result` **by reference**, as indicated by the `&` in front of it. This means that the task can modify `result`, and any such modification will be reflected in the main thread.

The task changes `result` to `42`, but it first sleeps for 100 milliseconds. When the main thread prints out the value of `result`, the task has not yet had time to modify its value, since it is still sleeping. Therefore, the program will actually print out the initial value `0`, which is not what we want.

To wait for the task to complete, we must use the `wait()` member function after detaching it:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::chrono
#include // std::cout
#include // std::this_thread

int main()
{
BS::thread_pool pool;
int result = 0;
pool.detach_task(
[&result]
{
std::this_thread::sleep_for(std::chrono::milliseconds(100));
result = 42;
});
pool.wait();
std::cout << result << '\n';
}
```

Now the program will print out the value `42`, as expected. Note, however, that `wait()` will wait for **all** the tasks in the queue, including any other tasks that were potentially submitted before or after the one we care about. If we want to wait just for **one** task, `submit_task()` would be a better choice.

### Waiting for submitted or detached tasks with a timeout

Sometimes you may wish to wait for the tasks to complete, but only for a certain amount of time, or until a specific point in time. For example, if the tasks have not yet completed after some time, you may wish to let the user know that there is a delay.

For tasks submitted with futures using `submit_task()`, this can be achieved using two member functions of `std::future`:

* `wait_for()` waits for the task to be completed, but stops waiting after the specified duration, given as an argument of type `std::chrono::duration`, has passed.
* `wait_until()` waits for the task to be completed, but stops waiting after the specified time point, given as an argument of type `std::chrono::time_point`, has been reached.

In both cases, the functions will return `std::future_status::ready` if the future is ready, meaning the task is finished and its return value, if any, has been obtained. However, they will return `std::future_status::timeout` if the future is not yet ready when the timeout has expired.

Here is an example:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::chrono
#include // std::future
#include // std::cout
#include // std::this_thread

int main()
{
BS::thread_pool pool;
const std::future my_future = pool.submit_task(
[]
{
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
std::cout << "Task done!\n";
});
while (true)
{
if (my_future.wait_for(std::chrono::milliseconds(200)) != std::future_status::ready)
std::cout << "Sorry, the task is not done yet.\n";
else
break;
}
}
```

The output should look similar to this:

```none
Sorry, the task is not done yet.
Sorry, the task is not done yet.
Sorry, the task is not done yet.
Sorry, the task is not done yet.
Task done!
```

For detached tasks, since we do not have futures for them, we cannot use this method. However, `BS::thread_pool` has two member functions, also named `wait_for()` and `wait_until()`, which similarly wait for a specified duration or until a specified time point, but do so for **all** tasks (whether submitted or detached). Instead of an `std::future_status`, the thread pool's wait functions returns `true` if all tasks finished running, or `false` if the duration expired or the time point was reached but some tasks are still running.

Here is the same example as above, using `detach_task()` and `pool.wait_for()`:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::chrono
#include // std::cout
#include // std::this_thread

int main()
{
BS::thread_pool pool;
pool.detach_task(
[]
{
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
std::cout << "Task done!\n";
});
while (true)
{
if (!pool.wait_for(std::chrono::milliseconds(200)))
std::cout << "Sorry, the task is not done yet.\n";
else
break;
}
}
```

### Class member functions as tasks

Let us consider the following program:

```cpp
#include // std::boolalpha, std::cout

class flag_class
{
public:
[[nodiscard]] bool get_flag() const
{
return flag;
}

void set_flag(const bool arg)
{
flag = arg;
}

private:
bool flag = false;
};

int main()
{
flag_class flag_object;
flag_object.set_flag(true);
std::cout << std::boolalpha << flag_object.get_flag() << '\n';
}
```

This program creates a new object `flag_object` of the class `flag_class`, sets the flag to `true` using the setter member function `set_flag()`, and then prints out the flag's value using the getter member function `get_flag()`.

What if we want to submit the member function `set_flag()` as a task to the thread pool? We can simply wrap the entire statement `flag_object.set_flag(true);` in a lambda, and pass `flag_object` to the lambda by reference, as in the following example:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::boolalpha, std::cout

class flag_class
{
public:
[[nodiscard]] bool get_flag() const
{
return flag;
}

void set_flag(const bool arg)
{
flag = arg;
}

private:
bool flag = false;
};

int main()
{
BS::thread_pool pool;
flag_class flag_object;
pool.submit_task(
[&flag_object]
{
flag_object.set_flag(true);
})
.wait();
std::cout << std::boolalpha << flag_object.get_flag() << '\n';
}
```

Of course, this will also work with `detach_task()`, if we call `wait()` on the pool itself instead of on the returned future.

Note that in this example, instead of getting a future from `submit_task()` and then waiting for that future, we simply called `wait()` on that future straight away. This is a common way of waiting for a task to complete if we have nothing else to do in the meantime. Note also that we passed `flag_object` by reference to the lambda, since we want to set the flag on that same object, not a copy of it.

Another thing you might want to do is call a member function from within the object itself, that is, from another member function. This follows a similar syntax, except that you must also capture `this` (i.e. a pointer to the current object) in the lambda. Here is an example:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::boolalpha, std::cout

BS::thread_pool pool;

class flag_class
{
public:
[[nodiscard]] bool get_flag() const
{
return flag;
}

void set_flag(const bool arg)
{
flag = arg;
}

void set_flag_to_true()
{
pool.submit_task(
[this]
{
set_flag(true);
})
.wait();
}

private:
bool flag = false;
};

int main()
{
flag_class flag_object;
flag_object.set_flag_to_true();
std::cout << std::boolalpha << flag_object.get_flag() << '\n';
}
```

Note that in this example we defined the thread pool as a global object, so that it is accessible outside the `main()` function. Although we could have, in theory, passed a reference to the thread pool in our call to `set_flag_to_true()`, that would be very cumbersome to do if multiple different functions need to use the same thread pool. Defining the thread pool as a global object is common practice, as it allows all functions to access the same thread pool without having to pass it around as an argument.

## Parallelizing loops

### Automatic parallelization of loops

One of the most common and effective methods of parallelization is splitting a loop into smaller sub-loops and running them in parallel. It is most effective in "embarrassingly parallel" computations, such as vector or matrix operations, where each iteration of the loop is completely independent of every other iteration.

For example, if we are summing up two vectors of 1000 elements each, and we have 10 threads, we could split the summation into 10 blocks of 100 elements each, and run all the blocks in parallel, potentially increasing performance by up to a factor of 10.

`BS::thread_pool` can automatically parallelize loops, making it very easy to implement many parallel algorithms without having to worry about the details. To see how this works, consider the following generic loop:

```cpp
for (T i = start; i < end; ++i)
loop(i);
```

where:

* `T` is any signed or unsigned integer type.
* The loop is over the range `[start, end)`, i.e. inclusive of `start` but exclusive of `end`.
* `loop()` is an operation performed for each loop index `i`, such as modifying an array with `end - start` elements.

This loop may be automatically parallelized and submitted to the thread pool's queue using the member function `submit_loop()`, which has the follows syntax:

```cpp
pool.submit_loop(start, end, loop, num_blocks);
```

where:

* `start` is the first index in the range.
* `end` is the index after the last index in the range, such that the full range is `[start, end)`. In other words, the loop will be equivalent to the generic loop above, but parallelized. Note that if `end <= start`, nothing will happen; the loop cannot go backwards.
* `loop()` is the function that should run in every iteration of the loop. It must take exactly one argument, the loop index. It cannot have a return value, as it will be executed multiple times by each task, so a return value would not make sense.
* `num_blocks` is the number of blocks of the form `[a, b)` to split the loop into. For example, if the range is `[0, 9)` and there are 3 blocks, then the blocks will be the ranges `[0, 3)`, `[3, 6)`, and `[6, 9)`. This argument can be omitted, in which case the number of blocks will be the number of threads in the pool.

The thread pool's internal algorithm ensures that each of the blocks has one of two sizes, differing by 1, with the larger blocks always first, so that the tasks are as evenly distributed as possible, to optimize performance. For example, if the range `[0, 100)` is split into 15 blocks, the result will be 10 blocks of size 7, which will be submitted first, and 5 blocks of size 6.

Each block will be submitted to the thread pool's queue as a separate task. Therefore, a loop that is split into 3 blocks will be split into 3 individual tasks, which may run in parallel. If there is only one block, then the entire loop will run as one task, and no parallelization will take place.

To parallelize the generic loop above, we use the following commands:

```cpp
BS::multi_future loop_future = pool.submit_loop(start, end, loop, num_blocks);
loop_future.wait();
```

`submit_loop()` returns an object of the helper class [`BS::multi_future`](#more-about-bsmulti_future). This is essentially a specialization of `std::vector>` with additional member functions. Each of the `num_blocks` blocks will have an `std::future` assigned to it, and all these futures will be stored inside the returned `BS::multi_future`. When `loop_future.wait()` is called, the main thread will wait until **all** tasks generated by `submit_loop()` finish executing, and **only** those tasks - not any other tasks that also happen to be in the queue. This is essentially the role of the `BS::multi_future` class: to wait for a specific **group of tasks**, in this case the tasks running the loop blocks.

As a simple example, the following code calculates and prints a table of squares of all integers from 0 to 99:

```cpp
#include // std::size_t
#include // std::setw
#include // std::cout

int main()
{
constexpr std::size_t max = 100;
std::size_t squares[max];
for (std::size_t i = 0; i < max; ++i)
squares[i] = i * i;
for (std::size_t i = 0; i < max; ++i)
std::cout << std::setw(2) << i << "^2 = " << std::setw(4) << squares[i] << ((i % 5 != 4) ? " | " : "\n");
}
```

We can parallelize it as follows:

```cpp
#include "BS_thread_pool.hpp" // BS::multi_future, BS::thread_pool
#include // std::size_t
#include // std::setw
#include // std::cout

int main()
{
BS::thread_pool pool(10);
constexpr std::size_t max = 100;
std::size_t squares[max];
const BS::multi_future loop_future = pool.submit_loop(0, max,
[&squares](const std::size_t i)
{
squares[i] = i * i;
});
loop_future.wait();
for (std::size_t i = 0; i < max; ++i)
std::cout << std::setw(2) << i << "^2 = " << std::setw(4) << squares[i] << ((i % 5 != 4) ? " | " : "\n");
}
```

Since there are 10 threads, and we omitted the `num_blocks` argument, the loop will be divided into 10 blocks, each calculating 10 squares.

As a side note, notice that here we parallelized the calculation of the squares, but we did not parallelize printing the results. This is for two reasons:

1. We want to print out the squares in ascending order, and we have no guarantee that the blocks will be executed in the correct order. This is very important; you must never expect that the parallelized loop will execute at the same order as the non-parallelized loop.
2. If we did print out the squares from within the parallel tasks, we would get a huge mess, since all 10 blocks would print to the standard output at once. [Later](#synchronizing-printing-to-a-stream-with-bssynced_stream) we will see how to synchronize printing to a stream from multiple tasks at the same time.

### Optimizing the number of blocks

The most important factor to consider when parallelizing loops is the number of blocks `num_blocks` to split the loop into. Naively, it may seem that the number of blocks should simply be equal to the number of threads in the pool, but that is usually **not** the optimal choice. Inevitably, some blocks will finish before other blocks; if there is only one block per thread, then any threads that have already finished executing their blocks will remain idle until the rest of the blocks are done, wasting many CPU cycles.

It is therefore generally better to use a larger number of blocks than the number of threads, to ensure that all threads work at maximum capacity. On the other hand, parallelization with too many blocks will eventually suffer from diminishing returns due to increased overhead. A good rule of thumb is to use a number of blocks equal to the square of the number of threads, but this is not necessarily the optimal number in all cases.

In the end, the optimal number of blocks will always depend on the specific algorithm being parallelized and the total number of indices in the loop, and may differ between different compilers, operating systems, and hardware configurations. For best performance, it is strongly recommended to do your own benchmarks to find the optimal number of blocks for your particular use case; see the [benchmarks code in the bundled test program](#performance-tests) for an example of how to do this.

Finally, note that the discussion here only pertains to situations where the parallelized loop is the only thing running in the pool. If there are many other tasks running in parallel from other sources, then you probably do not need to worry about idle time, since the threads will be kept busy by the other tasks anyway.

### Common index types

Let us now consider a subtlety regarding the types of the start and end indices. In the example [above](#automatic-parallelization-of-loops), the start index is `0`, which is of type `int`, while the end index is `max`, which is of type `std::size_t`. These two types are not compatible, as they are both of different signedness and (on a 64-bit system) of different bit width. In such cases, `submit_loop()` uses a custom type trait `BS::common_index_type` to determine the common type of the indices.

The common index type of two signed integers or two unsigned integers is the larger of the integers, while the common index type of a signed and an unsigned integer is a signed integer that can hold the full ranges of both integers. (This is in contrast to [`std::common_type`](https://en.cppreference.com/w/cpp/types/common_type), which would choose the unsigned integer in the latter case, causing a loop with a negative start index and an unsigned end index to fail due to integer overflow.)

The exception to this rule is when one of the integers is a 64-bit unsigned integer, and the other is a signed integer (of any bit width), since there is no fundamental signed type that can hold the full ranges of both integers. In this case, we choose a 64-bit unsigned integer as the common index type, since the most common scenario where this might happen is when the indices go from `0` to an index of type `std::size_t` - as in our example in the previous section.

However, it is important to note that this will fail if the first index is in fact negative. Therefore, **only** in the edge case where one index is a negative integer and the other is of an unsigned 64-bit integer type such as `std::size_t`, the user must cast both indices explicitly to the desired common type. In all other cases, this is handled automatically behind the scenes using `BS::common_index_type`.

### Parallelizing loops without futures

Just as in the case of [`detach_task()`](#detaching-and-waiting-for-tasks) vs. [`submit_task()`](#submitting-tasks-with-no-arguments-and-receiving-a-future), sometimes you may want to parallelize a loop, but you don't need it to return a `BS::multi_future`. In this case, you can save the overhead of generating the futures (which can be significant, depending on the number of blocks) by using `detach_loop()` instead of `submit_loop()`, with the same arguments.

For example, we could detach the loop of squares example above as follows:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::size_t
#include // std::setw
#include // std::cout

int main()
{
BS::thread_pool pool(10);
constexpr std::size_t max = 100;
std::size_t squares[max];
pool.detach_loop(0, max,
[&squares](const std::size_t i)
{
squares[i] = i * i;
});
pool.wait();
for (std::size_t i = 0; i < max; ++i)
std::cout << std::setw(2) << i << "^2 = " << std::setw(4) << squares[i] << ((i % 5 != 4) ? " | " : "\n");
}
```

**Warning:** Since `detach_loop()` does not return a `BS::multi_future`, there is no built-in way for the user to know when the loop finishes executing. You must use either [`wait()`](#detaching-and-waiting-for-tasks) as we did here, or some other method such as condition variables, to ensure that the loop finishes executing before trying to use anything that depends on its output. Otherwise, bad things will happen! If the loop is the only thing running in the pool, then generally `detach_loop()` followed by `wait()` is the optimal choice in terms of performance.

### Parallelizing individual indices vs. blocks

We have seen that `detach_loop()` and `submit_loop()` execute the function `loop(i)` for each index `i` in the loop. However, behind the scenes, the loop is split into blocks, and each block executes the `loop()` function multiple times. Each block has an internal loop of the form (where `T` is the type of the indices):

```cpp
for (T i = start; i < end; ++i)
loop(i);
```

The `start` and `end` indices of each block are determined automatically by the pool. For example, in the previous section, the loop from 0 to 100 was split into 10 blocks of 10 indices each: `start = 0` to `end = 10`, `start = 10` to `end = 20`, and so on; the blocks are not inclusive of the last index, since the `for` loop has the condition `i < end` and not `i <= end`.

However, this also means that the `loop()` function is executed multiple times per block. This generates additional overhead due to the multiple function calls. For short loops, this should not affect performance. However, for very long loops, with millions of indices, the performance cost may be significant.

For this reason, the thread pool library provides two additional member functions for parallelizing loops: `detach_blocks()` and `submit_blocks()`. While `detach_loop()` and `submit_loop()` execute a function `loop(i)` once per index but multiple times per block, `detach_blocks()` and `submit_blocks()` execute a function `block(start, end)` only once per block.

The main advantage of this method is increased performance, but the main disadvantage is slightly more complicated code. In particular, the user must define the loop from `start` to `end` manually within each block, ensuring that all the indices in the block are handled. Here is the previous example again, this time using `detach_blocks()`:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::size_t
#include // std::setw
#include // std::cout

int main()
{
BS::thread_pool pool(10);
constexpr std::size_t max = 100;
std::size_t squares[max];
pool.detach_blocks(0, max,
[&squares](const std::size_t start, const std::size_t end)
{
for (std::size_t i = start; i < end; ++i)
squares[i] = i * i;
});
pool.wait();
for (std::size_t i = 0; i < max; ++i)
std::cout << std::setw(2) << i << "^2 = " << std::setw(4) << squares[i] << ((i % 5 != 4) ? " | " : "\n");
}
```

Note how the block function takes two arguments, and includes the internal loop. Also, since we are using `detach_blocks()`, we must wait for the loop to finish executing using `wait()`. Alternatively, we could have used `submit_blocks()` and waited on the returned `BS::multi_future` object.

Generally, compiler optimizations should be able to make `detach_loop()` and `submit_loop()` perform roughly the same as `detach_blocks()` and `submit_blocks()`. However, `detach_blocks()` and `submit_blocks()` are always going to be inherently faster, at the cost of being slightly more complicated to use. In addition, having low-level control of each block can allow for further optimizations, such as allocating resources per block instead of per index. As usual, you should perform your own benchmarks to see which option works best for your particular use case.

### Loops with return values

As mentioned above, unlike `submit_task()`, the member function `submit_loop()` only takes loop functions with no return value. The reason is that each block is running the loop function multiple times, so a return value would not make sense. In contrast, `submit_blocks()` allows the block function to have a return value, as each block can return a unique value.

The block function will be executed once for each block, but the blocks are managed by the thread pool, with the user only able to select the number of blocks, but not the range of each block. Therefore, there is limited usability in returning one value per block. However, for cases where this is desired, such as for summation or some sorting algorithms, `submit_blocks()` does accept functions with return values, in which case it returns a `BS::multi_future` object where `T` is the type of the return value.

Here's an example of a function template summing all elements of type `T` in a given range:

```cpp
#include "BS_thread_pool.hpp" // BS::multi_future, BS::thread_pool
#include // std::uint64_t
#include // std::future
#include // std::cout

BS::thread_pool pool;

template
T sum(T min, T max)
{
BS::multi_future loop_future = pool.submit_blocks(
min, max + 1,
[](const T start, const T end)
{
T block_total = 0;
for (T i = start; i < end; ++i)
block_total += i;
return block_total;
},
100);
T result = 0;
for (std::future& future : loop_future)
result += future.get();
return result;
}

int main()
{
std::cout << sum(1, 1'000'000);
}
```

Note that we needed to specify the type `T` explicitly as `std::uint64_t`, that is, an unsigned 64-bit integer, as the result, 500,000,500,000, would not fit in a 32-bit integer.

Here we used the fact that `BS::multi_future` is a specialization of `std::vector>`, so we can use a range-based `for` loop to iterate over the futures, and use the `get()` member function of each future to get its value. The values of the futures will be the partial sums from each block, so when we add them up, we will get the total sum. Note that we divided the loop into 100 blocks, so there will be 100 futures in total, each with the partial sum of 10,000 numbers.

The range-based `for` loop will likely start before the loop finished executing, and each time it calls a future, it will get the value of that future if it is ready, or it will wait until the future is ready and then get the value. This increases performance, since we can start summing the results without waiting for the entire loop to finish executing first - we only need to wait for individual blocks.

If we did want to wait until the entire loop finishes before summing the results, we could have used the `get()` member function of the `BS::multi_future` object itself, which returns an `std::vector` with the values obtained from each future. In that case, the sum could be obtained after calling `submit_blocks()`, for example using `std::reduce`, as follows:

```cpp
#include "BS_thread_pool.hpp" // BS::multi_future, BS::thread_pool
#include // std::uint64_t
#include // std::cout
#include // std::reduce
#include // std::vector

BS::thread_pool pool;

template
T sum(T min, T max)
{
BS::multi_future loop_future = pool.submit_blocks(
min, max + 1,
[](const T start, const T end)
{
T block_total = 0;
for (T i = start; i < end; ++i)
block_total += i;
return block_total;
},
100);
std::vector partial_sums = loop_future.get();
T result = std::reduce(partial_sums.begin(), partial_sums.end());
return result;
}

int main()
{
std::cout << sum(1, 1'000'000);
}
```

### Parallelizing sequences

The member functions `detach_loop()`, `submit_loop()`, `detach_blocks()`, and `submit_blocks()` parallelize a loop by splitting it into blocks, and submitting each block as an individual task to the queue, with each such task iterating over all the indices in the corresponding block's range, which can be numerous. However, sometimes we have a loop with a small number of indices, or more generally, a sequence of tasks enumerated by some index. In such cases, we can avoid the overhead of splitting into blocks and simply submit each individual index as its own independent task to the pool's queue.

This can be done with `detach_sequence()` and `submit_sequence()`. The syntax of these functions is similar to `detach_loop()` and `submit_loop()`, except that they don't have the `num_blocks` argument at the end. The sequence function must take only one argument, the index.

As usual, `detach_sequence()` detaches the tasks and does not return a future, so you must use `wait()` if you need to wait for the entire sequence to finish executing, while `submit_sequence()` returns a `BS::multi_future`. If the tasks in the sequence return values, then the futures will contain those values, otherwise they will be `void` futures.

Here is a simple example, where each task in the sequence calculates the factorial of its index:

```cpp
#include "BS_thread_pool.hpp" // BS::multi_future, BS::thread_pool
#include // std::uint64_t
#include // std::cout
#include // std::vector

std::uint64_t factorial(const std::uint64_t n)
{
std::uint64_t result = 1;
for (std::uint64_t i = 2; i <= n; ++i)
result *= i;
return result;
}

int main()
{
BS::thread_pool pool;
constexpr std::uint64_t max = 20;
BS::multi_future sequence_future = pool.submit_sequence(0, max + 1, factorial);
std::vector factorials = sequence_future.get();
for (std::uint64_t i = 0; i < max + 1; ++i)
std::cout << i << "! = " << factorials[i] << '\n';
}
```

Note how the factorials of each index are stored in the `BS::multi_future`, and can be obtained as a vector using `get()`; each element of the vector is equal to the factorial of the element's index, calculated by its own individual task in the sequence.

**Warning:** Since each index in the sequence will be submitted as a separate task, `detach_sequence()` and `submit_sequence()` should only be used if the number of indices is small (say, within 1-2 orders of magnitude of the number of threads), and each index performs a substantial computation on its own. If you submit a sequence of 1 million indices, each performing a 1 ms calculation, the overhead of submitting each index as a separate task would far outweigh the benefits of parallelization.

### More about `BS::multi_future`

The helper class `BS::multi_future`, which we have been using throughout this section, provides a convenient way to collect and access groups of futures. While a `BS::multi_future` object is created automatically by the pool when parallelizing loops, you can also use it to store futures manually, such as those obtained from `submit_task()` or by other means. `BS::multi_future` is a specialization of `std::vector>`, so it should be used in a similar way:

* When you create a new `BS::multi_future` object, either use the default constructor to create an empty object and add futures to it later, or pass the desired number of futures to the constructor in advance.
* Use the `[]` operator to access the future at a specific index, or the `push_back()` member function to append a new future to the list. (If the number of futures is known in advance, you should use `reserve()` to allocate memory for all of them first, and only then `push_back()` the individual futures, otherwise memory will have to be reallocated multiple times, which is very inefficient.)
* The `size()` member function tells you how many futures are currently stored in the object.

However, `BS::multi_future` also has additional member functions that are aimed specifically at handling futures:

* Once all the futures are stored, you can use `wait()` to wait for all of them at once or `get()` to get an `std::vector` with the results from all of them.
* You can check how many futures are ready using `ready_count()`.
* You can check if all the stored futures are valid using `valid()`.
* You can wait for all the stored futures for a specific duration with `wait_for()` or wait until a specific time with `wait_until()`. These functions return `true` if all futures have been waited for before the duration expired or the time point was reached, and `false` otherwise.

Aside from using `BS::multi_future` to track the execution of parallelized loops, it can also be used, for example, whenever you have several different groups of tasks and you want to track the execution of each group individually.

## Utility classes

### Synchronizing printing to a stream with `BS::synced_stream`

When printing to an output stream from multiple threads in parallel, the output may become garbled. For example, try running this code:

```cpp
#include "BS_thread_pool.hpp" // BS::thread_pool
#include // std::cout

BS::thread_pool pool;

int main()
{
pool.submit_sequence(0, 5,
[](const unsigned int i)
{
std::cout << "Task no. " << i << " executing.\n";
})
.wait();
}
```

The output will be a mess similar to this:

```none
Task no. Task no. Task no. 3 executing.
0 executing.
Task no. 41 executing.
Task no. 2 executing.
executing.
```

The reason is that, although each **individual** insertion to `std::cout` is thread-safe, there is no mechanism in place to ensure subsequent insertions from the same thread are printed contiguously.

The thread pool utility class `BS::synced_stream` is designed to eliminate such synchronization issues. The stream to print to should be passed as a constructor argument. If no argument is supplied, `std::cout` will be used:

```cpp
// Construct a synced stream that will print to std::cout.
BS::synced_stream sync_out;
// Construct a synced stream that will print to the output stream my_stream.
BS::synced_stream sync_out(my_stream);
```

The member function `print()` takes an arbitrary number of arguments, which are inserted into the stream one by one, in the order they were given. `println()` does the same, but also prints a newline character `\n` at the end, for convenience. A mutex is used to synchronize this process, so that any other calls to `print()` or `println()` using the same `BS::synced_stream` object must wait until the previous call has finished.

As an example, this code:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool

BS::synced_stream sync_out;
BS::thread_pool pool;

int main()
{
pool.submit_sequence(0, 5,
[](const unsigned int i)
{
sync_out.println("Task no. ", i, " executing.");
})
.wait();
}
```

Will print out:

```none
Task no. 0 executing.
Task no. 1 executing.
Task no. 2 executing.
Task no. 3 executing.
Task no. 4 executing.
```

**Warning:** Always create the `BS::synced_stream` object **before** the `BS::thread_pool` object, as we did in this example. When the `BS::thread_pool` object goes out of scope, it waits for the remaining tasks to be executed. If the `BS::synced_stream` object goes out of scope before the `BS::thread_pool` object, then any tasks using the `BS::synced_stream` will crash. Since objects are destructed in the opposite order of construction, creating the `BS::synced_stream` object before the `BS::thread_pool` object ensures that the `BS::synced_stream` is always available to the tasks, even while the pool is destructing.

Most stream manipulators defined in the headers `` and ``, such as `std::setw` (set the character width of the next output), `std::setprecision` (set the precision of floating point numbers), and `std::fixed` (display floating point numbers with a fixed number of digits), can be passed as arguments to `print()` and `println()`, and will have the same effect as inserting them to the associated stream.

The only exceptions are the flushing manipulators `std::endl` and `std::flush`, which will not work because the compiler will not be able to figure out which template specializations to use. Instead, use `BS::synced_stream::endl` and `BS::synced_stream::flush`. Here is an example:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::sqrt
#include // std::setprecision, std::setw
#include // std::fixed

BS::synced_stream sync_out;
BS::thread_pool pool;

int main()
{
sync_out.print(std::setprecision(10), std::fixed);
pool.submit_sequence(0, 16,
[](const unsigned int i)
{
sync_out.print("The square root of ", std::setw(2), i, " is ", std::sqrt(i), ".", BS::synced_stream::endl);
})
.wait();
}
```

Note, however, that `BS::synced_stream::endl` should only be used if flushing is desired; otherwise, a newline character should be used instead. As with `std::endl`, using `BS::synced_stream::endl` too often will cause a performance hit, as it will force the stream to flush the buffer every time it is called.

If desired, `BS::synced_stream` can also synchronize printing into more than one stream at a time. To facilitate this, we can pass a list of output streams to the constructor. For example, the following program will print the same output to both `std::cout` and a log file:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::ofstream
#include // std::cout

BS::thread_pool pool;

int main()
{
std::ofstream log_file("task.log");
BS::synced_stream sync_out(std::cout, log_file);
pool.submit_sequence(0, 5,
[&sync_out](const unsigned int i)
{
sync_out.println("Task no. ", i, " executing.");
})
.wait();
}
```

Note that we must wait on the future before the `main()` function ends, as otherwise the log file may be destructed before the tasks finish executing. If we used `detach_sequence()`, which does not return a future, we would have to call `pool.wait()` in the last line.

In this example we did not create the `BS::synced_stream` as a global object, since we wanted to pass the log file as a stream to the constructor. However, it is also possible to add streams to or remove streams from an existing `BS::synced_stream` object using the member functions `add_stream()` and `remove_stream()`. For example, in the following program, we create a `BS::synced_stream` global object with the default constructor, so that it prints to `std::cout`, but then we change out minds, remove `std::cout` from the list of streams, and add a log file instead:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::ofstream
#include // std::cout

BS::synced_stream sync_out;
BS::thread_pool pool;

int main()
{
std::ofstream log_file("task.log");
sync_out.remove_stream(std::cout);
sync_out.add_stream(log_file);
pool.submit_sequence(0, 5,
[](const unsigned int i)
{
sync_out.println("Task no. ", i, " executing.");
})
.wait();
}
```

It is common practice to create a global `BS::synced_stream` object, so that it can be accessed from anywhere in the program, without having to pass it to every function that might want to print something to the stream. However, if you also have a global `BS::thread_pool` object, you must always make sure to define the global `BS::synced_stream` object **before** the global `BS::thread_pool` object, for the reasons explained in the warning above.

Internally, `BS::synced_stream` keeps the streams in an `std::vector`. The order in which the streams are added is also the order in which they will be printed to. For more precise control, you can use the member function `get_streams()` to get a reference to this vector, and manipulate it directly as you see fit.

### Synchronizing tasks with `BS::counting_semaphore` and `BS::binary_semaphore`

The thread pool library provides two utility classes, `BS::counting_semaphore` and `BS::binary_semaphore`, which offer versatile synchronization primitives that can be used to synchronize tasks in a variety of ways. These classes are equivalent to the C&plus;&plus;20 `std::counting_semaphore` and `std::binary_semaphore`, respectively, but are offered in the library as convenience polyfills for projects based on C&plus;&plus;17. If C&plus;&plus;20 features are available, the polyfills are not used, and instead are just aliases for the standard library classes.

Since `BS::counting_semaphore` and `BS::binary_semaphore` are identical in functionality to their standard library counterparts, we will not explain how to use them here. Instead, the user is referred to [cppreference.com](https://en.cppreference.com/w/cpp/thread/counting_semaphore).

## Managing tasks

### Monitoring the tasks

Sometimes you may wish to monitor what is happening with the tasks you submitted to the pool. This may be done using these three member functions:

* `get_tasks_queued()` gets the number of tasks currently waiting in the queue to be executed.
* `get_tasks_running()` gets the number of tasks currently being executed by the threads.
* `get_tasks_total()` gets the total number of unfinished tasks: either still in the queue, or being executed by a thread.

Note that `get_tasks_total() == get_tasks_queued() + get_tasks_running()`. These functions are demonstrated in the following program:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::chrono
#include // std::this_thread

BS::synced_stream sync_out;
BS::thread_pool pool(4);

void sleep_half_second(const unsigned int i)
{
std::this_thread::sleep_for(std::chrono::milliseconds(500));
sync_out.println("Task ", i, " done.");
}

void monitor_tasks()
{
sync_out.println(pool.get_tasks_total(), " tasks total, ", pool.get_tasks_running(), " tasks running, ", pool.get_tasks_queued(), " tasks queued.");
}

int main()
{
pool.wait();
pool.detach_sequence(0, 12, sleep_half_second);
monitor_tasks();
std::this_thread::sleep_for(std::chrono::milliseconds(750));
monitor_tasks();
std::this_thread::sleep_for(std::chrono::milliseconds(500));
monitor_tasks();
std::this_thread::sleep_for(std::chrono::milliseconds(500));
monitor_tasks();
pool.wait();
}
```

Assuming you have at least 4 hardware threads (so that 4 tasks can run concurrently), the output should be similar to:

```none
12 tasks total, 0 tasks running, 12 tasks queued.
Task 0 done.
Task 1 done.
Task 2 done.
Task 3 done.
8 tasks total, 4 tasks running, 4 tasks queued.
Task 4 done.
Task 5 done.
Task 6 done.
Task 7 done.
4 tasks total, 4 tasks running, 0 tasks queued.
Task 8 done.
Task 9 done.
Task 10 done.
Task 11 done.
0 tasks total, 0 tasks running, 0 tasks queued.
```

The reason we called `pool.wait()` in the beginning is that when the thread pool is created, an initialization task runs in each thread, so if we don't wait, the first line would say there are 16 tasks in total, including the 4 initialization tasks. See [below](#thread-initialization-functions) for more details. Of course, we also called `pool.wait()` at the end to ensure that all tasks have finished executing before the program ends.

### Purging tasks

Consider a situation where the user cancels a multithreaded operation while it is still ongoing. Perhaps the operation was split into multiple tasks, and half of the tasks are currently being executed by the pool's threads, but the other half are still waiting in the queue.

The thread pool cannot terminate the tasks that are already running, as C&plus;&plus; does not provide that functionality (and in any case, abruptly terminating a task while it's running could have extremely bad consequences, such as memory leaks and data corruption). However, the tasks that are still waiting in the queue can be purged using the `purge()` member function.

Once `purge()` is called, any tasks still waiting in the queue will be discarded, and will never be executed by the threads. Please note that there is no way to restore the purged tasks; they are gone forever!

Consider for example the following program:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::chrono
#include // std::this_thread

BS::synced_stream sync_out;
BS::thread_pool pool(4);

int main()
{
pool.detach_sequence(0, 8,
[](const unsigned int i)
{
std::this_thread::sleep_for(std::chrono::milliseconds(100));
sync_out.println("Task ", i, " done.");
});
std::this_thread::sleep_for(std::chrono::milliseconds(50));
pool.purge();
pool.wait();
}
```

The program submit 8 tasks to the queue. Each task waits 100 milliseconds and then prints a message. The thread pool has 4 threads, so it will execute the first 4 tasks in parallel, and then the remaining 4. We wait 50 milliseconds, to ensure that the first 4 tasks have all started running. Then we call `purge()` to purge the remaining 4 tasks. As a result, these tasks never get executed. However, since the first 4 tasks are still running when `purge()` is called, they will finish uninterrupted; `purge()` only discards tasks that have not yet started running. The output of the program therefore only contains the messages from the first 4 tasks:

```none
Task 0 done.
Task 1 done.
Task 2 done.
Task 3 done.
```

Please note that, as explained above, the thread pool cannot terminate running tasks on its own. If you need to do that, you must incorporate a mechanism into the task itself that will terminate the task safely. For example, you could create an atomic flag that the task checks periodically, terminating itself if the flag is set. Here is a simple example:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::chrono
#include // std::this_thread

BS::synced_stream sync_out;
BS::thread_pool pool(4);

int main()
{
std::atomic stop_flag = false;
pool.detach_sequence(0, 8,
[&stop_flag](const unsigned int i)
{
std::this_thread::sleep_for(std::chrono::milliseconds(100));
if (stop_flag)
return;
sync_out.println("Task ", i, " done.");
});
std::this_thread::sleep_for(std::chrono::milliseconds(50));
stop_flag = true;
pool.purge();
pool.wait();
}
```

This program will not print out any output, as the tasks will terminate themselves prematurely when `stop_flag` is set to `true`. In this case, we did not have to call `purge()`, but by doing so we prevented the other 4 tasks from being executed for no reason.

### Exception handling

`submit_task()` catches any exceptions thrown by the submitted task and forwards them to the corresponding future. They can then be caught when invoking the `get()` member function of the future. For example:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::exception
#include // std::future
#include // std::runtime_error

BS::synced_stream sync_out;
BS::thread_pool pool;

double inverse(const double x)
{
if (x == 0)
throw std::runtime_error("Division by zero!");
return 1 / x;
}

int main()
{
constexpr double num = 0;
std::future my_future = pool.submit_task(
[num]
{
return inverse(num);
});
try
{
const double result = my_future.get();
sync_out.println("The inverse of ", num, " is ", result, ".");
}
catch (const std::exception& e)
{
sync_out.println("Caught exception: ", e.what());
}
}
```

The output will be:

```none
Caught exception: Division by zero!
```

However, if you change `num` to any non-zero number, no exceptions will be thrown and the inverse will be printed.

It is important to note that `wait()` does not throw any exceptions; only `get()` does. Therefore, even if your task does not return anything, i.e. your future is an `std::future`, you must still use `get()` on the future obtained from it if you want to catch exceptions thrown by it. Here is an example:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::exception
#include // std::future
#include // std::runtime_error

BS::synced_stream sync_out;
BS::thread_pool pool;

void print_inverse(const double x)
{
if (x == 0)
throw std::runtime_error("Division by zero!");
sync_out.println("The inverse of ", x, " is ", 1 / x, ".");
}

int main()
{
constexpr double num = 0;
std::future my_future = pool.submit_task(
[num]
{
print_inverse(num);
});
try
{
my_future.get();
}
catch (const std::exception& e)
{
sync_out.println("Caught exception: ", e.what());
}
}
```

When using `BS::multi_future` to handle multiple futures at once, exception handling works the same way: if any of the futures may throw exceptions, you may catch these exceptions when calling `get()`, even in the case of `BS::multi_future`.

Note that if you use `detach_task()`, or any other `detach` member function, there is no way to catch exceptions thrown by the task, as a future will not be returned. In such cases, all exceptions thrown by the task will be silently ignored, to prevent program termination. If you need to catch exceptions in a detached task, you must do so within the task itself, as in this example:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::exception
#include // std::runtime_error

BS::synced_stream sync_out;
BS::thread_pool pool;

double inverse(const double x)
{
if (x == 0)
throw std::runtime_error("Division by zero!");
return 1 / x;
}

int main()
{
constexpr double num = 0;
pool.detach_task(
[num]
{
try
{
const double result = inverse(num);
sync_out.println("The inverse of ", num, " is ", result, ".");
}
catch (const std::exception& e)
{
sync_out.println("Caught exception: ", e.what());
}
});
pool.wait();
}
```

If exceptions are explicitly disabled in your codebase, or if the feature-test macro `__cpp_exceptions` is undefined for any other reason, exception handling will be automatically disabled in the thread pool.

### Getting information about the current thread

The class `BS::this_thread` provides functionality analogous to `std::this_thread`, that is, it allows a thread to reference itself. It contains the following static member functions:

* `BS::this_thread::get_index()` can be used to get the index of the current thread as an `std::optional` object.
* If this thread belongs to a `BS::thread_pool` object, the return value will be an index in the range `[0, N)` where `N == BS::thread_pool::get_thread_count()`.
* Otherwise, for example if this thread is the main thread or an independent thread not in any pools, `std::nullopt` will be returned.
* `BS::this_thread::get_pool()` can be used to get a pointer to the thread pool that owns the current thread as an `std::optional` object.
* If this thread belongs to a `BS::thread_pool` object, the return value will be a `void` pointer to that object.
* Otherwise, `std::nullopt` will be returned.

An [`std::optional`](https://en.cppreference.com/w/cpp/utility/optional) is an object that may or may not have a value. [`std::nullopt`](https://en.cppreference.com/w/cpp/utility/optional/nullopt) is a placeholder which indicates that the object does not have a value. To access an `std::optional`, you should first use [`std::optional::has_value()`](https://en.cppreference.com/w/cpp/utility/optional/operator_bool) to check if it contains a value, and if so, use [`std::optional::value()`](https://en.cppreference.com/w/cpp/utility/optional/value) to obtain that value. A shortcut for `if (x.has_value())` is `if (x)`, and a shortcut for `x.value()` is `*x`.

The reason that `BS::this_thread::get_pool()` returns a `void*` is that `BS::thread_pool` is a template. Once you obtain the pool pointer, you must cast it to the desired instantiation of the template if you want to use any member functions. Note that you have to cast it to the correct type; if you cast a pointer to a `BS::light_thread_pool` into a pointer to a `BS::priority_thread_pool`, for example, your program will have undefined behavior. (Please see the [optional features](#optional-features) section for more information about the template parameters and aliases.)

Here is an example illustrating all of the above:

```cpp
#include "BS_thread_pool.hpp" // BS::light_thread_pool, BS::synced_stream, BS::this_thread
#include // std::atomic
#include // std::size_t
#include // std::optional
#include // std::thread

BS::synced_stream sync_out;
BS::light_thread_pool p1;
BS::light_thread_pool p2;
std::atomic ltr = 'A';

void check_this_thread(const char letter)
{
const std::optional my_pool = BS::this_thread::get_pool();
const std::optional my_index = BS::this_thread::get_index();

if (my_pool && my_index)
{
const std::size_t pool_number = *my_pool == &p1 ? 1 : 2;
sync_out.println("Task ", letter, " is being executed by thread #", *my_index, " of pool #", pool_number, '.');
static_cast(*my_pool)->detach_task(
[letter]
{
sync_out.println("-> Task ", ltr++, " was submitted by task ", letter, " using detach_task().");
});
}
else
{
sync_out.println("Task ", letter, " is being executed by an independent thread, not in any thread pools.");
std::thread(
[letter]
{
sync_out.println("-> Task ", ltr++, " was submitted by task ", letter, " using a detached std::thread.");
})
.detach();
}
}

int main()
{
p1.submit_task(
[]
{
check_this_thread(ltr++);
})
.wait();
p2.submit_task(
[]
{
check_this_thread(ltr++);
})
.wait();
std::thread(
[]
{
check_this_thread(ltr++);
})
.join();
}
```

The output of this program will be similar to:

```none
Task A is being executed by thread #3 of pool #1.
-> Task B was submitted by task A using detach_task().
Task C is being executed by thread #7 of pool #2.
-> Task D was submitted by task C using detach_task().
Task E is being executed by an independent thread, not in any thread pools.
-> Task F was submitted by task E using a detached std::thread.
```

In this example, we execute the task `check_this_thread()` in three different ways:

1. By submitting it from the thread pool `p1`.
2. By submitting it from the thread pool `p2`.
3. By submitting it from an independent `std::thread`.

The task calls `BS::this_thread::get_pool()` and `BS::this_thread::get_index()` and receives two `std::optional` objects, `my_pool` and `my_index`. If both have a value (that is, evaluate to `true`), then the task knows it is running in a thread pool. The actual values are then obtained by "dereferencing" them: the pool pointer is `*my_pool`, and the thread index is `*my_index`.

The task deduces which pool it is running in by comparing the pointer `*my_pool` to the addresses of the pools `p1` and `p2`. It also gets the index of the thread from `*my_index`. Finally, it detaches an additional task (without waiting for it, as that might cause a deadlock!) from its own pool by first casting the `void*` pointer to the correct type, which in this case is `BS::light_thread_pool*`, and then calling the `detach_task()` member function of that specific pool.

If `my_pool` and `my_index` do not have values (that is, evaluate to `false`), then the task knows it is running in an independent thread. In this case, it detaches the additional task using another independent thread.

### Thread initialization functions

Sometimes, it is necessary to initialize the threads before they run any tasks. This can be done by submitting a proper initialization function to the `BS::thread_pool` constructor or to `reset()`, either as the only argument or as the second argument after the desired number of threads.

The thread initialization function must have no return value. It can either take one argument, the thread index of type `std::size_t`, or zero arguments. In the latter case, the function can use `BS::this_thread::get_index()` to find the thread index. In addition, the function can use `BS::this_thread::get_pool()` to find which pool its thread belongs to.

The initialization functions are effectively submitted as a set of special tasks, one per thread, which bypass the queue, but still count towards the number of running tasks. This means `get_tasks_total()` and `get_tasks_running()` will report that these tasks are running if they are checked immediately after the pool is initialized.

This is done so that the user has the option to either wait for the initialization functions to finish, by calling `wait()` on the pool, or just keep going. In either case, the initialization functions will always finish running before any tasks are executed by the corresponding thread, so there is no reason to wait for them to finish unless they have some side-effects that affect the main thread, or if they must finish running on **all** the threads before the pool starts executing any tasks.

Here is a simple example:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::thread_pool
#include // std::mt19937_64, std::random_device

BS::synced_stream sync_out;
thread_local std::mt19937_64 twister;

int main()
{
BS::thread_pool pool(
[]
{
twister.seed(std::random_device()());
});
pool.submit_sequence(0, 4,
[](int)
{
sync_out.println("I generated a random number: ", twister());
})
.wait();
}
```

In this example, we create a `thread_local` Mersenne twister engine, meaning that each thread has its own independent engine. However, if we do not seed the engine, each thread would generate the exact same sequence of pseudo-random numbers. To remedy this, we pass an initialization function to the `BS::thread_pool` constructor which seeds the twister in each thread with the (hopefully) non-deterministic random number generator `std::random_device`.

Note that the lambda function we passed to `submit_sequence()` has the signature `[](int)`, with an unnamed `int` argument, as it does not make use of the sequence index, which will be a number in the range `[0, 4)`. This is an easy way to simply submit the same task multiple times.

**Warning:** Exceptions thrown by thread initialization functions must not throw any exceptions, as that will result in program termination. Any exceptions must be handled explicitly within the function.

### Thread cleanup functions

Similarly to the thread initialization function, it is also possible to provide the pool with a cleanup function to run in each thread right before it is destroyed, which will happen when the pool is destructed or reset. Like the initialization function, the cleanup function must have no return value, and can either take one argument, the thread index of type `std::size_t`, or zero arguments. Each pool can have its own cleanup function, which is specified using the member function `set_cleanup_func()`. Here is an example:

```cpp
#include "BS_thread_pool.hpp" // BS::synced_stream, BS::this_thread, BS::thread_pool
#include // std::chrono
#include // std::size_t
#include // std::ofstream
#include // std::to_string
#include // std::this_thread

thread_local std::ofstream log_file;
thread_local BS::synced_stream sync_out(log_file);
constexpr std::size_t threads = 4;

int main()
{
BS::thread_pool pool(threads,
[](const std::size_t idx)
{
log_file.open("thread_" + std::to_string(idx) + ".log");
});
pool.set_cleanup_func(
[]
{
log_file.close();
});
pool.submit_sequence(0, threads * 10,
[](const std::size_t idx)
{
std::this_thread::sleep_for(std::chrono::milliseconds(50));
sync_out.println("Task ", idx, " is running on thread ", *BS::this_thread::get_index(), '.');
})
.wait();
}
```

In this example, we create 4 threads, each of which has a separate thread-local `BS::synced_stream` object writing to its own log file of the form `thread_N.log` where `N` is the thread index. The initialization function, passed as an argument to the constructor, opens the log file. The cleanup function, set using `set_cleanup_func()`, closes the log file.

We submit 40 tasks to the queue using `submit_sequence()`, each of which prints a message to the log file indicating which thread it is running on. When the `main()` function exits and `pool` is destroyed, the cleanup function is called for each thread, ensuring that the log files are closed properly.

**Warning:** As with initialization functions, exceptions thrown by thread cleanup functions must not throw any exceptions, as that will result in program termination. Any exceptions must be handled explicitly within the function.

### Passing task arguments by constant reference

In C&plus;&plus;, it is often crucial to pass function arguments by reference or constant reference, instead of by value. This allows the function to access the object being passed directly, rather than creating a new copy of the object. We have already seen [above](#detaching-and-waiting-for-tasks) that submitting an argument by reference is a simple matter of capturing it with a