{"id":19705889,"url":"https://github.com/llnl/b-mpi3","last_synced_at":"2025-07-15T15:43:09.699Z","repository":{"id":41070085,"uuid":"318444580","full_name":"LLNL/b-mpi3","owner":"LLNL","description":"  This aims to be an wrapper to C-MPI3 for C++, using the principles of simplicity, STL, RAII and Boost and enforcing type-safety. This is a mirror of https://gitlab.com/correaa/boost-mpi3.","archived":false,"fork":false,"pushed_at":"2024-10-11T08:50:11.000Z","size":1823,"stargazers_count":21,"open_issues_count":0,"forks_count":0,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-29T16:46:57.493Z","etag":null,"topics":["c-plus-plus","cpp","header-only","mpi","radiuss"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LLNL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-12-04T07:57:30.000Z","updated_at":"2025-04-14T13:45:15.000Z","dependencies_parsed_at":"2024-01-19T01:36:20.629Z","dependency_job_id":"d4831093-8c39-4020-9d2c-14a070f192d2","html_url":"https://github.com/LLNL/b-mpi3","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/LLNL/b-mpi3","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2Fb-mpi3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2Fb-mpi3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2Fb-mpi3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2Fb-mpi3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LLNL","download_url":"https://codeload.github.com/LLNL/b-mpi3/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2Fb-mpi3/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265443556,"owners_count":23766417,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c-plus-plus","cpp","header-only","mpi","radiuss"],"created_at":"2024-11-11T21:31:28.756Z","updated_at":"2025-07-15T15:43:09.637Z","avatar_url":"https://github.com/LLNL.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!--- \u00262\u003e/dev/null\n    xdg-open $0; exit\n---\u003e\n[comment]: # (Comment)\n\n# B.MPI3\n*Alfredo A. Correa*\n\u003ccorreaa@llnl.gov\u003e\n\n[//]: \u003c\u003e (\u003calfredo.correa@gmail.com\u003e)\n\nB-MPI3 is a C++ library wrapper for version 3.1 of the MPI standard interface that simplifies the utilization and maintenance of MPI code.\nB-MPI3 C++ aims to provide a more convenient, powerful and an interface less prone to errors than the standard C-based MPI interface.\n\nB-MPI3 simplifies the utilization of MPI without completely changing the communication model, allowing for a seamless transition from C-MPI.\nB-MPI3 also provides allocators and facilities to manipulate MPI-mediated Remote Access and shared memory.\n\nFor example, pointers are not utilized directly and it is replaced by an iterator-based interface and most data, in particular custom type objects are serialized automatically into messages by the library.\nB-MPI3 interacts well with the C++ standard library, containers and custom data types (classes).\n\nB.MPI3 is written from [scratch](https://octo-repo-visualization.vercel.app/?repo=llnl%2Fb-mpi3) in C++17 and it has been tested with many standard compliant MPI library implementations and compilers, OpenMPI +1.9, MPICH +3.2.1, MVAPICH, Spectrum MPI, and [ExaMPI](https://github.com/tonyskjellum/ExaMPI), using the following compilers gcc +5.4.1, clang +6.0, PGI 18.04.\n\nB.MPI3 is not an official Boost library, but is designed following the principles of Boost and the STL.\nB.MPI3 is not a derivative of Boost.MPI and it is unrelated to the, [now deprecated](https://web.archive.org/web/20170421220544/http://blogs.cisco.com/performance/the-mpi-c-bindings-what-happened-and-why/), official MPI-C++ interface.\nIt adds features which were missing in Boost.MPI (which only covers MPI-1), with an iterator-based interface and MPI-3 features (RMA and Shared memory).\n\nB.MPI3 optionally depends on Boost +1.53 for automatic serialization.\n\n## Contents\n[[_TOC_]]\n\n## Introduction\n\nMPI is a large library for run-time parallelism where several paradigms coexist.\nIt was is originally designed as standardized and portable message-passing system to work on a wide variety of parallel computing architectures.\n\nThe last standard, MPI-3, uses a combination of techniques to achieve parallelism, Message Passing (MP), (Remote Memory Access (RMA) and Shared Memory (SM).\nWe try here to give a uniform interface and abstractions for these features by means of wrapper function calls and concepts brought familiar to C++ and the STL.\n\n## Motivation: The problem with the standard interface\n\nA typical C-call for MP looks like this,\n\n```cpp\nint status_send = MPI_Send(\u0026numbers, 10, MPI_INT, 1, 0, MPI_COMM_WORLD);\nassert(status_send == MPI_SUCCESS);\n... // concurrently with \nint status_recv = MPI_Recv(\u0026numbers, 10, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);\nassert(status_recv == MPI_SUCCESS);\n```\n\nIn principle this call can be made from a C++ program.\nHowever there are obvious drawbacks from using this standard interface.\n\nHere we enumerate some of problems,\n\n* Function calls have many arguments (e.g. 6 or 7 arguments in average)\n* Many mandatory arguments are redundant or could easily have a default natural value (e.g. message tags are not always necessary).\n* Use of raw pointers and sizes, (e.g. `\u0026number` and `1`)\n* Data argument are type-erased into `void*`.\n* Only primitive types (e.g. `MPI_INT`) can be passed.\n* Consistency between pointer types and data-types is responsibility of the user.\n* Only contiguous memory blocks can be used with this interface.\n* Error codes are stored and had to be checked after each function call.\n* Use of handles (such as `MPI_COMM_WORLD`), handles do not have a well defined semantics.\n\nA call of this type would be an improvement:\n\n```cpp\nworld.send(numbers.begin(), numbers.end(), 1);\n... // concurrently with \nworld.receive(numbers.begin(), numbers.end(), 0); \n```\n\nFor other examples, see here: [http://mpitutorial.com/tutorials/mpi-send-and-receive/](http://mpitutorial.com/tutorials/mpi-send-and-receive/)\n\nMPI used to ship with a C++-style interfaces.\nIt turns out that this interface was a very minimal change over the C version, and for good reasons it was dropped.\n\nThe B.MPI3 library was designed to use simultaneously (interleaved) with the standard C interface of MPI.\nIn this way, changes to existing code can be made incrementally.\n\n## Usage\n\nThe library is \"header-only\"; no separate compilation or configuration is necessary after downloading the library.\n\n\n```cpp\ngit clone https://gitlab.com/correaa/boost-mpi3.git\n```\n\nIt requires an MPI distribution (e.g. OpenMPI or MPICH2), a C++14 compiler and Boost libraries installed.\nIn a system such as Ubuntu or Fedora, the dependencies can by installed by `sudo apt install g++ libmpich-dev libboost-test-dev ` or `sudo dnf install gcc-c++ boost-devel openmpi-devel mpich-devel`.\n\nA typical compilation/run command looks like this:\n\n```bash\n$ mpic++ communicator_send.cpp -o communicator_send.x -lboost_serialization\n$ mpirun -n 8 ./communicator_send.x\n```\n\nAlternatively, the library can be fetched on demand by the CMake project:\n\n```cmake\ninclude(FetchContent)\nFetchContent_Declare(bmpi3 GIT_REPOSITORY https://gitlab.com/correaa/boost-mpi3.git)  # or git@gitlab.com:correaa/boost-mpi3.git\nFetchContent_MakeAvailable(bmpi3)\n\ntarget_link_libraries(your_executable PRIVATE bmpi3)\n```\n\nSome systems require loading the MPI module before compiling or using MPI programs, `module load mpi` (or `mpich`).\n\nThe library is tested frequently against `openmpi` and `mpich` implementations of MPI.\n\n## Testing\n\nThe library has a basic `ctest` based testing system.\n\n```bash\n# module load mpi/mpich  # or mpi/openmpi  # needed in systems like Fedora\ncd mpi3\nmkdir build \u0026\u0026 cd build\ncmake ..\ncmake --build ..\nctest\n```\n\n## Initialization\n\nLike MPI, B.MPI3 requires some global library initialization.\nThe library includes a convenience header `mpi3/main.hpp`, which provides a \"main\" function that does this initialization. \nIn this way, a parallel program looks very much like normal programs, except that the main function has a third argument with the default global communicator passed in.\n\n```cpp\n#include \"mpi3/version.hpp\"\n#include \"mpi3/main.hpp\"\n\n#include\u003ciostream\u003e\n\nnamespace mpi3 = boost::mpi3; \n\nint mpi3::main(int argc, char** argv, mpi3::communicator world) {\n\tif(world.rank() == 0) {std::cout \u003c\u003c mpi3::version() \u003c\u003c '\\n';}\n\treturn 0;\n}\n```\n\nHere `world` is a communicator object that is a wrapper over MPI communicator handle.\n\nChanging the `main` program to this syntax in existing code can be too intrusive. \nFor this reason a more traditional initialization is also possible.\nThe alternative initialization is done by instantiating the `mpi3::environment` object (from with the global communicator `.world()` is extracted).\n\n```cpp\n#include \"mpi3/environment.hpp\"\nint main(int argc, char** argv){\n\tmpi3::environment env(argc, argv);\n\tauto world = env.world(); // communicator is extracted from the environment \n    // ... code here\n\treturn 0;\n}\n```\n\n## Communicators\n\nIn the last example, `world` is a global communicator (not necessarily the same as `MPI_COMM_WORLD`, but a copy of it).\nThere is no global communicator variable `world` that can be accessed directly in a nested function.\nThe idea behind this is to avoid using the global communicators in nested functions of the program unless they are explicitly passed in the function call.\nCommunicators are usually passed by reference to nested functions.\nEven in traditional MPI it is a mistake to assume that the `MPI_COMM_WORLD` is the only available communicator.\n\n`mpi3::communicator` represent communicators with value-semantics.\nThis means that `mpi3::communicator` can be copied or passed by reference.\nA communicator and their copies are different entities that compare equal.\nCommunicators can be empty, in a state that is analogous to `MPI_COMM_NULL` but with proper value semantics.\n\nLike in MPI communicators can be duplicated (copied into a new instance) or split.\nThey can be also compared. \n\n```cpp\nmpi3::communicator world2 = world;\nassert( world2 == world );\nmpi3::communicator hemisphere = world/2;\nmpi3::communicator interleaved = world%2;\n```\n\nThis program for example splits the global communicator in two sub-communicators one of size 2 (including process 0 and 1) and one with size 6 (including 2, 3, ... 7);\n\n```cpp\n#include \"mpi3/main.hpp\"\n#include \"mpi3/communicator.hpp\"\n\nnamespace mpi3 = boost::mpi3;\nusing std::cout;\n\nint mpi3::main(int argc, char* argv[], mpi3::communicator world){\n    assert(world.size() == 8); // this program can only be run in 8 processes\n    mpi3::communicator comm = (world \u003c= 1);\n    assert(!comm || (comm \u0026\u0026 comm.size() == 2));\n    return 0;\n}\n```\n\nCommunicators give also index access to individual `mpi3::processes` ranging from `0` to `comm.size()`. \nFor example, `world[0]` referrers to process 0 or the global communicator.\nAn `mpi3::process` is simply a rank inside a communicator.\nThis concept doesn't exist explicit in the standard C interface, but it simplifies the syntax for message passing.\n\nSplitting communicators can be done more traditionally via the `communicator::split` member function. \n\nCommunicators are used to pass messages and to create memory windows.\nA special type of communicator is a shared-communicator `mpi3::shared_communicator`.\n\n## Message Passing\n\nThis section describes the features related to the message passing (MP) functions in the MPI library.\nIn C-MPI information is passed via pointers to memory.\nThis is expected in a C-based interface and it is also very efficient.\nIn Boost.MPI, information is passed exclusively by value semantics. \nAlthough there are optimizations that amortize the cost, we decided to generalize the pointer interface and leave the value-based message passing for a higher-level syntax. \n\nHere we replicate the design of STL to process information, that is, aggregated data is passed mainly via iterators. (Pointer is a type of iterator).\n\nFor example in STL data is copied between ranges in this way.\n```cpp\nstd::copy(origin.begin(), origin.end(), destination.begin());\n```\n\nThe caller of function copy doesn't need to worry about he type of the `origin` and `destination` containers, it can mix pointers and iterators and the function doesn't need more redundant information than the information passed. \nThe programmer is responsible for managing the memory and making sure that design is such that the algorithm can access the data referred by the passed iterators.\n\nContiguous iterators (to built-in types) are particularity efficient because they can be mapped to pointers at compile time. This in turn is translated into a MPI primitive function call.\nThe interface for other type of iterators or contiguous iterators to non-build-in type are simulated, mainly via buffers and serialization.\nThe idea behind this is that generic message passing function calls can be made to work with arbitrary data types.\n\nThe main interface for message passing in B.MPI3 are member functions of the communicator.\nFor example `communicator::send`, `::receive` and `::barrier`. \nThe functions `::rank` and `::size` allows each process to determine their unique identity inside the communicator.\n\n```cpp\nint mpi3::main(int argc, char* argv[], mpi3::communicator world) {\n    assert(world.size() == 2);\n\tif(world.rank() == 0) {\n\t   std::vector\u003cdouble\u003e v = {1.,2.,3.};\n\t   world.send(v.begin(), v.end(), 1); // send to rank 1\n\t} else if(world.rank() == 1) {\n\t   std::vector\u003cdouble\u003e v(3);\n\t   world.receive(v.begin(), v.end(), 0); // receive from rank 1\n\t   assert( v == std::vector{1.,2.,3.} );\n\t}\n\tworld.barrier(); // synchronize execution here\n\treturn 0;\n}\n```\n\nOther important functions are `::gather`, `::broadcast` and `::accumulate`. \nThis syntax has a more or less obvious (but simplified) mapping to the standard C-MPI interface.\nIn Boost.MPI3 however all, these functions have reasonable defaults that make the function call shorted and less prone to errors and with the C-MPI interface.\n\nFor more examples, look into `./mpi3/tests/`, `./mpi3/examples/` and `./mpi3/exercises/`.\n\nThe interface described above is iterator based and is a direct generalization of the C-interface which works with pointers.\nIf the iterators are contiguous and the associated value types are primitive MPI types, the function is directly mapped to the C-MPI call.\n\nAlternatively, value-based interface can be used.\nWe will show the terse syntax, using the process objects.\n\n```cpp\nint mpi3::main(int, char**, mpi3::communicator world) {\n    assert(world.size() == 2);\n\tif(world.rank() == 0) {\n\t   double v = 5.;\n\t   world[1] \u003c\u003c v;\n\t} else if(world.rank() == 1) {\n\t   double v = -1.;\n\t   world[0] \u003e\u003e v;\n\t   assert(v == 5.);\n\t}\n\treturn 0;\n}\n```\n\n## Remote Memory Access\n\nRemote Memory (RM) is handled by `mpi3::window` objects. \n`mpi3::window`s are created by `mpi3::communicator` via a collective (member) functions.\nSince `mpi3::window`s represent memory, it cannot be copied (but can be moved). \n\n```cpp\nmpi3::window w = world.make_window(begin, end);\n```\n\nJust like in the MPI interface, local access and remote access is synchronized by a `window::fence` call.\nRead and write remote access is performed via put and get functions.\n\n```cpp\nw.fence();\nw.put(begin, end, rank);\nw.fence();\n```\n\nThis is minimal example using `put` and `get` functions.\n\n```cpp\n#include \"mpi3/main.hpp\"\n#include\u003ciostream\u003e\n\nnamespace mpi3 = boost::mpi3; using std::cout;\n\nint mpi3::main(int, char*[], mpi3::communicator world) {\n\n\tstd::vector\u003cdouble\u003e darr(world.rank()?0:100);\n\tmpi3::window\u003cdouble\u003e w = world.make_window(darr.data(), darr.size());\n\tw.fence();\n\tif(world.rank() == 0) {\n\t\tstd::vector\u003cdouble\u003e a = {5., 6.};\n\t\tw.put(a.begin(), a.end(), 0);\n\t}\n\tworld.barrier();\n\tw.fence();\n\tstd::vector\u003cdouble\u003e b(2);\n\tw.get(b.begin(), b.end(), 0);\n\tw.fence();\n\tassert( b[0] == 5.);\n\tworld.barrier();\n\n\treturn 0;\n}\n```\n\nIn this example, memory from process 0 is shared across the communicator, and accessible through a common window.\nProcess 0 writes (`window::put`s) values in the memory (this can be done locally or remotely). \nLater all processes read from this memory. \n`put` and `get` functions take at least 3 arguments (and at most 4).\nThe first two is a range of iterators, while the third is the destination/source process rank (called \"target_rank\"). \n\nRelevant examples and test are located in For more examples, look into `./mpi3/tests/`, `./mpi3/examples/` and `./mpi3/exercises/`.\n\n`mpi3::window`s may carry type information (as `mpi3::window\u003cdouble\u003e`) or not (`mpi3::window\u003c\u003e`)\n\n## Shared Memory\n\nShared memory (SM) uses the underlying capability of the operating system to share memory from process within the same node. \nHistorically shared memory has an interface similar to that of remove access.\nOnly communicators that comprise a single node can be used to create a share memory window.\nA special type of communicator can be created by splitting a given communicator.\n\n`mpi3::shared_communicator node = world.split_shared();`\n\nIf the job is launched in single node, `node` will be equal (congruent) to `world`.\nOtherwise the global communicator will be split into a number of (shared) communicators equal to the number of nodes.\n\n`mpi3::shared_communicator`s can create `mpi3::shared_window`s. \nThese are special type of memory windows.\n\n```cpp\n#include \"mpi3/main.hpp\"\n\nnamespace mpi3 = boost::mpi3; using std::cout;\n\nint mpi3::main(int argc, char* argv[], mpi3::communicator world) {\n\n\tmpi3::shared_communicator node = world.split_shared();\n\tmpi3::shared_window\u003cint\u003e win = node.make_shared_window\u003cint\u003e(node.rank()==0?1:0);\n\n\tassert(win.base() != nullptr and win.size\u003cint\u003e() == 1);\n\n\twin.lock_all();\n\tif(node.rank()==0) *win.base\u003cint\u003e(0) = 42;\n\tfor (int j=1; j != node.size(); ++j){\n\t\tif(node.rank()==0) node.send_n((int*)nullptr, 0, j);//, 666);\n\t    else if(node.rank()==j) node.receive_n((int*)nullptr, 0, 0);//, 666);\n\t}\n\twin.sync();\n\n\tint l = *win.base\u003cint\u003e(0);\n\twin.unlock_all();\n\n\tint minmax[2] = {-l,l};\n\tnode.all_reduce_n(\u0026minmax[0], 2, mpi3::max\u003c\u003e{});\n\tassert( -minmax[0] == minmax[1] );\n\tcout \u003c\u003c \"proc \" \u003c\u003c node.rank() \u003c\u003c \" \" \u003c\u003c l \u003c\u003c std::endl;\n\n\treturn 0;\n}\n```\n\nFor more examples, look into `./mpi3/tests/`, `./mpi3/examples/` and `./mpi3/exercises/`.\n\n# Beyond MP: RMA and SHM\n\nMPI provides a very low level abstraction to inter-process communication.\nHigher level of abstractions can be constructed on top of MPI and by using the wrapper the works is simplified considerably.\n\n## Mutex\n\nMutexes can be implemented fairly simply on top of RMA.\nMutexes are used similarly than in threaded code, \nit prevents certain blocks of code to be executed by more than one process (rank) at a time.\n\n```cpp\n#include \"mpi3/main.hpp\"\n#include \"mpi3/mutex.hpp\"\n\n#include\u003ciostream\u003e\n\nnamespace mpi3 = boost::mpi3; using std::cout;\n\nint mpi3::main(int, char**, mpi3::communicator world) {\n\n\tmpi3::mutex m(world);\n\t{\n\t\tm.lock();\n\t\tcout \u003c\u003c \"locked from \" \u003c\u003c world.rank() \u003c\u003c '\\n';\n\t\tcout \u003c\u003c \"never interleaved \" \u003c\u003c world.rank() \u003c\u003c '\\n';\n\t\tcout \u003c\u003c \"forever blocked \" \u003c\u003c world.rank() \u003c\u003c '\\n';\n\t\tcout \u003c\u003c std::endl;\n\t\tm.unlock();\n\t}\n\treturn 0;\n}\n```\n\n(Recursive mutexes are not implemented yet)\n\nMutexes themselves can be used to implement atomic operations on data.\n\n# Ongoing work\n\nWe are implementing memory allocators for remote memory, atomic classes and asynchronous remote function calls.\nHigher abstractions and use patterns will be implemented, specially those that fit into the patterns of the STL algorithms and containers.\n\n# Advanced Topics\n\n## Thread safety\n\nIf you are not using threads at all, you can skip this section; \nhowever here you can find some rationale behind design decisions taken by the library and learn how to use `mpi3::communicator` as a member of a class.\n\nThread-safety with MPI is extremely complicated, as there are various aspects to it, from the data communicated, to the communicator itself, to operations order, to asynchronous messaging, to the runtime system.\nThis library doesn't try to hide this fact; in place, it leverages the tools available to C++ to deal with this complication.\nAs we will see, there are certain steps to make the code _compatible_ with threads to difference degrees.\n\nAbsolute thread-safety is a very strong guarantee and it would come at a very steep performance cost.\nAlmost no general purpose library guarantees complete thread safety.\nIn opposition to thread-safety, we will discuss thread-compatibility, which is a more reasonable goal.\nThread-compatibility refers to the property of a system to be able to be thread-safe if extra steps are taken and that you have the option to take these steps only when needed.\n\nThe first condition for thread compatibility is to have an MPI environment that supports threads.\nIf you have an MPI system provides only a `thread_support` at the level of `mpi3::thread::single` it means that there is probably no way to make MPI calls from different threads an expect correct results.\nIf your program expects to call MPI in concurrent sections, your only option would be to change to a system that supports MPI threading.\n\nIn this small example, we assume that the program expects threading and MPI by completely rejecting the run if the any level different from `single` is not provided. \nThis is not at all terrible choice, _optionally_ supporting threading in a big program can be prohibitive from a design point of view.\n\n```cpp\nint main() {\n\tmpi3::environment env{mpi3::thread::multiple};\n\tswitch( env.thread_support() ) {\n\t\tcase mpi3::thread::single    : throw std::logic_error{\"threads not supported\"};\n\t\tcase mpi3::thread::funneled  : std::cout\u003c\u003c\"funneled\"  \u003c\u003cstd::endl; break;\n\t\tcase mpi3::thread::serialized: std::cout\u003c\u003c\"serialized\"\u003c\u003cstd::endl; break;\n\t\tcase mpi3::thread::multiple  : std::cout\u003c\u003c\"multiple\"  \u003c\u003cstd::endl; break;\n\t}\n\t...\n```\n\nAlternatively you can just check that `env.thread_support() \u003e mpi3::single`, since the levels `multiple \u003e serialized \u003e funneled \u003e single` are ordered.\n\n### From C to C++\n\nThe MPI-C standard interface is expressed in the C language (and Fortran).\nThe C-language doesn't have many ways to deal with threads except by thorough documentation.\nThis indicates that any level of thread assurance that we can express in a C++ interface cannot be derived by the C-interface syntax alone; \nit has to be derived, at best, from the documentation and when documentation is lacking from common sense and common practice in existing MPI implementations.\n\nThe modern C++ language has several tools to deal with thread safety: the C++11 memory model, the `const`, `mutable` and `thread_local` attributes and a few other standard types and functions, such as `std::mutex`, `std::call_once`, etc.\n\n### Data and threads\n\nEven if MPI operations are called outside concurrent sections it is still your responsibility to make sure that the *data* involved in communication is synchronized; this is always the case.\nClear ownership and scoping of *data* helps a lot towards thread safety.\nAvoiding mutable shared data between threads also helps.\nPerhaps as a last resort, data can be locked with mutex objects to be written or accessed one thread at time.\n\n### Communicator and threads\n\nThe library doesn't control or owns the communicated data for the most part, therefore the main concern of the library regarding threading is within the communicator class itself.\n\nThe C-MPI interface briefly mentions thread-safety, for example most MPI operations are accompanied by the following note (e.g. https://www.mpich.org/static/docs/latest/www3/MPI_Send.html):\n\n\u003e **Thread and Interrupt Safety**\n\u003e\n\u003e This routine is thread-safe. This means that this routine may be safely used by multiple threads without the need for any  user-provided thread locks. However, the routine is not interrupt safe. Typically, this is due to the use of memory allocation routines such as malloc or other non-MPICH runtime routines that are themselves not interrupt-safe. \n\nThis doesn't mean that that _all_ calls can be safely done from different threads concurrently, only some of them, those that refer to completely different argument can be safe.\n\nIn practice it is observable that for most MPI operations the \"state\" of the communicator can change in time.\nEven if after the operation the communicator seems to be in the same state as before the call the operation itself changes, at least briefly, the state of the communicator object.\nThis internal state can be observed from another thread even through undefined behavior, even if transiently.\nA plausible model to explain this behavior is that internal buffers are used by individual communicators during communication.\n\nIn modern C++, this is enough to mark communicator operations non-`const` (i.e. an operation than can be applied only on a mutable instance of the communicator).\n\n(`MPI_Send` has \"tags\" to differentiate separate communications and may help with concurrent calls, but this is still not a enough since the tags are runtime variables, of which the library doesn't know the origin.\nBesides, the use of tags are not a general solution since collective operation do not use tags at all.\nIt has been known for a while that the identity of the communicator in some sense serves as a tag for collective communications.\nThis is why it is so useful to be able to duplicate communicators to distinguish between collective communication messages.)\n\nThis explains why most member functions of `mpi3::communicator` are non-`const`, and also why most of the time `mpi3::communicators` must either be passed either by non-`const` reference or by value (depending on the intended use, see below.)\nBe aware that passing by `const`-reference `mpi3::communicator const\u0026` is not very productive because no communication operation can be performed with this instance (not even duplication to obtain a new instance).\n(This behavior is not unheard of in C++: standard \"printing\" streams generally need be _mutable_ to be useful (e.g. `std::cout` or `std::ofstream`), even though they don't seem to have a changing state.)\n\nThis brings us to the important topic of communicator construction and assignment.\n\nMore material: [\"C++ and Beyond 2012: Herb Sutter - You don't know const and mutable\"](https://web.archive.org/web/20170119232617/https://channel9.msdn.com/posts/C-and-Beyond-2012-Herb-Sutter-You-dont-know-blank-and-blank) and [\"related\"](https://web.archive.org/web/20160924183715/https://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Herb-Sutter-Concurrency-and-Parallelism).\n\n### Duplication of communicator\n\nIn C, custom structures do not have special member functions that indicate copying.\nIn general this is provided by free functions operating in pointer or _handle_ types, and in general in their signature ignores `const`ness.\n\nIn C-MPI, the main function to duplicate a communicator is `int MPI_Comm_dup(MPI_Comm comm, MPI_Comm *newcomm)`.\nWhen translating from C to C++ we have to understand that `MPI_Comm` is a handle to a communicator, that is, it behaves like a pointer.\nIn a general library the source (first argument) is conceptually constant (unmodified) during a copy, so we could be tempted to mark it as `const` when translating to C++.\n\n```cpp\nstruct communicator {\n    ...\n    communicator bad_duplicate() const;  // returns a new communicator, congruent to the current communicator, \n};\n```\nFurthermore, we could be tempted to call it `copy` or even to make it part of the copy-constructor.\n\nBut, alas, this is not the case according to the rules we delineated earlier.\nWe know that duplication is an operation that requires communication and it is observable (through concurrent threads) that the internal state of the _original_ communicator is changed *while* it is duplicated.\nTherefore to be honest with the semantics of communicator duplication we are forced to implement this function as non-`const`.\n\n```cpp\nstruct communicator {\n    ...\n    communicator duplicate();  // returns a new communicator, congruent to the current communicator\n};\n```\n\nThe consequence of this line of though is that a `const` communicator (i.e. non-mutable) cannot be duplicated.\nThat is, not only such communicator cannot do any communication operation but it cannot be duplicated itself.\n\n```cpp\nmpi3::communicator new_comm{comm.duplicate()};\n```\n\nThis syntax also makes very explicit what the operation really does.\n\n### Pass-by-value or pass-by-reference\n\nAs indicated earlier, a useful communicator is one that is mutable.\nTherefore when passing a communicator to a function we have two main options, either pass by reference (non-const reference) or by value.\n\n```cpp\nvoid f(mpi3::communicator\u0026 comm);\nvoid g(mpi3::communicator  comm);\n```\n\nThese two cases have different meanings and different things can be done with the corresponding communicators.\n\nCase `f` implies that, first, we are reusing a communicator, even if all communication operations are internal to the function or second, that `f` can communicate messages with a communicator that is external to the function.\n\nAlthough reusing a communicator sound reasonable (since duplicating communicators can be an expensive operation), even if all communication is contained in `f` there is a risk that some communicator is mixed inadvertedly with communication external to `f`.\nThe logic of collective operation would be distributed in different functions in the code, which is possible but difficult or impossible to reason about.\nIf `f` is running in a multi-threaded environment, it could be dealing with a communicator that is being shared with other threads.\n\nCase `g` is different in the sense that it knows that it has exclusive access to the communicator, send and receive operations cannot be captured by other functions and collective operations only need to be fully contained inside `g`.\n\nFor completeness we can also imagine a function declared as pass-by-const-reference.\n\n```cpp\nvoid h(mpi3::communicator const\u0026 comm);\n```\n\nIn the current system, this is not very useful since only a handful of operations, which do not include communication or duplication, can be done. \n(An example is probing the `.size()` of the communicator.)\nTake into account that, inside the `h` function it is also \"too late\" to produce a duplicate of the communicator.\n\n## Communicator as an implementation detail\n\nNote that so far we didn't indicate how to use `mpi3::communicator` member with threads, we are simply following the logic being transparent of what each MPI operation is likely to perform behind the scenes regarding the (transient) state of the communicator.\n\nIn C++ it is very useful to include a communicator to each object that requires to perform communication to maintain its internal consistency.\nSuppose we have a data that is distributed across many processes, and that we store a instance of the communicator containing these processes.\nSuch class could have operations that modify its state and others that do not.\nThe correct design is to mark the function in the latter category as `const`.\n\n```cpp\nstruct distributed_data {\n    void set() { ... }\n    void print() const { ... }\n\n private:\n    mpi3::communicator comm_;\n};\n```\n\nHowever such design doesn't work, because for `print` to do any actual communication (e.g. to communicate some data to the root process) would need to have access to a mutable communicator, the `const` mark prevents that.\n\nOne option is to make `print` non-`const`, this is bad because we will lose any concept of mutability just because an implementation detail.\nThe other option to remove const by force,\n```cpp\n    void print() const { const_cast\u003cmpi3::communicator\u0026\u003e(comm_).do_something()... }\n```\nwhich would work but it is not very idiomatic.\nBesides, this class would become now **hostile** to threads, because two simultaneous `print` calls (which are marked as `const`) on the same class could overlap, the messages could be mixed and weird behavior can appear under threads and we would need to look inside the implementation of `print`.\nEnding up with hostile class is an basically a show stopped for threading and must be avoided.\n\nNote that making the communicator member a pointer `mpi3::communicator* comm_;` doesn't solve any problem, it just kick the can down the road.\n\nThis leads to a more modern design which would use the keyword `mutable`.\n\n```cpp\nstruct distributed_data {\n    void set() { ... }\n    void print() const { ... }\n\n public:\n    mutable mpi3::communicator comm_;\n};\n```\n\nThis will allow the use of the communicator from internals of `print() const` without the use of `const_cast`.\nThis doesn't save us from the problem of using the communicator concurrently but at least it is clear in the declaration of the class.\nAs a matter of fact this `mutable` attribute is exactly what marks the class a thread unsafe.\n(A mutable member without a synchronization mechanism is a red flag in itself.)\nIf a single instance of the class is never used across threads *or* the program is single threaded there is nothing else that one needs to do.\n\nNote also that different instances of the class can also be used from different threads, since they don't share anything, nor internal data or their internal communicator.\n\nWhat if you want to make your class, that contains a communicator thread-safe, at least safe for calling concurrently non mutating (`const`) members?\nFor that you need to implement your own synchronization or locking mechanism.\nThere is no single recipe for that, you can use a single mutex to lock access for the communicator alone or both the communicator and data.\n\n```cpp\nstruct distributed_data {\n    void set() { ... }\n    void print() const { std::lock_guard\u003cstd::mutex\u003e guard{mtx_}; ... use comm_ ... }\n\n private:\n\tmutable std::mutex mtx_;\n    mutable mpi3::communicator comm_;\n};\n```\n\nI don't recommend doing this specifically; the code above is just to illustrate the point. \nI can not give a general recipe beyond this point, because there are many possible choices on how to make class thread safe (e.g. data-safe) or thread safe to some specific level (operation-safe).\nIdeally concurrent data structure should be able to do some of the work without the synchronization bottleneck.\nThe whole point is that the library gives you this option, to trade-off safety and efficiency to the desired degree but no more.\n\nIn fact a (bad) blanket way to make the library thread safe could be to wrap every communicator in class with a mutex and make all most communication operations `const`. \nThis would force, from a design perspective, an unacceptable operation cost.\n\n### Not a copy-constructor, but a duplicate-constructor\n\nSo far we have shown the `duplicate` interface function as a mechanism for duplicating communicators (used as `auto new_comm{comm.duplicate()}`), which is nice because it makes the operation very explicit, but it also makes it difficult to integrate generically with other parts of C++.\n\nA reasonable copy constructor of the class containing a communicator would be:\n\n```cpp\nstruct distributed_data {\n    distributed_data(distributed_data const\u0026 other) : comm_{other.comm_.duplicate()} {}\n\n private:\n    ...\n    mutable mpi3::communicator comm_;\n};\n```\nNote that this code is valid because `comm_` is a mutable member of `other`.\nThe worst part of forcing us to use the \"non-standard\" `duplicate` function is that we can no longer \"default\" the copy constructor.\n\nCopying in C++ is usually delegated to special member functions such as the copy-constructor or copy-assignment.\nHowever these function take their source argument as `const` reference and as such it cannot call the `duplicate` member.\n(And even if we could we would be lying to the compiler in the sense that we could make the system crash by copying concurrently a single (supposedly) `const` communicator that is shared in two threads.)\n\nHowever the language is general enough to allow a constructor by non-const reference.\nThe signature of this constructor is this one:\n\n```cpp\ncommunicator::communicator(communicator      \u0026 other) {...}      // \"duplicate\" constructor?\ncommunicator::communicator(communicator const\u0026      ) = delete;  // no copy constructor\n```\n\nThere is no standard name for this type of constructor, I choose to call it here \"duplicate\"-constructor, or mutable-copy-constructor.\nThis function does internally call `MPI_Comm_dup`, and like `duplicate()` it can only be called with a source that is mutable.\nThis makes the copy constructor of the containing class more standard, or even can be  implemented as `= default;`.\n\n```cpp\nstruct distributed_data {\n    distributed_data(distributed_data const\u0026 other) : comm_{other.comm_} {}  // or = default;\n    ...\n private:\n    mutable mpi3::communicator comm_;\n};\n```\n\n**In summary**, \n1) all important communication operations are non-`const` because according to the rules and practice of modern C++ the internal state of the communicator is affected by these operations, \n2) ... including the `duplicate` operation; \n3) `mutable` is a good marker to indicate the _possible_ need for custom (thread) synchronization mechanism; it also makes possible the use of communicator as member of a class.\n4) the need may be critical or not (the user of the library decides),\n5) mutable instances of communicators (i.e. non-`const` variables or mutable members) can be duplicated using standard C++ syntax, via \"duplicate\"-constructor or via `duplicate` member functions. \n6) In general, it is likely to be a good idea to duplicate communicator for specific threads *before* creating them; otherwise duplication will happen \"too late\" with a shared (non-const) communicator.\n\n(Thanks Arthur O'Dwyer for the critical reading of this section.)\n\n## NCCL (GPU communication)\n\nIf the underlying MPI distribution is GPU-aware, in principle you can pass GPU pointers to the communication routines. \nThis is generally faster than copying back and forth to CPU.\n\nNvidia's NCCL conceptually implements a subset of MPI operations and it might be faster than GPU-aware MPI.\nTo obtain an NCCL communicator you pass an MPI communicator.\n\n```cpp\n\tmpi3::nccl::communicator gpu_comm{mpi_comm};\n```\n\nThe original MPI communicator is assumed be working with non-overlapping devices (e.g. one process per GPU).\nThis can be achieved by `cudaSetDevice(world.rank() % num_devices);` generally at the start of the program or autommically by using certain ways to run the MPI program (e.g. `lrun` tries to attach each MPI process to a different GPU device).\n\nWith some limitations, the NCCL communicator can be used to perform operations on GPU memory without the need to obtaining raw pointers. \nBy default it works with `thrust[::cuda]::device_ptr` or `thrust[::cuda]::universal_ptr`.\nFor example this produces a reduction in GPU across processes (even processes in different nodes):\n\n```cpp\n//  thust::device_vector\u003cint64_t, thrust::cuda::universal_allocator\u003cint64_t\u003e\u003e A(1000, gpu_comm.rank());\n\tthrust::device_vector\u003cint64_t, thrust::cuda::allocator\u003cint64_t\u003e\u003e A(1000, gpu_comm.rank());\n\n\tgpu_comm.all_reduce_n(A.data(), A.size(), A.data());\n```\n\nLike B-MPI3 communicator the NCCL communicator is destroyed automatically when leaving the scope.\n\nThe implementation is preliminary, the NCCL communicator is moveable but not copyable (or duplicable).\nCongruent NCCL communicators can be constructed from the same (or congruent) B-MPI3 communicator (at the cost of a regular MPI broadcast).\nThere is not mechanism to create NCCL subcommunicators from other NCCL communicators, except using MPI subcommunicators as constructor arguments.\n\n# Conclusion\n\nThe goal is to provide a type-safe, efficient, generic interface for MPI.\nWe achieve this by leveraging template code and classes that C++ provides.\nTypical low-level use patterns become extremely simple, and that exposes higher-level patterns.\n\n# Mini tutorial\n\nThis section describes the process of bringing a C++ program that uses the original MPI interface to one that uses B.MPI3.\nBelow it is a valid C++ MPI program using send and receive function.\nDue to the legacy nature of MPI, C and C++ idioms are mixed.\n\n```cpp\n#include\u003cmpi.h\u003e\n\n#include\u003ciostream\u003e\n#include\u003cnumeric\u003e\n#include\u003cvector\u003e\n\nint main(int argc, char **argv) {\n\tMPI_Init(\u0026argc, \u0026argv);\n\tMPI_Comm comm = MPI_COMM_WORLD;\n\n\tint count = 10;\n\n\tstd::vector\u003cdouble\u003e xsend(count); iota(begin(xsend), end(xsend), 0);\n\tstd::vector\u003cdouble\u003e xrecv(count, -1);\n\n\tint rank = -1;\n\tint nprocs = -1;\n\tMPI_Comm_rank(comm, \u0026rank);\n\tMPI_Comm_size(comm, \u0026nprocs);\n\tif(nprocs%2 == 1) {\n\t   if(rank == 0) {std::cerr\u003c\u003c\"Must be called with an even number of processes\"\u003c\u003cstd::endl;}\n\t   return 1;\n\t}\n\n\tint partner_rank = (rank/2)*2 + (rank+1)%2;\n\n\tMPI_Send(xsend.data(), count, MPI_DOUBLE, partner_rank  , 0          , comm);\n\tMPI_Recv(xrecv.data(), count, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, comm, MPI_STATUS_IGNORE);\n\tassert(xrecv[5] == 5);\n\n\tif(rank == 0) {std::cerr\u003c\u003c\"successfully completed\"\u003c\u003cstd::endl;}\n\n\tMPI_Finalize();\n\treturn 0;\n}\n```\n\nWe are going to work \"inward\", with the idea of mimicking the process of modernizing a code from the top (the opposite it is also feasible).\nThis process is typical if the low level code needs to stay untouched for historical reasons.\n\nThe first step is to include the wrapper library and, as a warm up, replace the `Init`, `Finalize` calls.\nAt the same time we obtain the (global) world communicator from the library.\n\n\n```cpp\n#include \"../../mpi3/environment.hpp\"\n\n#include\u003ciostream\u003e\n#include\u003cnumeric\u003e\n#include\u003cvector\u003e\n\nnamespace bmpi3 = boost::mpi3;\n\nint main(int argc, char **argv) try {\n\tbmpi3::environment::initialize(argc, argv);\n\tMPI_Comm comm = \u0026bmpi3::environment::get_world_instance(); assert(comm == MPI_COMM_WORLD)\n...\n\tbmpi3::environment::finalize();\n\treturn 0;\n}\n```\n\nNotice that we are getting a reference to the global communicator using the `get_world_instance`, then, with the ampersand (`\u0026`) operator, we obtain a `MPI_Comm` handle than can be used with the rest of the code untouched.\n\nSince `finalize` will need to be executed in any path, it is preferable to use an RAII object to represent the environment.\nJust like in classic MPI, it is wrong to create more than one environment.\n\nBoth, accessing the global communicator directly is in general considered problematic.\nFor this reason it makes more sense to ask for a duplicate of the global communicator.\n\n```cpp\nint main(int argc, char **argv) {\n\tbmpi3::environment env(argc, argv);\n\tbmpi3::communicator world = env.world();\n\tMPI_Comm comm = \u0026world; assert(comm != MPI_COMM_WORLD);\n...\n\treturn 0;\n}\n```\n\nThis ensures that `finalize` is always called (by the destructor) and that we are not using the original global communicator, but a duplicate.\n\nSince this pattern is very common, a convenient \"main\" function is declared by the library as a replacement declared in the `mpi3/main.hpp` header.\n\n```cpp\n#include \"../../mpi3/main.hpp\"\n\n#include\u003ciostream\u003e\n#include\u003cnumeric\u003e\n#include\u003cvector\u003e\n\nnamespace bmpi3 = boost::mpi3;\n\nint bmpi3::main(int, char **, bmpi3::communicator world) {\n\tMPI_Comm comm = \u0026world; assert(comm != MPI_COMM_WORLD);\n...\n\treturn 0;\n}\n```\n\nThe next step is to replace the use of the MPI communicator handle by a proper `mpi3::communicator` object.\nSince `world` is already a duplicate of the communicator we can directly use it.\nThe `size` and `rank` are methods of this object which naturally return their values.\n\n```cpp\n...\n\tint rank = world.rank();\n\tint nprocs = world.size();\n...\n```\n\nSimilarly the calls to send and receive data can be transformed.\nNotice that the all the irrelevant or redundant arguments (including the receive source) can be omitted.\n\n```cpp\n...\n\tworld.send_n   (xsend.data(), count, partner_rank);\n\tworld.receive_n(xrecv.data(), count);\n...\n```\n\n(We use the `_n` suffix interface to emphasize that we are using element count (container size) as argument.)\n\nThe condition `(rank == 0)` is so common that can be replaced by the `communicator`'s method `is_root()`:\n\n```cpp\n\tif(world.is_root()) {std::cerr\u003c\u003c\"Must be called with an even number of processes\"\u003c\u003cstd::endl;}\n```\n\n```cpp\n#include \"../../mpi3/main.hpp\"\n\n#include\u003ciostream\u003e\n#include\u003cnumeric\u003e\n#include\u003cvector\u003e\n\nnamespace bmpi3 = boost::mpi3;\n\nint bmpi3::main(int /*argc*/, char ** /*argv*/, bmpi3::communicator world) try {\n\tint count = 10;\n\n\tstd::vector\u003cdouble\u003e xsend(count); iota(begin(xsend), end(xsend), 0);\n\tstd::vector\u003cdouble\u003e xrecv(count, -1);\n\n\tif(world.size()%2 == 1) {\n\t   if(world.is_root()) {std::cerr\u003c\u003c\"Must be called with an even number of processes\"\u003c\u003cstd::endl;}\n\t   return 1;\n\t}\n\n\tint partner_rank = (world.rank()/2)*2 + (world.rank()+1)%2;\n\n\tworld.send_n   (xsend.data(), count, partner_rank);\n\tworld.receive_n(xrecv.data(), count);\n\tassert(xrecv[5] == 5);\n\n\tif(world.is_root()) {std::cerr\u003c\u003c\"successfully completed\"\u003c\u003cstd::endl;}\n\treturn 0;\n}\n```\n\nThis completes the replacement of the original MPI interface.\nFurther steps can be taken to exploit the safety provided by the library. \nFor example, instead of using pointers from the dynamic arrays, we can use the iterators to describe the start of the sequences.\n\n```cpp\n...\n\tworld.send_n   (xsend.begin(), xsend.size(), partner_rank);\n\tworld.receive_n(xrecv.begin(), xrecv.size());\n...\n```\nor use the range.\n\n```cpp\n...\n\tworld.send   (xsend.begin(), xsend.end(), partner_rank);\n\tworld.receive(xrecv.begin(), xrecv.end());\n...\n```\n\n(Note that `_n` was dropped from the method name because we are using iterator ranges now.)\n\nFinally, the end of the receiving sequence can be omitted in many cases since the information is contained in the message and the correctness can be ensured by the logic of the program.\n\n```cpp\n...\n\tworld.send(xsend.begin(), xsend.end(), partner_rank);\n\tauto last = world.receive(xrecv.begin());  assert(last == xrecv.end()); \n...\n```\n\nAfter some rearrangement we obtain the final code, which is listed below.\nWe also replace separate calls by a single `send_receive` call which is optimized by the MPI system and more correct in this case, also we ensure \"constness\" of the sent values (`cbegin/cend`)).\nThere are no pointers being used in this final version.\n\n```cpp\n#include \"../../mpi3/main.hpp\"\n\n#include\u003ciostream\u003e\n#include\u003cnumeric\u003e\n#include\u003cvector\u003e\n\nnamespace bmpi3 = boost::mpi3;\n\nint bmpi3::main(int /*argc*/, char ** /*argv*/, bmpi3::communicator world) {\n\tif(world.size()%2 == 1) {\n\t   if(world.is_root()) {std::cerr\u003c\u003c\"Must be called with an even number of processes\"\u003c\u003cstd::endl;}\n\t   return 1;\n\t}\n\n\tstd::vector\u003cdouble\u003e xsend(10); iota(begin(xsend), end(xsend), 0);\n\tstd::vector\u003cdouble\u003e xrecv(xsend.size(), -1);\n\n\tworld.send_receive(cbegin(xsend), cend(xsend), (world.rank()/2)*2 + (world.rank()+1)%2, begin(xrecv));\n\n\tassert(xrecv[5] == 5);\n\tif(world.is_root()) {std::cerr\u003c\u003c\"successfully completed\"\u003c\u003cstd::endl;}\n\treturn 0;\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllnl%2Fb-mpi3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fllnl%2Fb-mpi3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllnl%2Fb-mpi3/lists"}