{"id":23973214,"url":"https://github.com/H-DNA/MPiSC","last_synced_at":"2025-09-13T05:30:37.642Z","repository":{"id":270289916,"uuid":"909884774","full_name":"H-DNA/MPiSC","owner":"H-DNA","description":"Investigation and porting of shared-memory MPSCs to distributed context using MPI-3","archived":false,"fork":false,"pushed_at":"2025-08-29T10:31:47.000Z","size":74320,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-08T04:51:50.141Z","etag":null,"topics":["algorithms-and-data-structures","atomic","cpp11","distributed-computing","mpi-3","mpi-rma","mpi-shm","mpsc"],"latest_commit_sha":null,"homepage":"https://h-dna.github.io/MPiSC/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/H-DNA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-30T01:18:51.000Z","updated_at":"2025-08-29T10:30:20.000Z","dependencies_parsed_at":"2025-02-06T16:41:51.463Z","dependency_job_id":"49e49b25-4d2a-40f6-9b60-e90f679eafb7","html_url":"https://github.com/H-DNA/MPiSC","commit_stats":null,"previous_names":["huy-dna/cluster-mpsc","huy-dna/distributed-mem-mpsc-mpi3-cpp11","huy-dna/distributed-hybrid-mpi-mpsc","huy-dna/distributed-mpsc-with-hybrid-mpi","h-dna/mpisc","huy-dna/mpisc"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/H-DNA/MPiSC","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/H-DNA%2FMPiSC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/H-DNA%2FMPiSC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/H-DNA%2FMPiSC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/H-DNA%2FMPiSC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/H-DNA","download_url":"https://codeload.github.com/H-DNA/MPiSC/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/H-DNA%2FMPiSC/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274920118,"owners_count":25373956,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-13T02:00:10.085Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithms-and-data-structures","atomic","cpp11","distributed-computing","mpi-3","mpi-rma","mpi-shm","mpsc"],"created_at":"2025-01-07T04:17:34.857Z","updated_at":"2025-09-13T05:30:37.619Z","avatar_url":"https://github.com/H-DNA.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Porting shared memory MPSC queues to distributed context using MPI-3 RMA\n\n## Objective\n\n- Examination of the *shared-memory* literature to find potential *lock-free*, *concurrent*, *multiple-producer single-consumer* queue algorithms.\n- Use the new MPI-3 RMA capabilities to port potential lock-free *shared-memory* queue algorithms to distributed context.\n- Potentially optimize MPI RMA ports using MPI-3 SHM + C++11 memory model. \n\n- Minimum required characteristics:\n\n| Dimension           | Desired property        |\n| ------------------- | ----------------------- |\n| Queue length        | Fixed length            |\n| Number of producers | Many                    |\n| Number of consumers | One                     |\n| Operations          | `queue`, `enqueue`      |\n| Concurrency         | Concurrent \u0026 Lock-free  |\n\n## Motivation\n\n- Queue is the backbone data structures in many applications: scheduling, event handling, message bufferring. In these applications, the queue may be highly contended, for example, in event handling, there can be multiple sources of events \u0026 many consumers of events at the same time. If the queue has not been designed properly, it can become a bottleneck in a highly concurrent environment, adversely affecting the application's scalability. This sentiment also applies to queues in distributed contexts.\n- Within the context of shared-memory, there have been plenty of research and testing going into efficient, scalable \u0026 lock-free queue algorithms. This presents an opportunity to port these high-quality algorithms to the distributed context, albeit some inherent differences that need to be taken into consideration between the two contexts.\n- In the distributed literature, most of the proposed algorithms completely disregard the existing shared-memory algorithms, mostly due to the discrepancy between the programming model of shared memory and that of distributed computing. However, with MPI-3 RMA, the gap is bridged, and we can straightforwardly model shared memory application using MPI. This is why we investigate the porting approach \u0026 compare them with existing distributed queue algorithms.\n\n## Approach\n\nThe porting approach we choose is to use MPI-3 RMA to port lock-free queue algorithms. We further optimize these ports using MPI SHM (or the so called MPI+MPI hybrid approach) and C++11 for shared memory synchronization.\n\n\u003cdetails\u003e\n  \u003csummary\u003eWhy MPI RMA?\u003c/summary\u003e\n  \n  MPSC queue belongs to the class of \u003ci\u003eirregular\u003c/i\u003e applications, this means that:\n  \u003cul\u003e\n    \u003cli\u003eMemory access pattern is not known.\u003c/li\u003e\n    \u003cli\u003eData locations cannot be known in advance, it can change during execution.\u003c/li\u003e\n  \u003c/ul\u003e\n  \n  In other words, we cannot statically analyze where the data may be stored - data can be stored anywhere and we can only determine its location at runtime. This means the tradition message passing interface using \u003ccode\u003eMPI_Send\u003c/code\u003e and \u003ccode\u003eMPI_Recv\u003c/code\u003e is insufficient: Suppose at runtime, process \u003ccode\u003eA\u003c/code\u003e wants and knows to access a piece of data at \u003ccode\u003eB\u003c/code\u003e, then \u003ccode\u003eA\u003c/code\u003e must issue \u003ccode\u003eMPI_Recv(B)\u003c/code\u003e, but this requires \u003ccode\u003eB\u003c/code\u003e to anticipate that it should issue \u003ccode\u003eMPI_Send(A, data)\u003c/code\u003e and know that which data \u003ccode\u003eA\u003c/code\u003e actually wants. The latter issue can be worked around by having \u003ccode\u003eA\u003c/code\u003e issue \u003ccode\u003eMPI_Send(B, data_descriptor)\u003c/code\u003e first. Then, \u003ccode\u003eB\u003c/code\u003e must have waited for \u003ccode\u003eMPI_Recv(A)\u003c/code\u003e. However, because the memory access pattern is not known, \u003ccode\u003eB\u003c/code\u003e must anticipate that any other processes may want to access its data. It is possible but cumbersome.\n   \n   MPI RMA is specifically designed to conveniently express irregular applications by having one side specify all it wants.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eWhy MPI-3 RMA?\u003c/summary\u003e\n\n  MPI-3 improves the RMA API, providing the non-collective \u003ccode\u003eMPI_Win_lock_all\u003c/code\u003e for a process to open an access epoch on a group of processes. This allows for lock-free synchronization.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eHybrid MPI+MPI\u003c/summary\u003e\n  The Pure MPI approach is oblivious to the fact that some MPI processes are on the same node, which causes some unnecessary overhead. MPI-3 introduces the MPI SHM API, allowing us to obtain a communicator containing processes on a single node. From this communicator, we can allocate a shared memory window using \u003ccode\u003eMPI_Win_allocate_shared\u003c/code\u003e. Hybrid MPI+MPI means that MPI is used for both intra-node and inter-node communication. This shared memory window follows the \u003cem\u003eunified memory model\u003c/em\u003e and can be synchronized both using MPI facilities or any other alternatives. Hybrid MPI+MPI can take advantage of the many cores of current computer processors.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eHybrid MPI+MPI+C++11\u003c/summary\u003e\n  Within the shared memory window, C++11 synchronization facilities can be used and prove to be much more efficient than MPI. So incorporating C++11 can be thought of as an optimization step for intra-node communication.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eHow to perform an MPI port in a lock-free manner?\u003c/summary\u003e\n  \n  With MPI-3 RMA capabilities:\n  \u003cul\u003e\n    \u003cli\u003eUse \u003ccode\u003eMPI_Win_lock_all\u003c/code\u003e and \u003ccode\u003eMPI_Win_unlock_all\u003c/code\u003e to open and end access epochs.\u003c/li\u003e\n    \u003cli\u003eWithin an access epoch, MPI atomics are used.\u003c/li\u003e\n  \u003c/ul\u003e\n\u003c/details\u003e\n\n## Literature review\n\n### Known problems\n- ABA problem.\n\n  Possible solutions: Monotonic counter, hazard pointer.\n\n- Safe memory reclamation problem.\n\n  Possible solutions: Hazard pointer.\n\n- Special case: empty queue - Concurrent `enqueue` and `dequeue` can conflict with each other.\n\n  Possible solutions: Dummy node to decouple head and tail.\n\n- A slow process performing `enqueue` and `dequeue` could leave the queue in an intermediate state.\n\n  Possible solutions:\n  - Help mechanism: To be lock-free, the other processes can help out patching up the queue (do not wait).\n\n- A dead process performing `enqueue` and `dequeue` could leave the queue broken.\n  \n  Possible solutions:\n  - Help mechanism: The other processes can help out patching up the queue.\n\n- Motivation for the help mechanism?\n\n  Why: If `enqueue` or `dequeue` needs to perform some updates on the queue to move it to a consistent state, then a suspended process may leave the queue in an intermediate state. The `enqueue` and `dequeue` should not wait until it sees a consistent state or else the algorithm is blocking. Rather, they should help the suspended process complete the operation.\n\n  Solutions often involve (1) detecting intermediate state (2) trying to patch.\n\n  Possible solutions:\n  - Typically, updates are performed using CAS. If CAS fails, some state changes have occurred, we can detect if this is intermediary \u0026 try to perform another CAS to patch up the queue.\n    Note that the patching CAS may fail in case the queue is just patched up, so looping until a successful CAS may not be necessary.\n\n### Trends\n\n- Speed up happy paths.\n  - The happy path can use lock-free algorithm and fall back to the wait-free algorithm. As lock-free algorithms are typically more efficient, this can lead to speedups.\n  - Replacing CAS with simpler operations like FAA, load/store in the fast path.\n- Avoid contention: Enqueuers or dequeuers performing on a shared data structures can harm each other's progress.\n  - Local buffers can be used at the enqueuers' side in MPSC queue so that enqueuers do not contend with each other.\n  - Elimination + Backing off techniques in MPMC.\n- Cache-aware solutions.\n\n## Evaluation strategy\n\nWe need to evaluate at least 3 levels:\n- Theory verification: Prove that the algorithm possesses the desired properties.\n- Implementation verification: Even though theory is correct, implementation details nuances can affect the desired properties.\n  - Static verification: *Verify* the source code + its dependencies.\n  - Dynamic verification: *Verify* its behavior at runtime \u0026 *Benchmark*.\n\n### Correctness\n- Linearizability\n- No problematic ABA problem\n- Memory safety:\n  - Safe memory reclamation\n\n### Performance\n- Performance: The less time it takes to serve common workloads on the target platform the better.\n\n### Lock-freedom\n- Lock-freedom: A process suspended while using the queue should not prevent other processes from making progress using the queue.\n\n\u003cdetails\u003e\n  \u003csummary\u003eCaution - Lock-freedom of dependencies\u003c/summary\u003e\n  A lock-free algorithm often \u003cem\u003eassumes\u003c/em\u003e that some synchronization primitive is lock-free. This depends on the target platform and during implementation, the library used. Care must be taken to avoid accidental non-lock-free operation usage.\n\u003c/details\u003e\n\n### Scalability\n- Scalability: The performance gain for `queue` and `enqueue` should scale with the number of threads on the target platform.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FH-DNA%2FMPiSC","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FH-DNA%2FMPiSC","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FH-DNA%2FMPiSC/lists"}