{"id":20280465,"url":"https://github.com/jrmadsen/madthreading","last_synced_at":"2025-04-11T06:36:48.554Z","repository":{"id":104781579,"uuid":"90807050","full_name":"jrmadsen/madthreading","owner":"jrmadsen","description":"A low-overhead, task-based threading API using a thread-pool of C++11 threads","archived":false,"fork":false,"pushed_at":"2018-10-29T23:54:59.000Z","size":9357,"stargazers_count":10,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-25T04:32:36.979Z","etag":null,"topics":["async","cplusplus","cpp","cpp11","multithreading","openmp","pthreads","pybind11","python","task-tree","tbb","thread-management","thread-pool","threading","threadpool","threadsafe"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jrmadsen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-10T01:19:43.000Z","updated_at":"2021-11-03T14:33:30.000Z","dependencies_parsed_at":"2023-05-29T19:30:37.158Z","dependency_job_id":null,"html_url":"https://github.com/jrmadsen/madthreading","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrmadsen%2Fmadthreading","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrmadsen%2Fmadthreading/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrmadsen%2Fmadthreading/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrmadsen%2Fmadthreading/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jrmadsen","download_url":"https://codeload.github.com/jrmadsen/madthreading/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248358533,"owners_count":21090400,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["async","cplusplus","cpp","cpp11","multithreading","openmp","pthreads","pybind11","python","task-tree","tbb","thread-management","thread-pool","threading","threadpool","threadsafe"],"created_at":"2024-11-14T13:35:42.889Z","updated_at":"2025-04-11T06:36:48.542Z","avatar_url":"https://github.com/jrmadsen.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# madthreading\n\n## This repository is no longer under development\n## Functionality has been migrated to [Parallel Tasking Library (PTL)](https://github.com/jrmadsen/PTL)\n\nA low-overhead, task-based threading API using a pool of C++11 threads (i.e. thread pool)\n\nMadthreading is a general multithreading API similar to Intel's Thread Building\nBlocks but with a more flexible tasking system. For example, in TBB, task\ngroups cannot return and the function pointer does not support arguments. \nThese features are available here. Additionally, there is a bit more transparency\nand simplicity to the inner workings and the overhead has been reduced. However,\nadvanced features like flowgraph are not available. Is it better than TBB? I am\nnot sure but I certainly like it better than OpenMP\n\nSide note about OpenMP:\n  - There is a common misconception that all you need to do to multithread with OpenMP is add a `#pragma omp parallel`... \n  - This is very misleading, although it will generally work, the performance is usually terrible because of things like false-sharing\n  - To fix the false-sharing and improve performance, the same amount of coding work usually need to be done as when dealing with raw threads or TBB\n    - At this point, you are doing the same amount of work but add the following issues:\n      - (1) OpenMP is only supported by certain compilers\n      - (2) OpenMP injects \"ghost\" code that makes performance tuning and debugging difficult because you never see the code\n      - (3) weird things like using a lambda (which is inlined) not having the same performance as not using a lambda, even though it absolutely should be\n        - I actually showed this to OpenMP developers and they did not know why because of #2 but confirmed there shouldn't be a performance hit\n      - (4) bugs like #3 are compiler-specific and therefore performance isn't as portable\n      - (5) Pragmas make your code look ugly\n  - Example #6 (examples/ex6) demonstrates many of the different implementations for multithreading\n  \nFeatures:\n  - Low overhead\n  - Work-stealing (inactive thread grabs work off stack of tasks)\n  - Thread-pool (no overhead of thread creation)\n  - Interface takes any function construct\n  - Support for return types from joining (e.g. summation from all threads)\n  - Background tasks via pointer signaling\n    \nThe primary benefit of using Madthreading is the creation of a\nthread-pool. Threads are put to sleep when not doing work and do not require\ncompute cycles when not in use.\n\nThe thread-pool is created during the instantiation of the\nthread-manager. Once the thread-manager has been created, you simply\npass functions with or without arguments to the thread-manager, which\ncreates tasks and these tasks are iterated over until the task stack is\nempty.\n\nPassing tasks to thread-manager is done through three primary interfaces:\n\n```c++\n mad::thread_manager::exec(mad::task_group*, ...)       // pass one task to the stack\n mad::thread_manager::run(mad::task_group*, ...)        // similar to exec but N times\n                                                        // where N = mad::thread_manager::size()\n mad::thread_manager::run_loop(mad::task_group*, ...)   // generic construct\n```\n\nTasks are not explicitly created. However, you are required to pass a pointer\nto a task-group. The task_group is the handle for joining/synchronization.\nInstead of explicitly creating tasks, you pass function pointers and\narguments, where the arguments are for the function (exec, run) or for the\nloop that creates the tasks (run_loop). Examples are provided in the\nexamples/ directory of the source code\n\nCurrently, there is support for functions using an unlimited number of arguments.\n\nThe number of threads is controlled via the environment variable FORCE_NUM_THREADS\nor MAD_NUM_THREADS, with the latter taking supremacy.\n\nRequired dependencies:\n  - GNU, Clang, or Intel compiler supporting C++11\n  - CMake\n   \nOptional dependencies:\n  - UnitTest++ (for unit testing)\n    - sudo apt-get install libunittest++-dev (Ubuntu)\n  - TBB (used for allocators and in examples)\n  - OpenMP (used for SIMD and in examples)\n  - SWIG version 3.0+ (some support for Python wrapping)\n\nMadthreading provides a generic interface to using atomics and, when C++11 is\nnot available, provides a mutexed-based interface that works like an atomic.\n\nThere are two forms of tasks: standard and a tree type design. The tree\ntype design is intended to be closer to the TBB design.\n\n ##################################################\n\nExamples:\n  - ex1  : simple usage examples\n  - ex2  : simple MT pi calculation using run_loop\n  - ex3a : simple MT pi calculation using task_tree\n  - ex3b : simple MT pi calculation using task_tree + grainsize\n  - ex4  : a vectorization example using intrinsics\n  - ex5  : demonstration of task_group usage cases\n  - ex6  : PI calculations using different threading methods (for comparison)\n    - serial\n    - TBB (CXX98-body)\n    - TBB (CXX11-lambda),\n    - pure pthreads\n    - C++11 threads\n    - OpenMP (loop + reduction)\n    - OpenMP (task)\n    - OpenMP (parallel block - type 1)\n    - OpenMP (parallel block - type 2)\n    - mad thread-pool (run_loop)\n    - mad thread-pool (task_tree)\n    - mad thread-pool (task_tree w/ grainsize)\n\n ##################################################\n    \nExamples of OpenMP issues:\n\n- On GCC 4.8.2 (possibly fixed, I showed this to a developer during a tutorial)\n  - Here is the code he gave me (included in examples/ex6/omp_pi_loop.cc). This code utilized ~400% of the CPUs on 4 threads (on a 4 core machine), in other words, perfect speed-up:\n\n\n```c++\n#pragma omp parallel\n{\n    #pragma omp for reduction(+:sum)\n    for(ulong_type i = 0; i \u003c num_steps; ++i)\n    {\n        double_type x = (i-0.5)*step;\n        sum = sum + 4.0/(1.0+x*x);\n    }\n}\n```\n\n  - I made one change, because I like to try things, which was completely valid for the C++ standard (C++11), which was the use of lambda (which would be inlined):\n\n```c++\n#pragma omp parallel\n{\n    auto calc_x = [step] (ulong_type i) { return (i-0.5)*step; };\n    #pragma omp for reduction(+:sum)\n    for(ulong_type i = 0; i \u003c num_steps; ++i)\n    {\n        double_type x = calc_x(i);\n        sum = sum + 4.0/(1.0+x*x);\n    }\n}\n```\n\n  - My CPU utilization dropped down from ~400% to ~250%. I was shocked. \n  - My first thought wasn't that it was OpenMP's fault but mine. \n  - I tried different captures, different placements of the lambda (inside and outside parallel section, inside the loop, etc.) and none of that fixed the performance. \n  - Then I thought, maybe since GCC just implemented C++11 in 4.7, the problem might be on the GCC side.\n  - I tested the comparison with TBB and found no difference. \n  - So I brought it to the OpenMP developer and showed him and he verified everything and tried fixing the performance himself to no avail. \n  - He wrote down the compiler version and said he'd look into it. \n\n\nThis experience started me down the road to the opinion I have of OpenMP today:\n\n- **OpenMP is convenient but far too opaque in what is being done \"under the hood\" to allow straight-forward diagnosis of performance issues.**\n- **False-sharing is very easy to introduce, requires a lot of experience to quickly diagnose, and is a byproduct of the OpenMP pragma style. It is much less common with other models because of how you are forced to build the code (functionally).**\n- **Thus, you could end up spending far too much of your own time either (a) trying to fix something that shouldn't have to be fixed, (b) searching for the performance bottleneck that would exist in other threading models, or (c) both.**\n\n\n\n- On GCC 5.4.1 with a very large amount of data. It took 118 seconds at 8% CPU\nutilization with two threads (max = 200%).\n\n```c++\n// \u003e [cxx] ctoast_cov_accumulate_zmap\n// : 118.837 wall,   4.820 user +   4.650 system =   9.470 CPU [seconds] (  8.0%)\n// (total # of laps: 32)\n#pragma omp parallel default(shared)\n{\n    int64_t i, j, k;\n    int64_t hpx;\n    int64_t zpx;\n\n    int threads = 1;\n    int trank = 0;\n\n    #ifdef _OPENMP\n    threads = omp_get_num_threads();\n    trank = omp_get_thread_num();\n    #endif\n\n    int tpix;\n\n    for (i = 0; i \u003c nsamp; ++i) \n    {\n        if ((indx_submap[i] \u003e= 0) \u0026\u0026 (indx_pix[i] \u003e= 0)) \n        {\n            hpx = (indx_submap[i] * subsize) + indx_pix[i];\n            tpix = hpx % threads;\n            if ( tpix == trank ) \n            {\n                zpx = (indx_submap[i] * subsize * nnz) + (indx_pix[i] * nnz);\n\n                for ( j = 0; j \u003c nnz; ++j )\n                    zdata[zpx + j] += scale * signal[i] * weights[i * nnz + j];\n            }\n        }\n    }\n}\n```\n\nThe fix is below. It ran in 1.474 seconds at 200% CPU utilization. It was \namazingly simple...\n\n```c++\n// \u003e [cxx] accumulate_zmap_direct\n// :   1.474 wall,   2.930 user +   0.020 system =   2.950 CPU [seconds] (200.1%)\n// (total # of laps: 32)\n#pragma omp parallel default(shared)\n{\n    int threads = 1;\n    int trank = 0;\n\n#ifdef _OPENMP\n    threads = omp_get_num_threads();\n    trank = omp_get_thread_num();\n#endif\n\n    for (int64_t i = 0; i \u003c nsamp; ++i )\n    {\n        if ((indx_submap[i] \u003e= 0) \u0026\u0026 (indx_pix[i] \u003e= 0))\n        {\n            int64_t hpx = (indx_submap[i] * subsize) + indx_pix[i];\n            int64_t tpix = hpx % threads;\n            if ( tpix == trank )\n            {\n                int64_t zpx = (indx_submap[i] * subsize * nnz)\n                              + (indx_pix[i] * nnz);\n\n                for (int64_t j = 0; j \u003c nnz; ++j )\n                    zdata[zpx + j] += scale * signal[i] * weights[i * nnz + j];\n            }\n        }\n    }\n}\n```\n\nNotice the difference? \"i\", \"j\", \"k\", \"hpx\", \"tpix\", and \"zpx\" are locally declared instead of at the start of the parallel block. How did this make such a big difference? I HAVE NO IDEA BECAUSE I COULDN'T SEE WHAT OPENMP WAS DOING!\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrmadsen%2Fmadthreading","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjrmadsen%2Fmadthreading","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrmadsen%2Fmadthreading/lists"}