{"id":45292064,"url":"https://github.com/tactcomplabs/circustent","last_synced_at":"2026-02-21T03:24:02.682Z","repository":{"id":41871033,"uuid":"218161285","full_name":"tactcomplabs/circustent","owner":"tactcomplabs","description":"Memory system characterization benchmarks using atomic operations","archived":false,"fork":false,"pushed_at":"2026-01-21T13:55:21.000Z","size":2186,"stargazers_count":16,"open_issues_count":1,"forks_count":11,"subscribers_count":8,"default_branch":"main","last_synced_at":"2026-01-22T01:40:42.907Z","etag":null,"topics":["atomic","benchmark","hpc","mpi","openmp","openshmem"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tactcomplabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2019-10-28T23:03:59.000Z","updated_at":"2026-01-21T13:49:43.000Z","dependencies_parsed_at":"2024-07-08T05:32:06.435Z","dependency_job_id":"88b959f1-373c-4897-bcda-9bac8d8084e2","html_url":"https://github.com/tactcomplabs/circustent","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tactcomplabs/circustent","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tactcomplabs%2Fcircustent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tactcomplabs%2Fcircustent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tactcomplabs%2Fcircustent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tactcomplabs%2Fcircustent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tactcomplabs","download_url":"https://codeload.github.com/tactcomplabs/circustent/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tactcomplabs%2Fcircustent/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29672703,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-21T03:11:15.450Z","status":"ssl_error","status_checked_at":"2026-02-21T03:10:34.920Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["atomic","benchmark","hpc","mpi","openmp","openshmem"],"created_at":"2026-02-21T03:24:01.915Z","updated_at":"2026-02-21T03:24:02.672Z","avatar_url":"https://github.com/tactcomplabs.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CircusTent: Atomic Memory Operation System Benchmarks\n\n[![GitHub license](https://img.shields.io/badge/license-APACHE2-blue.svg)](https://raw.githubusercontent.com/tactcomplabs/circustent/main/LICENSE)\n\n![CircusTent](docs/imgs/circus_tent.png)\n\n## Overview\n\nThe CircusTent infrastructure is designed to provide users and architects\nthe ability to discover the relevant performance of a target system\narchitecture's memory subsystem using atomic memory operations.  Atomic\nmemory operations have traditionally been considered to be latent or\nlow performance given the difficulty in their respective implementations.  \nHowever, atomic operations are widely utilized across parallel programming\nconstructs for synchronization primitives and to promote concurrency.  However,\nprior to the creation of CircusTent, the architecture and programming\nmodel communities had little ability to quantify the performance of\natomics on varying scales of a system architecture.\n\nThe CircusTent infrastructure is designed to be a modular benchmark\nplatform consisting of a frontend and backend infrastructure.  \nThe frontend infrastructure defines the various benchmark types and\nstandard benchmark algorithms as well as providing the command line\nexecution interface.  The backend provides one or more implementations\nof the standard algorithms using various programming models.  \n\n## Building From Source\n\n### Prerequisites\n\nThe following packages/utilities are required to build CircusTent from source:\n* CMake 3.4.3+\n* C++ Compiler\n* C Compiler\n\nOptional packages include:\n* RPM tools to build RPMs\n* Debian package tools to build DEBs\n* Backend-specific libraries\n\n### Building\n\nThe following steps are generic build instructions.  You may need to\nmodify these steps for your target system and compiler.\n\n1. Clone the CircusTent repository\n```\ngit clone https://github.com/tactcomplabs/circustent.git\n```\n2. Setup your build tree\n```\ncd circustent\nmkdir build\ncd build\n```\n3. Execute CMake to generate the makefiles (where XXX refers to the backend that you want to enable)\n```\ncmake -DENABLE_XXX=ON -DCT_CFLAGS=\"...\" -DCT_CXXFLAGS=\"...\" -DCT_LINKER_FLAGS=\"...\" ../\n```\nNote that it will most often be necessary to pass the compiler specific flags needed for your chosen backend\nimplementation to the CMake infrastructure via the CT_CFLAGS, CT_CXXFLAGS, and CT_LINKER_FLAGS options as shown above.\n\n4. Execute the build\n```\nmake\n```\nThe `circustent` binary will reside in ./src/CircusTent/\n\n5. (Optional) Install the build\n```\nmake install\n```\n\n### Build Options\nThe following are additional build options supported by the CircusTent CMake script\n* CC : Utilize the target C compiler\n* CXX : Utilize the target C++ compiler\n* -DCMAKE\\_C\\_FLAGS : Set the standard C compiler flags\n* -DCMAKE\\_CXX\\_FLAGS : Set the standard C++ compiler flags\n* -DCMAKE\\_INSTALL\\_PREFIX : installation target (make install)\n* -DCIRCUSTENT\\_BUILD\\_RPM : Builds an RPM package\n* -DCIRCUSTENT\\_BUILD\\_DEB : Builds a DEB package\n* -DCIRCUSTENT\\_BUILD\\_TGZ : Builds a TGZ package\n* -DBUILD\\_ALL\\_TESTING : Builds the test infrastructure (developers only)\n\n## Algorithm Descriptions\n\nThe following contains brief descriptions of each candidate algorithm.  For each algorithm,\nwe apply one or more of the following atomics:\n* Fetch and Add (ADD)\n* Compare and Exchange (CAS)\n\nThe algorithmic descriptions below do not specify the size of the data values\nimplemented.  The CircusTent software does not derive bandwidth.  However,\nwe highly suggest that implementors utilize 64-bit values for the source\nand index portions of the benchmark.  \n\nThe following table presents all the core benchmarks and the number of\natomic operations performed for each (which is vital to calculating\naccurate GAMs values across platforms).\n\n| Benchmark | Number of AMOs |\n| ------ | ------ |\n| RAND | 1 |\n| STRIDE1 | 1 |\n| STRIDEN | 1 |\n| PTRCHASE | 1 |\n| CENTRAL | 1 |\n| SG | 4 |\n| SCATTER | 3 |\n| GATHER | 3 |\n\n### RAND\nPerforms a stride-1 atomic update using an index array with randomly generated\nindices and a source value array.  The index array (IDX) must contain valid indices\nwithin the bounds of the source value array (ARRAY). In most cases, utilizing standard-C\nlinear congruential methods is sufficient.\n```\nfor( i=0; i\u003citers; i++ ){\n    AMO(ARRAY[IDX[i]])\n}\n```\n\n### STRIDE1\nPerforms a stride-1 atomic update using only a source array (ARRAY).  \n```\nfor( i=0; i\u003citers; i++ ){\n    AMO(ARRAY[i])\n}\n```\n\n### STRIDEN\nPerforms a stride-N atomic update using only a source array (ARRAY).  \nThe user must specify the respective stride of the operation\n```\nfor( i=0; i\u003citers; i+=stride ){\n    AMO(ARRAY[i])\n}\n```\n\n### PTRCHASE\nPerforms a pointer chase operation across an index array.  This implies\nthat the i'th+1 value is selected from the i'th operation.  This algorithm\nonly utilizes the index array (IDX).  All index values must be valid within the\nscope of the index array.  \n```\nfor( i=0; i\u003citers; i++ ){\n    start = AMO(IDX[start])\n}\n```\n\n### CENTRAL\nPerforms an atomic operation to a singular value from all PEs.  This is a deliberate\nhot-spot action that is designed to immediately stress system and network\ninterconnects.\n```\nfor( i=0; i\u003citers; i++ ){\n    AMO(ARRAY[0])\n}\n```\n\n### SG\nPerforms a scatter and a gather operation.  The source values for the scatter,\ngather and the final values are all fetched atomically.  As with the other\nalgorithms, the source array and index array must be valid.\n```\nfor( i=0; i\u003citers; i++ ){\n    src = AMO(IDX[i])\n    dest = AMO(IDX[i+1])\n    val = AMO(ARRAY[src])\n    AMO(ARRAY[dest], val) // ARRAY[dest] = val\n}\n```\n\n### SCATTER\nPerforms the scatter portion of an SG operation.  As with the other\nalgorithms, the source array and index array must be valid.\n```\nfor( i=0; i\u003citers; i++ ){\n    dest = AMO(IDX[i+1])\n    val = AMO(ARRAY[i])\n    AMO(ARRAY[dest], val) // ARRAY[dest] = val\n}\n```\n\n### GATHER\nPerforms the gather portion of an SG operation.  As with the other\nalgorithms, the source array and index array must be valid.\n```\nfor( i=0; i\u003citers; i++ ){\n    dest = AMO(IDX[i+1])\n    val = AMO(ARRAY[dest])\n    AMO(ARRAY[i], val) // ARRAY[i] = val\n}\n```\n\n\n\n## Backends\n### OMP\n* CMake Build Flag: -DENABLE_OMP=ON\n* Implementation Language: C++ \u0026 C using GNU intrinsics\n* Utilizes unsigned 64-bit integers for the ARRAY and IDX values\n* Utilizes \\_\\_ATOMIC\\_RELAXED where appropriate\n* Intrinsic documentation: [GNU Atomics](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html)\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | yes |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | yes |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | yes |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | yes |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | yes |\n| SG_ADD | yes |\n| SG_CAS | yes |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | yes |\n| GATHER_ADD | yes |\n| GATHER_CAS | yes |\n\n### OMP with Target Offloading\n* CMake Build Flags: -DENABLE_OMP_TARGET=ON\n* Implementation Language: C++ \u0026 C\n* Users may define a particular $OMP_DEFAULT_DEVICE, otherwise the default is utilized\n* Maps the provided PEs argument to OpenMP teams wherein the number of iterations specified are executed by each team. Iterations for a given team are workshared using thread and vector level parallelism based on the behavior of the user's OpenMP implementation and compiler.\n* In order to preserve the intended memory access pattern, the PTRCHASE kernels utilize only teams level parallelism.\n* Utilizes unsigned 64-bit integers for the ARRAY and IDX values\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | no |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | no |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | no |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | no |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | no |\n| SG_ADD | yes |\n| SG_CAS | no |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | no |\n| GATHER_ADD | yes |\n| GATHER_CAS | no |\n\n### OpenSHMEM\n* CMake Build Flag: -DENABLE_OPENSHMEM=ON\n* Users must specify the OpenSHMEM compiler wrapper alongside the CMake command as follows:\n```\nCC=oshcc CXX=oschcxx  cmake -DENABLE_OPENSHMEM=ON ../\n```\n* Implementation  Language: C++ and C using SHMEM functions and symmetric heap\n* Utilizes unsigned 64-bit integers for the ARRAY and IDX values\n* Target PE's for all benchmarks except PTRCHASE are initialized in a stride-1 ring pattern.  This implies\nthat for every N'th PE, the target PE is N+1.  All benchmarks except PTRCHASE target a single destination PE for each iteration\n* The PTRCHASE benchmark utilizes randomly generated target PE's for each iteration\n* For benchmark values that don't require atomic access to indices, we utilize SHMEM_GET operations to\nfetch the index for a given iteration (ex, RAND_ADD, RAND_CAS)\n* Tested with OSSS-UCX: [OpenSHMEM Reference Implementation](https://github.com/openshmem-org/osss-ucx)\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | yes |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | yes |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | yes |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | yes |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | yes |\n| SG_ADD | yes |\n| SG_CAS | yes |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | yes |\n| GATHER_ADD | yes |\n| GATHER_CAS | yes |\n\n### MPI\n* CMake Build Flag: -DENABLE_MPI=ON\n* Users must specify the MPI compiler wrapper alongside the CMake command as follows:\n```\nCC=mpicc CXX=mpicxx cmake -DENABLE_MPI=ON ../\n```\n* Implementation  Language: C++ and C using MPI-3 functions and one-sided operations\n* Utilizes unsigned 64-bit integers for the ARRAY and IDX values\n* Target PE's for all benchmarks except PTRCHASE are initialized in a stride-1 ring pattern.  This implies\nthat for every N'th PE, the target PE is N+1.  All benchmarks except PTRCHASE target a single destination PE for each iteration\n* The PTRCHASE benchmark utilizes randomly generated target PE's for each iteration\n* For benchmark values that don't require atomic access to indices, we utilize MPI_Get operations to\nfetch the index for a given iteration (ex, RAND_ADD, RAND_CAS)\n* Tested with OpenMPI\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | yes |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | yes |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | yes |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | yes |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | yes |\n| SG_ADD | yes |\n| SG_CAS | yes |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | yes |\n| GATHER_ADD | yes |\n| GATHER_CAS | yes |\n\n### xBGAS\n* CMake Build Flag: -DENABLE_XBGAS=ON\n* Implementation  Language: C++ and C using xBGAS functions\n* Utilizes unsigned 64-bit integers for the ARRAY and IDX values\n* Target PE's for all benchmarks except PTRCHASE are initialized in a stride-1 ring pattern.  This implies\nthat for every N'th PE, the target PE is N+1.  All benchmarks except PTRCHASE target a single destination PE for each iteration\n* The PTRCHASE benchmark utilizes randomly generated target PE's for each iteration\n* For benchmark values that don't require atomic access to indices, we utilize XBGAS_GET operations to\nfetch the index for a given iteration (ex, RAND_ADD, RAND_CAS)\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | yes |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | yes |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | yes |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | yes |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | yes |\n| SG_ADD | yes |\n| SG_CAS | yes |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | yes |\n| GATHER_ADD | yes |\n| GATHER_CAS | yes |\n\n### OpenACC\n* CMake Build Flags: -DENABLE_OPENACC=ON\n* Implementation Language: C++ \u0026 C\n* Users may define $ACC_DEVICE_TYPE and/or $ACC_DEVICE_ID to set\nthe target device type and ID, respectively. However, since these values\nmay be overidden or ignored by your OpenACC implementation, we recommend\nthe user verify their desired device matches the one selected by checking\nthe CircusTent output messages printed during device initiailization.\n* Maps the provided PEs argument to OpenACC gangs wherein the number of iterations specified are executed by each gang. Iterations for a given gang are workshared using worker and vector level parallelism based on the behavior of the user's OpenACC implementation and compiler.\n* In order to preserve the intended memory access pattern, the PTRCHASE kernels utilize only gangs level parallelism.\n* Utilizes unsigned 64-bit integers for the ARRAY and IDX values\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | no |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | no |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | no |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | no |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | no |\n| SG_ADD | yes |\n| SG_CAS | no |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | no |\n| GATHER_ADD | yes |\n| GATHER_CAS | no |\n\n### Pthreads\n* CMake Build Flags: -DENABLE_PTHREADS=ON\n* Implementation Language: C++ \u0026 C using GNU intrinsics\n* Utilizes unsigned 64-bit integers for the ARRAY and IDX values\n* Utilizes \\_\\_ATOMIC\\_RELAXED where appropriate\n* Intrinsic documentation: [GNU Atomics](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html)\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | yes |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | yes |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | yes |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | yes |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | yes |\n| SG_ADD | yes |\n| SG_CAS | yes |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | yes |\n| GATHER_ADD | yes |\n| GATHER_CAS | yes |\n\n### OpenCL\n* CMake Build Flags: -DENABLE_OPENCL=ON\n* Implementation Language: C++ \u0026 C with OpenCL extensions\n* Users must define both $OCL_TARGET_PLATFORM_NAME and $OCL_TARGET_DEVICE_NAME to set\nthe OpenCL target platform and device, respectively\n* Utilizes unsigned 64-bit integers (cl_ulong) for the ARRAY and IDX values\n* Utilizes OpenCL API-level atomic operations\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | yes |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | yes |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | yes |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | yes |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | yes |\n| SG_ADD | yes |\n| SG_CAS | yes |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | yes |\n| GATHER_ADD | yes |\n| GATHER_CAS | yes |\n\n### C++ Standard Threads \u0026 Atomics\n* CMake Build Flags: -DENABLE_CPP_STD=ON\n* Implementation Language: C++11\n* Utilizes unsigned 64-bit integers for the ARRAY and IDX values\n* Utilizes C++11 standard library threads and atomic operations\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | yes |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | yes |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | yes |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | yes |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | yes |\n| SG_ADD | yes |\n| SG_CAS | yes |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | yes |\n| GATHER_ADD | yes |\n| GATHER_CAS | yes |\n\n### CUDA\n* CMake Build Flag: -DENABLE_CUDA=ON\n* Implementation Language: CUDA C/C++\n* Utilizes unsigned 64-bit integers \n* Utilizes [CUDA API-level atomic operations](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions)\n* Desired taget device can be set with $CUDA_VISIBLE_DEVICES, otherwise the default CUDA-enabled device will be used\n* In lieu of a PEs parameter, requires specification of CUDA-specific parallel resources as used in the kernel launch configuration:\n    * `--blocks` : number of thread blocks\n    * `--threads`: number of threads per block\n* Sample Execution:\n```\ncircustent -b RAND_ADD -m 1024 -i 1000 --blocks 100 --threads 512\n```\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | yes |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | yes |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | yes |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | yes |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | yes |\n| SG_ADD | yes |\n| SG_CAS | yes |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | yes |\n| GATHER_ADD | yes |\n| GATHER_CAS | yes |\n\n### YGM\n* CMake Build Flag: -DENABLE_YGM=ON\n* Implementation Language: C++17\n* Utilizes unsigned 64-bit integers for the ARRAY and IDX values\n* Utilizes [YGM Communication Library](https://github.com/LLNL/ygm/tree/develop)\n* Target PE's for STRIDE1 and STRIDEN benchmarks are initialized in a stride-1 ring pattern. This implies\nthat for every N'th PE, the target PE is N+1.\n* The target PE during the CENTRAL benchmark is rank 0 for all ranks participating in the benchmark.\n* The PTRCHASE, RAND, SG, SCATTER, GATHER benchmarks utilize randomly generated target PE's for each iteration. This strays from some other implementations where all benchmarks except PTRCHASE are initialized in ring pattern.\n* While operations are atomic with respect to each rank executing remote function calls, they are not atomic with respect to the sender. Because there is no notion of this behavior in YGM asynchronous communication we do not use built in C/C++ atomics, but rather implement the same operation written as C++ lambdas executable as YGM remote procedure calls. \n* Tested with MVAPICH 2, GCC 12\n\nThere are two options available for specific testing of YGM features. These are not specified at execution time, but should be part of the CMake flags as they will change compilation of the benchmarks. These flags can be turned on or off independently, and by default they are both OFF. \n* -DSKIP_PROGRESS_PTRCHASE=ON : disables the ygm::comm::local_progress() call in the PTRCHASE implementations, leaving scheduling of buffer sends up to YGM runtime.\n* -DSHORTCUT_RPC=ON : In applicable  benchmarks, enables 'shortcut' versions of remote lambda calls that minimize message counts but are less informative as a distributed benchmark.\n\n| Benchmark | Supported? |\n| ------ | ------ |\n| RAND_ADD | yes |\n| RAND_CAS | yes |\n| STRIDE1_ADD | yes |\n| STRIDE1_CAS | yes |\n| STRIDEN_ADD | yes |\n| STRIDEN_CAS | yes |\n| PTRCHASE_ADD | yes |\n| PTRCHASE_CAS | yes |\n| CENTRAL_ADD | yes |\n| CENTRAL_CAS | yes |\n| SG_ADD | yes |\n| SG_CAS | yes |\n| SCATTER_ADD | yes |\n| SCATTER_CAS | yes |\n| GATHER_ADD | yes |\n| GATHER_CAS | yes |\n\n## Execution Parameters\n\n### Backend Independent Parameters\n\nThe following list details the current set of command line options common to all CircusTent backends:\n* --bench BENCH : specifies the target benchmark to run\n* --memsize BYTES : sets the size of the memory array to allocate in bytes (general rule is 1/2 of physical memory)\n* --iters ITERATIONS : sets the number of algorithmic iterations per PE. Total iterations = (PEs x ITERATIONS)\n* --stride STRIDE : sets the stride (in elements) for the target algorithm. Not all algorithms require the stride to be specified. If this value is not required, the algorithm will ignore it.\n* --help : prints the help menu\n* --list : prints a list of the target benchmarks\n\nIn addition to the options above, backends not explictly listed below also utilize the \"pes\" command line option as shown. \n* --pes PEs : sets the number of parallel execution units (threads, ranks, etc...)\n\n### CUDA Parameters\n\nWhen utilizing the CUDA backend, users must explicitly define the number of thread blocks and threads per block to use during kernel execution as follows (Note that the CUDA backend does not accept a PEs argument):\n* --blocks THREAD_BLOCKS : Sets the number of thread blocks\n* --threads THREADS_PER_BLOCK : Sets the number of threads per block\n\n### Sample Execution\n\nThe following are various examples of utilizing CircusTent for benchmarks\n\n1. Print the help menu\n```\ncircustent --help\n```\n2. List the benchmark algorithms\n```\ncircustent --list\n```\n3. Execute the RAND\\_ADD algorithm using 1024 bytes of memory, 2 PE's and 1000 iterations\n```\ncircustent -b RAND_ADD -m 1024 -p 2 -i 1000\n```\n4. Execute the SCATTER\\_CAS algorithm using 16GB of memory, 24 PE's and 20,000,000 iterations\n```\ncircustent -b SCATTER_CAS -m 16488974000 -p 24 -i 20000000\n```\n\n## Interpreting the Results\nFor each of the target benchmarks, CircusTent prints two relevant\nperformance values.  First, the wallclock runtime of the target algorithm\nis printed in seconds.  Note that running very small problems with very small\nwallclock runtimes may exceed the lower bound of the timing variables.  If\nyou experience issues in printing the timing, increase the number of iterations\nper PE.  An example of the timing printout is as follows:\n\n```\nTiming (secs)        : 0.340783\n```\n\nThe second metric that is printed is the number of billions of atomic\noperations per second, or GAMS (Giga AMOs/sec).  This metric derives\nthe total, parallel number of atomic operations performed in the given\ntime window.  This value can be utilized to compare platforms based upon\nthe number of parallel atomics that can be realistically performed using the\ntarget algorithm.  This is derived uniquely for each algorithm as the total\nnumber of atomics performend is equivalent to (NUM\\_PEs x NUM\\_ITERATIONS x NUM\\_AMOs\\_PER\\_ITER ).\nAn example of the GAMs printout is as follows:\n\n```\nGiga AMOs/sec (GAMS) : 4.22556\n```\n\nA sample result set from executing the the OpenMP (OMP) implementation\non a modern, dual socket Intel Xeon system are depicted as follows.\nFor each of these benchmarks, we utilized the following execution parameters:\n* Memsize = 16488974000\n* Iterations = 20000000\n* PEs = 1 - 24\n* Stride (StrideN) = 9\n\n![GAMS](docs/imgs/GAMS.png)\n![TIMING](docs/imgs/TIMING.png)\n\n## Adding New Atomic Implementations\n\nSee the developer documentation.\n\n## Contributing\n\nAll contributions must be made via documented pull requests.  Pull requests will be tested\nusing the CircusTent development infrastructure in order to ensure correctness and\ncode stability.  Pull requests may be initially denied for one or more of the following\nreasons (violations will be documented in pull request comments):\n* Code lacks sufficient documentation\n* Code inhibits/breaks existing functionality\n* Code does not follow existing stylistic guidelines\n* Benchmark implementation violates benchmark rules\n* Benchmark implementation cannot be proven to exist (no test systems exist)\n\n## License\nCircustTent is licensed under an Apache-style license see the [LICENSE](LICENSE) file for details\n\n## Authors\n* *Brody Williams* - *PhD Student* - [Texas Tech University](https://discl.cs.ttu.edu/doku.php)\n* *Michael Beebe* - *PhD Student* - [Texas Tech University](https://discl.cs.ttu.edu/doku.php)\n* *Pedro Barros* - *Undergraduate Student* - [Instituto Militar de Engenharia](https://www.linkedin.com/in/pbbdasilva/)\n* *John Leidel* - *Chief Scientist* - [Tactical Computing Labs](http://www.tactcomplabs.com)\n* *David Donofrio* - *Chief Hardware Architect* - [Tactical Computing Labs](http://www.tactcomplabs.com)\n* *Preston Piercey* - *Research Engineer I* - [Tactical Computing Labs](http://www.tactcomplabs.com)\n\n## Acknowledgments\n* None at this time\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftactcomplabs%2Fcircustent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftactcomplabs%2Fcircustent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftactcomplabs%2Fcircustent/lists"}