{"id":20164083,"url":"https://github.com/southernmethodistuniversity/profiling_applications","last_synced_at":"2026-03-05T08:01:40.822Z","repository":{"id":110252020,"uuid":"257281733","full_name":"SouthernMethodistUniversity/profiling_applications","owner":"SouthernMethodistUniversity","description":"Profiling Applications on M2","archived":false,"fork":false,"pushed_at":"2020-04-21T19:07:31.000Z","size":75,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-03T03:13:22.255Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SouthernMethodistUniversity.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-20T12:59:26.000Z","updated_at":"2020-04-21T19:07:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"c32980cd-555c-4612-b3ec-b409b59a817b","html_url":"https://github.com/SouthernMethodistUniversity/profiling_applications","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/SouthernMethodistUniversity/profiling_applications","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SouthernMethodistUniversity%2Fprofiling_applications","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SouthernMethodistUniversity%2Fprofiling_applications/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SouthernMethodistUniversity%2Fprofiling_applications/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SouthernMethodistUniversity%2Fprofiling_applications/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SouthernMethodistUniversity","download_url":"https://codeload.github.com/SouthernMethodistUniversity/profiling_applications/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SouthernMethodistUniversity%2Fprofiling_applications/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30115662,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T03:40:26.266Z","status":"ssl_error","status_checked_at":"2026-03-05T03:39:15.902Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T00:33:11.389Z","updated_at":"2026-03-05T08:01:40.768Z","avatar_url":"https://github.com/SouthernMethodistUniversity.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Profiling Applications on M2\n\n## Center for Research Computing (CRC)\n\n* Maintains our primary shared resource for research computing, ManeFrame II (M2),\n  in collaboration with OIT\n* Provides research computing tools, support, and training to all faculty, staff,\n  and students using research computing resources\n  [www.smu.edu/csc](https://www.smu.edu/csc) has documentation and news\n* [help@smu.edu](mailto:help@smu.edu) or\n  [rkalescky@smu.edu](mailto:rkalescky@smu.edu) for help\n\n## CSC Workshop Series\n\n|Date         |Workshop                                                     |\n|-------------|-------------------------------------------------------------|\n|January 21   |M2 Introduction                                              |\n|January 28   |Introduction to LAPACK and BLAS                              |\n|February 4   |Text Mining with Python on M2 (Lead by Dr. Eric Godat)       |\n|February 11  |Using the New HPC Portal                                     |\n|February 18  |Using GitHub                                                 |\n|February 25  |Writing Portable Accelerator Code with KOKKOS, RAJA, and OCCA|\n|March 3      |M2 Introduction                                              |\n|March 10     |Introduction to Parallelization Using MPI                    |\n|March 17     |No Workshop Spring Break                                     |\n|March 24     |Writing High Performance Python Code                         |\n|March 31     |Creating Portable Environments with Docker and Singularity   |\n|April 7      |M2 Introduction                                              |\n|April 14     |Introduction to Parallelization Using OpenMP and OpenACC     |\n|April 21     |Profiling Applications on M2                                 |\n|April 28     |Improving Code Vectorization                                 |\n\n## Accessing ManeFrame II (M2) for this Workshop\n\n* Via Terminal or Putty as usual (see [here](http://faculty.smu.edu/csc/documentation/access.html) for details)\n* Via the HPC Portal (Note that this doesn't support X11 forwarding)\n    1. Go to [hpc.smu.edu](https://hpc.smu.edu/).\n    2. Sign in using your SMU ID and SMU password.\n    3. Select \"ManeFrame II Shell Access\" from the \"Clusters\" drop-down menu.\n\n## Profiling and Performance Analysis with GCC\n\nThere are two primary mechanisms for profiling code: determining which\nroutines take the most time, and determining which specific lines of\ncode would be best to optimize. Thankfully, the [GNU compiler\ncollection](http://gcc.gnu.org/) includes utilities for both of these\ntasks, as will be illustrated below. Utilities with similar\nfunctionality are included with some other compilers.\n\n### Generating a profile\n\nIn the GNU compilers (and many others), you can enable profiling\ninformation through adding in the `-p` compiler flag. Add this compiler\nflag to the commands in the `CMakeCache.txt` for the target `mmm`\n\nProfiling information is generated by running the executable once to\ncompletion. \n\n```\n$ module load spack gcc-9.2 armadillo cmake\n$ cmake .\n$ cmake --build .\n$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 ./mmm 2000 2000 2000\n```\n\nWrite down the total runtime required for the program (you will use this\ninformation later on).\n\nWhen the program has finished, you should see a new file in the\ndirectory called `gmon.out`. This contains the relevant profiling data,\nand was written during the execution of the code.\n\nExamine the profiling information by using the program `gprof`. You use\nthis by calling `gprof`, followed by the executable name. It will\nautomatically look in the `gmon.out` file in that directory for the\nprofiling data that relates to the executable. Run the command\n\n```\n$ gprof mmm\n```\n\nWhen you run `gprof`, it outputs all of the profiling information to the\nscreen. To enable easier examination of these results, you should\ninstead send this data to a file. You can redirect this information to\nthe file `profiling_data.txt` with the command\n\n```\n$ gprof mmm \u003e profiling_data.txt\n```\n\nYou will then have the readable file `profiling_data.txt` with the\nrelevant profiling information.\n\n### Identifying bottlenecks\n\nRead through the first table of profiling information in this file. The\nfirst column of this table shows the percentage of time spent in each\nfunction called by the executable. Identify which one takes the vast\nmajority of the time. This bottleneck should be the first routine that\nyou investigate for optimization.\n\nLook through the routine identified from the previous step, the\nfunction may be contained in a file with a different name, so you can\nuse `grep` to find which file contains the routine:\n\n```\n$ grep -i \u003croutine_name\u003e *\n```\n\nwhere `\u003croutine_name\u003e` is the function that you identified from the\nprevious step.\n\nOnce you have determined the file that contains the culprit function,\nyou can use the second utility routine `gcov` to determine which lines\nin the file are executed the most. To use `gcov`, you must modify the\ncompile line once more, to use the compilation flags\n`-fprofile-arcs -ftest-coverage`.\n\nAdd these compiler flags to the commands in the `CMakeLists.txt` for the\ntarget `mmm`, recompile, and re-run the executable,\n\n```\n$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 ./mmm 2000 2000 2000\n$ mv ./CMakeFiles/mmm.dir/mmm.cpp.gcno mmm.gcno\n$ mv ./CMakeFiles/mmm.dir/mmm.cpp.gcda mmm.gcda\n```\n\nYou should now see additional files in the directory with extentions\n`.gcda` and `.gcno`. If you do not see these files, revisit the above\ninstructions to ensure that you haven't missed any steps.\n\nYou should now run `gcov` on the input file that held the function you\nidentified from the steps above. For example, if the source code file\nwas `file.cpp`, you would run\n\n```\n$ gcov mmm.cpp\n```\n\nThis will output some information to the screen, including the name of a\n`.gcov` file that it creates with information on the program. Open this\nnew file using `nano`, and you will see lines like the following:\n\n```\n      2001:   10:    for (unsigned long int k = 0; k \u003c p; ++k) {\n   4002000:   11:        for (unsigned long int j = 0; j \u003c n; ++j) {\n8004000000:   12:            for (unsigned long int i = 0; i \u003c m; ++i) {\n8000000000:   13:                C.at(i, k) += A.at(i, j) * B.at(j, k);\n```\n\nThe first column of numbers on the left signify the number of times each\nline of code was executed within the program. The second column of\nnumbers correspond to the line number within the source code file. The\nremainder of each line shows the source code itself. From the above\nsnippet, we see that lines 54 and 55 were executed 1.01 and 1 million\ntimes, respectively, indicating that these would be prime locations for\ncode optimization.\n\nFind the corresponding lines of code in the function that you identified\nfrom the preceding step. It is here where you should focus your\noptimization efforts.\n\n### Optimizing code\n\nSave a copy of the source code file you plan to modify using the `cp`\ncommand, e.g.\n\n```\n$ cp file.cpp file_old.cpp\n```\n\nwhere `file` is the file that you have identified as containing the\nbottleneck routine (use the appropriate extension for your coding\nlanguage). We will use this original file again later in the session.\n\nNow that you know which lines are executed, and how often, you should\nremove the `gcov` compiler options, but keep the `-p` in your\n`CMakeLists.txt`.\n\nDetermine what, if anything, can be optimized in this routine. The topic\nof code optimization is bigger than we can cover in a single workshop\nsession, but here are some standard techniques.\n\n#### Code optimization techniques\n\n1.  Is there a simpler way that the arithmetic could be accomplished?\n    Sometimes the most natural way of writing down a problem does not\n    result in the least amount of effort. For example, we may implement\n    a line of code to evaluate the polynomial $p(x) =\n    2x^4-3x^3+5x^2-8x+7$ using either\n\n    ```\n    p = 2.0*x*x*x*x - 3.0*x*x*x + 5.0*x*x - 8*x + 7.0;\n    ```\n\n    or\n\n    ```\n    p = (((2.0*x - 3.0)*x + 5.0)*x - 8.0)*x + 7.0;\n    ```\n\n    The first line requires 10 multiplication and 4 addition/subtraction\n    operations, while the second requires only 4 multiplications and 4\n    additions/subtractions.\n\n2.  Is the code accessing memory in an optimal manner? Computers store\n    and access memory from RAM one \\\"page\\\" at a time, meaning that if\n    you retrieve a single number, the numbers nearby that value are also\n    stored in fast-access cache memory. So, if each iteration of a loop\n    uses values that are stored in disparate portions of RAM, each value\n    could require retrieval of a separate page. Alternatively, if each\n    loop iteration uses values from memory that are stored nearby one\n    another, many numbers in a row can be retrieved using a single RAM\n    access. Since RAM access speeds are significantly slower than cache\n    access speeds, something as small as a difference in loop ordering\n    can make a huge difference in speed.\n\n3.  Is the code doing redundant computations? While modern computers can\n    perform many calculations in the time it takes to access one page of\n    RAM, some calculations are costly enough to warrant computing it\n    only once and storing the result for later reuse. This is especially\n    pertinent for things that are performed a large number of times. For\n    example, consider the following two algorithms:\n\n    ```\n    for (i=1; i\u003c10000; i++) {\n    d[i] = u[i-1]/h/h - 2.0*u[i]/h/h + u[i+1]/h/h;\n    } \n    ```\n\n    and\n\n    ```\n    double hinv2 = 1.0/h/h;\n    for (i=1; i\u003c10000; i++) {\n    d[i] = (u[i-1] - 2.0*u[i] + u[i+1])*hinv2;\n    }\n    ```\n\n    Since floating-point division is significantly more costly than\n    multiplication (roughly $10\\times$), and the division by $h^2$ is\n    done redundantly both within and between loop iterations, the second\n    of these algorithms is typically much faster than the first.\n\n4.  Is the code doing unnecessary data copies? In many programming\n    languages, a function can be written to use either *call-by-value*\n    or *call-by-reference*.\n\n    In call-by-value, all arguments to a function are copied from the\n    calling routine into a new set of variables that are local to the\n    called function. This allows the called function to modify the input\n    variables without concern about corrupting data in the calling\n    routine.\n\n    In call-by-reference, the called function only receives memory\n    references to the actual data held by the calling routine. This\n    allows the called function to directly modify the data held by the\n    calling routine.\n\n    While call-by-reference is obviously more \\\"dangerous,\\\" it avoids\n    unnecessary (and costly) memory allocation/copying/deallocation in\n    the executing code. As such, highly efficient code typically uses\n    call-by-reference, with the programmer responsible for ensuring that\n    data requiring protection in the calling program is manually copied\n    before function calls, or that the functions themselves are\n    constructed to avoid modifying the underlying data.\n\n    In C and C++, call-by-value is the default, whereas Fortran uses\n    call-by-reference. However in C, pointers may be passed through\n    function calls to emulate call-by-reference. In C++, either pointers\n    can be sent through function calls, or arguments may be specified as\n    being passed by reference (using the `\u0026` symbol).\n\nFind what you can fix, so long as you do not change the mathematical\nresult. Delete and re-compile the executable,\n\nRe-examine the results using `gprof`, and repeat the optimization\nprocess until you are certain that the code has been sufficiently\noptimized. You should be able to achieve a significant performance\nimprovement (at least 40% faster than the original).\n\nWrite down the total runtime required for your hand-optimized program.\nCopy your updated code to the file `file_new.cpp` (again, use the\nappropriate extension for your coding language).\n\n## Profiling Python Scripts\n\nLike GCC and Python, R has the ability to profile at the line level.\n\n```\n$ module purge\n$ module load python\n$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 python3 mmm.py\n```\n\nThis demonstrates the performance difference between an optimized matrix-matrix\nimplementation and a simple implementation.\n\n```\n$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 python3 -m cProfile -o profile_data.txt mmm.py\n$ python3 view_profile.py | grep gemm\n```\n\nThe `view_profile.py` script extracts line-level profile information from the\n`profile_data.txt` file.\n\n## Profiling R Scripts\n\nLike GCC and Python, R has the ability to profile at the function and line levels.\n\n```\n$ module purge\n$ module load r/3.6.2\n$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 Rscript mmm.R 1\n```\n\nThis demonstrates the performance difference between an optimized matrix-matrix\nimplementation and a simple implementation.\n\n```\n$ srun -p development,htc,standard-mem-s -c 1 --mem=6G -t 5 Rscript profile.R\n```\n\nThe `profile.R` script runs the profiler twice, once for function-level\ninformation and again for line-level information.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsouthernmethodistuniversity%2Fprofiling_applications","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsouthernmethodistuniversity%2Fprofiling_applications","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsouthernmethodistuniversity%2Fprofiling_applications/lists"}