{"id":16902691,"url":"https://github.com/dvhh/masscorrelation","last_synced_at":"2026-05-15T01:08:12.971Z","repository":{"id":28367202,"uuid":"31881122","full_name":"dvhh/massCorrelation","owner":"dvhh","description":"An exercise in writing an efficient correlation calculator","archived":false,"fork":false,"pushed_at":"2017-09-22T08:50:38.000Z","size":29,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-20T14:48:36.218Z","etag":null,"topics":["calculations","correlation-calculation","cuda","matrix","multi-threading","openmp"],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dvhh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-09T05:34:20.000Z","updated_at":"2022-01-10T08:12:51.000Z","dependencies_parsed_at":"2022-07-06T14:33:16.209Z","dependency_job_id":null,"html_url":"https://github.com/dvhh/massCorrelation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dvhh/massCorrelation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvhh%2FmassCorrelation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvhh%2FmassCorrelation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvhh%2FmassCorrelation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvhh%2FmassCorrelation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dvhh","download_url":"https://codeload.github.com/dvhh/massCorrelation/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dvhh%2FmassCorrelation/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263335846,"owners_count":23450936,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["calculations","correlation-calculation","cuda","matrix","multi-threading","openmp"],"created_at":"2024-10-13T18:07:30.225Z","updated_at":"2025-10-26T18:31:49.727Z","avatar_url":"https://github.com/dvhh.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# massCorrelation\n\n[![Build Status](https://travis-ci.org/dvhh/massCorrelation.svg?branch=master)](https://travis-ci.org/dvhh/massCorrelation)\n[![Coverity Scan Build Status](https://scan.coverity.com/projects/5220/badge.svg)](https://scan.coverity.com/projects/5220)\n\nAn exercise in writing an efficient correlation calculator\n\n## Overview\n\nThis is primarly an exercise in implementing a correlation calculation,\n\nThe primary implementation is already quite fast but I wanted to know how fast could I get,\nIn chronological order :\n- Baseline algo came from the R project with some optimisation for repated paring of vectors.\n- Matrix Algorithm was describe in a [stackoverflow post][4] \n- CUDA Implementation of both algorithm\n- Multithreading implementation with pthread and OpenMP\n\nThis program is very easy to parallelize due to no depency from one result to another, resulting in very little need for synchronization between threads (The only synchronization needed was to manage the task queue between the different threads in a 1-producer n-consumer ).\n\nWe are of course assuming : \n- That there is enough memory to hold in memory both the input and the result data. \n- Data type used for calculation is float.\n- Multi-thread code do not make any effort toward the processor and use a threadpool of 128 threads.\n- Measured timing highly depend on I/O right now. Will attempt to reduce that dependency in the future. \n\n## Measured Timing\n\nTime have been measured on an [Intel E5606][1]\nCUDA code is run on one [Tesla C2075][2] \n\ninput : 601 x 45101 matrix \n\n- Baseline Implementation : 977.89\n- CUDA : 98.27\n- Matrix Implementation : 1008.80\n- Matrix CUDA : 142.23\n- Multi-threaded : 305.19\n- openMP : 475.90 (to be checked later when the node is less loaded )\n- openMP Nested : 349.13\n- No calculation : 56.64\n\nfor comparison on an Intel [X5690][3]\n\n- Baseline Implementation : 640.27\n- Matrix Implementation : 639.95\n- Multi-threaded : 82.10\n- openMP : 130.34\n- openMP Nested : 115.11\n- No calculation : 9.80\n\ninput : 100 x 10000 (generated with randomMatrix.pl)\n\nIntel [E5606][1] \n- Baseline : 11.40 (9.08)\n- CUDA : Unavailable at this time\n- Matrix : 11.18 (8.80)\n- Matrix CUDA : Unavailable at this time\n- Multi-threaded : 7.39 (5.22)\n- OpenMP : 6.26 (3.99)\n- OpenMP Nested : 6.04  (3.71)\n- No calculation : 2.63 (0.33)\n\nIntel [X5690][3]\n- Baseline : 5.67 (5.45)\n- CUDA : 3.27 (1.12)\n- Matrix : 5.53 (5.23)\n- Matrix CUDA : 3.21 (1.13)\n- Multi-threaded : 0.94 (0.72)\n- openMP : 1.25 (1.03)\n- OpenMP Nested : 2.97 (2.74)\n- No calculation : 0.43 (0.22)\n\nTiming on [Tegra2 T20][5]\n- Baseline : 124.26 (78.09)\n- Matrix : 119.02 (76.48)\n- Thread : 75.11 (41.43)\n- OpenMP : 95.44 (61.21)\n- OpenMP nested : 93.57 (56.98)\n- No calculation :47.44 (2.44)\n\n## Notes\n\nThe CUDA implementation have been quite straight-forward to implement and provided benefits out of the box, Attempt to cleverly optimize the CUDA kernel by using shared memory were unsuccessful ( no performance gained ), and trade-offs were unacceptable ( reducing the number of thread to accomodate the size of the shared memory ).\nWhile the multi-thread code required some implementation for the threadpool, and might be optimized further by using more specialized queues (which could reduce the time used in allocation ).\n\n[1]:http://ark.intel.com/products/52583/Intel-Xeon-Processor-E5606-8M-Cache-2_13-GHz-4_80-GTs-Intel-QPI\n[2]:http://www.nvidia.co.jp/docs/IO/43395/BD-05880-001_v02.pdf\n[3]:http://ark.intel.com/products/52576/Intel-Xeon-Processor-X5690-12M-Cache-3_46-GHz-6_40-GTs-Intel-QPI\n[4]:http://stackoverflow.com/a/18965892/105104\n[5]:http://www.nvidia.com/object/tegra-superchip.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdvhh%2Fmasscorrelation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdvhh%2Fmasscorrelation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdvhh%2Fmasscorrelation/lists"}