Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dfordivam/perf-check
Some code to check performance of mulit-threaded code in haskell and openMP
https://github.com/dfordivam/perf-check
Last synced: 6 days ago
JSON representation
Some code to check performance of mulit-threaded code in haskell and openMP
- Host: GitHub
- URL: https://github.com/dfordivam/perf-check
- Owner: dfordivam
- Created: 2015-12-10T10:00:02.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2015-12-14T10:36:19.000Z (about 9 years ago)
- Last Synced: 2024-12-23T11:13:13.664Z (11 days ago)
- Language: C
- Size: 50.8 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
p1 : Naive implementation of parallelism in openMP and haskell
The tasks are really simple and therefore the single-threaded execution has an edge over multiple threads.
The benefits of parallelism happen after >1000 tasks per thread per iteration.
The haskell seems to do better than openMPp2 : Tasks operate on random data.
Tasks are still simple (same as p1), but each iteration operates on random set of data.
Here openMP does very bad for NUM_TASK ~1000 (9x slower, haskell 2x slower)
For NUM_TASK 10000 per iteration haskell does better than single threaded, openMP is still 3x slowerp3 : Unequal distribution of tasks in threads
For unequal distribution the number of tasks in each thread is same
But some threads will get additional heavy task randomly.
Here if we keep NUM_TASK 10000 then openMP performs slightly better than single thread
Haskell performs much better ~1.5x speedup on 3 coresp4 : Every Task does an atomic operation in the end (Increment a counter)
Here openMP is 2x slow with NUM_TASK 1000, 1x with 10000
Haskell has a callback routine with atomicModifyIORef
It is 4x slower with 1000, and 3.5x slower with 10000
So most of degradation is due to callback into haskell
Note: With MVar the haskell just stops running. (>100x slower?)p5 : More realistic example
1. Task reads/write only 10% of memory of the array
2. Number of arrays is 200k
(should be more but my system's memory is limited)
3. Array size is 1k
(should be more but my system's memory is limited)The openMP parallel pragma should come just before the execution
of parallel blocks. If we call generateExecTask inside the parallel
pragma region, then it can cause degradation of upto 1.5-2xHere performance of openMP and haskell are similar
Note : Here I removed the callback of haskell and used
openMP atomic instead. Haskell callback is an overkill
to do for each taskp6 : Task is more realistic. Its access of memory is non contiguous
Important finding: If the array of execTasks (for all iterations)
is known before the openMP iterations start, then there is a
significant speedup (upto 1.5x on 3 cores).
(Though this scenario is un-realistic as the next
set of execution tasks is not known beforehand)Other changes : Task access localization- There is a provision
to limit the task on either half, one-third, etc. arrays
This is essential as the real world tasks (in an iteration)
are limited to a part of heap memory.Result Comparison- Added some infrastructure to compare the result
with single threaded run. The openMP with single core gives correct result.
Haskell on the other hand always cause thread race and mismatch in result.p7 : Same as p6, but execTasks are 'created'(actually copied) for each iteration
p8 : Compare and swap mechanism in each task to check if values are valid
This resulted in 2.5x degradation in single thread performance
and with 3 cores 2x slowdown compared to single-thread version without CAS.p9 : Added sort on execution tasks just before the iteration starts
This effectively removes the result mismatch issue
Without sort the speedup on 3 core is 2.15x
with sort 1.5x