https://github.com/dfordivam/perf-check

Some code to check performance of mulit-threaded code in haskell and openMP
https://github.com/dfordivam/perf-check

Last synced: 18 days ago
JSON representation

Some code to check performance of mulit-threaded code in haskell and openMP

Host: GitHub
URL: https://github.com/dfordivam/perf-check
Owner: dfordivam
Created: 2015-12-10T10:00:02.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2015-12-14T10:36:19.000Z (almost 10 years ago)
Last Synced: 2025-02-14T21:39:55.925Z (10 months ago)
Language: C
Size: 50.8 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README

Awesome Lists containing this project

README

p1 : Naive implementation of parallelism in openMP and haskell
The tasks are really simple and therefore the single-threaded execution has an edge over multiple threads.
The benefits of parallelism happen after >1000 tasks per thread per iteration.
The haskell seems to do better than openMP

p2 : Tasks operate on random data.
Tasks are still simple (same as p1), but each iteration operates on random set of data.
Here openMP does very bad for NUM_TASK ~1000 (9x slower, haskell 2x slower)
For NUM_TASK 10000 per iteration haskell does better than single threaded, openMP is still 3x slower

p3 : Unequal distribution of tasks in threads
For unequal distribution the number of tasks in each thread is same
But some threads will get additional heavy task randomly.
Here if we keep NUM_TASK 10000 then openMP performs slightly better than single thread
Haskell performs much better ~1.5x speedup on 3 cores

p4 : Every Task does an atomic operation in the end (Increment a counter)
Here openMP is 2x slow with NUM_TASK 1000, 1x with 10000
Haskell has a callback routine with atomicModifyIORef
It is 4x slower with 1000, and 3.5x slower with 10000
So most of degradation is due to callback into haskell
Note: With MVar the haskell just stops running. (>100x slower?)

p5 : More realistic example
1. Task reads/write only 10% of memory of the array
2. Number of arrays is 200k
(should be more but my system's memory is limited)
3. Array size is 1k
(should be more but my system's memory is limited)

The openMP parallel pragma should come just before the execution
of parallel blocks. If we call generateExecTask inside the parallel
pragma region, then it can cause degradation of upto 1.5-2x

Here performance of openMP and haskell are similar

Note : Here I removed the callback of haskell and used
openMP atomic instead. Haskell callback is an overkill
to do for each task

p6 : Task is more realistic. Its access of memory is non contiguous
Important finding: If the array of execTasks (for all iterations)
is known before the openMP iterations start, then there is a
significant speedup (upto 1.5x on 3 cores).
(Though this scenario is un-realistic as the next
set of execution tasks is not known beforehand)

Other changes : Task access localization- There is a provision
to limit the task on either half, one-third, etc. arrays
This is essential as the real world tasks (in an iteration)
are limited to a part of heap memory.

Result Comparison- Added some infrastructure to compare the result
with single threaded run. The openMP with single core gives correct result.
Haskell on the other hand always cause thread race and mismatch in result.

p7 : Same as p6, but execTasks are 'created'(actually copied) for each iteration

p8 : Compare and swap mechanism in each task to check if values are valid
This resulted in 2.5x degradation in single thread performance
and with 3 cores 2x slowdown compared to single-thread version without CAS.

p9 : Added sort on execution tasks just before the iteration starts
This effectively removes the result mismatch issue
Without sort the speedup on 3 core is 2.15x
with sort 1.5x

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dfordivam/perf-check

Awesome Lists containing this project

README