Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pkestene/tsp
traveling salesman problem solved with different programing models
https://github.com/pkestene/tsp
cea cpp cuda kokkos nvidia-gpu openacc openmp performance-portability stdpar sycl
Last synced: 15 days ago
JSON representation
traveling salesman problem solved with different programing models
- Host: GitHub
- URL: https://github.com/pkestene/tsp
- Owner: pkestene
- Created: 2020-12-23T16:24:04.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2023-11-11T23:16:40.000Z (about 1 year ago)
- Last Synced: 2024-07-29T19:07:36.760Z (5 months ago)
- Topics: cea, cpp, cuda, kokkos, nvidia-gpu, openacc, openmp, performance-portability, stdpar, sycl
- Language: C++
- Homepage:
- Size: 56.6 KB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
Solving the Traveling Salesman Problem (TSP) with brute force and
parallelism just for teaching and illustrative purpose.# [Traveling Salesman Problem](https://en.wikipedia.org/wiki/Travelling_salesman_problem)
# Brute force parallelism to solve TSP
The code here is mostly adapted from the very nice talk by David Olsen, [Faster Code Through Parallelism on CPUs and GPUs](https://www.youtube.com/watch?v=cbbKEAWf1ow) at [CppCon 2019](https://cppcon.org/cppcon-2019-program/).
I've just slightly changed the way the permutations are computed which turns out to be slightly faster than the original.
See also companion slides by D. Olsen (Nvidia) at GTC2019 [s9770-c++17-parallel-algorithms-for-nvidia-gpus-with-pgi-c++.pdf](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9770-c++17-parallel-algorithms-for-nvidia-gpus-with-pgi-c++.pdf)
# Timings
Timings reported here are measured on my own desktop.
Hardware :
- CPU Intel(R) Core(TM) i5-9400F @ 2.90 GHz - 6 cores
- GPU Nvidia GeForce RTX 2060Software:
- Ubuntu 20.04
- [Nvidia hpc sdk version 20.11](https://developer.nvidia.com/hpc-sdk) with cuda 11.2
- g++ 9.3## serial version (reference)
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.064
11 | 0.66
12 | 8.48
13 | 130.7
14 | 1997.9## OpenMP pragmas
### GNU compiler
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.044
11 | 0.152
12 | 1.64
13 | 24.2
14 | 368.3## OpenMP target for Nvidia
We need an OpenMP 4.5 capable compiler:
Just follow the steps mentioned here:
- [how-to-build-and-run-your-modern-parallel-code-in-c-17-and-openmp-4-5-library-on-nvidia-gpus](https://devmesh.intel.com/blog/724749/how-to-build-and-run-your-modern-parallel-code-in-c-17-and-openmp-4-5-library-on-nvidia-gpus)
- [Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs](https://hpc-wiki.info/hpc/Building_LLVM/Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs)Performance are quite slow (compared to PGI/OpenAcc or Kokkos::CUDA below) by a factor of x3 when N >= 12.
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.077
11 | 0.104
12 | 0.39
13 | 3.56
14 | 59.5I tried three compile toolchains:
- clang 10.0 with cuda 10.1
- clang 11.0 with cuda 11.2
- clang 12.0 with cuda 11.3## OpenAcc (PGI compiler)
### OpenAcc for multicore CPU
These results are strange, much too slow (maybe related to my host configuration ?). I also a strong performance drop when using nvhpc 20.11 versus 20.5 (??).
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.030
11 | 0.25
12 | 7.21
13 | XX
14 | XX### OpenAcc for GPU
Performance looks good (similar to Kokkos::CUDA below).
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.107
11 | 0.128
12 | 0.25
13 | 1.51
14 | 20.5## Kokkos
To build, you just need to edit `kokkos/Makefile`, and change the first line by modifying variable `KOKKOS_PATH` to the full path where you cloned [kokkos](https://github.com/kokkos/kokkos/) sources.
### Kokkos - OpenMP
Just run `make` in subdir `kokkos`.
Timings are similar to OpenMP pragma.
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.015
11 | 0.152
12 | 1.84
13 | 24.5
14 | 376.6### Kokkos - Cuda
Just run `make KOKKOS_DEVICES=Cuda` in subdir `kokkos`.
Timings are very similar to Openacc (GPU).
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.01
11 | 0.01
12 | 0.106
13 | 1.41
14 | 22.4## stdpar
### stdpar for multicore cpu (nvc++ compiler)
Again, performance obtained here is odd; it's much slower than OpenMP with g++, but should be similar....
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.027
11 | 0.128
12 | 5.10
13 | 114.0
14 | TODO### stdpar for GPU (nvc++ compiler)
Similar performance as OpenAcc and Kokkos/Cuda.
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.007
11 | 0.016
12 | 0.15
13 | 1.40
14 | 21.3## SYCL
There are a lot of SYCL implementation available, here we tried
- Intel OneAPI DPC++ (linux) for x86 multi-core
- Intel LLVM (linux) for Nvidia GPU (see https://github.com/intel/llvm)### SYCL OneAPI for x86 multicore
`module load compiler/2021.1.1`
Number of cities | time (seconds)
----------------- | ---------------
10 | 1.29
11 | 1.40
12 | 2.62
13 | 21.3
14 | TODOfor large values of N, we retrieve the expected results (OpenMP, Kokkos::OpenMP, ...).
### SYCL OneAPI for Nvidia GPU (LLVM/intel)
`module load llvm/12-intel-sycl-cuda`
These results should be optimized by changing the number of block of threads and the block size (à la CUDA).
Performance are correct, maybe a bit slow for N=14.
Number of cities | time (seconds)
----------------- | ---------------
10 | 0.001
11 | 0.008
12 | 0.11
13 | 1.86
14 | 37.4