{"id":13314621,"url":"https://github.com/pkestene/tsp","last_synced_at":"2025-08-19T03:30:36.808Z","repository":{"id":86494737,"uuid":"323949651","full_name":"pkestene/tsp","owner":"pkestene","description":"traveling salesman problem solved with different programing models","archived":false,"fork":false,"pushed_at":"2023-11-11T23:16:40.000Z","size":58,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-07-29T19:07:36.760Z","etag":null,"topics":["cea","cpp","cuda","kokkos","nvidia-gpu","openacc","openmp","performance-portability","stdpar","sycl"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pkestene.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-12-23T16:24:04.000Z","updated_at":"2024-07-10T20:24:14.000Z","dependencies_parsed_at":"2023-11-12T00:22:25.386Z","dependency_job_id":"5bbcb531-bf6c-4849-8b4c-c93e26cec851","html_url":"https://github.com/pkestene/tsp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pkestene%2Ftsp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pkestene%2Ftsp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pkestene%2Ftsp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pkestene%2Ftsp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pkestene","download_url":"https://codeload.github.com/pkestene/tsp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230312448,"owners_count":18206858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cea","cpp","cuda","kokkos","nvidia-gpu","openacc","openmp","performance-portability","stdpar","sycl"],"created_at":"2024-07-29T18:11:50.343Z","updated_at":"2024-12-18T17:26:38.562Z","avatar_url":"https://github.com/pkestene.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"Solving the Traveling Salesman Problem (TSP) with brute force and\nparallelism just for teaching and illustrative purpose.\n\n# [Traveling Salesman Problem](https://en.wikipedia.org/wiki/Travelling_salesman_problem)\n\n# Brute force parallelism to solve TSP\n\nThe code here is mostly adapted from the very nice talk by David Olsen, [Faster Code Through Parallelism on CPUs and GPUs](https://www.youtube.com/watch?v=cbbKEAWf1ow) at [CppCon 2019](https://cppcon.org/cppcon-2019-program/).\n\nI've just slightly changed the way the permutations are computed which turns out to be slightly faster than the original.\n\nSee also companion slides by D. Olsen (Nvidia) at GTC2019 [s9770-c++17-parallel-algorithms-for-nvidia-gpus-with-pgi-c++.pdf](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9770-c++17-parallel-algorithms-for-nvidia-gpus-with-pgi-c++.pdf)\n\n# Timings\n\nTimings reported here are measured on my own desktop.\n\nHardware :\n- CPU Intel(R) Core(TM) i5-9400F @ 2.90 GHz - 6 cores\n- GPU Nvidia GeForce RTX 2060\n\nSoftware:\n- Ubuntu 20.04\n- [Nvidia hpc sdk version 20.11](https://developer.nvidia.com/hpc-sdk) with cuda 11.2\n- g++ 9.3\n\n## serial version (reference)\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |    0.064\n11                |    0.66\n12                |    8.48\n13                |  130.7\n14                | 1997.9\n\n## OpenMP pragmas\n\n### GNU compiler\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   0.044\n11                |   0.152\n12                |   1.64\n13                |  24.2\n14                | 368.3\n\n## OpenMP target for Nvidia\n\nWe need an OpenMP 4.5 capable compiler:\n\nJust follow the steps mentioned here:\n- [how-to-build-and-run-your-modern-parallel-code-in-c-17-and-openmp-4-5-library-on-nvidia-gpus](https://devmesh.intel.com/blog/724749/how-to-build-and-run-your-modern-parallel-code-in-c-17-and-openmp-4-5-library-on-nvidia-gpus)\n- [Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs](https://hpc-wiki.info/hpc/Building_LLVM/Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs)\n\nPerformance are quite slow (compared to PGI/OpenAcc or Kokkos::CUDA below) by a factor of x3 when N \u003e= 12.\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   0.077\n11                |   0.104\n12                |   0.39\n13                |   3.56\n14                |  59.5\n\nI tried three compile toolchains:\n- clang 10.0 with cuda 10.1\n- clang 11.0 with cuda 11.2\n- clang 12.0 with cuda 11.3\n\n## OpenAcc (PGI compiler)\n\n### OpenAcc for multicore CPU\n\nThese results are strange, much too slow (maybe related to my host configuration ?). I also a strong performance drop when using  nvhpc 20.11 versus 20.5 (??).\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   0.030\n11                |   0.25\n12                |   7.21\n13                |  XX\n14                |  XX\n\n### OpenAcc for GPU\n\nPerformance looks good (similar to Kokkos::CUDA below).\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   0.107\n11                |   0.128\n12                |   0.25\n13                |   1.51\n14                |  20.5\n\n## Kokkos\n\nTo build, you just need to edit `kokkos/Makefile`, and change the first line by modifying variable `KOKKOS_PATH` to the full path where you cloned [kokkos](https://github.com/kokkos/kokkos/) sources.\n\n### Kokkos - OpenMP\n\nJust run `make` in subdir `kokkos`.\n\nTimings are similar to OpenMP pragma.\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   0.015\n11                |   0.152\n12                |   1.84\n13                |  24.5\n14                | 376.6\n\n### Kokkos - Cuda\n\nJust run `make KOKKOS_DEVICES=Cuda` in subdir `kokkos`.\n\nTimings are very similar to Openacc (GPU).\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   0.01\n11                |   0.01\n12                |   0.106\n13                |   1.41\n14                |  22.4\n\n## stdpar\n\n### stdpar for multicore cpu (nvc++ compiler)\n\nAgain, performance obtained here is odd; it's much slower than OpenMP with g++, but should be similar....\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   0.027\n11                |   0.128\n12                |   5.10\n13                | 114.0\n14                |  TODO\n\n### stdpar for GPU (nvc++ compiler)\n\nSimilar performance as OpenAcc and Kokkos/Cuda.\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   0.007\n11                |   0.016\n12                |   0.15\n13                |   1.40\n14                |  21.3\n\n## SYCL\n\nThere are a lot of SYCL implementation available, here we tried\n- Intel OneAPI DPC++ (linux) for x86 multi-core\n- Intel LLVM (linux) for Nvidia GPU (see https://github.com/intel/llvm)\n\n### SYCL OneAPI for x86 multicore\n\n`module load compiler/2021.1.1`\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   1.29\n11                |   1.40\n12                |   2.62\n13                |  21.3\n14                |  TODO\n\nfor large values of N, we retrieve the expected results (OpenMP, Kokkos::OpenMP, ...).\n\n### SYCL OneAPI for Nvidia GPU (LLVM/intel)\n\n`module load llvm/12-intel-sycl-cuda`\n\nThese results should be optimized by changing the number of block of threads and the block size (à la CUDA).\n\nPerformance are correct, maybe a bit slow for N=14.\n\nNumber of cities  | time (seconds)\n----------------- | ---------------\n10                |   0.001\n11                |   0.008\n12                |   0.11\n13                |   1.86\n14                |  37.4\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpkestene%2Ftsp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpkestene%2Ftsp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpkestene%2Ftsp/lists"}