{"id":23117935,"url":"https://github.com/bourbonut/lbm-gpu","last_synced_at":"2025-05-06T23:45:58.573Z","repository":{"id":127045691,"uuid":"487332488","full_name":"bourbonut/lbm-gpu","owner":"bourbonut","description":"The Lattice Boltzmann Method on GPU","archived":false,"fork":false,"pushed_at":"2025-04-13T13:09:33.000Z","size":8780,"stargazers_count":8,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-13T14:22:25.190Z","etag":null,"topics":["cuda","cupy","gpu","hpc","numba","nvidia","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-2.1","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bourbonut.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-04-30T16:55:25.000Z","updated_at":"2025-04-13T13:09:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"68172059-1965-48dc-be93-deae533a6c99","html_url":"https://github.com/bourbonut/lbm-gpu","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bourbonut%2Flbm-gpu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bourbonut%2Flbm-gpu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bourbonut%2Flbm-gpu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bourbonut%2Flbm-gpu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bourbonut","download_url":"https://codeload.github.com/bourbonut/lbm-gpu/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252788404,"owners_count":21804280,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","cupy","gpu","hpc","numba","nvidia","python"],"created_at":"2024-12-17T04:32:01.019Z","updated_at":"2025-05-06T23:45:58.553Z","avatar_url":"https://github.com/bourbonut.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Lattice Boltzmann Method on GPU\n\n![demo](./docs/demo.gif)\n\nThis example is fluid flow from left to right over a cylinder in top view.\n\n## Goal\n\nThe goal of this project is to make parallel the Lattice Boltzmann Method on GPU through `numba` and `cupy`.\n\n## Installation\n\nYou need to know your cuda version to install `cupy` correctly.\nMy version is `11.6` (you can see it in `setup.py`).\nCreate a virtual environment and install the requirements :\n```\npip install -r requirements.txt\n```\n\n## Usage\n\n### Parameters\n\nIn `utils/parameters.py`, you can change parameters of the simulation :\n```python\nmaxIter = 8 * 15 * 5 * 10  # Total number of time iterations.\n# 8 * 15 for frames per second (= 120)\n# 5 for seconds\n# 10 because every 10 steps, the program saves the state\nRe = 150.0  # Reynolds number.\nnx, ny = 1024, 22 * 32  # Number of lattice nodes.\n# 1024 because my GPU can use 1024 threads per block maximum\n# 22 for the number of Streaming Multiprocessors\n# 32 is a multiple of 2 (could be 64, 128, ...)\nly = ny - 1  # Height of the domain in lattice units.\ncx, cy, r = nx // 4, ny // 2, ny // 9  # Coordinates of the cylinder.\nuLB = 0.04  # Velocity in lattice units.\nnulb = uLB * r / Re\n# Viscoscity in lattice units.\nomega = 1 / (3 * nulb + 0.5)\n```\n\n### Run programs\n\n```sh\n# numba\npython numba_lbmFlowAroundCylinder.py\n# cupy\npython kcupy_lbmFlowAroundCylinder.py\n# cupy without kernels (only functions already implemented)\n# it is less optimized\npython cupy_lbmFlowAroundCylinder.py\n# original method (sequential with numpy)\npython lbmFlowAroundCylinder.py\n```\n\n### Tests\n\nTo generate references for tests, you can save them in `pickle` files by running :\n```\npython alltests.py -p\n```\nThey will be saved in `tests/picklefiles`.\n\nNow, you can check that everything works :\n```sh\n# numba tests\npython alltests.py\n# cupy tests\npython alltests.py -c\n```\n\n## Profiling\n\n![nsight](./docs/nsight-analysis.png)\n\nYou can profile programs to study performance of kernels.\nYou should reduce the number of iterations in parameters (`utils/parameters.py`):\n```python\nmaxIter = 3  # Total number of time iterations.\n```\nThen you should comment `cv2` steps in `numba_lbmFlowAroundCylinder.py` or `kcupy_lbmFlowAroundCylinder.py` depending if you want to improve performance with `numba` or `cupy` :\nBefore commenting :\n\n```python\n# ...\nimport cv2\n# ...\nframeSize = (INTNX, INTNY)\npath_video = \"output_video.avi\"\nbin_loader = cv2.VideoWriter_fourcc(*\"DIVX\")\nout = cv2.VideoWriter(path_video, bin_loader, 120, frameSize)\n\ndef main():\n  # ...\n  for time in range(maxIter + 1):\n    # ...\n    if time % 10 == 0 and time != 0:\n          print(round(100 * time / maxIter, 3), \"%\")\n          u = d_u.get()\n          arr = np.sqrt(u[0] ** 2 + u[1] ** 2).transpose()\n          new_arr = ((arr / arr.max()) * 255).astype(\"uint8\")\n          img_colorized = cv2.applyColorMap(new_arr, cmapy.cmap(\"plasma\"))\n          out.write(img_colorized)\n\n    out.release()\n```\n\nAfter commenting :\n\n```python\n# ...\n# import cv2\n# ...\n# frameSize = (INTNX, INTNY)\n# path_video = \"output_video.avi\"\n# bin_loader = cv2.VideoWriter_fourcc(*\"DIVX\")\n# out = cv2.VideoWriter(path_video, bin_loader, 120, frameSize)\n\ndef main():\n  # ...\n  for time in range(maxIter + 1):\n    # ...\n    # if time % 10 == 0 and time != 0:\n    #       print(round(100 * time / maxIter, 3), \"%\")\n    #       u = d_u.get()\n    #       arr = np.sqrt(u[0] ** 2 + u[1] ** 2).transpose()\n    #       new_arr = ((arr / arr.max()) * 255).astype(\"uint8\")\n    #       img_colorized = cv2.applyColorMap(new_arr, cmapy.cmap(\"plasma\"))\n    #       out.write(img_colorized)\n    #\n    # out.release()\n```\nThen you can run :\n```sh\nsh ncu-profiler.sh numba_lbmFlowAroundCylinder.py # or kcupy_lbmFlowAroundCylinder.py\n```\nIt will produce a file where all data are stored (`profile.ncu-rep`).\nThen, you can use the UI from Nvidia :\n```sh\nsh nsight-profiler-ui.sh profile.ncu-rep # or without argument if you want only to open the application\n```\n\n## Some results for kernels\n\nThe GPU to get the following results is [NVIDIA A100 TENSOR CORE GPU](https://www.nvidia.com/en-us/data-center/a100/) where :\n- The number of Streaming Multiprocessors is `108`.\n- The number of nodes is `nx, ny = 2048, 216 * 32`\n\nNote : current parameters in scripts are chosen for the [NVIDIA GEFORCE GTX 1660 Super](https://www.nvidia.com/en-us/geforce/news/nvidia-geforce-gtx-1660-super-1650-super/). Then the number of Streaming Multiprocessors is `22`.\n\n### Numba\n\n|   Kernel name  | Execution Duration | Compute Throughput | Memory Throughput | L1 Cache Throughput | L2 Cache Throughput |\n| :------------: | :----------------: | :----------------: | :---------------: | :-----------------: | :-----------------: |\n|   macroscopic  |       7.67 ms      |       6.79 %       |      90.08 %      |       90.52 %       |       51.71 %       |\n|   equilibrium  |       2.34 ms      |       19.05 %      |      88.02 %      |       88.49 %       |       58.85 %       |\n| streaming_step |       2.18 ms      |       57.37 %      |      85.31 %      |       85.75 %       |       83.36 %       |\n|    collision   |       4.56 ms      |       6.29 %       |      70.96 %      |       71.52 %       |       64.97 %       |\n|   bounce_back  |      345.09 µs     |       16.49 %      |       71.9 %      |        72.6 %       |       65.08 %       |\n|     inflow     |      17.95 µs      |       2.03 %       |       3.47 %      |        0.96 %       |        4.09 %       |\n|   update_fin   |      10.37 µs      |       0.73 %       |       5.26 %      |        2.19 %       |        7.31 %       |\n|     outflow    |       9.31 µs      |       0.72 %       |       3.12 %      |        2.13 %       |        4.58 %       |\n\n### Cupy\n\n|   Kernel name  | Execution Duration | Compute Throughput | Memory Throughput | L1 Cache Throughput | L2 Cache Throughput |\n| :------------: | :----------------: | :----------------: | :---------------: | :-----------------: | :-----------------: |\n|   macroscopic  |       8.05 ms      |       6.79 %       |      91.09 %      |       91.77 %       |       51.15 %       |\n| streaming_step |       2.14 ms      |       25.37 %      |       86.8 %      |       87.24 %       |        82.5 %       |\n|   equilibrium  |       2.33 ms      |       19.11 %      |      88.36 %      |       88.97 %       |       59.31 %       |\n|    collision   |       4.74 ms      |        4.9 %       |      67.45 %      |        67.7 %       |       62.43 %       |\n|   bounce_back  |      340.45 µs     |       12.2 %       |      72.74 %      |       73.92 %       |       65.38 %       |\n|   update_fin   |      10.56 µs      |       0.65 %       |       6.21 %      |        2.28 %       |        7.87 %       |\n|     inflow     |      10.82 µs      |       0.82 %       |       5.15 %      |        1.98 %       |        7.88 %       |\n|     outflow    |       9.7 µs       |       0.52 %       |       3.44 %      |        2.26 %       |        4.86 %       |\n\n\n## Report\n\nThe [report](./docs/lattice-bolzmann-method-gpu.pdf) summarizes :\n1. Objective of the project\n2. More details on algorithms implemented\n3. Results of Numba and Cupy experiments\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"./docs/lattice-bolzmann-method-gpu.pdf\"\u003e\n        \u003cimg width=300px src=\"./docs/cover.png\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbourbonut%2Flbm-gpu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbourbonut%2Flbm-gpu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbourbonut%2Flbm-gpu/lists"}