{"id":17646985,"url":"https://github.com/bencardoen/singularity_slurm_cuda","last_synced_at":"2025-10-15T20:34:18.289Z","repository":{"id":77438556,"uuid":"451212168","full_name":"bencardoen/singularity_slurm_cuda","owner":"bencardoen","description":"Example on how to get started with Singularity and CUDA on a SLURM cluster","archived":false,"fork":false,"pushed_at":"2023-05-18T11:49:34.000Z","size":219,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-07T07:13:10.217Z","etag":null,"topics":["cuda","nvidia","singularity-container","slurm-cluster","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bencardoen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-23T19:42:14.000Z","updated_at":"2025-03-20T04:50:52.000Z","dependencies_parsed_at":"2023-06-12T08:45:35.292Z","dependency_job_id":null,"html_url":"https://github.com/bencardoen/singularity_slurm_cuda","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bencardoen/singularity_slurm_cuda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bencardoen%2Fsingularity_slurm_cuda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bencardoen%2Fsingularity_slurm_cuda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bencardoen%2Fsingularity_slurm_cuda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bencardoen%2Fsingularity_slurm_cuda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bencardoen","download_url":"https://codeload.github.com/bencardoen/singularity_slurm_cuda/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bencardoen%2Fsingularity_slurm_cuda/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269654009,"owners_count":24454317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-09T02:00:10.424Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","nvidia","singularity-container","slurm-cluster","tensorflow"],"created_at":"2024-10-23T11:09:24.388Z","updated_at":"2025-10-15T20:34:13.254Z","avatar_url":"https://github.com/bencardoen.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A quick example on how to get up and running with singularity on a cluster with CUDA\n\n**Note: if you copy paste these examples, at a minimum verify you know what they do. These are listed only as examples, without any warranty, you should know if and how they apply to your use case and cluster**\n\nSee [slides.md](slides.md) for a slidedeck, and [pdf](bencardoen_20220913.pdf) version made with HackMD/Reveal.js\n\n## Required\n- HPC cluster account\n  - You know your account/group info\n  - You've configured ssh key access\n- Basic Linux CLI interaction\n\nYou do not need Singularity on your own machine, though for more advanced use cases you probably will want to.\n\nIf you do not have Linux to work with Singularity on your home machine, try a VM using VirtualBox or similar software, or WSL2.\n\n## Walkthrough\n### Login to the cluster\n```bash\nssh you@cluster.country\n```\n### Get the image\nWe'll use a tensorflow image from NVidia.\nWe'll assume for now there's a temporary directory on a fast local disk at $SLURM_TMPDIR. This may not be the case, so please adjust to your setting.\nIf you don't set these variables, singularity will write to $HOME, which you never want.\n\n```bash\nmodule load singularity\nif [[ \"$SLURM_TMPDIR\" ]]; then export STMP=$SLURM_TMPDIR; else export STMP=\"/scratch/$USER\"; fi\n```\nThis ensures that, if you're in a compute node, you use its fast storage, if not, use scratch space.\n```bash\nmkdir -p $STMP/singularity/{cache,tmp}\nexport SINGULARITY_TMPDIR=\"$STMP/singularity/tmp\"\nexport SINGULARITY_CACHEDIR=\"$STMP/singularity/cache\"\ncd $SINGULARITY_TMPDIR\n```\nNow pull (~ download) the image. This is a docker image, so Singularity will convert it on the fly.\n```bash\nsingularity pull tensorflow-19.11-tf1-py3.sif docker://nvcr.io/nvidia/tensorflow:19.11-tf1-py3\n```\nThe pull image can take ~20 mins or depending on network, disk, ... .\n\n#### Pull is too slow ...\nIn that case, run the pull command locally, and copy the resulting image to the cluster.\n\n### Store the image where compute nodes can access it\nFor example:\n```\ncp tensorflow-19.11-tf1-py3.sif /scratch/$USER\n# or\ncp tensorflow-19.11-tf1-py3.sif /project/$USER\n```\n**Filesystems on clusters specialize usually for 2 orthogonal use cases: fast and temporary, slow and permanent. Your cluster documentation will tell you which is which.**\n\n### Get an interactive node\n```bash\nsalloc --time=3:0:0 --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4G --account=\u003cYOURGROUP\u003e --gres=gpu:1\n```\nAfter getting the node\n```bash\n## Make sure environment is clean\nmodule purge\n\nmodule load singularity\nmodule load cuda\n\nif [[ \"$SLURM_TMPDIR\" ]]; then export STMP=$SLURM_TMPDIR; else export STMP=\"/scratch/$USER\"; fi\nmkdir -p $STMP/singularity/{cache,tmp}\nexport SINGULARITY_TMPDIR=\"$STMP/singularity/tmp\"\nexport SINGULARITY_CACHEDIR=\"$STMP/singularity/cache\"\ncd $SINGULARITY_TMPDIR\n\ncp /scratch/$USER/tensorflow-19.11-tf1-py3.sif .  # Change if needed\n\nsingularity shell --nv tensorflow-19.11-tf1-py3.sif\n```\nNow you can execute code inside the container\n```\nSingularity\u003e python\n\u003e\u003e\u003e import tensorflow as tf\n\u003e\u003e\u003e tf.test.is_gpu_available()\n```\nThis should print a lot of info on CUDA version, GPU type etc, and evaluate to True.\n\n## SBATCH mode\nCheck singularitysbatch.sh as an example. Make sure you modify the account, email, and image location entries.\n```\nsbatch singularitysbatch.sh\n```\n\n### Notes\n#### Creating your own images\nYou can create your own images in 2 x 2 ways:\n- local vs remote\n- definition file or stateful\n##### Local v remote\nFor most non-trivial images you will need sudo rights on the machine where you build singularity.\nIf you do not have that on your current machine, fear not, you have these options:\n\n- Sylabs.io [Remote Builder](https://cloud.sylabs.io/builder)\n- [Azure](https://azure.microsoft.com/en-us/free/students/)\n- [AWS](https://aws.amazon.com/education/awseducate/)\n- Run a VM in [Virtualbox](https://www.virtualbox.org/)\n- On windows, use WSL2, VM, ...\n- Integrate with a pipeline using automated testing e.g [CircleCI](https://circleci.com/)\n\nWhen in doubt, go with the first option, all you need is your definition file, the builder will even do syntax checking, that won't be the case if you build yourself.\n\nBuilding an image shouldn't take longer than ~ 30 minutes, well within the free tier of cloud providers.\n\n#### Definition v stateful\nA definition file a pristine recipe that is interpretable, someone who wants to know what the image contains or how it is built only needs to read that file.\nSometimes you may need to 'edit' the image, that is, you convert the image to writable folders, open a shell, modify, and rebuild. \nIn 99.99% of all cases, however, a definition file is the way to go. \nEditing an image is an option if you want to figure out how to improve it in a way that isn't working by definition file, iow you figure out interactively what commands are needed, then rebuild the image. If it works, then add your commands to the definition file.\nThe Singularity docs detail precisely how to achieve either case.\n\n### Recipe\nCreate this file, e.g. `recipe.def`\n```toml\nBootstrap: docker\nFrom: nvcr.io/nvidia/pytorch:21.12-py3\n\n%post\n    echo \"Hi\"\n    # Add post install instructions you need to customize\n\n%labels\n    Version v0.0.1\n\n%help\n    This is a demo container used to illustrate a def file.\n```\nbuild it\n```bash\nsingularity build myimage.sif recipe.def\n```\n\n#### Accessing data\n```\nsingularity shell --nv -B \u003csomedir\u003e:\u003cmountpoint\u003e tensorflow-19.11-tf1-py3.sif\n```\nNow \u003csomedir\u003e will appear inside the container as \u003cmountpoint\u003e.\n\n\n## Extra resources\n[Compute Canada Wiki on Singularity](https://docs.computecanada.ca/wiki/Singularity)\n\n[Singularity documentation](https://sylabs.io/docs)\n\n[Sylabs cloud builder](https://cloud.sylabs.io/library)\n\n### But I want PyTorch\n```\nsingularity pull image.sif docker://nvcr.io/nvidia/pytorch:21.12-py3\n```\nMore tags at [NVidia NVCR][https://catalog.ngc.nvidia.com/containers]\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbencardoen%2Fsingularity_slurm_cuda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbencardoen%2Fsingularity_slurm_cuda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbencardoen%2Fsingularity_slurm_cuda/lists"}