{"id":22051613,"url":"https://github.com/parlaynu/learn-slurm","last_synced_at":"2026-04-18T17:36:24.409Z","repository":{"id":182425649,"uuid":"639147910","full_name":"parlaynu/learn-slurm","owner":"parlaynu","description":"Build a simple slurm cluster in AWS. Automated build and setup - three commands and you have a working system.","archived":false,"fork":false,"pushed_at":"2023-07-19T22:44:44.000Z","size":106,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-23T15:34:14.421Z","etag":null,"topics":["ansible","aws","munge","nfs","slurm-wlm","terraform","ubuntu2204lts"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/parlaynu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-05-10T21:33:55.000Z","updated_at":"2024-01-29T10:27:54.000Z","dependencies_parsed_at":"2023-07-20T00:00:16.200Z","dependency_job_id":null,"html_url":"https://github.com/parlaynu/learn-slurm","commit_stats":null,"previous_names":["parlaynu/learn-slurm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/parlaynu/learn-slurm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parlaynu%2Flearn-slurm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parlaynu%2Flearn-slurm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parlaynu%2Flearn-slurm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parlaynu%2Flearn-slurm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/parlaynu","download_url":"https://codeload.github.com/parlaynu/learn-slurm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/parlaynu%2Flearn-slurm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31978397,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T17:30:12.329Z","status":"ssl_error","status_checked_at":"2026-04-18T17:29:59.069Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ansible","aws","munge","nfs","slurm-wlm","terraform","ubuntu2204lts"],"created_at":"2024-11-30T15:09:56.598Z","updated_at":"2026-04-18T17:36:24.372Z","avatar_url":"https://github.com/parlaynu.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Learning Slurm\n\nBuilds a [slurm cluster](https://slurm.schedmd.com/overview.html) in AWS for learning.\n\nThe built system looks like the following diagram. Each server is running Ubuntu 22.04 LTS.\n\n![System Architeture Diagram](docs/network.png)\n\nThis configuration is very basic at the moment - definitely just for learning, not production. \n\nIt's a work in progress with more to come at some point:\n\n* more sophisticated cluster configuration\n* users with home directories on the nfs server\n* adding support for slurmrestd\n* some tools to use the rest API\n\nAnd I'm sure there are a lot of things I don't know about yet that will get added as well.\n\n\n## Prerequisites\n\nYou need an AWS account for this to work. If you can get the AWS CLI working, then you will\nhave everything in place that you need to get this to run.\n\nTerraform is used to build the AWS infrastructure and also create the ansible scripts.\n\nAnsible is used to configure the servers.\n\n\n## Quickstart\n\nThe build and configuration is fully automated with the terraform and ansible files in the `build-aws`\ndirectory.\n\nFirst, within the `build-aws` directory, copy the `terraform.tfvars.example` file to `terraform.tfvars` and \nupdate the variables to:\n\n* set the aws profile and region to use\n* set a custom internal domain\n* set the number of workers to create\n\nBuild the infrastructure and ansible configurations with terraform:\n\n    terraform init\n    terraform apply\n\nThis create a directory called `local` and in there you can see the ansible configs as well as an\nssh configuration file so you can log into all the machines that have been created. For example,\nto log into the slurm controller:\n\n    ssh -F local/ssh.cfg slurmctl\n\nBefore moving on to the next step, wait for the servers to be fully up and running. If you can log\ninto them and see that their hostname is different from the default AWS naming, then everything is\nready.\n\nConfigure the machines:\n\n    ./local/ansible/run-ansible.sh\n\nAt this point you should be able to do some basic testing as in the next section.\n\nTo shut it all down, run:\n\n    terraform destroy\n\n\n## Testing\n\nYou should be able to follow along \nwith this [tutorial video](https://youtu.be/U42qlYkzP9k) that's referenced from the SchedMD website.\nThe instructions below are basically the same as in the video.\n\nLog into the controller:\n\n    ssh -F local/ssh.cfg slurmctl\n\nGet some information about the cluster:\n\n    sinfo\n\n    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST\n    studio*      up   infinite      2   idle slurm-[00-01]\n\nTake a look at the details for one of the nodes:\n\n    scontrol show node slurm-00\n\n    NodeName=slurm-00 Arch=x86_64 CoresPerSocket=1 \n       CPUAlloc=0 CPUTot=1 CPULoad=0.06\n       AvailableFeatures=(null)\n       ActiveFeatures=(null)\n       Gres=(null)\n       NodeAddr=slurm-00 NodeHostName=slurm-00 Version=21.08.5\n       OS=Linux 5.19.0-1024-aws #25~22.04.1-Ubuntu SMP Tue Apr 18 23:41:58 UTC 2023 \n       RealMemory=1 AllocMem=0 FreeMem=201 Sockets=1 Boards=1\n       State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A\n       Partitions=studio \n       BootTime=2023-05-10T21:33:59 SlurmdStartTime=2023-05-10T22:11:46\n       LastBusyTime=2023-05-10T22:16:35\n       CfgTRES=cpu=1,mem=1M,billing=1\n       AllocTRES=\n       CapWatts=n/a\n       CurrentWatts=0 AveWatts=0\n       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s\n\nCreate a simple test script called hello-slurm.sh:\n\n    #!/usr/bin/env bash\n    echo hello $HOSTNAME\n    sleep 10\n    exit $?\n\nSubmit the job:\n\n    sbatch -N1 -n1 hello-slurm.sh\n\n    Submitted batch job 1\n\nCheck the status using the job id returned:\n\n    scontrol show job 1\n\n    JobId=1 JobName=hello-slurm.sh\n       UserId=ubuntu(1000) GroupId=ubuntu(1000) MCS_label=N/A\n       Priority=4294901759 Nice=0 Account=(null) QOS=normal\n       JobState=RUNNING Reason=None Dependency=(null)\n       Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0\n       RunTime=00:00:06 TimeLimit=UNLIMITED TimeMin=N/A\n       SubmitTime=2023-05-10T22:16:25 EligibleTime=2023-05-10T22:16:25\n       AccrueTime=2023-05-10T22:16:25\n       StartTime=2023-05-10T22:16:25 EndTime=Unknown Deadline=N/A\n       SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-05-10T22:16:25 Scheduler=Main\n       Partition=studio AllocNode:Sid=slurmctl:4722\n       ReqNodeList=(null) ExcNodeList=(null)\n       NodeList=slurm-00\n       BatchHost=slurm-00\n       NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*\n       TRES=cpu=1,node=1,billing=1\n       Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*\n       MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0\n       Features=(null) DelayBoot=00:00:00\n       OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)\n       Command=/home/ubuntu/hello-slurm.sh\n       WorkDir=/home/ubuntu\n       StdErr=/home/ubuntu/slurm-1.out\n       StdIn=/dev/null\n       StdOut=/home/ubuntu/slurm-1.out\n       Power=\n\nSubmit a bunch of jobs in held state so they don't disappear too quickly:\n\n    sbatch -H -N1 -n1 hello-slurm.sh\n    sbatch -H -N1 -n1 hello-slurm.sh\n    sbatch -H -N1 -n1 hello-slurm.sh\n\nCheck the status of them all:\n\n    squeue\n\n    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)\n        4    studio hello-sl   ubuntu PD       0:00      1 (JobHeldUser)\n        3    studio hello-sl   ubuntu PD       0:00      1 (JobHeldUser)\n        2    studio hello-sl   ubuntu PD       0:00      1 (JobHeldUser)\n\nAnd cancel them:\n\n    scancel 2\n    squeue\n\n    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)\n        4    studio hello-sl   ubuntu PD       0:00      1 (JobHeldUser)\n        3    studio hello-sl   ubuntu PD       0:00      1 (JobHeldUser)\n\n    scancel 3 4\n    squeue\n\n    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparlaynu%2Flearn-slurm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparlaynu%2Flearn-slurm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparlaynu%2Flearn-slurm/lists"}