{"id":26288848,"url":"https://github.com/clustercockpit/cc-slurm-adapter","last_synced_at":"2026-02-17T11:12:32.142Z","repository":{"id":282248389,"uuid":"939281556","full_name":"ClusterCockpit/cc-slurm-adapter","owner":"ClusterCockpit","description":"A Slurm job scheduler adapter to be used with ClusterCockpit","archived":false,"fork":false,"pushed_at":"2025-03-13T14:35:12.000Z","size":218,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-13T15:29:54.074Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ClusterCockpit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-26T09:42:58.000Z","updated_at":"2025-03-13T14:35:17.000Z","dependencies_parsed_at":"2025-03-13T15:42:06.291Z","dependency_job_id":null,"html_url":"https://github.com/ClusterCockpit/cc-slurm-adapter","commit_stats":null,"previous_names":["clustercockpit/cc-slurm-adapter"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClusterCockpit%2Fcc-slurm-adapter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClusterCockpit%2Fcc-slurm-adapter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClusterCockpit%2Fcc-slurm-adapter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClusterCockpit%2Fcc-slurm-adapter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ClusterCockpit","download_url":"https://codeload.github.com/ClusterCockpit/cc-slurm-adapter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243652685,"owners_count":20325611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-14T22:15:04.490Z","updated_at":"2026-02-17T11:12:32.136Z","avatar_url":"https://github.com/ClusterCockpit.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cc-slurm-adapter\n\n## Overview\n\ncc-slurm-adapter is a software daemon, which feeds [cc-backend](https://github.com/ClusterCockpit/cc-backend) with job information from [Slurm](https://slurm.schedmd.com/) in realtime.\nIt is designed to be reasonably fault tolerant and verbose (if enabled) about possible undesired conditions.\nFor example, should cc-backend go down or Slurm be unavailable, cc-slurm-adapter should handle everything as expected.\nThat means that no jobs are lost and they are submitted to cc-backend as soon as everything is running again.\nThis also includes cc-slurm-adapter itself.\nSo if cc-slurm-adapter should encounter an internal error (e.g. incompatibility due to new Slurm version), no jobs should be lost.\nHowever, there are exceptions.\nPlease consult the Limitations section for more details.\n\nThis daemon runs on the same node, which [slurmctld](https://slurm.schedmd.com/slurmctld.html) runs on.\nData is obtained via Slurm's commands `sacct`, `squeue`, `sacctmgr`, and `scontrol`.\n`slurmrestd` is not used, thus not required.\nHowever, the usage of Slurm's slurmdbd is mandatory.\n\nWhen the daemon runs, a periodic timer (default 1 minute) triggers a synchronization between the state of Slurm and cc-backend.\nData is then submitted to cc-backend via REST API.\nIf the job was successfully started or stopped, a notification can optionally be sent over [NATS](https://nats.io/) to notify any kind of client about changes.\n\nIn addition to the regular periodic timer, cc-slurm-adapter can be set up to immediately trigger upon a [Slurm Prolog/Epilog](https://slurm.schedmd.com/prolog_epilog.html).\nThis allows the job to be submitted more or less immediately to cc-backend and significantly reduces the delay.\nHowever, the usage of slurmctld's prolog/epilog hook is optional and is only useful to reduce update delay.\n\n## Limitations\n\nBecause slurmdbd does not appear to store all information about a job, submitted jobs to cc-backend may lack specific information in certain cases.\nThis includes all the information, that are not obtainable via a `sacct`.\nFor example, cc-slurm-adapter obtains resource allocation information via `scontrol --cluster XYZ show job XYZ --json`.\nHowever, it appears that this resource information becomes unavailable after a few minutes once the job has stopped (regardless of success or failure).\nShould the resource allocation information be of importance, one must make sure the daemon is not stopped for too long.\nMost notably, should resource information be unavailable, cc-backend cannot associate the job with metrics anymore.\nThe jobs can still be listed in cc-backend, but showing CPU, GPU, and memory metrics won't work anymore.\n\n## Slurm Version Compatibility\n\n### Versions known to work\n\nThese versions are known to work:\n\n- 24.xx.x\n- 25.xx.x\n\n### Slurm Related Code\n\nWe try to keep all Slurm related code in `slurm.go`.\nThis most notably refers to the JSON structure, which is returned by the various Slurm commands.\nThe most likely error that you will encounter with a Slurm incompatibility is a nil pointer dereference.\nWhile this currently may happen outside of `slurm.go`, the line should more or less uniquely identify which field is missing.\n\n### Slurm JSON\n\nAll Slurm JSON structs are replicated using Go structs and members are declared as pointer type.\nThat way we actually know if a value was missing during JSON parsing.\nShould you encounter the situation of a nil dereference, use the following commands to check against Slurm's current JSON layout:\n\n- Get a any job IDs either via `squeue` or `sacct`.\n- Unfortunately `scontrol` returns a different JSON layout than `sacct`. Accordingly, we need both JSON layouts returned by the following two commands:\n  - `sacct -j 12345 --json`, where 12345 is the job's ID.\n  - `scontrol show job 12345 --json`, where 12345 is the job's ID.\n\n### SlurmInt and SlurmString in JSON\n\nAt some point Slurm started transitioning from plain integers to a struct, which contains whether the integer is infinite or actually set at all.\nIt appears that with every version change, more values in the JSON change from plain integers to this integer struct.\nWhen you encounter such a change, use the custom type `SlurmInt`, which supports being parsed as plain integer and integer struct.\nThat way we can keep compatibility with old versions.\n\nA similar situation exists with strings.\nFor some reason, Slurm started replacing some strings in the API with arrays, which may contain an arbitrary amount of strings (possibly none at all).\nBecause the makes handling the API rather nasty, the SlurmString type can be used.\nThis can be parsed both as a normal string and such an array.\nShould the array be empty, it is interpreted as a blank string.\nShould the array have at least one value, the string is set to that particular value.\nWhile there may be the case of multiple values in an array, we have not encountered this under normal circumstances.\nThough, it is possible this may change in newer Slurm versions.\n\n## Command Line Usage\n\nOptions/Argument | Description\n--- | ---\n`-config \u003cpath\u003e`        | Specify the path to the config file\n`-daemon`               | Run the cc-slurm-adapter in daemon mode\n`-debug \u003clog-level\u003e`    | Set the log evel (default is 2)\n`-help`                 | Show help for all command line flags\n\nIf `-daemon` is not supplied, cc-slurm-adapter runs in Prolog/Epilog mode.\nThis only works when running from a Slurm Prolog/Epilog context.\n\n## Configuration\n\n### Example\n\nHere is an example of the configuration file.\nMost values are optional, see Reference to see which ones you really need.\n\n```json\n{\n    \"pidFilePath\": \"/run/cc-slurm-adapter/daemon.pid\",\n    \"prepSockListenPath\": \"/run/cc-slurm-adapter/daemon.sock\",\n    \"prepSockConnectPath\": \"/run/cc-slurm-adapter/daemon.sock\",\n    \"lastRunPath\": \"/var/lib/cc-slurm-adapter/last_run\",\n    \"slurmPollInterval\": 60,\n    \"slurmQueryDelay\": 1,\n    \"slurmQueryMaxSpan\": 604800,\n    \"slurmQueryMaxRetries\": 5,\n    \"ccPollInterval\" : 21600,\n    \"ccRestSubmitJobs\" : true,\n    \"ccRestUrl\": \"https://my-cc-backend-instance.example\",\n    \"ccRestJwt\": \"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\",\n    \"gpuPciAddrs\": {\n        \"^nodehostname0[0-9]$\": [\n            \"00000000:00:10.0\",\n            \"00000000:00:3F.0\"\n        ],\n        \"^nodehostname1[0-9]$\": [\n            \"00000000:00:10.0\",\n            \"00000000:00:3F.0\"\n        ]\n    },\n    \"ignoreHosts\": \"^nodehostname9\\\\w+$\",\n    \"natsServer\": \"mynatsserver.example\",\n    \"natsPort\": 4222,\n    \"natsSubject\": \"mysubject\",\n    \"natsUser\": \"myuser\",\n    \"natsPassword\": \"123456789\",\n    \"natsCredsFile\": \"/etc/cc-slurm-adapter/nats.creds\",\n    \"natsNKeySeedFile\": \"/etc/ss-slurm-adapter/nats.nkey\"\n}\n```\n\n### Reference\n\nFor example values of each config key, check the example above.\nThe non-optional values above show the default values.\n\nConfig Key | Optional | Description\n--- | --- | ---\n`pidFilePath`       | yes | Path to the PID file. cc-slurm-adapter manages its own PID file in order to avoid concurrent execution.\n`ipcSockPath`       | N/A | This option has been removed and is replaced by `prepSockListenPath` and `prepSockConnectPath`.\n`prepSockListenPath` | yee | Path to the PrEp socket. This socket is needed for cc-slurm-adapter running in daemon mode to receive prolog/epilog events. By default a UNIX socket is used, but it is possible to use a TCP socket for multi-slurmctld setups. Instead of using a plain path, you may use formats like the following: `tcp:127.0.0.1:12345`, `tcp:0.0.0.0:12345`, `tcp:[::1]:12345`, `tcp:[::]:12345`, `tcp::12345` (v4 + v6)\n`prepSockConnectPath` | yes | Path to the PrEp socket. This socket is needed for cc-slurm-adapter running in prolog/epilog mode. Same format as `prepSockListenPath`.\n`lastRunPath`       | yes | Path to the file which contains the time stamp of cc-slurm-adapter's last successful sync to cc-backend. Time is stored as a file timestamp, not in the file itself.\n`slurmPollInterval` | yes | Interval (seconds) in which a sync to cc-backend occurs, assuming no prolog/epilog event occurs.\n`slurmQueryDelay`   | yes | Time (seconds) to wait between prolog/epilog event to actual synchronization. This is just for good measure to give Slurm some time to react. There should usually be no need to change this.\n`slurmQueryMaxSpan` | yes | Maximum time (seconds) cc-slurm-adapter is allowed to synchronize jobs from the past. This is to avoid accidental flooding with millions of jobs from e.g. multiple years.\n`slurmMaxRetries`   | yes | Maximum attempts Slurm should be queried upon a Prolog/Epilog event. If Slurm is reacting slow or isn't available at all, this limits the \"fast\" attempts to query Slurm about that job. Even if it should time out, a later occuring synchronize should still catch this job. Though, the latency from Slurm to cc-backend is increased. There should usually be no need to change this.\n`ccPollInterval`    | yes | Interval (seconds) in which all jobs are queried from cc-backend. Used to prevent stuck jobs and during normal operation this does not need to run often.\n`ccRestSubmitJobs`  | yes | Submit started/stopped jobs to cc-backend. You can set this to `false` to supress submission of start/stop jobs via REST. This only makes any sense if you have NATS enabled and cc-backend registers start/stop jobs via NATS.\n`ccRestUrl`         | no  | The URL to cc-backend's REST API. Must not contain a trailing slash.\n`ccRestJwt`         | no  | The JWT obtained from cc-backend, which allows access to the REST API.\n`gpuPciAddrs`       | yes | Dictionary of Regexes mapping to a list of PCI addresses. If some of your nodes have GPUs, use this to map the hostnames via regex to a list of GPUs those nodes have. They have to be ordered like NVML shows them, which should hopefully be the same as `nvidia-smi` shows them, if all devices are visible.\n`ignoreHosts`       | yes | Regex of hostnames to ignore. If all hosts that are associated to a job match this regex, the job is discarded and not reported to cc-backend.\n`natsServer`        | yes | Hostname of the NATS server. Leave blank or omit to disable NATS.\n`natsPort`          | yes | Port of the NATS server.\n`natsSubject`       | yes | Subject to which publish job information to.\n`natsUser`          | yes | If your NATS server requires [user auth](https://docs.nats.io/running-a-nats-service/configuration/securing_nats/auth_intro/username_password), specify NATS user.\n`natsPassword`      | yes | Password to be used with the NATS user.\n`natsCredsFile`     | yes | If your NATS server requires a [credentials file](https://docs.nats.io/using-nats/developer/connecting/creds), use this to set the file path.\n`natsNKeySeedFile`  | yes | If your NATS server requires plain [NKey auth](https://docs.nats.io/running-a-nats-service/configuration/securing_nats/auth_intro/nkey_auth), use this to specify the path to the file, which contains the NKey seed (private key).\n\n## Admin Guide\n\n### How to compile\n\n```\n$ make\n```\n\n### Daemon (required)\n\n#### Copy binary and configuration\n\nThis should be self explanatory.\nYou can copy the binary and the configuration anywhere you like.\nSince the configuration file contains the JWT for cc-backend and possibly NATS credentials, make sure to use appropriate file permissions.\n\n#### Installing the service\n\nSee this example systemd service file:\n\n```ini\n[Unit]\nDescription=cc-slurm-adapter\n\nWants=network.target\nAfter=network.target\n\n[Service]\nUser=cc-slurm-adapter\nGroup=slurm\nExecStart=/opt/cc-slurm-adapter/cc-slurm-adapter -daemon -config /opt/cc-slurm-adapter/config.json\nWorkingDirectory=/opt/cc-slurm-adapter/\nRuntimeDirectory=cc-slurm-adapter\nRuntimeDirectoryMode=0750\nRestart=on-failure\nRestartSec=15s\n\n[Install]\nWantedBy=multi-user.target\n```\n\nThis service file runs the cc-slurm-adapter daemon as the user `cc-slurm-adapter`.\nA runtime directory /run/cc-slurm-adapter is created for the PID file and PrEp socket.\nThe group is set to `slurm` so that the RuntimeDirectoryMode=0750 will allow the group `slurm` to enter this directory.\nAccess from the `slurm` group is only necessary, if the slurmctld Prolog/Epilog hook is used.\nThat is because the Prolog/Epilog is executed as the slurm user/group and otherwise cc-slurm-adapter (in Prolog/Epilog mode) is unable to open the Unix socket in the runtime directory.\n\n#### Setting cc-slurm-adapter user Slurm permissions\n\nDepending on the Slurm configuration, an arbitrary user may not be allowed to access information from Slurm via `sacct` or `scontrol`.\nSince it is recommended to run cc-slurm-adapter as its own user, this user (e.g. `cc-slurm-adapter`) needs to be given permission.\nYou can do this like the following:\n\n```\n$ sacctmgr add user cc-slurm-adapter Account=root AdminLevel=operator\n```\n\nIf your Slurm instance is restricted and the permissions are not given, NO JOBS WILL BE REPORTED!\n\n#### Debugging the daemon\n\nOkay, so you have set up the daemon, but no jobs are running or it is crashing?\nIn that case you should first check the log.\ncc-slurm-adapter doesn't have its own log file.\nInstead it prints all errors and warnings to stderr.\n\nIf the log doesn't show anything useful, you can increase the default log-level from 2 to 5 via `-log-level 5`.\nWhile it may spam the console for many jobs, the debug messages may give insight to what exactly is going wrong.\nThough, cc-slurm-adapter attempts to print anything of significance to the default log level.\n\n### slurmctld Prolog/Epilog Hook (optional)\n\nIf you want to make cc-slurm-adapter more responsive to new jobs, you can enable the slurmctld hook.\nIn your `slurm.conf`, add the following lines:\n\n```ini\nPrEpPlugins=prep/script\nPrologSlurmctld=/some_path/hook.sh\nEpilogSlurmctld=/some_path/hook.sh\n```\n\nWhere `hook.sh` is an executable shell script which may look like this:\n\n```bash\n#!/bin/sh\n\n/opt/cc-slurm-adapter/cc-slurm-adapter\n\nexit 0\n```\n\nGenerally speaking, it is not necessary to specify the config path during the Prolog/Epilog invocation.\nHowever, that is only the case if the default PrEp socket path `/run/cc-slurm-adapter/daemon.sock` is used.\nIf you want to change that, you have to add `-config /some_other_path/config.json` to your `hook.sh` script.\nIn that case the config also has to be readable by the user or group `slurm`.\n\nIt is important to exit the script with 0.\nThat is because the exit code of a Slurm Prolog/Epilog determines whether a job allocation should succeed or not.\nWhile this allocation failure can be useful for ensuring correct operation, it should not be used in production.\nIf for example the cc-slurm-adapter is restarterd or stopped for some time, the Prolog/Epilog calls will fail and will immediately deny all job allocations.\nThis is most certainly undesirable in a production instance.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclustercockpit%2Fcc-slurm-adapter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclustercockpit%2Fcc-slurm-adapter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclustercockpit%2Fcc-slurm-adapter/lists"}