{"id":39618762,"url":"https://github.com/dcfidalgo/slurm_sweeps","last_synced_at":"2026-01-18T08:22:27.776Z","repository":{"id":178956878,"uuid":"659638472","full_name":"dcfidalgo/slurm_sweeps","owner":"dcfidalgo","description":"A simple tool to perform parameter sweeps on SLURM clusters.","archived":false,"fork":false,"pushed_at":"2024-12-09T19:27:46.000Z","size":336,"stargazers_count":3,"open_issues_count":5,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-04T02:04:10.279Z","etag":null,"topics":["asha","high-performance-computing","hpc","hpo","hyperparameter-optimization","slurm","slurm-cluster","slurm-workload-manager","sweeps"],"latest_commit_sha":null,"homepage":"https://gitlab.mpcdf.mpg.de/dcfidalgo/slurm_sweeps","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dcfidalgo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-06-28T08:47:49.000Z","updated_at":"2024-12-09T01:01:03.000Z","dependencies_parsed_at":"2023-12-21T18:25:01.800Z","dependency_job_id":"c6875201-aaca-4141-a6d5-151063790c1a","html_url":"https://github.com/dcfidalgo/slurm_sweeps","commit_stats":null,"previous_names":["dcfidalgo/slurm_sweeps"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/dcfidalgo/slurm_sweeps","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcfidalgo%2Fslurm_sweeps","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcfidalgo%2Fslurm_sweeps/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcfidalgo%2Fslurm_sweeps/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcfidalgo%2Fslurm_sweeps/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dcfidalgo","download_url":"https://codeload.github.com/dcfidalgo/slurm_sweeps/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcfidalgo%2Fslurm_sweeps/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28534143,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-18T00:39:45.795Z","status":"online","status_checked_at":"2026-01-18T02:00:07.578Z","response_time":98,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asha","high-performance-computing","hpc","hpo","hyperparameter-optimization","slurm","slurm-cluster","slurm-workload-manager","sweeps"],"created_at":"2026-01-18T08:22:27.183Z","updated_at":"2026-01-18T08:22:27.751Z","avatar_url":"https://github.com/dcfidalgo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n  \u003ca href=\"\"\u003e\u003cimg src=\"slurm_sweeps.png\" alt=\"slurm sweeps logo\" width=\"210\"\u003e\u003c/a\u003e\n  \u003cbr\u003e\n  slurm sweeps\n\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\u003cb\u003eA simple tool to perform parameter sweeps on SLURM clusters.\u003c/b\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/dcfidalgo/slurm_sweeps/blob/main/LICENSE\"\u003e\n    \u003cimg alt=\"License\" src=\"https://img.shields.io/github/license/dcfidalgo/slurm_sweeps.svg?color=blue\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://app.codecov.io/gh/dcfidalgo/slurm_sweeps\"\u003e\n    \u003cimg alt=\"Codecov\" src=\"https://img.shields.io/codecov/c/gh/dcfidalgo/slurm_sweeps\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\nThe main motivation was to provide a lightweight [ASHA implementation](https://arxiv.org/abs/1810.05934) for\n[SLURM clusters](https://slurm.schedmd.com/overview.html) that is fully compatible with\n[pytorch-lightning's ddp](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel).\n\nIt is heavily inspired by tools like [Ray Tune](https://www.ray.io/ray-tune) and [Optuna](https://optuna.org/).\nHowever, on a SLURM cluster, these tools can be complicated to set up and introduce considerable overhead.\n\n*Slurm sweeps* is simple, lightweight, and has few dependencies.\nIt uses SLURM Job Steps to run the individual trials.\n\n## Installation\n\n```commandline\npip install slurm-sweeps\n```\n\n### Dependencies\n- cloudpickle\n- numpy\n- pandas\n- pyyaml\n\n## Usage\nYou can just run this example on your laptop.\nBy default, the maximum number of parallel trials equals the number of CPUs on your machine.\n\n```python\n\"\"\" Content of test_ss.py \"\"\"\nfrom time import sleep\nimport slurm_sweeps as ss\n\n\n# Define your train function\ndef train(cfg: dict):\n    for epoch in range(cfg[\"epochs\"]):\n        sleep(0.5)\n        loss = (cfg[\"parameter\"] - 1) ** 2 / (epoch + 1)\n        # log your metrics\n        ss.log({\"loss\": loss}, epoch)\n\n\n# Define your experiment\nexperiment = ss.Experiment(\n    train=train,\n    cfg={\n        \"epochs\": 10,\n        \"parameter\": ss.Uniform(0, 2),\n    },\n    asha=ss.ASHA(metric=\"loss\", mode=\"min\"),\n)\n\n\n# Run your experiment\nresult = experiment.run(n_trials=1000)\n\n# Show the best performing trial\nprint(result.best_trial())\n```\n\nOr submit it to a SLURM cluster.\nWrite a small SLURM script `test_ss.slurm` that runs the code above:\n```bash\n#!/bin/bash -l\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=18\n#SBATCH --cpus-per-task=4\n#SBATCH --mem-per-cpu=1GB\n\npython test_ss.py\n```\n\nBy default, this will run `$SLURM_NTASKS` trials in parallel.\nIn the case above: `2 nodes * 18 tasks = 36 trials`\n\nThen submit it to the queue:\n```commandline\nsbatch test_ss.slurm\n```\n\nSee the `tests` folder for an advanced example of training a PyTorch model with Lightning's DDP.\n\n## API Documentation\n\n### `CLASS slurm_sweeps.Experiment`\n\n```python\nclass Experiment(\n    train: Callable,\n    cfg: Dict,\n    name: str = \"MySweep\",\n    local_dir: Union[str, Path] = \"./slurm-sweeps\",\n    asha: Optional[ASHA] = None,\n    slurm_cfg: Optional[SlurmCfg] = None,\n    restore: bool = False,\n    overwrite: bool = False,\n)\n```\n\nSet up an HPO experiment.\n\n**Arguments**:\n\n- `train` - A train function that takes as input the `cfg` dict.\n- `cfg` - A dict passed on to the `train` function.\n  It must contain the search spaces via `slurm_sweeps.Uniform`, `slurm_sweeps.Choice`, etc.\n- `name` - The name of the experiment.\n- `local_dir` - Where to store and run the experiments. In this directory,\n  we will create the database `slurm_sweeps.db` and a folder with the experiment name.\n- `slurm_cfg` - The configuration of the Slurm backend responsible for running the trials.\n  We automatically choose this backend when slurm sweeps is used within an sbatch script.\n- `asha` - An optional ASHA instance to cancel less promising trials.\n- `restore` - Restore an experiment with the same name?\n- `overwrite` - Overwrite an existing experiment with the same name?\n\n#### `Experiment.name`\n\n```python\n@property\ndef name() -\u003e str\n```\n\nThe name of the experiment.\n\n#### `Experiment.local_dir`\n\n```python\n@property\ndef local_dir() -\u003e Path\n```\n\nThe local directory of the experiment.\n\n#### `Experiment.run`\n\n```python\ndef run(\n    n_trials: int = 1,\n    max_concurrent_trials: Optional[int] = None,\n    summary_interval_in_sec: float = 5.0,\n    nr_of_rows_in_summary: int = 10,\n    summarize_cfg_and_metrics: Union[bool, List[str]] = True\n) -\u003e pd.DataFrame\n```\n\nRun the experiment.\n\n**Arguments**:\n\n- `n_trials` - Number of trials to run. For grid searches, this parameter is ignored.\n- `max_concurrent_trials` - The maximum number of trials running concurrently. By default, we will set this to\n  the number of cpus available, or the number of total Slurm tasks divided by the number of tasks\n  requested per trial.\n- `summary_interval_in_sec` - Print a summary of the experiment every x seconds.\n- `nr_of_rows_in_summary` - How many rows of the summary table should we print?\n- `summarize_cfg_and_metrics` - Should we include the cfg and the metrics in the summary table?\n  You can also pass in a list of strings to only select a few cfg and metric keys.\n\n**Returns**:\n\n  A summary of the trials in a pandas DataFrame.\n\n### `CLASS slurm_sweeps.ASHA`\n\n```python\nclass ASHA(\n    metric: str,\n    mode: str,\n    reduction_factor: int = 4,\n    min_t: int = 1,\n    max_t: int = 50,\n)\n```\n\nBasic implementation of the Asynchronous Successive Halving Algorithm (ASHA) to prune unpromising trials.\n\n**Arguments**:\n\n- `metric` - The metric you want to optimize.\n- `mode` - Should the metric be minimized or maximized? Allowed values: [\"min\", \"max\"]\n- `reduction_factor` - The reduction factor of the algorithm\n- `min_t` - Minimum number of iterations before we consider pruning.\n- `max_t` - Maximum number of iterations.\n\n#### `ASHA.metric`\n\n```python\n@property\ndef metric() -\u003e str\n```\n\nThe metric to optimize.\n\n#### `ASHA.mode`\n\n```python\n@property\ndef mode() -\u003e str\n```\n\nThe 'mode' of the metric, either 'max' or 'min'.\n\n#### `ASHA.find_trials_to_prune`\n\n```python\ndef find_trials_to_prune(database: \"pd.DataFrame\") -\u003e List[str]\n```\n\nCheck the database and find trials to prune.\n\n**Arguments**:\n\n- `database` - The experiment's metrics table of the database as a pandas DataFrame.\n\n\n**Returns**:\n\n  List of trial ids that should be pruned.\n\n### CLASS `slurm_sweeps.SlurmCfg`\n\n```python\n@dataclass\nclass SlurmCfg:\n  exclusive: bool = True\n  nodes: int = 1\n  ntasks: int = 1\n  args: str = \"\"\n```\n\nA configuration class for the SlurmBackend.\n\n**Arguments**:\n\n- `exclusive` - Add the `--exclusive` switch.\n- `nodes` - How many nodes do you request for your srun?\n- `ntasks` - How many tasks do you request for your srun?\n- `args` - Additional command line arguments for srun, formatted as a string.\n\n### CLASS `slurm_sweeps.Result`\n\n```python\nclass Result(\n    experiment: str,\n    local_dir: Union[str, Path] = \"./slurm-sweeps\",\n)\n```\n\nThe result of an experiment.\n\n**Arguments**:\n\n- `experiment` - The name of the experiment.\n- `local_dir` - The directory where we find the `slurm-sweeps.db` database.\n\n#### `Result.experiment`\n\n```python\n@property\ndef experiment() -\u003e str\n```\n\nThe name of the experiment.\n\n#### `Result.trials`\n\n```python\n@property\ndef trials() -\u003e List[Trial]\n```\n\nA list of the trials of the experiment.\n\n#### `Result.best_trial`\n\n```python\ndef best_trial(\n    metric: Optional[str] = None,\n    mode: Optional[str] = None\n) -\u003e Trial\n```\n\nGet the best performing trial of the experiment.\n\n**Arguments**:\n\n- `metric` - The metric. By default, we take the one defined by ASHA.\n- `mode` - The mode of the metric, either 'min' or 'max'. By default, we take the one defined by ASHA.\n\n**Returns**:\n\n  The best trial.\n\n### CLASS `slurm_sweeps.trial.Trial`\n\n```python\n@dataclass\nclass Trial:\n    cfg: Dict\n    process: Optional[subprocess.Popen] = None\n    start_time: Optional[datetime] = None\n    end_time: Optional[datetime] = None\n    status: Optional[Union[str, Status]] = None\n    metrics: Optional[Dict[str, Dict[int, Union[int, float]]]] = None\n```\n\nA trial of an experiment.\n\n**Arguments**:\n\n- `cfg` - The config of the trial.\n- `process` - The subprocess that runs the trial.\n- `start_time` - The start time of the trial.\n- `end_time` - The end time of the trial.\n- `status` - Status of the trial. If `process` is not None, we will always query the process for the status.\n- `metrics` - Logged metrics of the trial.\n\n#### `Trial.trial_id`\n\n```python\n@property\ndef trial_id() -\u003e str\n```\n\nThe trial ID is a 6-digit hash from the config.\n\n#### `Trial.runtime`\n\n```python\n@property\ndef runtime() -\u003e Optional[timedelta]\n```\n\nThe runtime of the trial.\n\n#### `Trial.is_terminated`\n\n```python\ndef is_terminated() -\u003e bool\n```\n\nReturn True, if the trial has been completed or pruned.\n\n### FUNCTION `slurm_sweeps.log`\n\n```python\ndef log(metrics: Dict[str, Union[float, int]], iteration: int)\n```\n\nLog metrics to the database.\n\nIf ASHA is configured, this also checks if the trial needs to be pruned.\n\n**Arguments**:\n\n- `metrics` - A dictionary containing the metrics.\n- `iteration` - Iteration of the metrics. Most of the time this will be the epoch.\n\n**Raises**:\n\n-  `TrialPruned` if the holy ASHA says so!\n-  `TypeError` if a metric is not of type `float` or `int`.\n\n## Contact\nDavid Carreto Fidalgo (david.carreto.fidalgo@mpcdf.mpg.de)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcfidalgo%2Fslurm_sweeps","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdcfidalgo%2Fslurm_sweeps","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcfidalgo%2Fslurm_sweeps/lists"}