{"id":18620511,"url":"https://github.com/simonsobs/mbatch","last_synced_at":"2025-10-18T13:05:47.374Z","repository":{"id":55127346,"uuid":"326519699","full_name":"simonsobs/mbatch","owner":"simonsobs","description":"A parallelized pipeline script plumbing tool","archived":false,"fork":false,"pushed_at":"2025-02-28T04:09:25.000Z","size":101,"stargazers_count":4,"open_issues_count":11,"forks_count":1,"subscribers_count":39,"default_branch":"main","last_synced_at":"2025-03-25T08:11:37.399Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonsobs.png","metadata":{"files":{"readme":"README.rst","changelog":"HISTORY.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.rst","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-03T23:17:18.000Z","updated_at":"2025-02-28T04:09:29.000Z","dependencies_parsed_at":"2024-06-05T15:24:02.806Z","dependency_job_id":null,"html_url":"https://github.com/simonsobs/mbatch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonsobs%2Fmbatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonsobs%2Fmbatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonsobs%2Fmbatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonsobs%2Fmbatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonsobs","download_url":"https://codeload.github.com/simonsobs/mbatch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248329551,"owners_count":21085557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T04:06:38.092Z","updated_at":"2025-10-18T13:05:47.293Z","avatar_url":"https://github.com/simonsobs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"======\nmbatch\n======\n\n``mbatch`` is a parallelized pipeline script plumbing tool. It aims to be\nsimple; it does *not* aim to be powerful, flexible or automagical e.g. like\n``parsl``. It is intended to be of specialized use for SLURM-based hybrid\nMPI+OpenMP pipelines and emphasizes versioning, reproducibility and controlled\ncaching.  ``mbatch`` aims to provide a quick way to stitch together existing\npipeline scripts without requiring significant code changes. A pipeline can be\nput together using a YAML file that stitches together various stages, where each\nstage has its own script that outputs products to disk. Unlike more generic\npipeline tools (e.g. ``ceci``, ``BBpipe``), dependencies between stages have to\nbe specified manually, and are only used to specify dependencies between SLURM\nsubmissions; however, this also means far less boilerplate code is needed to make\nyour pipeline compatible with this tool. ``mbatch`` also does checks\nof the git cleanliness of specified modules, and logs this to aid future\nreproducibility and automatically decide whether to re-use previously run stages.\n\n* Free software: BSD license\n* OS support: Unix-like (e.g. Linux, Mac OS X, but not Windows)\n* Requires Python \u003e= 3.6\n\nFeatures\n--------\n\n* Separates projects and stages within projects by automatically creating\n  directory structures\n* Detects cluster computing site, composes appropriate SLURM ``sbatch`` scripts, assigns\n  dependencies and submits them\n* Logs all information on a per-stage basis, including arguments used for the\n  run, git and package version information, SLURM output and job completion status\n* Based on the logged information, automatically decides whether to re-use\n  certain stages (and not submit them to the queue)\n* Shows a summary of what stages will be re-used and what will be submitted, and\n  prompts user to confirm before proceeding\n\n\nInstalling\n----------\n\nFirst, you should pip install ``mbatch``, either off PyPI (currently not implemented, use git clone below):\n\n.. code-block:: console\n\t\t\n   $ pip install mbatch --user\n\nor by git cloning and then doing a local install:\n\n.. code-block:: console\n\t\t\n   $ git clone git@github.com:simonsobs/mbatch.git\n   $ cd mbatch\n   $ python setup.py install --user\n\nYou will likely need to do a small amount of configuration, as described below.\n\nConfiguration\n-------------\n   \nNext, you should make sure that there are appropriate configurations\nfor the sites you frequently use. You can find out the location\nof the default site configuration files by running:\n\n.. code-block:: console\n\t\t\n   $ mbatch --show-site-path\n\nThis will typically show a location like\n\n```\n~/.local/lib/python3.8/site-packages/mbatch/data/sites/\n```\n\nYou can edit the site files here e.g. ``niagara.yml`` for the ``niagara`` Scinet\nsupercomputing site, though note that the default provided\nones may end up overwritten when ``mbatch`` is updated. To guard against that,\nyou can copy the contents of ``~/.local/lib/python3.8/site-packages/mbatch/data/sites/``\n(or whatever was the result of the above command) to ``~/.mbatch/``, which is the\nfirst location that ``mbatch`` looks for site files.\n\nYou can also create a custom SLURM template `~/.mbatch/mysite.yml`. For example a site with 8 cores per node and 8GB memory per node would looks like:\n``\n\ndefault_constraint: None\ndefault_part: None\ndefault_qos: None\ndefault_account: None\narchitecture:\n  None:      # constraint name\n    None:    # partition name\n      cores_per_node: 8\n      threads_per_core: 2\n      memory_per_node_gb: 8\n\n\ntemplate: |\n  #!/bin/bash!CONSTRAINT!QOS!PARTITION!ACCOUNT\n  #SBATCH --nodes=!NODES\n  #SBATCH --time=!WALL\n  #SBATCH --ntasks-per-node=!TASKSPERNODE\n  #SBATCH --ntasks=!TASKS\n  #SBATCH --cpus-per-task=!THREADS\n  #SBATCH --job-name=!JOBNAME\n  #SBATCH --output=!OUT_%j.txt\n  #SBATCH --mail-type=FAIL\n  #SBATCH --mail-user=\n\n  export DISABLE_MPI=false\n  export OMP_NUM_THREADS=!THREADS\n  export NUMEXPR_MAX_THREADS=!THREADS\n\n  mpirun !CMD\n``\nUsing ``--site mysite`` to specify this template.\nTo enable hyper-threading, change ``!THREADS`` to ``!HYPERTHREADS``.\n\n\nPipeline requirements\n---------------------\n\n``mbatch`` works best with an existing pipeline structure that can be\nbroken down into stages. Each stage has its own script and outputs its\nproducts to disk. A stage may depend on the outputs of other stages.\n\nWhen writing a new pipeline or modifying an existing one to work with\n``mbatch``, we recommend using the ``argparse`` Python module. Only a few things need to be kept in mind:\n\n* The pipeline stage scripts do *not* need to do any versioning or tagging of individual runs. This is done through\n  the ``mbatch`` project name specified for each submission.\n* Every pipeline stage script should accept an argument ``--output-dir``. The user will not have\n  to set this argument; it is managed by ``mbatch``.\n* The script should only accept one positional argument: ``mbatch`` allows you\n  to loop over different values of this argument when submitting jobs. Any\n  number of optional arguments can be provided.\n* All of the stage output products should then be stored in the directory pointed to by ``args.output-dir``.\n* If the stage needs products as input from a different stage e.g. with name ``stage1``, they should be obtained from\n  ``{args.output_dir}/../stage1/``.\n\nThat's it! Once your pipeline scripts have been set up this way, you will need to write a configuration\nfile that specifies things like what MPI scheme to use for each stage, what\nother stages it depends on, etc.\n\n\nExample\n-------\n\nLet's go over the simple example in the `example/` directory of mbatch's Github\nrepository. To try out the example yourself, you will have to clone the\nrepository as explained earlier.\n\nWe change into the example directory where there are a set of Python scripts\nstage1.py, stage2.py, stage3.py, stage4.py that contain rudimentary example\npipeline stages that may or may not read some inputs and save output data to disk.\n\nFor this example, we will create a directory called `output` that will hold\nany output data. `mbatch` works by submitting a set of scripts using SLURM's\n`sbatch` and asking for outputs from these scripts to be organized into\nseparate stage directories for each script, which are all under the same \"project\"\ndirectory. The `output` directory we make here will be the root (parent) directory\nfor any projects we submit for this example.\n\n.. code-block:: bash\n\n\t\t$ cd example\n\t\t$ mkdir output\n\t\t$ ls\n\t\t\n\t\texample\n\t\t├── output/\n\t\t├── stage1.py\n\t\t├── stage2.py\n\t\t├── stage3.py\n\t\t├── stage4.py\n\t\t└── example.yml\n\n\nWe also see an example configuration file example.yml which will\nbe the input for `mbatch` that stitches together these stage scripts.\n\nLet's examine example.yml closely. The YAML file includes the following:\n\n.. code-block:: bash\n\n\t\troot_dir: output/\n\n\nThis indicates that the root directory for any projects run with this configuration\nfile will be `output/`.  A project with name \"foo\", for example, will then go into\nthe directory `output/foo/` and outputs of pipeline stages of this project will go\ninto sub-directories of `output/foo/`.\n\nNext up in `example.yml` we see\n\n.. code-block:: bash\n\n\t\tglobals:\n\t\t    lmax: 3000\n\t\t    lmin: 100\n\n\nThis defines two arguments that are global to all pipeline stages. These\narguments can then be referenced by any pipeline stage that we wish to make\nit accessible to. More on this later.\n\nFurther down in `example.yml` we see\n\n.. code-block:: bash\n\n\t\tgitcheck_pkgs:\n\t\t    - numpy\n\t\t    - scipy\n\n\t\tgitcheck_paths:\n\t\t    - ./\n\t\t      \n\n`gitcheck_pkgs`: This directs `mbatch` to log the git status (commit hash, branch, etc.)\nand/or package version of the listed Python packages. Whether these packages\nhave changed will subseqently influence whether previously completed stages\nare re-used. `gitcheck_paths` is similar, but instead of specifying\na package, you specify a path to a directory that is under git version control.\nIn this example `./` will refer to the `mbatch` repository itself.\n\n\nFinally, in example.yml we see the definition of the pipeline stages, which are\ndescribed in the comments below:\n\n\n.. code-block:: bash\n\n\t\t# This structure will contain all the pipeline stage\n\t\t# definitions. The order in which the stages are listed\n\t\t# below does not matter, but the `depends` section in\n\t\t# each stage will influence the order in which they are\n\t\t# actually queued.\n\t\tstages:\n\n\t\t    # This first stage named `stage1` uses the python executable to run\n\t\t    # stage1.py (in the same directory). It passes no arguments (no globals\n\t\t    # either). And since it doesn't have a `parallel` section, it uses\n\t\t    # default options, including requesting only a walltime of 15 minutes.\n\t\t    # It does not depend on any other stages either, so it won't wait in\n\t\t    # the queue for others to finish.\n\t\t    stage1:\n\t\t        exec: python\n\t\t        script: stage1.py\n\t\t\n\t\t    # This stage named `stage2` also doesn't depend on others and thus won't\n\t\t    # wait, but it (a) does specify that we should pass the global variables\n\t\t    # as optional arguments to stage2.py. It also passes a few other options\n\t\t    # to the script. It does not pass any positional arguments.\n\t\t    # It also explicitly says to use 8 OpenMP threads and\n\t\t    # requests 15 minutes of walltime.\n            # Note: If hyper-threading (2 threads per core) is enabled in SLURM\n            # template, the generated sbatch script will have\n            # OMP_NUM_THREADS=16 and --cpus-per-task=16\n\t\t    stage2:\n\t\t        exec: python\n\t\t        script: stage2.py\n\t\t        globals:\n\t\t            - lmin\n\t\t            - lmax\n\t\t        options:\n\t\t            arg1: 0\n\t\t            arg2: 1\n\t\t            flag1: true\n\t\t        parallel:\n\t\t            threads: 8\n\t\t            walltime: 00:15:00\n\t\t    \n\t\t    \n\t\t    # This stage named `stage3` depends on stage1 and stage2, so it will\n\t\t    # only start after stage1 and stage2 have successfully completed with\n\t\t    # exit code zero. In addition to passing globals and the optional argument\n\t\t    # \"nsims\", it also passes one positional argument \"TTTT\" specified through\n\t\t    # the \"arg\" keyword.\n\t\t    # In the ``parallel`` section we request nproc=4 MPI processes. As an\n\t\t    # alternative to specifying the exact number of OpenMP threads, we provide\n\t\t    # an estimate for the maximum memory each process will use memory_gb and\n\t\t    # the minimum number of threads to use. Based on the memory available on\n\t\t    # a single node at the computing site and the number of cores per node,\n\t\t    # mbatch will use an even number of threads = max(min_threads,\n\t\t    # cores_per_node/memory_per node * memory_gb). \n\t\t    stage3:\n\t\t         exec: python\n\t\t         script: stage3.py\n\t\t         depends:\n\t\t             - stage1\n\t\t             - stage2\n\t\t\t globals:\n\t\t\t     - lmin\n\t\t\t     - lmax\n\t\t\t options:\n\t\t\t     nsims: 32\n\t\t\t     arg: TTTT\n\t\t\t parallel:\n\t\t\t     nproc: 4\n\t\t\t     memory_gb: 4\n\t\t\t     min_threads: 8\n\t\t\t     walltime: 00:15:00\n\n\t\t    # This stage named `stage3loop` is similar to `stage3` but\n\t\t    # it provides a list for `arg`. This will create N copies\n\t\t    # of this stage, each of which loop the positional argument\n\t\t    # over the N elements of the list specified by `arg`.\n\t\t    stage3loop:\n\t\t        exec: python\n\t\t        script: stage3.py\n\t\t        depends:\n\t\t            - stage1\n\t\t\t    - stage2\n\t\t\tglobals:\n\t\t\t    - lmin\n\t\t\t    - lmax\n\t\t\toptions:\n\t\t\t    nsims: 32\n\t\t\targ:\n\t\t\t    - TTTT\n\t\t\t    - TTEE\n\t\t\t    - TETE\n\t\t\tparallel:\n\t\t\t    nproc: 4\n\t\t\t    memory_gb: 4\n\t\t\t    min_threads: 8\n\t\t\t    walltime: 00:15:00\n\n\t\t    # Another stage that depends on a previous one\n\t\t    stage4:\n \t\t        exec: python\n\t\t\tscript: stage4.py\n\t\t\tdepends:\n\t\t\t    - stage3\n\t\t\t    - stage3loop\n\t\t\tparallel:\n\t\t\t    nproc: 1\n\t\t\t    threads: 8\n\t\t\t    walltime: 00:15:00\n\t\t\t\t\t      \n\t\t     # Another stage that depends on stage4, but uses\n\t\t     # the same script as did stage4.\n\t\t     stage5:\n\t\t         exec: python\n\t\t\t script: stage4.py\n\t\t     depends:\n\t\t\t     - stage4\n\t\t\t parallel:\n\t\t\t     nproc: 1\n\t\t\t     threads: 8\n\t\t\t     walltime: 00:15:00\n\t\t\t\t\t\t\n\nWe can run this pipeline configuration with `mbatch`. Here is how it looks when run locally (not on a remote system that has SLURM installed):\n\n.. code-block:: bash\n\n\t\t$ mbatch foo example.yml\n\t\tNo SLURM detected. We will be locally executing commands serially.\n\t\tWe are doing a dry run, so we will just print to screen.\n\t\tSUMMARY FOR SUBMISSION OF PROJECT foo\n\t\tstage1     [SUBMIT]\n\t\tstage2     [SUBMIT]\n\t\tstage3     [SUBMIT]\n\t\tstage3loop_TTTT    [SUBMIT]\n\t\tstage3loop_TTEE    [SUBMIT]\n\t\tstage3loop_TETE    [SUBMIT]\n\t\tstage4     [SUBMIT]\n\t\tstage5     [SUBMIT]\n\t\tProceed with this? (Y/n)\n\n\t\nwhich shows a summary of the stages that will be reused or submitted (in a first run where no products exist, all will be submitted). You will receive a prompt to confirm the submission.\n\nHere, `mbatch` has detected that all stages need to be run (because no previous outputs exist),\nand asks us to confirm the submission. After proceeding and the commands have completed\n(in serial execution, since we are trying this locally), the directory structure now looks like:\n\n\n.. code-block:: bash\n\n\t\t$ tree .\n\t\t.\n\t\t├── example.yml\n\t\t├── output\n\t\t│   └── foo\n\t\t│       ├── stage1\n\t\t│       │   ├── stage1_result.txt\n\t\t│       │   └── stage_config.yml\n\t\t│       ├── stage2\n\t\t│       │   ├── stage2_result.txt\n\t\t│       │   └── stage_config.yml\n\t\t│       ├── stage3\n\t\t│       │   ├── stage3_result_TTTT.txt\n\t\t│       │   └── stage_config.yml\n\t\t│       ├── stage3loop_TETE\n\t\t│       │   ├── stage3_result_TETE.txt\n\t\t│       │   └── stage_config.yml\n\t\t│       ├── stage3loop_TTEE\n\t\t│       │   ├── stage3_result_TTEE.txt\n\t\t│       │   └── stage_config.yml\n\t\t│       ├── stage3loop_TTTT\n\t\t│       │   ├── stage3_result_TTTT.txt\n\t\t│       │   └── stage_config.yml\n\t\t│       ├── stage4\n\t\t│       │   ├── stage4_result.txt\n\t\t│       │   └── stage_config.yml\n\t\t│       └── stage5\n\t\t│           ├── stage4_result.txt\n\t\t│           └── stage_config.yml\n\t\t├── stage1.py\n\t\t├── stage2.py\n\t\t├── stage3.py\n\t\t└── stage4.py\n\n\nFor more information on running mbatch, use\n\n.. code-block:: bash\n\n\tmbatch -h\n\n\nWrapper for OpenMP+MPI jobs\n---------------------------\n\n``mbatch`` now includes a wrapper ``wmpi`` for hybrid OpenMP+MPI jobs that are not part\nof a pipeline. Here's how to use it:\n\n.. code-block:: bash\n\n\t\t\n\tusage: wmpi [-h] [-d DEPENDENCIES] [-o OUTPUT_DIR] [-t THREADS] [-s SITE]\n\t\t    [-n NAME] [-A ACCOUNT] [-q QOS] [-p PARTITION] [-c CONSTRAINT]\n\t\t    [-w WALLTIME] [--dry-run]\n\t\t    N Command\n\n\tSubmit hybrid OpenMP+MPI jobs\n\n\tpositional arguments:\n\t  N                     Number of MPI jobs\n\t  Command               Command\n\n\toptional arguments:\n\t  -h, --help            show this help message and exit\n\t  -d DEPENDENCIES, --dependencies DEPENDENCIES\n\t\t\t\tComma separated list of dependency JOBIDs\n\t  -o OUTPUT_DIR, --output-dir OUTPUT_DIR\n\t\t\t\tOutput directory\n\t  -t THREADS, --threads THREADS\n\t\t\t\tNumber of threads\n\t  -s SITE, --site SITE  Site name (optional; will auto-detect if not provided)\n\t  -n NAME, --name NAME  Job name\n\t  -A ACCOUNT, --account ACCOUNT\n\t\t\t\tAccount name\n\t  -q QOS, --qos QOS     QOS name\n\t  -p PARTITION, --partition PARTITION\n\t\t\t\tPartition name\n\t  -c CONSTRAINT, --constraint CONSTRAINT\n\t\t\t\tConstraint name\n\t  -w WALLTIME, --walltime WALLTIME\n\t\t\t\tWalltime\n\t  --dry-run             Only show submissions.\n\n`mbatch` will then pick a template for an `sbatch` configuration file by detecting what cluster computer you are using (only NERSC, niagara and Perimeter's Symmetry are currently supported), populate this template and submit it using `sbatch`. The idea behind this wrapper is that you won't have to think too much about which cluster you are on (beyond the core counts).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonsobs%2Fmbatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonsobs%2Fmbatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonsobs%2Fmbatch/lists"}