{"id":21721425,"url":"https://github.com/informaticsmatters/nextflow-pcluster","last_synced_at":"2025-09-11T02:32:23.053Z","repository":{"id":80605157,"uuid":"305705035","full_name":"InformaticsMatters/nextflow-pcluster","owner":"InformaticsMatters","description":"Nextflow AWS Parallel Cluster Configuration","archived":false,"fork":false,"pushed_at":"2023-04-04T08:02:14.000Z","size":14793,"stargazers_count":4,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"pcluster-v3","last_synced_at":"2025-04-12T21:36:32.144Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/InformaticsMatters.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-10-20T12:45:31.000Z","updated_at":"2024-03-25T13:38:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"36b5d296-a931-4933-b16e-1ed93e38aab1","html_url":"https://github.com/InformaticsMatters/nextflow-pcluster","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/InformaticsMatters/nextflow-pcluster","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fnextflow-pcluster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fnextflow-pcluster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fnextflow-pcluster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fnextflow-pcluster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/InformaticsMatters","download_url":"https://codeload.github.com/InformaticsMatters/nextflow-pcluster/tar.gz/refs/heads/pcluster-v3","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InformaticsMatters%2Fnextflow-pcluster/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272801248,"owners_count":24995247,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-30T02:00:09.474Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-26T02:16:44.342Z","updated_at":"2025-08-30T04:12:27.404Z","avatar_url":"https://github.com/InformaticsMatters.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Nextflow AWS ParallelCluster Configuration\nMaterial for the formation and use of an v3 ParallelCluster (slurm-based) compute\nenvironment.\n\nYou'll need: -\n\n-   Python\n-   [jq]\n-   An AWS user with an [AdministratorAccess] managed policy\n\n## Overview\nThese materials create a compute cluster on AWS running the a [Slurm] workload\nmanager and sets up [Nextflow] to execute workflows.\n\nThe cluster is created using [AWS Parallel Cluster], a tool from AWS that\nautomates the creation of a number of types of cluster on AWS.\n\nStandard usage of these materials results in creating:\n\n-   A single master node in the public subnet\n-   An autoscaling group of worker nodes in the private subnet\n-   A shared EFS volume mounted at `/efs` on all master and worker nodes\n-   A node post installation script that\n    -   Installs [Singularity] on all nodes when they are started\n    -   Installs and configures Nextflow on the master node\n\nThe installation process is highly configurable and can create variations\nof the standard usage. Parallel Cluster creates a config file\nwhere these configuration changes can be made.\n\n\u003e   Consult the Parallel Cluster docs for full details.\n\nAfter you've satisfied the instructions in the **Getting Started** section\nbelow you typically: -\n\n1.  Configure a cluster\n2.  Create a cluster\n3.  Connect (SSH) to the cluster head node and run your Nextflow workflow\n4.  Repeat step 3 until done\n5.  Delete the cluster to avoid AWS charges \n\n## Getting started\nStart from a suitable virtual environment\n(ideally Python 3.8 host or better): -\n\n    $ python -m venv venv\n \n    $ source venv/bin/activate\n    (venv) $ pip install --upgrade pip\n    (venv) $ pip install -r requirements.txt --upgrade\n    \n    $ aws --version\n    aws-cli/2.9.1 Python/3.11.0 Darwin/21.6.0 source/x86_64 prompt/off\n\n    $ jq --version\n    jq-1.5\n\n### EC2 key-pair\nIf you have an existing SSH keypair on the AWS account you can skip this step.\n\nIf you do not have a pre-existing keypair, as an AWS user with\n*AdministratorAccess*, set the user credentials and default region environment\nvariables for your intended cluster: -\n\n    $ export AWS_ACCESS_KEY_ID=????\n    $ export AWS_SECRET_ACCESS_KEY=??????\n    $ export AWS_DEFAULT_REGION=eu-central-1\n\n...and create a keypair on the account, which can be easily done\nwith the `aws` CLI and `jq` to conveniently extract and write the\nprivate key block: -\n\n    $ KEYPAIR_NAME=nextflow-pcluster\n    $ aws ec2 create-key-pair --key-name ${KEYPAIR_NAME} \\\n        | jq -r .KeyMaterial \u003e ~/.ssh/${KEYPAIR_NAME} \u0026\u0026 \\\n        chmod 0600 ~/.ssh/${KEYPAIR_NAME} \n\n### IAM Role and Policies\nUsing the AWS console (or CLI) create a **Role** for use with the cluster.\nThis will typically be an **EC2** role. Later we'll be attaching policies to\nthis role. For now you do not need to add any additional policies.\nJust continue to **Create role** and give it a name (like `nextflow-pcluster`).\n\n...and set some convenient variables, that we'll use later.\nNamely the created user name and your AWS account ID: -\n\n    $ CLUSTER_ACCOUNT_ID=000000000000\n\nWe now create **Policies** in AWS and then attach them to the role.\n\n\u003e   The [ParallelCluster policies] for v3 are numerous and complex\n    but we've extracted what we found to be essential and placed them\n    in the project `iam` directory.\n\n\u003e   Copies of the policies exist in this repository along with a shell-script\n    to rapidly adapt them for the user and cluster you're going to create.\n\n\u003e   The `EVERYTHING-policy` is a combination of all the other policy files.\n    Which might be useful if you reach an IAM policy limit.\n\nGiven a region (like `eu-central-1`), user account ID, cluster name and a role name\nyou can render the repository's copy of the reference policy files \nusing the following command: -\n\n    $ ./render-policies.sh \\\n        ${AWS_DEFAULT_REGION} \\\n        ${CLUSTER_ACCOUNT_ID}\n\nNow install each of the policies using the AWS CLI. The policy names\nyou choose must be unique for your account: -\n    \n    $ aws iam create-policy \\\n        --policy-name NextflowClusterInstancePolicy \\\n        --policy-document file://v3-instance-policy.json\n    \n    $ aws iam create-policy \\\n        --policy-name NextflowClusterUserPolicy \\\n        --policy-document file://v3-user-policy.json\n\n    $ aws iam create-policy \\\n        --policy-name NextflowClusterOperatorPolicy \\\n        --policy-document file://v3-operator-policy.json\n\nNow, again using the AWS CLI, attach the policies to your chosen AWS role: -\n\n    $ aws iam attach-role-policy \\\n        --policy-arn arn:aws:iam::${CLUSTER_ACCOUNT_ID}:policy/NextflowClusterInstancePolicy \\\n        --role-name ${CLUSTER_ROLE_NAME}\n        \n    $ aws iam attach-role-policy \\\n        --policy-arn arn:aws:iam::${CLUSTER_ACCOUNT_ID}:policy/NextflowClusterUserPolicy \\\n        --role-name ${CLUSTER_ROLE_NAME}\n        \n    $ aws iam attach-role-policy \\\n        --policy-arn arn:aws:iam::${CLUSTER_ACCOUNT_ID}:policy/NextflowClusterOperatorPolicy \\\n        --role-name ${CLUSTER_ROLE_NAME}\n\n### Upload installation scripts\nPart of cluster formation permits the execution of installation scripts\nthat are pulled from AWS S3 as cluster compute instances are created. Example\n_post-installation_ scripts that prepare directories, singularity and\na default configuration file for Nextflow can be found in this repository's\n`installation-scripts` directory.\n\nNote that you might want to further customise the file that gets created at\n`/home/centos/.nextflow/config`.\n\nUse one of these scripts unless you have one of your own.\n\n\u003e   At the time of writing there are post-installation scripts for amazon\n    (Amazon Linux 2) and centos (CentOS 7).\n\n\u003e   For this example we're going to create a cluster based on the\n    **Amazon Linux 2** machine image.\n\nCreate an S3 bucket and upload the post-installation script for your\nchosen image to it (the bucket's called `nf-pcluster` in this example).\nHere we ensure that the file's `acl` (Access Control List)\npermits `public-read`: -\n\n    $ CLUSTER_BUCKET=nextflow-pcluster\n    $ CLUSTER_OS=amazon\n    $ aws s3 cp installation-scripts/${CLUSTER_OS}-post-install.sh \\\n        s3://${CLUSTER_BUCKET}/${CLUSTER_OS}-post-install.sh \\\n        --acl public-read\n\n## Creating a cluster configuration user\nFrom here we will be running the `pcluster` command-line utility\nto configure and manage the actual cluster. All we've done so far is\nprepare the ground for the formation of the cluster.\n\nIf you have an AWS IAM User with *AdministratorAccess* and you are happy\nto use that user then there's nothing more to do except move on to the next\nsection - **Creating a cluster configuration**.\n\n\u003e   You will still need a user with *AdministratorAccess* in this step.\n\nBut, if you do not want to use a user with *AdministratorAccess* to\ncreate the cluster then you need to create a new user and attach suitable\npolicies.\n\nFirstly, in the AWS console, create a new user with **Programmatic access**.\nSomething like `nextflow-pcluster` (or select an existing user)\n \n\u003e   There is no need to add any policies to the user but you must record\n    the newly assigned **Access key ID** and **Secret access key** before\n    closing the final window. If you forget you can always create another\n    access key later.\n\nNow, attach the previously rendered **NextflowClusterUserPolicy** policy\nto our user: -\n\n    $ CLUSTER_USER_NAME=nextflow-pcluster\n\n    $ aws iam attach-user-policy \\\n        --policy-arn arn:aws:iam::${CLUSTER_ACCOUNT_ID}:policy/NextflowClusterUserPolicy \\\n        --user-name ${CLUSTER_USER_NAME}\n\nThe user's credentials rather than an admin user's credentials\ncan now be used in the next step to configure the cluster.\n\n## Creating a cluster configuration\nWith the preparation work done we're all set to configure and create a cluster.\n\nWe use the `pcluster configure` command's interactive wizard to define our\ncluster.\n\n\u003e   Here we're using a pre-created EFS filesystem\n    (see Amazon's [Creating EFS] documentation)\n    rather than relying in ParallelCluster to do this for us. By doing this\n    we can preserve workflow data between cluster instantiations.\n\nHere's a typical configuration file we end up with (with redacted data).\nRather than use `pcluster configure` you can simply craft your own file.\nIn the following we are using a pre-assigned EFS, one created using the AWS\nconsole: -\n\n```yaml\nRegion: eu-central-1\nImage:\n  Os: alinux2\nTags:\n  - Key: Dept\n    Value: 'XYZ'\nSharedStorage:\n- Name: cluster-one\n  StorageType: Efs\n  MountDir: efs\n  EfsSettings:\n    FileSystemId: fs-00000000000000\nHeadNode:\n  InstanceType: t3a.large\n  Networking:\n    SubnetId: subnet-00000000000000000\n    ElasticIp: false\n  Ssh:\n    KeyName: im-pc3\n  CustomActions:\n    OnNodeConfigured:\n      Script: https://im-aws-parallel-cluster.s3.amazonaws.com/amazon-post-install.sh\nScheduling:\n  Scheduler: slurm\n  SlurmSettings:\n    ScaledownIdletime: 15\n  SlurmQueues:\n    - Name: compute\n      CapacityType: SPOT\n      ComputeResources:\n        - Name: cluster-one\n          InstanceType: c6a.4xlarge\n          MinCount: 1\n          MaxCount: 25\n          Efa:\n            Enabled: false\n      CustomActions:\n        OnNodeConfigured:\n          Script: https://im-aws-parallel-cluster.s3.amazonaws.com/amazon-post-install.sh\n      Networking:\n        SubnetIds:\n        - subnet-00000000000000000\n```\n\n## Create the cluster\nWith configuration edited you can create the cluster: - \n\n    $ CLUSTER_NAME=cluster-one\n    $ pcluster create-cluster -c ./config.yaml -n ${CLUSTER_NAME}\n\nAnd list clusters with: -\n\n    $ pcluster list-clusters\n\n\u003e   Allow 10 to 15 minutes for cluster formation to finish\n\n## Mounting a pre-configured EFS on the bastion\nAssuming you've created a suitable EFS (see Amazon's [Creating EFS] documentation)\nyou can mount it on the bastion with the following commands,\nreplacing `???` with values relevant to you: -\n\n    $ sudo yum install -y amazon-efs-utils\n    $ sudo yum -y install nfs-utils\n    $ sudo service nfs start\n\n    $ sudo mkdir /efs\n    $ sudo mount -t nfs \\\n        -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport \\\n        fs-????????????.efs.????.amazonaws.com:/ \\\n        /efs\n\n\u003e   Refer to the [EFS] documentation for further details.\n\n## Connect to the cluster\nYour cluster's created (well the _head node_ is). You can now use the CLI to\nconnect to the head node using the SSH key you created earlier. Here we just\nmake sure Nextflow is correctly installed by running the classic _hello_\nworkflow, which will create compute instances to run the workflow processes.\n\nAssuming you've put your private key file in `~/.ssh/id_rsa` you can connect\nwith: -\n\n    $ pcluster ssh -n ${CLUSTER_NAME}\n    [...]\n    \n    centos@ip-0-0-0-0 ~]$ nextflow run hello\n    [...]\n    N E X T F L O W  ~  version 20.07.1\n    [...]\n    [4e/8c5c13] process \u003e sayHello (3) [100%] 4 of 4 ✔\n    Ciao world!\n\n    Bonjour world!\n\n    Hola world!\n\n    Hello world!\n\n    Completed at: 20-Oct-2020 15:06:36\n    Duration    : 4m 46s\n    CPU hours   : (a few seconds)\n    Succeeded   : 4\n\n\u003e   Initial execution of Nextflow will take some time as\n    compute instances need to be instantiated (compute instances are created\n    on-demand and, in our configuration, retired automatically when idle\n    for 10 minutes) as well as the download of Nextflow dependent modules\n    and conversion of any required Docker container images to Singularity. \n\nCongratulations! You can now run Slurm-based Nextflow workflows!\n\n\u003e   To execute our [fragmentation workflow] you may need the private copy of\n    the keypair used to create the cluster in the Master node's\n    `~/.ssh/${KEYPAIR_NAME}` directory. This will allow you to create the\n    database server (a separate EC2 instance) using the keypair you used to\n    create the cluster, remembering to set the Master node's file permissions\n    correctly (i.e. `chmod 0600 ~/.ssh/${KEYPAIR_NAME}`)\n\n\u003e   An alternative (non-config) SSH connection mechanism, armed with the\n    Master's address and private key-pair, is\n    `ssh -i ~/.ssh/nextflow-pcluster \u003cUSER\u003e@\u003cMASTER_ADDR\u003e` where `\u003cUSER\u003e` is\n    is `ec2-user` for an Amazon Linux 2 master and `centos` for Centos.\n\n## Deleting the cluster\nOnce you're done, if you no longer need the cluster, delete it: -\n\n    $ pcluster delete-cluster -n ${CLUSTER_NAME}\n\n\u003e   Be careful with this command - it does not ask \"Are you sure?\".\n\n\u003e   We've noticed that tearing-down the cluster may not always be successful\n    (observed October 2020) and manual intervention in the AWS CloudFormation\n    console was required. It is always worth checking the AWS CloudFormation\n    console to make sure the stack responsible for the cluster has been\n    deleted.\n\n## A custom cluster image\nParallelCluster's [ImageBuilder] is a tool to create custom images (AMIs) that you\ncan use as the basis of your cluster's head and compute instances. This is especially\nuseful if you find you're installing a lot of custom packages, which can slow down\nthe formation of new compute nodes. By creating a custom image with all your\napplication packages you can reduce the time taken for new nodes to become available.\n\n\u003e   To build custom images your chosen IAM user will need the\n    **image build pcluster user policy** described in the [ParallelCluster Policies]\n    section of the AWS documentation.\n\nTo do this you simply put your package configuration into a shell-script and store this\nin an Amazon S3 bucket. You then refer to this script in the ImageBuilder YAML-based\nconfiguration file.\n\nWe've put our ParallelCluster v3 configuration file, which installs nextflow and\nsingularity) into our public S3 bucket. We can then create a simple image builder\nfile that refers to this script to create a custom image.\n\nOurs looks like this...\n\n```yaml\n---\n# A ParallelCluster v3 ImageBuilder configuration.\n# Used to compile custom images.\n#\n# This file is a TEMPLATE file, replace the `000[...]000` IDs with\n# values suitable for your environment.\n#\n# See https://docs.aws.amazon.com/parallelcluster/latest/ug/building-custom-ami-v3.html\nBuild:\n  InstanceType: c6a.4xlarge\n  # A Parent Image to bass this one one.\n  # Here we're using a suitable Amazon Linux.\n  # You can use 'pcluster list-official-images' to find some.\n  ParentImage: ami-00000000000000000\n  # If you don't have a 'default VPC'\n  # you will need to provide a Subnet (and a SecurityGroup)\n  SubnetId: subnet-00000000000000000\n  SecurityGroupIds:\n  - sg-00000000000000000\n  # Components to add to the image.\n  # Here we're running our custom script (on S3)\n  # that installs nextflow and apptainer (singularity)\n  Components:\n  - Type: script\n    Value: s3://im-aws-parallel-cluster/imagebuilder-amazon.sh\n  # Allow the builder to access S3\n  Iam:\n    AdditionalIamPolicies:\n    - Policy: arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess\n  # Other stuff...\n  UpdateOsPackages:\n    Enabled: true\n``` \n\n\u003e   You can find a copy of the ImageBuilder YAML file in the `imagebuilder` directory\n    of this repository.\n    \nThen, if the above configuration is placed in the file `imagebuilder-nextflow.yaml`\nwe can run the image builder and create a custom image: -\n\n    $ pcluster build-image \\\n        --image-configuration imagebuilder-nextflow.yaml \\\n        --image-id nextflow \\\n        --region eu-central-1\n\nBuilding an Image building can take a substantial length of time (an hour or so)\nbut you can track image build status using the following command: -\n\n    $ pcluster describe-image --image-id nextflow --region eu-central-1\n\nWhen the `imageBuildStatus` from the above command is `BUILD_COMPLETE` you should\nalso find the image AMI under `ec2AmiInfo -\u003e amiId`.\n\nYou can now use this AMI in your cluster configuration and remove the\ncorresponding `CustomActions`, which are no longer required, by placing the AMI\nin the `Image` block of your cluster configuration: -\n\n```yaml\nImage:\n  Os: alinux2\n  CustomAmi: ami-00000000000000000\n```\n\nNow, clusters built using this configuration should become available a little more\nquickly.\n\n---\n\n[administratoraccess]: https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html#aws-managed-policies\n[aws parallel cluster]: https://docs.aws.amazon.com/parallelcluster/index.html\n[creating efs]: https://docs.aws.amazon.com/efs/latest/ug/gs-step-two-create-efs-resources.html\n[documentation for the configuration file]: https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-configuration-file-v3.html\n[efs]: https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html\n[fragmentation workflow]: https://github.com/InformaticsMatters/fragmentor\n[imagebuilder]: https://docs.aws.amazon.com/parallelcluster/latest/ug/building-custom-ami-v3.html\n[jq]: https://stedolan.github.io/jq/\n[nextflow]: https://www.nextflow.io/\n[parallelcluster policies]: https://docs.aws.amazon.com/parallelcluster/latest/ug/iam-roles-in-parallelcluster-v3.html\n[singularity]: https://sylabs.io/docs/\n[slurm]: https://slurm.schedmd.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finformaticsmatters%2Fnextflow-pcluster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finformaticsmatters%2Fnextflow-pcluster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finformaticsmatters%2Fnextflow-pcluster/lists"}