{"id":13536793,"url":"https://github.com/Azure/cyclecloud-pbspro","last_synced_at":"2025-04-02T03:31:08.110Z","repository":{"id":37926927,"uuid":"144634186","full_name":"Azure/cyclecloud-pbspro","owner":"Azure","description":"Example Azure CycleCloud PBSpro cluster type","archived":false,"fork":false,"pushed_at":"2025-01-13T17:37:01.000Z","size":345,"stargazers_count":14,"open_issues_count":19,"forks_count":22,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-03-17T11:53:19.114Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Azure.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-13T21:10:29.000Z","updated_at":"2025-03-03T13:42:13.000Z","dependencies_parsed_at":"2023-11-07T01:15:17.925Z","dependency_job_id":"4aa84971-5f50-431f-a97e-03b82c91906e","html_url":"https://github.com/Azure/cyclecloud-pbspro","commit_stats":null,"previous_names":[],"tags_count":42,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Azure%2Fcyclecloud-pbspro","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Azure%2Fcyclecloud-pbspro/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Azure%2Fcyclecloud-pbspro/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Azure%2Fcyclecloud-pbspro/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Azure","download_url":"https://codeload.github.com/Azure/cyclecloud-pbspro/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246751056,"owners_count":20827825,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T09:00:49.615Z","updated_at":"2025-04-02T03:31:08.103Z","avatar_url":"https://github.com/Azure.png","language":"Python","readme":"# Azure CycleCloud OpenPBS project\n\nOpenPBS is a highly configurable open source workload manager. See the\n[OpenPBS project site](http://www.openpbs.org/) for an overview and the [PBSpro\ndocumentation](https://www.pbsworks.com/PBSProductGT.aspx?n=Altair-PBS-Professional\u0026c=Overview-and-Capabilities\u0026d=Altair-PBS-Professional,-Documentation)\nfor more information on using, configuring, and troubleshooting OpenPBS\nin general.\n\n## Versions\n\nOpenPBS (formerly PBS Professional OSS) is released as part of version `20.0.0`. PBSPro OSS is still available\nin CycleCloud by specifying the PBSPro OSS version.\n\n```ini\n   [[[configuration]]]\n   pbspro.version = 18.1.4-0\n```\n\n## Installing Manually\n\nNote: When using the cluster that is shipped with CycleCloud, the autoscaler and default queues are already installed.\n\nFirst, download the installer pkg from GitHub. For example, you can download the [2.0.25 release here](https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.25/cyclecloud-pbspro-pkg-2.0.25.tar.gz)\n\n```bash\n# Prerequisite: python3, 3.6 or newer, must be installed and in the PATH\nwget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.25/cyclecloud-pbspro-pkg-2.0.25.tar.gz\ntar xzf cyclecloud-pbspro-pkg-2.0.25.tar.gz\ncd cyclecloud-pbspro\n# Optional, but recommended. Adds relevant resources and enables strict placement\n./initialize_pbs.sh\n# Optional. Sets up workq as a colocated, MPI focused queue and creates htcq for non-MPI workloads.\n./initialize_default_queues.sh\n\n# Creates the azpbs autoscaler\n./install.sh  --venv /opt/cycle/pbspro/venv\n\n# If you have jetpack available, you may use the following:\n# ./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \\\n#                              --username $(jetpack config cyclecloud.config.username) \\\n#                              --password $(jetpack config cyclecloud.config.password) \\\n#                              --url $(jetpack config cyclecloud.config.web_server) \\\n#                              --cluster-name $(jetpack config cyclecloud.cluster.name)\n\n# Otherwise insert your username, password, url, and cluster name here.\n./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \\\n                             --username user \\\n                             --password password \\\n                             --url https://fqdn:port \\\n                             --cluster-name cluster_name\n\n# lastly, run this to understand any changes that may be required.\n# For example, you typically have to add the ungrouped and group_id resources\n# to the /var/spool/pbs/sched_priv/sched_priv file and restart.\n## [root@scheduler cyclecloud-pbspro]# azpbs validate\n## ungrouped is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS\n## group_id is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS\nazpbs validate\n```\n\n## Autoscale and scalesets\n\nIn order to try and ensure that the correct VMs are provisioned for different types of jobs, CycleCloud treats autoscale of MPI and serial jobs differently in OpenPBS clusters. \n\nFor serial jobs, multiple VM scalesets (VMSS) are used in order to scale as quickly as possible. For MPI jobs to use the InfiniBand fabric for those instances that support it, all of the nodes allocated to the job have to be deployed in the same VMSS. CycleCloud\nhandles this by using a `PlacementGroupId` that groups nodes with the same id into the same VMSS. By default, the `workq` appends\nthe equivalent of `-l place=scatter:group=group_id` by using native queue defaults.\n\n## Hooks\n\nOur PBS integration uses 3 different PBS hooks. `autoscale` does the bulk of the work required to scale the cluster up and down. All relevant log messages can be seen in `/opt/cycle/pbspro/autoscale.log`. `cycle_sub_hook` will validate jobs unless they use `-l nodes` syntax, in which case those jobs are held and later processed by our last hook `cycle_sub_hook_periodic`.\n\n### Autoscale Hook\nThe most important is the `autoscale` plugin, which runs by default on a 15 second interval. You can adjust this frequency by running\n```bash\nqmgr -c \"set hook autoscale freq=NUM_SECONDS\"\n```\n\n### Submission Hooks\n`cycle_sub_hook` will validate that your job has the proper placement restrictions set. If it encounters a problem, it will output a detailed message on why the job was rejected and how to resolve the issue. For example\n\n```bash\n$\u003e echo sleep 300 | qsub -l select=2 -l place=scatter\n```\n```qsub: Job uses more than one node and does not place the jobs based on group_id, which may cause issues with tightly coupled jobs.\nPlease do one of the following\n    1) Ensure this placement is set by adding group=group_id to your -l place= statement\n        Note: Queue workq's resource_defaults.place=group=group_id\n    2) Add -l skipcyclesubhook=true on this job\n        Note: If the resource does not exist, create it -\u003e qmgr -c 'create resource skipcyclesubhook type=boolean'\n    3) Disable this hook for this queue via queue defaults -\u003e qmgr -c 'set queue workq resources_default.skipcyclesubhook=true'\n    4) Disable this plugin - 'qmgr -c 'set hook cycle_sub_hook enabled=false'\n        Note: Disabling this plugin may prevent -l nodes= style submissions from working properly.\n```\n\nOne important note: if you are using `Torque` style submissions, i.e. those that uses `-l nodes` instead of `-l select`, PBS will simply convert that submission into an equivalent `-l select` style submission. However, the default placement defined for the queue is not respected by PBS when converting the job. To get around this, we will `hold` the job and our last hook, `cycle_sub_hook_periodic` will periodically update the job's placement and release it.\n\n\n## Configuring Resources\nThe cyclecloud-pbspro application matches PBS resources to azure cloud resources \nto provide rich autoscaling and cluster configuration tools. The application will be deployed\nautomatically for clusters created via the CycleCloud UI or it can be installed on any \nPBS admin host on an existing cluster. For more information on defining resources in _autoscale.json_, see [ScaleLib's documentation](https://github.com/Azure/cyclecloud-scalelib/blob/master/README.md).\n\nThe default resources defined with the cluster template we ship with are\n\n```json\n{\"default_resources\": [\n   {\n      \"select\": {},\n      \"name\": \"ncpus\",\n      \"value\": \"node.vcpu_count\"\n   },\n   {\n      \"select\": {},\n      \"name\": \"group_id\",\n      \"value\": \"node.placement_group\"\n   },\n   {\n      \"select\": {},\n      \"name\": \"host\",\n      \"value\": \"node.hostname\"\n   },\n   {\n      \"select\": {},\n      \"name\": \"mem\",\n      \"value\": \"node.memory\"\n   },\n   {\n      \"select\": {},\n      \"name\": \"vm_size\",\n      \"value\": \"node.vm_size\"\n   },\n   {\n      \"select\": {},\n      \"name\": \"disk\",\n      \"value\": \"size::20g\"\n   }]\n}\n```\n\nNote that disk is currently hardcoded to `size::20g` because of platform limitations to determine how much disk a node will\nhave. Here is an example of handling VM Size specific disk size\n```json\n   {\n      \"select\": {\"node.vm_size\": \"Standard_F2\"},\n      \"name\": \"disk\",\n      \"value\": \"size::20g\"\n   },\n   {\n      \"select\": {\"node.vm_size\": \"Standard_H44rs\"},\n      \"name\": \"disk\",\n      \"value\": \"size::2t\"\n   }\n   ```\n\n# azpbs cli\nThe `azpbs` cli is the main interface for all autoscaling behavior. Note that it has a fairly powerful autocomplete capabilities. For example, typing `azpbs create_nodes --vm-size ` and then you can tab-complete the list of possible VM Sizes. Autocomplete information is updated every `azpbs autoscale` cycle, but can also be refreshed manually by running `azpbs refresh_autocomplete`.\n\n| Command | Description |\n| :---    | :---        |\n| autoscale            | End-to-end autoscale process, including creation, deletion and joining of nodes. |\n| buckets              | Prints out autoscale bucket information, like limits etc |\n| config               | Writes the effective autoscale config, after any preprocessing, to stdout |\n| create_nodes         | Create a set of nodes given various constraints. A CLI version of the nodemanager interface. |\n| default_output_columns | Output what are the default output columns for an optional command. |\n| delete_nodes         | Deletes node, including draining post delete handling |\n| demand               | Dry-run version of autoscale. |\n| initconfig           | Creates an initial autoscale config. Writes to stdout |\n| jobs                 | Writes out autoscale jobs as json. Note: Running jobs are excluded. |\n| join_nodes           | Adds selected nodes to the scheduler |\n| limits               | Writes a detailed set of limits for each bucket. Defaults to json due to number of fields. |\n| nodes                | Query nodes |\n| refresh_autocomplete | Refreshes local autocomplete information for cluster specific resources and nodes. |\n| remove_nodes         | Removes the node from the scheduler without terminating the actual instance. |\n| retry_failed_nodes   | Retries all nodes in a failed state. |\n| shell                | Interactive python shell with relevant objects in local scope. Use --script to run python scripts |\n| validate             | Runs basic validation of the environment |\n| validate_constraint  | Validates then outputs as json one or more constraints. |\n\n\n## azpbs buckets\nUse the `azpbs buckets` command to see which buckets of compute are available, how many are available, and what resources they have.\n```bash\nazpbs buckets --output-columns nodearray,placement_group,vm_size,ncpus,mem,available_count\n```\n```\nNODEARRAY PLACEMENT_GROUP     VM_SIZE         NCPUS MEM     AVAILABLE_COUNT\nexecute                       Standard_F2s_v2 1     4.00g   50             \nexecute                       Standard_D2_v4  1     8.00g   50             \nexecute                       Standard_E2s_v4 1     16.00g  50             \nexecute                       Standard_NC6    6     56.00g  16             \nexecute                       Standard_A11    16    112.00g 6              \nexecute   Standard_F2s_v2_pg0 Standard_F2s_v2 1     4.00g   50             \nexecute   Standard_F2s_v2_pg1 Standard_F2s_v2 1     4.00g   50             \nexecute   Standard_D2_v4_pg0  Standard_D2_v4  1     8.00g   50             \nexecute   Standard_D2_v4_pg1  Standard_D2_v4  1     8.00g   50             \nexecute   Standard_E2s_v4_pg0 Standard_E2s_v4 1     16.00g  50             \nexecute   Standard_E2s_v4_pg1 Standard_E2s_v4 1     16.00g  50             \nexecute   Standard_NC6_pg0    Standard_NC6    6     56.00g  16             \nexecute   Standard_NC6_pg1    Standard_NC6    6     56.00g  16             \nexecute   Standard_A11_pg0    Standard_A11    16    112.00g 6              \nexecute   Standard_A11_pg1    Standard_A11    16    112.00g 6\n```\n\n\n## azpbs demand\nIt is common that you want to test out autoscaling without actually allocating anything. `azpbs demand` is a dry-run\nversion of `azpbs autoscale`. Here is a simple example where we allocate two machines for a simple `-l select=2` submission. As\nyou can see, job id `1` is using one `ncpus` on two different nodes. \n```bash\nazpbs demand\n```\n```azpbs demand  --output-columns name,job_id,/ncpus\nNAME      JOB_IDS NCPUS\nexecute-1 1       0/1  \nexecute-2 1       0/1\n```\n\n## azpbs create_nodes\nManually creating nodes via `azpbs create_nodes` is also quite powerful. Note that it also has a `--dry-run` mode as well.\n\nHere is an example of allocating 100 `slots` of `mem=memory::1g` or 1gb partitions. Since our nodes have 4gb each, then we expect 25 nodes to be created.\n```bash\nazpbs create_nodes --keep-alive --vm-size Standard_F2s_v2 --slots 100 --constraint-expr mem=memory::1g --dry-run --output-columns name,/mem\n```\n```\nNAME       MEM        \nexecute-1  0.00g/4.00g\n...\nexecute-25 0.00g/4.00g\n```\n\n## azpbs delete_/remove_nodes\n`azpbs` supports safely removing a node from PBS. The different between `delete_nodes` and `remove_nodes` is simply that `delete_nodes`, on top of removing the node from PBS, will also delete the node. You may delete by hostname or node name. Pass in `*` to delete/remove all nodes.\n\n\n## azpbs shell\n`azpbs shell` is a more advanced command that can be quit powerful. This command fully constructs the entire in-memory structures used by `azpbs autoscale` to allow the user to interact with them dynamically. All of the objects are passed in to the local scope, and can be listd by calling `pbsprohelp()`. This is a powerful debugging tool.\n\n\n```bash\n[root@pbsserver ~] azpbs shell\nCycleCloud Autoscale Shell\n\u003e\u003e\u003e pbsprohelp()\nconfig               - dict representing autoscale configuration.\ncli                  - object representing the CLI commands\npbs_env              - object that contains data structures for queues, resources etc\nqueues               - dict of queue name -\u003e PBSProQueue object\njobs                 - dict of job id -\u003e Autoscale Job\nscheduler_nodes      - dict of hostname -\u003e node objects. These represent purely what the scheduler sees without additional booting nodes / information from CycleCloud\nresource_definitions - dict of resource name -\u003e PBSProResourceDefinition objects.\ndefault_scheduler    - PBSProScheduler object representing the default scheduler.\npbs_driver           - PBSProDriver object that interacts directly with PBS and implements PBS specific behavior for scalelib.\ndemand_calc          - ScaleLib DemandCalculator - pseudo-scheduler that determines the what nodes are unnecessary\nnode_mgr             - ScaleLib NodeManager - interacts with CycleCloud for all node related activities - creation, deletion, limits, buckets etc.\npbsprohelp            - This help function\n\u003e\u003e\u003e queues.workq.resources_default\n{'place': 'scatter:group=group_id'}\n\u003e\u003e\u003e jobs[\"0\"].node_count\n2\n```\n\n`azpbs shell` can also take in as an argument `--script path/to/python_file.py`, allowing the user to have full access to the in-memory structures, again by passing in the objects through the local scope, to customize the autoscale behavior.\n\n```bash\n[root@pbsserver ~] cat example.py \nfor bucket in node_mgr.get_buckets():\n    print(bucket.nodearray, bucket.vm_size, bucket.available_count)\n\n[root@pbsserver ~] azpbs shell -s example.py \nexecute Standard_F2s_v2 50\nexecute Standard_D2_v4 50\nexecute Standard_E2s_v4 50\n```\n## Timeouts\nBy default we set idle and boot timeouts across all nodes.\n```\"idle_timeout\": 300,\n   \"boot_timeout\": 3600\n```\nYou can also set these per nodearray.\n```\"idle_timeout\": {\"default\": 300, \"nodearray1\": 600, \"nodearray2\": 900},\n   \"boot_timeout\": {\"default\": 3600, \"nodearray1\": 7200, \"nodearray2\": 900},\n```\n## Logging\nBy default, `azpbs` will use `/opt/cycle/pbspro/logging.conf`, as defined in `/opt/cycle/pbsspro/autoscale.json`. This will create the following logs.\n\n### /opt/cycle/pbspro/autoscale.log\n\n`autoscale.log` is the main log for all `azpbs` invocations.\n\n### /opt/cycle/pbspro/qcmd.log\n`qcmd.log` every PBS executable invocation and the response, so you can see exactly what commands are being run.\n\n### /opt/cycle/pbspro/demand.log\nEvery `autoscale` iteration, `azpbs` prints out a table of all of the nodes, their resources, their assigned jobs and more. This log\ncontains these values and nothing else.\n\n\n# Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.microsoft.com.\n\nWhen you submit a pull request, a CLA-bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n","funding_links":[],"categories":["Recipes"],"sub_categories":["Azure CycleCloud"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAzure%2Fcyclecloud-pbspro","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAzure%2Fcyclecloud-pbspro","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAzure%2Fcyclecloud-pbspro/lists"}