{"id":18802117,"url":"https://github.com/oracle-quickstart/oci-hpc","last_synced_at":"2025-04-07T07:17:41.350Z","repository":{"id":40005224,"uuid":"160391506","full_name":"oracle-quickstart/oci-hpc","owner":"oracle-quickstart","description":"Terraform examples for deploying HPC clusters on OCI","archived":false,"fork":false,"pushed_at":"2025-03-29T06:47:34.000Z","size":6723,"stargazers_count":44,"open_issues_count":20,"forks_count":32,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-03-31T06:06:00.114Z","etag":null,"topics":["architecture","cloud","gluster","hpc","hpc-cluster","oci","oracle","oracle-led","terraform"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"upl-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oracle-quickstart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-04T17:03:18.000Z","updated_at":"2025-03-29T06:46:15.000Z","dependencies_parsed_at":"2024-01-05T20:38:22.527Z","dependency_job_id":"69a33ad7-167b-4109-b5ff-d2132ccb8025","html_url":"https://github.com/oracle-quickstart/oci-hpc","commit_stats":null,"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oracle-quickstart%2Foci-hpc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oracle-quickstart%2Foci-hpc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oracle-quickstart%2Foci-hpc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oracle-quickstart%2Foci-hpc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oracle-quickstart","download_url":"https://codeload.github.com/oracle-quickstart/oci-hpc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247608160,"owners_count":20965953,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["architecture","cloud","gluster","hpc","hpc-cluster","oci","oracle","oracle-led","terraform"],"created_at":"2024-11-07T22:26:37.291Z","updated_at":"2025-04-07T07:17:41.327Z","avatar_url":"https://github.com/oracle-quickstart.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Stack to create an HPC cluster. \n\n[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle-quickstart/oci-hpc/archive/refs/heads/master.zip)\n\n\n## Policies to deploy the stack: \n```\nallow service compute_management to use tag-namespace in tenancy\nallow service compute_management to manage compute-management-family in tenancy\nallow service compute_management to read app-catalog-listing in tenancy\nallow group user to manage all-resources in compartment compartmentName\n```\n## Policies for autoscaling or resizing:\nAs described when you specify your variables, if you select instance-principal as way of authenticating your node, make sure your generate a dynamic group and give the following policies to it: \n```\nAllow dynamic-group instance_principal to read app-catalog-listing in tenancy\nAllow dynamic-group instance_principal to use tag-namespace in tenancy\n```\nAnd also either:\n\n```\nAllow dynamic-group instance_principal to manage compute-management-family in compartment compartmentName\nAllow dynamic-group instance_principal to manage instance-family in compartment compartmentName\nAllow dynamic-group instance_principal to use virtual-network-family in compartment compartmentName\nAllow dynamic-group instance_principal to use volumes in compartment compartmentName\nAllow dynamic-group instance_principal to manage dns in compartment compartmentName\n```\nor:\n\n```\nAllow dynamic-group instance_principal to manage all-resources in compartment compartmentName\n```\n\n\n## Supported OS: \nThe stack allowa various combination of OS. Here is a list of what has been tested. We can't guarantee any of the other combination.\n\n|   Controller  |    Compute   |\n|---------------|--------------|\n|      OL8      |      OL8     |\n|      OL8      |      OL7     |\n| Ubuntu  22.04 | Ubuntu 22.04 |\n\nWhen switching to Ubuntu, make sure the username is changed from opc to Ubuntu in the ORM for both the controller and compute nodes. \n## How is resizing different from autoscaling ?\nAutoscaling is the idea of launching new clusters for jobs in the queue. \nResizing a cluster is changing the size of a cluster. In some case growing your cluster may be a better idea, be aware that this may lead to capacity errors. Because Oracle CLoud RDMA is non virtualized, you get much better performance but it also means that we had to build HPC islands and split our capacity across different network blocks.\nSo while there may be capacity available in the DC, you may not be able to grow your current cluster.  \n\n# Cluster Network Resizing (via resize.sh)\n\nCluster resizing refers to ability to add or remove nodes from an existing cluster network. Apart from add/remove, the resize.py script can also be used to reconfigure the nodes. \n\nResizing of HPC cluster with Cluster Network consist of 2 major sub-steps:\n- Add/Remove node (IaaS provisioning) to cluster – uses OCI Python SDK \n- Configure the nodes (uses Ansible)\n  -  Configures newly added nodes to be ready to run the jobs\n  -  Reconfigure services like Slurm to recognize new nodes on all nodes\n  -  Update rest of the nodes, when any node/s are removed (eg: Slurm config, /etc/hosts, etc.)\n\n  Cluster created by the autoscaling script can also be resized by using the flag --cluster_name cluster-1-hpc\n \n## resize.sh usage \n\nThe resize.sh is deployed on the controller node as part of the HPC cluster Stack deployment. Unreachable nodes have been causing issues. If nodes in the inventory are unreachable, we will not do cluster modification to the cluster unless --remove_unreachable is also specified. That will terminate the unreachable nodes before running the action that was requested (Example Adding a node) \n\n```\n/opt/oci-hpc/bin/resize.sh -h\nusage: resize.sh [-h] [--compartment_ocid COMPARTMENT_OCID]\n                 [--cluster_name CLUSTER_NAME] [--nodes NODES [NODES ...]]\n                 [--no_reconfigure] [--user_logging] [--force] [--remove_unreachable]\n                 [{add,remove,remove_unreachable,list,reconfigure}] [number] [--quiet]\nScript to resize the CN\n\npositional arguments:\n  {add,remove,remove_unreachable,list,reconfigure}\n                              Mode type. add/remove node options, implicitly\n                              configures newly added nodes. Also implicitly\n                              reconfigure/restart services like Slurm to recognize\n                              new nodes. Similarly for remove option, terminates\n                              nodes and implicitly reconfigure/restart services like\n                              Slurm on rest of the cluster nodes to remove reference\n                              to deleted nodes. IMPORTANT: remove or remove_unreachable \n                              means delete the node from the cluster which means terminate \n                              the node. remove_unreachable should be used to remove specific \n                              nodes which are no longer reachable via ssh. It gives you control \n                              on which nodes will be terminated by passing the --nodes parameter.\nnumber                        Number of nodes to add or delete if a list of\n                              hostnames is not defined.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --compartment_ocid COMPARTMENT_OCID\n                        OCID of the compartment, defaults to the Compartment\n                        OCID of the localhost\n  --cluster_name CLUSTER_NAME\n                        Name of the cluster to resize. Defaults to the name\n                        included in the controller\n  --nodes NODES [NODES ...]\n                        List of nodes to delete\n  --no_reconfigure      If present. Does not rerun the playbooks\n  --user_logging        If present. Use the default settings in ~/.oci/config\n                        to connect to the API. Default is using\n                        instance_principal\n  --force               If present. Nodes will be removed even if the destroy\n                        playbook failed\n  --ansible_crucial     If present during reconfiguration, only crucial\n                        ansible playbooks will be executed on the live nodes.\n                        Non live nodes will be removed\n  --remove_unreachable  If present, ALL nodes that are not sshable will be terminated \n                        before running the action that was requested (Example Adding a node). \n                        CAUTION: Use this only if you want to remove ALL nodes that \n                        are unreachable. Instead, remove specific nodes that are \n                        unreachable by using positional argument remove_unreachable. \n  --quiet               If present, the script will not prompt for a response when \n                        removing nodes and will not give a reminder to save data \n                        from nodes that are being removed\n```\n\n**Add nodes** \n\nConsist of the following sub-steps:\n- Add node (IaaS provisioning) to cluster – uses OCI Python SDK \n- Configure the nodes (uses Ansible)\n  -  Configures newly added nodes to be ready to run the jobs\n  -  Reconfigure services like Slurm to recognize new nodes on all nodes\n\nAdd one node \n```\n/opt/oci-hpc/bin/resize.sh add 1\n\n```\n\nAdd three nodes to cluster compute-1-hpc\n```\n/opt/oci-hpc/bin/resize.sh add 3 --cluster_name compute-1-hpc\n\n```\n\n\n**Remove nodes** \n\nConsist of the following sub-steps:\n- Remove node/s (IaaS termination) from cluster – uses OCI Python SDK \n- Reconfigure rest of the nodes in the cluster  (uses Ansible)\n  -  Remove reference to removed node/s on rest of the nodes (eg: update /etc/hosts, slurm configs, etc.)\n \n\nRemove specific node:  \n```\n/opt/oci-hpc/bin/resize.sh remove --nodes inst-dpi8e-assuring-woodcock\n```\nor \n\nRemove a list of nodes (space seperated):  \n```\n/opt/oci-hpc/bin/resize.sh remove --nodes inst-dpi8e-assuring-woodcock inst-ed5yh-assuring-woodcock\n```\nor \nRemove one node randomly:  \n```\n/opt/oci-hpc/bin/resize.sh remove 1\n```\nor \nRemove 3 nodes randomly from compute-1-hpc:  \n```\n/opt/oci-hpc/bin/resize.sh remove 3 --cluster_name compute-1-hpc\n\n```\nor \nRemove 3 nodes randomly from compute-1-hpc but do not prompt for a response when removing the nodes and do not give a reminder to save data \nfrom nodes that are being removed :  \n```\n/opt/oci-hpc/bin/resize.sh remove 3 --cluster_name compute-1-hpc --quiet\n\n```\n\n**Reconfigure nodes** \n\nThis allows users to reconfigure nodes (Ansible tasks) of the cluster.  \n\nFull reconfiguration of all nodes of the cluster.   This will run the same steps, which are ran when a new cluster is created.   If you manually updated configs which are created/updated as part of cluster configuration, then this command will overwrite your manual changes.   \n\n```\n/opt/oci-hpc/bin/resize.sh reconfigure\n```\n\n\n\n## Resizing (via OCI console)\n**Things to consider:**  \n- If you resize from OCI console to reduce cluster network/instance pool size(scale down),  the OCI platform decides which node to terminate (oldest node first)\n- OCI console only resizes the Cluster Network/Instance Pool, but it doesn't execute the ansible tasks (HPC Cluster Stack) required to configure the newly added nodes or to update the existing nodes when a node is removed (eg: updating /etc/hosts, slurm config, etc).   \n\n\n# Autoscaling\n\nThe autoscaling will work in a “cluster per job” approach. This means that for job waiting in the queue, we will launch new cluster specifically for that job. Autoscaling will also take care of spinning down clusters. By default, a cluster is left Idle for 10 minutes before shutting down. Autoscaling is achieved with a cronjob to be able to quickly switch from one scheduler to the next.\n\nSmaller jobs can run on large clusters and the clusters will be resized down after the grace period to only the running nodes. Cluster will NOT be resized up. We will spin up a new larger cluster and spin down the smaller cluster to avoid capacity issues in the HPC island. \n\nInitial cluster deployed through the stack will never be spun down.\n\nThere is a configuration file at `/opt/oci-hpc/conf/queues.conf` with an example at `/opt/oci-hpc/conf/queues.conf.example`to show how to add multiple queues and multiple instance types. Examples are included for HPC, GPU or Flex VMs. \n\nYou will be able to use the instance type name as a feature in the job definition to make sure it runs/create the right kind of node. \n\nYou can only have one default instance-type per queue and one default queue. To submit to a non default queue, either add this line to the SBATCH file: `#SBATCH --partition compute` or in the command line: `sbatch -p queuename job.sh`\n\nThe key word `permanent` allows will spin up clusters but not delete them untill it is set to false. It is not needed to reconfigure slurm after you change that value. \n\nAfter a modification of the `/opt/oci-hpc/conf/queues.conf`, you need to run \n`/opt/oci-hpc/bin/slurm_config.sh`\n\nIf you have some state that is messing with Slurm, you can make sure it is put back in the initial state with \n`/opt/oci-hpc/bin/slurm_config.sh --initial`\n\nTo turn on autoscaling: \nUncomment the line in `crontab -e`:\n```\n* * * * * /opt/oci-hpc/autoscaling/crontab/autoscale_slurm.sh \u003e\u003e /opt/oci-hpc/logs/crontab_slurm.log 2\u003e\u00261\n```\nAnd in /etc/ansible/hosts, below value should be true\n```\nautoscaling = true\n```\n\n# Submit\nHow to submit jobs: \nSlurm jobs can be submitted as always but a few more constraints can be set: \nExample in `/opt/oci-hpc/samples/submit/`: \n\n```\n#!/bin/sh\n#SBATCH -n 72\n#SBATCH --ntasks-per-node 36\n#SBATCH --exclusive\n#SBATCH --job-name sleep_job\n#SBATCH --constraint hpc-default\n\ncd /nfs/scratch\nmkdir $SLURM_JOB_ID\ncd $SLURM_JOB_ID\nMACHINEFILE=\"hostfile\"\n\n# Generate Machinefile for mpi such that hosts are in the same\n#  order as if run via srun\n#\nscontrol show hostnames $SLURM_JOB_NODELIST \u003e $MACHINEFILE\nsed -i \"s/$/:${SLURM_NTASKS_PER_NODE}/\" $MACHINEFILE\n\ncat $MACHINEFILE\n# Run using generated Machine file:\nsleep 1000\n```\n \n- Instance Type: You can specify the OCI instance type that you’d like to run on as a constraint. This will make sure that you run on the right shape and also generate the right cluster. Instance types are defined in the `/opt/oci-hpc/conf/queues.conf` file in yml format. Leave all of the field in there even if they are not used. You can define multiple queues and multiple instance type in each queue. If you do not select an instance type when creating your job, it will use the default one.\n\n- cpu-bind: On Ubuntu 22.04, we are switching to Cgroup v2 and we did notice that when hyperthreading is turned off. The default cpu-bind may give some issues. If you get an error like `error: task_g_set_affinity: Invalid argument`, you can try running your job with --cpu-bind=none or --cpu-bind=sockets\n## Clusters folders: \n```\n/opt/oci-hpc/autoscaling/clusters/clustername\n```\n\n## Logs: \n```\n/opt/oci-hpc/logs\n```\n\nEach cluster will have his own log with name: `create_clustername_date.log` and `delete_clustername_date.log`\nThe log of the crontab will be in `crontab_slurm.log`\n\n\n## Manual clusters: \nYou can create and delete your clusters manually. \n### Cluster Creation\n```\n/opt/oci-hpc/bin/create_cluster.sh NodeNumber clustername instance_type queue_name\n```\nExample: \n```\n/opt/oci-hpc/bin/create_cluster.sh 4 compute2-1-hpc HPC_instance compute2\n```\n\n### Cluster Deletion: \n```\n/opt/oci-hpc/bin/delete_cluster.sh clustername\n```\n\nIn case something goes wrong during the deletion, you can force the deletion with \n```\n/opt/oci-hpc/bin/delete_cluster.sh clustername FORCE\n```\nWhen the cluster is already being destroyed, it will have a file `/opt/oci-hpc/autoscaling/clusters/clustername/currently_destroying` \n\n## Autoscaling Monitoring\nIf you selected the autoscaling monitoring, you can see what nodes are spinning up and down as well as running and queued jobs. Everything will run automatically except the import of the Dashboard in Grafana due to a problem in the Grafana API. \n\nTo do it manually, in your browser of choice, navigate to controllerIP:3000. Username and password are admin/admin, you can change those during your first login. Go to Configuration -\u003e Data Sources. Select autoscaling. Enter Password as Monitor1234! and click on 'Save \u0026 test'. Now click on the + sign on the left menu bar and select import. Click on Upload JSON file and upload the file the is located at `/opt/oci-hpc/playbooks/roles/autoscaling_mon/files/dashboard.json`. Select autoscaling (MySQL) as your datasource. \n\nYou will now see the dashboard. \n\n\n# LDAP \nIf selected controller host will act as an LDAP server for the cluster. It's strongly recommended to leave default, shared home directory. \nUser management can be performed from the controller using ``` cluster ``` command. \nExample of cluster command to add a new user: \n```cluster user add name```\nBy default, a `privilege` group is created that has access to the NFS and can have sudo access on all nodes (Defined at the stack creation. This group has ID 9876) The group name can be modified.\n```cluster user add name --gid 9876```\nTo avoid generating a user-specific key for passwordless ssh between nodes, use --nossh. \n```cluster user add name --nossh --gid 9876```\n\n# Shared home folder\n\nBy default, the home folder is NFS shared directory between all nodes from the controller. You have the possibility to use a FSS to share it as well to keep working if the controller goes down. You can either create the FSS from the GUI. Be aware that it will get destroyed when you destroy the stack. Or you can pass an existing FSS IP and path. If you share an existing FSS, do not use /home as mountpoint. The stack will take care of creating a $nfsshare/home directory and mounting it at /home after copying all the appropriate files. \n\n# Deploy within a private subnet\n\nIf \"true\", this will create a private endpoint in order for Oracle Resource Manager to configure the controller VM and the future nodes in private subnet(s). \n* If \"Use Existing Subnet\" is false, Terraform will create 2 private subnets, one for the controller and one for the compute nodes.  \n* If \"Use Existing Subnet\" is also true, the user must indicate a private subnet for the controller VM. For the compute nodes, they can reside in another private subnet or the same private subent as the controller VM. \n\nThe controller VM will reside in a private subnet. Therefore, the creation of a \"controller service\" (https://docs.oracle.com/en-us/iaas/Content/controller/Concepts/controlleroverview.htm), a VPN or FastConnect connection is required. If a public subnet exists in the VCN, adapting the security lists and creating a jump host can also work. Finally, a Peering can also be established betwen the private subnet and another VCN reachable by the user.\n\n\n\n## max_nodes_partition.py usage \n\nUse the alias \"max_nodes\" to run the python script max_nodes_partition.py. You can run this script only from controller.\n\n$ max_nodes --\u003e Information about all the partitions and their respective clusters, and maximum number of nodes distributed evenly per partition\n\n$ max_nodes --include_cluster_names xxx yyy zzz --\u003e where xxx, yyy, zzz are cluster names. Provide a space separated list of cluster names to be considered for displaying the information about clusters and maximum number of nodes distributed evenly per partition\n\n\n## validation.py usage\n\nUse the alias \"validate\" to run the python script validation.py. You can run this script only from controller. \n\nThe script performs these checks. \n-\u003e Check the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files.\n-\u003e PCIe bandwidth check \n-\u003e GPU Throttle check \n-\u003e Check whether md5 sum of /etc/hosts file on nodes matches that on controller\n\nProvide at least one argument: [-n NUM_NODES] [-p PCIE] [-g GPU_THROTTLE] [-e ETC_HOSTS]\n\nOptional argument with [-n NUM_NODES] [-p PCIE] [-g GPU_THROTTLE] [-e ETC_HOSTS]: [-cn CLUSTER_NAMES]\nProvide a file that lists each cluster on a separate line for which you want to validate the number of nodes and/or pcie check and/or gpu throttle check and/or /etc/hosts md5 sum. \n\nFor pcie, gpu throttle, and /etc/hosts md5 sum check, you can either provide y or Y along with -cn or you can give the hostfile path (each host on a separate line) for each argument. For number of nodes check, either provide y or give y along with -cn.\n\nBelow are some examples for running this script.\n\nvalidate -n y --\u003e This will validate that the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files. The clusters considered will be the default cluster if any and cluster(s) found in /opt/oci-hpc/autoscaling/clusters directory. The number of nodes considered will be from the resize script using the clusters we got before. \n\nvalidate -n y -cn \u003ccluster name file\u003e --\u003e This will validate that the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files. It will also check whether md5 sum of /etc/hosts file on all nodes matches that on controller. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file. \n\nvalidate -p y -cn \u003ccluster name file\u003e --\u003e This will run the pcie bandwidth check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file. \n\nvalidate -p \u003cpcie host file\u003e --\u003e This will run the pcie bandwidth check on the hosts provided in the file given. The pcie host file should have a host name on each line.\n\nvalidate -g y -cn \u003ccluster name file\u003e --\u003e This will run the GPU throttle check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file. \n\nvalidate -g \u003cgpu check host file\u003e --\u003e This will run the GPU throttle check on the hosts provided in the file given. The gpu check host file should have a host name on each line.\n\nvalidate -e y -cn \u003ccluster name file\u003e --\u003e This will run the /etc/hosts md5 sum check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file. \n\nvalidate -e \u003cmd5 sum check host file\u003e --\u003e This will run the /etc/hosts md5 sum check on the hosts provided in the file given. The md5 sum check host file should have a host name on each line.\n\nYou can combine all the options together such as:\nvalidate -n y -p y -g y -e y -cn \u003ccluster name file\u003e\n\n\n## /opt/oci-hpc/scripts/collect_logs.py\nThis is a script to collect nvidia bug report, sosreport, console history logs. \n\nThe script needs to be run from the controller. In the case where the host is not ssh-able, it will get only  console history logs for the same.\n\nIt requires the below argument.\n--hostname \u003cHOSTNAME\u003e\n\nAnd --compartment-id \u003cCOMPARTMENT_ID\u003e is optional (i.e. assumption is the host is in the same compartment as the controller). \n\nWhere HOSTNAME is the node name for which you need the above logs and COMPARTMENT_ID is the OCID of the compartment where the node is.\n\nThe script will get all the above logs and put them in a folder specific to each node in /home/{user}. It will give the folder name as the output.\n\nAssumption: For getting the console history logs, the script expects to have the node name in /etc/hosts file.\n\nExamples:\n\npython3 collect_logs.py --hostname compute-permanent-node-467\nThe nvidia bug report, sosreport, and console history logs for compute-permanent-node-467 are at /home/ubuntu/compute-permanent-node-467_06132023191024\n\npython3 collect_logs.py --hostname inst-jxwf6-keen-drake\nThe nvidia bug report, sosreport, and console history logs for inst-jxwf6-keen-drake are at /home/ubuntu/inst-jxwf6-keen-drake_11112022001138\n\nfor x in `less /home/opc/hostlist` ; do echo $x ; python3 collect_logs.py --hostname $x; done ;\ncompute-permanent-node-467\nThe nvidia bug report, sosreport, and console history logs for compute-permanent-node-467 are at /home/ubuntu/compute-permanent-node-467_11112022011318\ncompute-permanent-node-787\nThe nvidia bug report, sosreport, and console history logs for compute-permanent-node-787 are at /home/ubuntu/compute-permanent-node-787_11112022011835\n\nWhere hostlist had the below contents\ncompute-permanent-node-467\ncompute-permanent-node-787\n\n\n## Collect RDMA NIC Metrics and Upload to Object Storage\n\nOCI-HPC is deployed in customer tenancy. So, OCI service teams cannot access metrics from these OCI-HPC stack clusters. Due to overcome this issue, in release,\nwe introduce a feature to collect RDMA NIC Metrics and upload those metrics to Object Storage. Later on, that Object Storage URL could be shared with OCI service\nteams. After that URL, OCI service teams could access metrics and use those metrics for debugging purpose.\n\nTo collect RDMA NIC Metrics and upload those to Object Storage, user needs to follow these following steps:\n\nStep 1: Create a PAR (PreAuthenticated Request)\nFor creating a PAR, user needs to select check-box \"Create Object Storage PAR\" during Resource Manager's stack creation.\nBy default, this check box is enabled. By selecting, this check-box, a PAR would be created.\n\nStep 2: Use shell script: upload_rdma_nic_metrics.sh to collect metrics and upload to object storage.\nUser needs to use shell script: upload_rdma_nic_metrics.sh to collect metrics and upload to object storage. User could configure metrics\ncollection limit and interval through config file: rdma_metrics_collection_config.conf.\n\n## Meshpinger\n\nMeshpinger is a tool for validating network layer connectivity between RDMA NICs on a cluster network in OCI. The tool is capable of initiating ICMP ping from every RDMA NIC port on the cluster network to every other RDMA NIC port on the same cluster network and\nreporting back the success/failure status of the pings performed in the form of logs\n\nRunning the tool before starting workload on a cluster network should serve as a good precheck step to gain confidence on the network reachability between RDMA NICs. Typical causes for reachability failures that the tool can help pinpoint are,\n1. Link down on the RDMA NIC\n2. RDMA interface initialization or configuration issues including IP address assignment to\nthe interface\n3. Insufficient ARP table size on the node to store all needed peer mac addresses","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foracle-quickstart%2Foci-hpc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foracle-quickstart%2Foci-hpc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foracle-quickstart%2Foci-hpc/lists"}