{"id":25649771,"url":"https://github.com/openchami/mini-bootcamp","last_synced_at":"2026-06-17T22:31:03.152Z","repository":{"id":263638570,"uuid":"888151341","full_name":"OpenCHAMI/mini-bootcamp","owner":"OpenCHAMI","description":"OpenCHAMI mini-bootcamp for learning how to use OpenCHAMI","archived":false,"fork":false,"pushed_at":"2025-02-18T17:00:22.000Z","size":793,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-02-18T18:21:29.934Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenCHAMI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-13T22:46:06.000Z","updated_at":"2025-02-18T17:00:27.000Z","dependencies_parsed_at":"2024-12-11T15:28:35.275Z","dependency_job_id":"8bd191bc-5501-45c7-84c2-b8280d6d626d","html_url":"https://github.com/OpenCHAMI/mini-bootcamp","commit_stats":null,"previous_names":["openchami/mini-bootcamp"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/OpenCHAMI/mini-bootcamp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCHAMI%2Fmini-bootcamp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCHAMI%2Fmini-bootcamp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCHAMI%2Fmini-bootcamp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCHAMI%2Fmini-bootcamp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenCHAMI","download_url":"https://codeload.github.com/OpenCHAMI/mini-bootcamp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenCHAMI%2Fmini-bootcamp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34468766,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-17T02:00:05.408Z","response_time":127,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-23T14:33:44.320Z","updated_at":"2026-06-17T22:31:03.136Z","avatar_url":"https://github.com/OpenCHAMI.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ochami Bootcamp\nThis Doc is a very brief tutorial on how to deploy OpenCHAMI\n\n## Assumptions\n\n### A running OS\nI think we all know how to install a linux OS on a machine at this point.\n\n### Config Management\nWe'll go over some config management, but it will not be a full system deployment. Anything beyond basic booting functions will not be covered\n\n### Cluster Images\nOpenCHAMI doesn't provide an image build system. It relies on external images being available.  \nWe'll go over how we are building images locally but they won't be full production-like images\n\n## Prep\nSome stuff we need before we start deploying OpenCHAMI\n\n### Package installs\n```bash\ndnf install -y ansible git podman jq\n```\n### Setup hosts\nClusters generally have names. This cluster is named `demo` and the shortname for our nodes is `nid`. Feel free to be creative on your own time.  \nThe BMCs are named `\u003cshortname\u003e-bmc`. \nMake your `/etc/hosts` look something like\n```bash\n172.16.0.254    demo.openchami.cluster\n172.16.0.1      nid001\n172.16.0.2      nid002\n172.16.0.3      nid003\n172.16.0.4      nid004\n172.16.0.5      nid005\n172.16.0.6      nid006\n172.16.0.7      nid007\n172.16.0.8      nid008\n172.16.0.9      nid009\n172.16.0.101    nid-bmc001\n172.16.0.102    nid-bmc002\n172.16.0.103    nid-bmc003\n172.16.0.104    nid-bmc004\n172.16.0.105    nid-bmc005\n172.16.0.106    nid-bmc006\n172.16.0.107    nid-bmc007\n172.16.0.108    nid-bmc008\n172.16.0.109    nid-bmc009\n```\n\n### powerman + conman\nInstall the things\n```bash\ndnf install -y powerman conman jq\n```\nConfigure `/etc/powerman/powerman.conf`, remember your cluster shortnames. User/Password should be the same on all systems\n```bash\ninclude \"/etc/powerman/ipmipower.dev\"\n\ndevice \"ipmi0\" \"ipmipower\" \"/usr/sbin/ipmipower -D lanplus -u admin -p Password123! -h nid-bmc[001-009] -I 17 -W ipmiping |\u0026\"\nnode \"nid[001-009]\" \"ipmi0\" \"nid-bmc[001-009]\"\n```\nStart and enable powerman:\n```bash\nsystemctl start powerman\nsystemctl enable powerman\n```\nThen Check to make sure you can see the power state of the nodes\n```bash\npm -q\n```\n\nConman is next. Configure your `/etc/conman.conf`. You may have to zero out that file first.\nShould look something like the below, with your cluster shortname in place.\n```bash\nSERVER keepalive=ON\nSERVER logdir=\"/var/log/conman\"\nSERVER logfile=\"/var/log/conman.log\"\nSERVER loopback=ON\nSERVER pidfile=\"/var/run/conman.pid\"\nSERVER resetcmd=\"/usr/bin/powerman -0 %N; sleep 5; /usr/bin/powerman -1 %N\"\nSERVER tcpwrappers=ON\n\nGLOBAL seropts=\"115200,8n1\"\nGLOBAL log=\"/var/log/conman/console.%N\"\nGLOBAL logopts=\"sanitize,timestamp\"\n\n# Compute nodes\nCONSOLE name=\"nid001\" dev=\"ipmi:nid-bmc001\" ipmiopts=\"U:admin,P:Password123!,C:17,W:solpayloadsize\"\nCONSOLE name=\"nid002\" dev=\"ipmi:nid-bmc002\" ipmiopts=\"U:admin,P:Password123!,C:17,W:solpayloadsize\"\nCONSOLE name=\"nid003\" dev=\"ipmi:nid-bmc003\" ipmiopts=\"U:admin,P:Password123!,C:17,W:solpayloadsize\"\nCONSOLE name=\"nid004\" dev=\"ipmi:nid-bmc004\" ipmiopts=\"U:admin,P:Password123!,C:17,W:solpayloadsize\"\nCONSOLE name=\"nid005\" dev=\"ipmi:nid-bmc005\" ipmiopts=\"U:admin,P:Password123!,C:17,W:solpayloadsize\"\nCONSOLE name=\"nid006\" dev=\"ipmi:nid-bmc006\" ipmiopts=\"U:admin,P:Password123!,C:17,W:solpayloadsize\"\nCONSOLE name=\"nid007\" dev=\"ipmi:nid-bmc007\" ipmiopts=\"U:admin,P:Password123!,C:17,W:solpayloadsize\"\nCONSOLE name=\"nid008\" dev=\"ipmi:nid-bmc008\" ipmiopts=\"U:admin,P:Password123!,C:17,W:solpayloadsize\"\nCONSOLE name=\"nid009\" dev=\"ipmi:nid-bmc009\" ipmiopts=\"U:admin,P:Password123!,C:17,W:solpayloadsize\"\n```\nThen start and enable `conman`\n```bash\nsystemctl start conman\nsystemctl enable conman\n```\n\nAt this point you can test powering on a node and check that conman is working\n```bash\npm -1 nid001\nconman nid001\n```\nYou should at least see console output, but it won't boot just yet...\n\n\n## OpenCHAMI microservices\nOpenCHAMI is a long acronym for something that is probably a lot more simple than you would expect. OpenCHAMI is ostensibly based on CSM but really we took SMD and BSS and that's about it. \n\n### SMD\nState Management Database (SMD), at least that is what I think SMD stands for, is a set of APIs that sit in front of a Postgres database. SMD does a lot more in CSM than it does in OpenCHAMI. There is no hardware discovery happening in SMD and we don't use it for holding the state of anything. SMD is simply an API that talks to a database that holds component information. The components here are Nodes, BMCs, and Interface data. \nIn OpenCHAMI SMD does not actively do anything and is a repository of information on the system hardware. \n### BSS\nBootScript Service (BSS) is a service that provides on demand iPXE scripts to nodes during the netboot process. It talks to SMD to confirm the requesting node exists and if so it returns a generated iPXE script based on the data it holds about that node. \n### Cloud-init\nWe wrote a custom cloud-init server that does some things similar to BSS. It will process the requesting nodes IP and find the component and/or group information, then build the cloud-init configs from there. Cloud-init data is populated externally. OpenCHAMI does not provide the actual configs only a way to push out the configs. \n\nThe server has two endpoints: `/cloud-init` and `cloud-init-secure`. Aptly named, the secure functions like the regular endpoint but requires a JWT to read from it. This is how we are providing secret data to the cluster nodes. \n### TPM-manager\nNot so aptly named, this service is a weird one. It's inital function was to experiment on configuring TPMs during boot. But some of the test systems did not have TPMs and so it's basic function is to generate a JWT and push it to the nodes during boot. \n\n### opaal and Hydra\n#### Hydra\n[Hydra](https://github.com/ory/hydra) is an oauth provider but it does not manage logins or user accounts etc. We use Hydra to create and hand out JWTs.\n\n### opaal\nOpaal is a toy OIDC provider. You make a request to opaal and it makes a JWT request to hydra, then hands that back to the \"user\". It's a pretend login service.\n\nHydra is something that will probably stick around for a while as we use it as the authorization server. opaal is a stand in service that will probably get replaced, hopefully soon.\nSo I wouldn't worry too much about opaal.\n### ACME and step-ca\nAutomatic Certificate Management Environemnt or ACME is what we use to automate CA cert renewals. This is so you don't have that special day every year when all your certificates expire and you have to go renew them and it's annoying. Now you have to renew them everyday! but it should be \"automatic\" and much easier. I say that but we only issue a single cert at the moment, so time will tell. We use [acme.sh](https://github.com/acmesh-official/acme.sh) to generate certs from a certificate authority. \n\n[step-ca](https://smallstep.com/docs/step-ca/) is the certificate authority we use to generate CA certs. \n### haproxy\nHAproxy acts our API gateway. It's what allows outside requests to reach into the container network and talk to various OpenCHAMI services. \n### postgres\nWe use postgres as the backend for BSS, SMD, and Hydra. It's just a postgres database in a container. \n\n## OpenCHAMI adjacent techonologies\nOpenCHAMI doesn't exist in a vacuum. There are parts of deploying OpenCHAMI that are not managed by OpenCHAMI. \nWe'll cover some of these briefly. Very Briefly. \n\n### DHCP and iPXE and Dracut\nThese are all important parts of the boot process. \n\n#### DHCP\nDHCP is all over the place so I'm not gonna go over what DHCP is. OpenCHAMI provides a [CoreDHCP](https://github.com/coredhcp/coredhcp) plugin called [coresmd](https://github.com/OpenCHAMI/coresmd). This links up with SMD to build out the config files and also provides TFTP based on the nodes architecture. This allows us to boot many types of systems.\n\n#### iPXE\niPXE is also something we should all be familiar with. OpenCHAMI interacts with iPXE via BSS, as explained above, but does not control the entire workflow.\n\nWe continue to use iPXE because it is in all firmware at this point. HTTP booting is becoming more popular but not all vendors are building that into their firmware just yet. \n\n#### Dracut\nOpenCHAMI doesn't directly interact with the dracut init stage, but we can insert parameters into BSS that can have an effect here. \nOne example is NFS provided rootfs. \n\n#### Boot process\nA summary of the boot process can be seen here\n![OpenCHAMI Network Boot](images/ochami-netboot-dark.jpg)\n\n\n### Containers and Microservices\nMiscroservices in very few words are defined by two concepts:\n- Independently deployable\n- Loosely coupled\n\nThink of it as building a castle with lego pieces instead of carving it out of a single piece of word or something. \nA good, in depth overview of microservices can be found [here](https://microservices.io/)\n\nContainers are pretty ubiquitous now and I'm sure we've all had some ratio of positive and negative experiences.\nThere are a lot of container orchestrators ranging from fairly simple like Docker and Podman to more complicated like Kubernetes. \nOpenCHAMI DOES NOT CARE about whatever orchestrator you choose to use. The strategy recommended is to follow the opt-in complexity model.\nIn other words, don't start with the most complicated deployment, start simple.\n\nThe OpenCHAMI microservices are distributed as containers which allow for a flexible deployment model. \nYou can see all the available OpenCHAMI containers [here](https://github.com/orgs/OpenCHAMI/packages)\n\nThe Deployment we'll use in this guide will leverage container volumes and networks to hold persistent data and route traffic accordingly. \n\n## Deploying OpenCHAMI\nWe have a set of [Deployment Recipes](https://github.com/OpenCHAMI/deployment-recipes.git) available on the [OpenCHAMI GitHub](https://github.com/OpenCHAMI). \nWe are going to use a specific one, the LANL [podman-quadlets](https://github.com/OpenCHAMI/deployment-recipes/tree/trcotton/podman-quadlets/lanl/podman-quadlets) recipe. We will have to modify some of the configs to match our cluster, but we'll get to that.  \nFirst pull down the deployment-recipes repo from the OpenCHAMI GitHub.\n```bash\ngit clone https://github.com/OpenCHAMI/deployment-recipes.git\n```\nGo to the cloned repo and the LANL podman-quadlets recipes\n```bash\ncd deployment-recipes/lanl/podman-quadlets\n```\nHere will have to make some local changes that match your system\n\n### Setup the inventory\nThe inventory is a single node so just change `inventory/01-ochami` and set\n```ini\n[ochami]\ndemo-head.si.usrc\n```\nTo be the value of `hostname` (demo-head.si.usrc in this case).\n\n### Set cluster names\nPick a cluster name and shortname. These examples use `demo` and `nid` respectively.  \nThese are set in `inventory/group_vars/ochami/cluster.yaml`\n```yaml\ncluster_name: \"demo\"\ncluster_shortname: \"nid\"\n```\n\n### Setup a private SSH key pair\nGenerate an SSH key pair if one doesn't exist\n```bash\nssh-keygen\n```\nJust hit enter 'til you get the prompt back.  \nNow we take the contents of `~/.ssh/id_rsa.pub` and set it in our inventory.  \nIn `inventory/group_vars/ochami/cluster.yaml`\n```yaml\ncluster_boot_ssh_pub_key: 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDZW66ja\u003csnip\u003e = root@st-head'\n```\n Replace what is there with what `ssh-keygen` created. Make sure it is the pub key. \n\n### Populate nodes\nNow we need to populate `inventory/group_vars/ochami/nodes.yaml`. This describes your cluster in a flat yaml file. \nIt will look something like:\n```yaml\nnodes:\n  - bmc_ipaddr: 172.16.0.101\n    ipaddr: 172.16.0.1\n    mac: ec:e7:a7:05:a1:fc\n    nid: 1\n    xname: x1000c1s7b0n0\n    group: compute\n    name: nid001\n```\nYour clusters have 9 computes, so you will have 9 entries. \nThe really important bits here are the MACs. Everything else is made up and you can mostly leave it alone except for the `name`, which you should change to match your `cluster_shortname`. \n\n#### Getting the MACs\nWe are gonna grab the MACs from redfish. \nMake a script `gen_nodes_file.sh` (and you guys are gonna be so impressed)\n```bash\n#!/bin/bash\nnid=1\nSN=${SN:-nid}\nif [ -z \"$rf_pass\" ]\nthen\n        \u003e\u00262 echo 'ERROR: rf_pass not set, needed for BMC credentials'\n        exit 1\nfi\necho \"nodes:\"\nfor i in {1..9}\ndo\n        # NIC MAC Address\n        NDATA=$(curl -sk -u \"$rf_pass\" https://172.16.0.10${i}/redfish/v1/Chassis/FCP_Baseboard/NetworkAdapters/Nic259/NetworkPorts/NICChannel0)\n        if [[ $? -ne 0 ]]\n        then\n                \u003e\u00262 echo \"172.16.0.10${i} unreachable, generating a random MAC\"\n                NRMAC=$(printf '02:00:00:%02x:%02x:%02x\\n' $((RANDOM%256)) $((RANDOM%256)) $((RANDOM%256)))\n                NDATA=\"{\\\"AssociatedNetworkAddresses\\\": [\\\"$NRMAC\\\"]}\"\n        fi\n        NIC_MAC=$(echo $NDATA | jq -r '.AssociatedNetworkAddresses|.[]')\n\n        # BMC MAC Address\n        BDATA=$(curl -sk -u \"$rf_pass\" https://172.16.0.10${i}/redfish/v1/Managers/bmc/EthernetInterfaces/eth0)\n        if [[ $? -ne 0 ]]\n        then\n                \u003e\u00262 echo \"Could not find BMC MAC address for for node with IP 172.16.0.${i}, generating a random one\"\n                BRMAC=$(printf '02:00:00:%02x:%02x:%02x\\n' $((RANDOM%256)) $((RANDOM%256)) $((RANDOM%256)))\n                BDATA=\"{\\\"MACAddress\\\": \\\"$BRMAC\\\"}\"\n        fi\n        BMC_MAC=$(echo $BDATA | jq .MACAddress | tr -d '\"')\n\n        # Print node config\n        echo \"- name: ${SN}00${i}\n  xname: x1000c1s7b${i}n0\n  nid: ${nid}\n  group: compute\n  bmc_mac: ${BMC_MAC}\n  bmc_ip: 172.16.0.10${i}\n  interfaces:\n  - mac_addr: ${NIC_MAC}\n    ip_addrs:\n    - name: management\n      ip_addr: 172.16.0.${i}\"\n\n        nid=$((nid+1))\ndone\n```\nSet the follow variables\n```bash\nexport SN=\u003ccluster-shortname\u003e\nexport rf_pass=\"admin:Password123!\"\n``` \nThen `chmod +x gen_nodes_file.sh` and run it\n```bash\ngen_nodes_file.sh \u003e nodes.yaml\n```\nIf a node's BMC does not respond it will generate a MAC address, You can fix it later. \nYou can then copy that to your ansible inventory (and replace the nodes.yaml that is there).\n\n### Running the OpenCHAMI playbook\nAlmost done. Run the provided playbook:\n```bash\nansible-playbook -l $HOSTNAME -c local -i inventory ochami_playbook.yaml\n```\n\nShould take a minute or two to start everything and populate the services.  \nAt the end you should have these containers running:\n```bash\n# podman ps --noheading | awk '{print $NF}' | sort\nbss\ncloud-init-server\ncoresmd\nhaproxy\nhydra\nimage-server\nopaal\nopaal-idp\npostgres\nsmd\nstep-ca\ntpm-manager\n```\n\n### Verifying things look OK\nThe playbook created a profile script `/etc/profile.d/ochami.sh`. So unless you logout and back in you'll be missing some ENV settings. You can also just `source /etc/profile.d/ochami.sh` without logging out. \n\nCreate a CA cert\n```bash\nget_ca_cert \u003e /etc/pki/ca-trust/source/anchors/ochami.pem\nupdate-ca-trust \n```\nThe cert will expire in 24 hours. You can regenerate certs with\n```\nsystemctl restart acme-deploy\nsystemctl restart acme-register\nsystemctl restart haproxy\n```\nThis would go great in a cron job.\n\nWe are going to use `ochami` as a CLI tool to interact with the OpenCHAMI\nservices. We can get the latest RPM from GitHub:\n```\nlatest_release_url=$(curl -s https://api.github.com/repos/OpenCHAMI/ochami/releases/latest | jq -r '.assets[] | select(.name | endswith(\"amd64.rpm\")) | .browser_download_url')\ncurl -L \"${latest_release_url}\" -o ochami.rpm\n```\nNow, we can install it:\n```\ndnf install -y ./ochami.rpm\n```\nMake sure it works:\n```\nochami --help\n```\n\nThis tool comes with manual pages. See **ochami**(1) for more.\n\nNow, we will need to generate a config file for `ochami`. Generate a system-wide\nconfig file by running (press **y** to confirm creation):\n```\nochami config --system cluster set --default --base-uri https://demo.openchami.cluster:8443 demo\n```\nThis creates a cluster called \"demo\", sets its base URI, and sets it as the\ndefault cluster (i.e. the cluster to use when none is specified when running\n`ochami`). `ochami` appends the paths for the services and endpoints it\ncommunicates with to the base URI.\n\nLet's also change the logging format to be a nicer format other than JSON:\n```\nochami config --system set log.format basic\n```\nLet's take a look at our config to make sure things are set correctly:\n```\nochami config show\n```\nIt should look like this:\n```yaml\nlog:\n    format: basic\n    level: warning\ndefault-cluster: demo\nclusters:\n    - name: demo\n      cluster:\n        base-uri: https://demo.openchami.cluster:8443\n```\n\nNow, we need to generate a token for the \"demo\" cluster. `ochami` reads this\nfrom `\u003cCLUSTER_NAME\u003e_ACCESS_TOKEN` where `\u003cCLUSTER_NAME\u003e` is the configured name\nof the cluster in all capitals. This is `DEMO` in our case. Let's set the token:\n```bash\nexport DEMO_ACCESS_TOKEN=$(gen_access_token)\n```\n\nCheck SMD is populated with `ochami smd component get | jq`\n```json\n{\n  \"Components\": [\n    {\n      \"Enabled\": true,\n      \"ID\": \"x1000c1s7b1\",\n      \"Type\": \"Node\"\n    },\n    {\n      \"Enabled\": true,\n      \"Flag\": \"OK\",\n      \"ID\": \"x1000c1s7b1n0\",\n      \"NID\": 1,\n      \"Role\": \"Compute\",\n      \"State\": \"On\",\n      \"Type\": \"Node\"\n    },\n    {\n      \"Enabled\": true,\n      \"ID\": \"x1000c1s7b2\",\n      \"Type\": \"Node\"\n    },\n    {\n      \"Enabled\": true,\n      \"Flag\": \"OK\",\n      \"ID\": \"x1000c1s7b2n0\",\n      \"NID\": 2,\n      \"Role\": \"Compute\",\n      \"State\": \"On\",\n      \"Type\": \"Node\"\n    },\n    ...\n]\n```\nYou should see:\n```json\n    {\n      \"Enabled\": true,\n      \"ID\": \"x1000c1s7bN\",\n      \"Type\": \"Node\"\n    },\n    {\n      \"Enabled\": true,\n      \"Flag\": \"OK\",\n      \"ID\": \"x1000c1s7bNn0\",\n      \"NID\": 1,\n      \"Role\": \"Compute\",\n      \"State\": \"On\",\n      \"Type\": \"Node\"\n    },\n```\nfor each `N` (in the xname) from 1-9, inclusive.\n\nCheck BSS is populated with `ochami bss boot params get | jq`\n```json\n[\n  {\n    \"cloud-init\": {\n      \"meta-data\": null,\n      \"phone-home\": {\n        \"fqdn\": \"\",\n        \"hostname\": \"\",\n        \"instance_id\": \"\",\n        \"pub_key_dsa\": \"\",\n        \"pub_key_ecdsa\": \"\",\n        \"pub_key_rsa\": \"\"\n      },\n      \"user-data\": null\n    },\n    \"initrd\": \"http://172.16.0.254:8080/openchami/compute-slurm/latest/initramfs-4.18.0-553.27.1.el8_10.x86_64.img\",\n    \"kernel\": \"http://172.16.0.254:8080/openchami/compute-slurm/latest/vmlinuz-4.18.0-553.27.1.el8_10.x86_64\",\n    \"macs\": [\n      \"ec:e7:a7:05:a1:fc\",\n      \"ec:e7:a7:05:a2:28\",\n      \"ec:e7:a7:05:93:84\",\n      \"ec:e7:a7:02:d9:90\",\n      \"02:00:00:a8:4f:04\",\n      \"ec:e7:a7:05:96:74\",\n      \"02:00:00:97:c4:2e\",\n      \"ec:e7:a7:05:93:48\",\n      \"ec:e7:a7:05:9f:50\"\n    ],\n    \"params\": \"root=live:http://172.16.0.254:8080/openchami/compute-slurm/latest/rootfs-4.18.0-553.27.1.el8_10.x86_64 ochami_ci_url=http://172.16.0.254:8081/cloud-init/ ochami_ci_url_secure=http://172.16.0.254:8081/cloud-init-secure/ overlayroot=tmpfs overlayroot_cfgdisk=disabled nomodeset ro ip=dhcp apparmor=0 selinux=0 console=ttyS0,115200 ip6=off network-config=disabled rd.shell\"\n  }\n]\n```\nWe'll have to update these values later when we build a test image. But for now we can see that it is at least working...\n\nCheck cloud-init is populated with `ochami cloud-init data get compute`\n```yaml\n#cloud-config\nruncmd:\n- setenforce 0\n- systemctl disable firewalld\nwrite_files:\n- content: ssh-rsa AAAAB3Nz\u003csnip\u003e root@st-head.si.usrc\n  path: /root/.ssh/authorized_keys\n```\nWe only setup authorized keys on the computes for now. \n\n### Building a test image\nWe'll build a test image real quick to boot into. Won't be anything special.\n\nFirst install `buildah`\n```bash\ndnf install -y buildah\n```\nCreate a blank container\n```bash\nCNAME=$(buildah from scratch)\n```\nMount it \n```bash\nMNAME=$(buildah mount $CNAME)\n```\nInstall some base packages\n```bash\ndnf groupinstall -y --installroot=$MNAME --releasever=8 \"Minimal Install\"\n```\nInstall the kernel and some need dracut stuff:\n```bash\ndnf install -y --installroot=$MNAME kernel dracut-live fuse-overlayfs cloud-init\n```\nThen rebuld the initrd so that during dracut it will download the image and mount the rootfs as an in memory overlay\n```bash\nbuildah run --tty $CNAME bash -c ' \\\n    dracut \\\n    --add \"dmsquash-live livenet network-manager\" \\\n    --kver $(basename /lib/modules/*) \\\n    -N \\\n    -f \\\n    --logfile /tmp/dracut.log 2\u003e/dev/null \\\n    '\n```\nThen commit it\n```bash\nbuildah commit $CNAME test-image:v1\n```\nWhile we're here we'll get the initrd, vmlinuz, and build a rootfs to boot from. \nWe have a container that holds all three of these items we just need to pull them out. \n\nSetup a directory to store these. We'll use an nginx container to serve these out later on.\n```bash\nmkdir -p /data/domain-images/openchami/rocky/test\n```\n\nGet the kernel version of the image\n```bash\nKVER=$(ls $MNAME/lib/modules)\n```\nIf you have more than one kernel installed then something went very wrong\n\nGet the initrd and vmlinuz\n```bash\ncp $MNAME/boot/initramfs-$KVER.img /data/domain-images/openchami/rocky/test\nchmod o+r /data/domain-images/openchami/rocky/test/initramfs-$KVER.img\ncp $MNAME/boot/vmlinuz-$KVER /data/domain-images/openchami/rocky/test\n```\n\nNow let's make a squashfs of the rootfs\n```bash\nmksquashfs $MNAME /data/domain-images/openchami/rocky/test/rootfs-$KVER -noappend -no-progress\n```\n\nAfter all this you should have something that looks like so\n```bash\n[root@st-head ~]# ls -l /data/domain-images/openchami/rocky/test/\ntotal 1244104\n-rw----r-- 1 root root  102142693 Oct 16 09:04 initramfs-4.18.0-553.22.1.el8_10.x86_64.img\n-rw-r--r-- 1 root root 1160933376 Oct 16 09:07 rootfs-4.18.0-553.22.1.el8_10.x86_64\n-rwxr-xr-x 1 root root   10881352 Oct 16 09:04 vmlinuz-4.18.0-553.22.1.el8_10.x86_64\n```\nWe'll use these later. \n\nClean up the container stuff\n```bash\nbuildah umount $CNAME\nbuildah rm $CNAME\n```\n### Configure BSS\nWe need to update BSS to use this image.  \nModify `inventory/group_vars/ochami/bss.yaml` and set\n```yaml\nbss_kernel_version: '4.18.0-553.22.1.el8_10.x86_64'\nbss_image_version: 'rocky/test'\n```\nThe `bss_kernel_version` should match `echo $KVER` if that is still set or you can check `/data/domain-images/openchami/rocky/test/`. \n\nUpdate BSS to use these new settings:\n```bash\nansible-playbook -l $HOSTNAME -c local -i inventory -t bss ochami_playbook.yaml\n```\nYou can check to make sure it got set correctly with\n```bash\nochami bss boot params get | jq\n```\n\n## Booting nodes\nLet's open like, I don't know, 4-5 windows.\nYou should be able to boot nodes now, but lets start with just one\n```bash\npm -1 nid001\n```\nand watch the console\n```bash\nconman nid001\n```\n\nChecking the logs will help debug boot issues and/or see the nodes interacting with the OpenCHAMI services.\nRun all these in separate windows...\n\nWatch incoming DHCP requests. \n```bash\npodman logs -f coresmd\n```\n\nCheck BSS requests.\n```bash\npodman logs -f bss\n```\n\nCheck cloud-init requests:\n```bash\npodman logs -f cloud-init-server\n```\n\n## Digging in\nAt this point you should be able to boot the test image and have all the fancy OpenCHAMI services running.\nNow we can dive into things and get a better picture of what is going on\n\n### SMD\nWe haven't really poked at SMD yet. There are a lot of endpoints but we are only really using these:\n\n| **Endpoint**                  | **`ochami` Command**   |\n| ----------------------------- | ---------------------- |\n| /State/Components             | `ochami smd component` |\n| /Inventory/ComponentEndpoints | `ochami smd compep`    |\n| /Inventory/RedfishEndpoints   | `ochami smd rfe`       |\n| /Inventory/EthernetInterfaces | `ochami smd iface`     |\n| /groups                       | `ochami smd group`     |\n\nAs shown in the table, the `ochami` command can be used to deal with these\nendpoints directly. Feel free to play around with it. For those that want to dig\naround using `curl`, you'll need the `DEMO_ACCESS_TOKEN` we created earlier. If\nit expired, regenerate it with:\n```bash\nexport DEMO_ACCESS_TOKEN=$(gen_access_token)\n```\n`SMD_URL` should be set already but confirm with `echo $SMD_URL`\n\nYou can use:\n```bash\ncurl -sH \"Authorization: Bearer $DEMO_ACCESS_TOKEN\" $SMD_URL/\u003cendpoint\u003e\n```\nto see all the fun data.\n\n- The `/State/Componets` holds all the Components. You should see your nodes and BMCs here. The xnames are pointless in this context but SMD REQUIRES THEM. I hate it.  \n- `/Inventory/ComponentEndpoints` is an intermediary endpoint. You don't directly interact with this endpoint.  \n- `/Inventory/RedfishEndpoints` is where the BMC data is stored. If you DELETE `/Inventory/RedfishEndpoints` then `/Inventory/ComponentEndpoints` will also get deleted.  \n- `/Inventory/EthernetInterfaces` is where all the interfaces are stored. IPs and MACs are mapped to Component IDs\n- `/groups` is where the group information is stored\n\n### BSS\nBSS only has two endpoints we care about.\n\n| **Endpoint**    | **`ochami` Command**     |\n| --------------- | ------------------------ |\n| /bootparameters | `ochami bss boot params` |\n| /bootscript     | `ochami bss boot script` |\n\nYou'll need `DEMO_ACCESS_TOKEN` for one of these and `BSS_URL` will need to be\nset (which it should be already).\n\n- `/bootparameters` will require a token, but running `curl -sH \"Authorization: Bearer $DEMO_ACCESS_TOKEN\" $BSS_URL/bootparameters` should show you all your bootparams with the associated MACs.\n- `/bootscript` can be accessed via HTTP (so nodes can get things during iPXE) and doesn't require a token. But you'll need to pick a valid MAC (pick one from the previous command output).\n`curl $BSS_URL/bootscript?mac=ec:e7:a7:05:a1:fc` should show this nodes iPXE chain. \n\n### cloud-init\nCloud-init is a little strange at the moment and is still being worked on. \nThis is how it works right now:\n- during systemd-init, cloud-init will start and it will try to use a data source\n- we are using the [NoCloud](https://cloudinit.readthedocs.io/en/latest/reference/datasources/nocloud.html) datasource. \n- This can take a URL on the kernel parametes as it's remote source (`ds=nocloud;s=http://172.16.0.254:8081/cloud-init/`)\n- The node does not have to specify which node it is when making the request\n- the cloud-init-server will inspect the IP that is making the request, then try to find it in the `/Inventory/EthernetInterfaces`\n- if a match is found then the `ComponentID` is returned.\n- The server then checks to see if the `ComponentID` is a member of any SMD groups\n- Then the server will see if the group (if found) has any associated cloud-init data\n- Then the server will see if the node has any node specific cloud-init data\n- It will merge the two data sources where the node specific entries \"win\" over the group sources\n- Then it will return the generated `user-data` and `meta-data`\n\nPopulating the cloud-init-server is relatively straight forward.\nHere is an example:\n```yaml\nname: compute\ncloud-init:\n  userdata:\n    write_files:\n      - path: /etc/test123\n        content: 'blah blah blah'\n    runcmd:\n      - echo hello\n  metadata:\n    instance-id: test\n```\n- The `name` is called the `IDENTIFIER` and it can be an xname or a group name (it can be whatever you want actually it doesn't check at all right now).\n- `cloud-init` is the top level structure and it's where you store the `userdata` and `metadata` content. \n- `userdata` and `metadata` are cloud-init specific directives. \n  - `userdata` is where you use cloud-init [modules](https://cloudinit.readthedocs.io/en/latest/reference/modules.html) to perform tasks at boot time. The example above is using two modules: `write_files` and `runcmd`, which I think you can figure out what they do. \n  - `metadata` is just a dictionary of key-value pairs. You can add whatever you want here. Cloud-init does support jinja2 templating but the cloud-init-server isn't working with that just yet. \n\nTo post data to the endpoint your payload needs to be in JSON, so you'll have to convert it. Save the above example to a file called `test.yaml`\n```bash\npython3 -c 'import sys, yaml, json; print(json.dumps(yaml.safe_load(sys.stdin)))' \u003c test.yaml | jq \u003e test.json\n```\n\nThen you can \n```bash\ncurl -X PUT -H \"Content-Type: application/json\" $CLOUD_INIT_URL/compute -d @test.json\n```\nThen\n```bash\ncurl -s $CLOUD_INIT_URL/compute | jq\n```\n\nThe `ochami` tool makes it a little bit easier to add things. However, it\nexpects an array of cloud-init configs since it can add/update many configs at\nonce. We can make this conversion easily:\n```bash\necho \"[$(cat test.json)]\" | python3 -c 'import sys, yaml, json; print(yaml.dump(json.load(sys.stdin)))' \u003e test2.yaml\n```\nThen, pass it to the tool:\n```bash\nochami cloud-init config update --payload-format yaml --payload test2.yaml\nochami cloud-init data get compute\n```\n\nYou can also get the exact cloud-init payloads that a node will get when booting by hitting the `/cloud-init/\u003cname\u003e/{user-data, meta-data}`\nFor example:\n```bash\ncurl -s $CLOUD_INIT_URL/compute/user-data\ncurl -s $CLOUD_INIT_URL/compute/meta-data\n\ncurl -s $CLOUD_INIT_URL/x1000c1s7b0n0/user-data\ncurl -s $CLOUD_INIT_URL/x1000c1s7b0n0/meta-data\n```\nThe `ochami` equivalents of the above commands are (note that `--user` is\noptional for fetching user-data):\n```bash\nochami cloud-init data get --user compute\nochami cloud-init data get --meta compute\n\nochami cloud-init data get x1000c1s7b0n0\nochami cloud-init data get --meta x1000c1s7b0n0\n```\n\nThe response you get will depend on `x1000c1s7b0n0` having node specific cloud-init data.\nLet's try something. Copy `test2.yaml` to `x1000c1s7b0n0.yaml` and add something different:\n```yaml\nname: x1000c1s7b0n0\ncloud-init:\n  userdata:\n    write_files:\n      - path: /etc/test123\n        content: 'blah blah blah but different'\n  metadata:\n    instance-id: test\n```\n\nThen add it to cloud-init\n```bash\nochami cloud-init config add --payload-format yaml -f x1000c1s7b0n0.yaml\n```\nThen get the data\n```bash\ncurl -s $CLOUD_INIT_URL/x1000c1s7b0n0/user-data\n```\nor use the `ochami` equivalent:\n```bash\nochami cloud-init data get x1000c1s7b0n0\n```\nWhat does it look like?\n\n### CoreDHCP\nWe currently use CoreDHCP as our DHCP provider. CoreDHCP is useful because it is\nplugin-based. All incoming DHCP packets are filtered through a list of plugins,\neach of which can optionally modify the response and either pass it through to\nthe next plugin or return the response to the client. This is very useful for\ncustomizing functionality.\n\nThe version of CoreDHCP that OpenCHAMI uses is built with a plugin called\n\"coresmd\" that checks if MAC addresses requesting an IP address exist in SMD and\nserves their corresponding IP address and BSS boot script URL. There is also\nanother plugin called \"bootloop\" that is optional and can be used as a catch-all\nto continuously reboot requesting nodes whose MAC address is unknown to\nSMD.[^bootloop]\n\n[^bootloop]: The reason for rebooting continuously is so that unknown nodes\n  continuously try to get a new IP address so that in the case these nodes are\n  added to SMD, they can get their IP address with a longer lease. Rebooting is\n  the default behavior, but the bootloop plugin allows customization of the\n  behavior.\n\nAnsible will place the CoreDHCP config file at\n`/etc/ochami/configs/coredhcp.yaml`. Feel free to take a look. See\n[here](https://github.com/OpenCHAMI/deployment-recipes/blob/main/quickstart/DHCP.md)\nfor a more in-depth description of how to configure CoreDHCP for OpenCHAMI on a\n\"real\" system.\n\nThe \"coresmd\" plugin contains its own TFTP server that serves out the iPXE\nbootloader binary matching the system CPU architecture. You can see these here:\n```\npodman exec coresmd ls /tftpboot\n```\nFor more advanced \"bootloop\" plugin config (if used), one can put a custom iPXE\nscript in this directory and then replace `default` in the bootloop config line\nwith the name of the script file to have that script execute instead of\nrebooting.\n\nCoreDHCP, as OpenCHAMI has it, does not handle DNS itself, but rather outsources\nto other DNS servers (see the `dns` directive in the config file).\n\nFinally, if the static mapping of MAC addresses to IP addresses is required for\nunknown nodes, the CoreDHCP \"file\" plugin can be added below the coresmd line in\nthe config file. See the DHCP.md document linked above for more details.\n\n### podman volumes, networks, and quadlets oh my\nNot related to OpenCHAMI specifically, but used for this deployment recipe we have lot's of podman concepts being used here. \nAnd these pretty much apply to docker but maybe not a perfect 1:1\n\n#### Volumes\nVolumes can be pretty flexible\nThe most seen volume is when you `podman run -v \u003chost-dir\u003e:\u003ccontainer-dir\u003e ...`. The `-v` flag is for volume and in this case you are mapping a directory that exists on the host to a directory inside the container. There are a lot of mount options but the default is `ro`. \n\nThe second way (and the one the quadlets use, mostly) is to create a volume with podman.\n```bash\npodman volume create test-volume\n```\nThen you can list all volumes\n```bash\npodman volume ls\n```\nand inspect\n```bash\npodman volume inspect test-volume\n```\nWhich shows you some data about the volume (`Mountpoint` is interesting)\n\nYou can also mount the volume (this just returns that `Mountpoint` value)\n```bash\nMNAME=$(podman volume mount test-volume)\ntouch $MNAME/test\n```\nSo let's use this volume with a container\n```bash\npodman run --name test1 --replace -it -v test-volume:/data docker.io/bash\n```\nWe named this container test with the `--name` flag and we reference the volume by name and run a bash container. You should have a shell in a container.\nWhat does `ls /data/` show?\n\nStart a different container in a seperate terminal \n```bash\npodman run --name test2 --replace -it -v test-volume:/data docker.io/bash\ntouch /data/test2\n```\n\nWe mounted the same volume. Then we touched a new file in the Volume.\nWhat does `ls /data/` show in the `test2` container?\n\nYou can see how volumes can allow containers to shared files and keep those files in a persistent volume. \n\n#### Networks\nWhen a podman container is started it is by default added to a `podman` network.\nStart a test container again:\n```bash\npodman run --name test1 --replace -it docker.io/bash\n```\nThen in another window\n```bash\npodman inspect test1 | jq -r '.[] | .NetworkSettings.Networks'\n```\nYou should see some fun things like IP and MAC. This is for the podman container. \nYou can also see it is a part of the `podman` network.\n\nIn your `test1` container run\n```bash\nip a\n```\nDoes it match what you saw in the inspect?\n\nYou can also create your own networks\n```bash\npodman network create ext-network\n```\nYou can view all the networks with \n```bash\npodman network ls\n```\nLet's use this network, but this time we'll start an nginx container.\nWe'll also use that volume we created earlier\n```bash\npodman run --name test-webserver --replace -d --network ext-network -v test-volume:/usr/share/nginx/html docker.io/nginx\n```\nNow run the inspect again, what do you see? Can you ping its IP?\nSSH to a compute node. Can you still ping it?\n\nWhat if you try to `curl` a file from this container?\n```bash\ncurl -O http://10.89.5.5/test\n```\n\nby default, when you create a podman network it gets set to be `external`.\nWhat this means is that podman will create firewall rules to forward traffic to this container. \n\nWhat if we don't want that? `podman network create` has an `--internal` flag that will stop podman from setting up these rules.\nCreate a new network\n```bash\npodman network create --internal int-network\n```\nNow let's repeat the steps from before\n```bash\npodman run --name test-webserver --replace -d --network int-network -v test-volume:/usr/share/nginx/html docker.io/nginx\n```\nNote the network in use changed. Inspect it again to get the IP\nThen\n```bash\ncurl -O http://10.89.4.3/test\n```\nDid it work?\n\nLeave that container running and start another bash container\n```bash\npodman run --name test-curl --replace -it --network int-network docker.io/bash\n```\nThe bash container doesn't have `curl` so use `wget`\n```bash\nwget http://10.89.5.7/test\n```\nDid it work?\n\nPodman networks let you isolate traffic between sets of containers and let you hide containers running on the host. We use a variety of them in this deployment. \n\n#### Quadlets\nNow that you know how all the pieces work and how you can combine them together we can now look at how we manage all these pieces.  \nThere are a lot of ways to manage containers. \nDocker has `docker-compose`.\nThere's kubernetes. \nPodman even has a `podman-compose` (but at the moment it is not great). Quadlets are another way to manage containers but are speficially meant to work with systemd. \n\nQuadlets have all the functionality of running with `podman run`. And there are a LOT of options (`podman run --help` to see for yourself). \nThe difference is that we write these options to files, that get generated into systemd services. This works great with something like Ansible becuase we can template out our container files.  \n\nThe quadlet files are located in `/etc/containers/systemd`.  \nIn that folder create a file called `test-webserver.container` with the following\n```ini\n[Unit]\nDescription=The test-webserver container\n\n[Container]\nContainerName=test-webserver\nHostName=test-webserver\nImage=docker.io/library/nginx:latest\n\n[Service]\nRestart=always\n```\nThen run `systemctl daemon-reload`\nYou should now be able to control the container with systemd\nSee the status\n```bash\nsystemctl status test-webserver\n```\nLooks like it is not running... let's start it\n```bash\nsystemctl start test-webserver\n```\nCheck the status again. \nWe should also be able to see it running with `podman ps`\n\nWe didn't attach any volumes or networks though...\nLet's create a volume to house our webserver data.  \nIn `/etc/containers/systemd` create a file called `webserver-data.volume` with the following\n```ini\n[Unit]\nDescription=test-webserver data volume\n[Volume]\nVolumeName=webserver-data\n```\nNow, edit the `test-webserver.container` file and in the `[Container]` section add\n```\nVolume=webserver-data.volume:/usr/share/nginx/html:ro\n```\nRun the following\n```bash\nsystemctl daemon-reload\nsystemctl restart test-webserver\n```\nLet's check if it is using this new volume...\n```bash\npodman inspect test-webserver | jq -r '.[]|.Mounts'\n```\nLooks like we have our volume in place. \nWe created an empty volume so nothing for our webserver to... serve, but you can add data in a variety of ways\n- have another container populate it\n- mount from the host (`Volume=\u003chost-dir\u003e:\u003ccontainer-dir\u003e`)\n- create the volume from an image. This is a fun one\n  ```bash\n  CNAME=$(buildah from scratch)\n  MNAME=$(buildah mount $CNAME)\n  echo \"HELLO\" \u003e $MNAME/test\n  buildah commit $CNAME test-volume-image\n  podman volume create --driver image --opt image=test-volume-image:latest fun-volume\n  podman run --name test-curl --replace -it -v fun-volume:/data  docker.io/bash cat /data/test\n  ```\n\nNetworks with quadlets are pretty straighforward.  \nMake a file in `/etc/containers/systemd` named `webserver-net.network` with the following\n```ini\n[Unit]\nDescription=webserver network\n\n[Network]\nNetworkName=webserver-net\nInternal=True\n```\nThis will create an internal network.  \nTo use it add the following to the `[Container]` section\n```\nNetwork=webserver-net.network\n```\nOnce you reload systemd and restart the webserver\n```bash\nsystemctl daemon-reload\nsystemctl restart test-webserver\n```\nYou should be able to see the container is now using this network\n```bash\npodman inspect test-webserver | jq -r '.[] | .NetworkSettings.Networks'\n```\nNow your webserver will only work on that podman network\n\nQuadlets make it easy to interface containers, and their volumes and networks, with systemd features. \nThere is a lot we won't cover but you should be able to look at the quadlet files dropped by ansible and get a clearer picture of how it all works. \n\n### Ansible\nThere's not much to cover here and more of an informational topic. This deployment uses ansible to create all the quadlet files alongside a number of other utility roles to drop config files or populate the OpenCHAMI services.  \nThe meat of the deployment comes from the `roles/quadlet` role. In the `templates` directory you find:\n- container.j2\n- network.j2\n- volume.j2\n\nThese are aptly named and are templates to create Container, Network, and Volume quadlet files. \n\nThe variables for these templates are in `inventory/group_vars/quadlets.yaml`. They are pretty verbose so I will let you go through them on your own but most of the variables should be readable. Once you have a good idea of how the templating works the rest of it is pretty easy\n- Drop Network templates\n- Drop Volume templates\n- Drop Container templates\n- Start containers\n\nThe only thing we haven't cover really is the dependencies of the OpenCHAMI services. We leverage systemd functionality to determine the start up order of the containers by setting\n```ini\n[Unit]\nRequires=\nAfter=\n```\nfor a container's dependency. The whole startup chain looks like this\n![OpenCHAMI Network Boot](images/podman_quadlets_flow_dark.png)\n\n## Image-Build Tool\nIn the Prep section we built a test image using Buildah. The image-build tool does pretty much the same thing but is a fancy python wrapper around Buildah.  \nWe can build more complicated images and layer them with the `image-build` tool using simple config files. \n\nGet the tool source\n```bash \ngit clone https://github.com/OpenCHAMI/image-builder\n```\n\nThen build the DNF version:\n```bash\npodman build -t image-builder:test -f dockerfiles/dnf/Dockerfile .\n```\nNow we have a container that will build other continers. Yay  \nLet's grab the image configs\n```bash\ngit clone https://github.com/OpenCHAMI/mini-bootcamp.git\ncd mini-ochami-bootcamp\n```\n\nAll the configs are in `image-configs` and are yaml based. \nYou shouldn't have to make many changes, but go over the yaml files and make sure `parent` and `publish_registry` all point to `registry.dist.si.usrc:5000/\u003ccluster_name\u003e`.\n\nLet's build a base image.\n```bash\npodman run --device /dev/fuse -it --name image-builder --rm -v $PWD:/data image-builder:test 'image-build --log-level INFO --config /data/image-configs/base.yaml'\n```\nThis will push to the `registry.dist.si.usrc:5000/stratus`, which you can then pull from and build on top of.\nLet's do that...\n```bash\npodman run --device /dev/fuse -it --name image-builder --rm -v $PWD:/data image-builder:test 'image-build --log-level INFO --config /data/image-configs/compute-base.yaml'\n```\nLet's keep going...\n```bash\npodman run --device /dev/fuse -it --name image-builder --rm -v $PWD:/data image-builder:test 'mkdir -p /tmp/dnf_test/log; image-build --log-level INFO --config /data/image-configs/compute-mlnx.yaml'\n```\nand then install slurm\n```bash\npodman run --device /dev/fuse -it --name image-builder --rm -v $PWD:/data image-builder:test 'mkdir -p /tmp/dnf_test/log; image-build --log-level INFO --config /data/image-configs/compute-slurm.yaml'\n```\n\nLook at us, we built a bunch of layers and now we have an image we can boot. \n\nThe layers are all sitting in the `registry.dist.si.usrc:5000` registry, which means we can pull them and create a bootable image just like before\n```bash\npodman pull --tls-verify=false registry.dist.si.usrc:5000/stratus/compute-slurm:latest\n```\nAgain make sure you are using your endpoint\n\nMake a directory to hold our kernel, initrd, and rootfs squash\n```bash\nmkdir -p /data/domain-images/openchami/compute-slurm/latest\n```\n\nThen get all the things\n```bash\nMNAME=$(podman image mount registry.dist.si.usrc:5000/stratus/compute-slurm)\nKVER=$(ls $MNAME/lib/modules)\ncp -f $MNAME/boot/vmlinuz-$KVER /data/domain-images/openchami/compute-slurm/latest/\ncp -f $MNAME/boot/initramfs-$KVER.img /data/domain-images/openchami/compute-slurm/latest/\nchmod o+r /data/domain-images/openchami/compute-slurm/latest/initramfs-$KVER.img\nmksquashfs $MNAME /data/domain-images/openchami/compute-slurm/latest/rootfs-$KVER -noappend -no-progress\n```\n\nNow we have an image we can use. Let's do that\nupdate your BSS inventory in the deployment recipe: `inventory/group_vars/ochami/bss.yaml`\n```yaml\nbss_kernel_version: '4.18.0-553.22.1.el8_10.x86_64'\nbss_image_version: 'compute-slurm/latest'\n```\nYour kernel version may be different so pay attention...  \n\nWe also added a bunch of cloud-init configs we did not really cover. \nIn the `mini-ochami-bootcamp` repo there is a `image-configs/files` directory. These files get added to the base image and they control how cloud-init is run. \nThese files will enable a two stage cloud-init; one for the regular insecure configs and another for the 'secrets' or secure configs.\n\nThe short story is update this variable in your inventory to look like:\n```yaml\nbss_params_cloud_init: 'ochami_ci_url=http://{{ cluster_boot_ip }}:8081/cloud-init/ ochami_ci_url_secure=http://{{ cluster_boot_ip }}:8081/cloud-init-secure/'\n```\n\nThen rerun the BSS role in ansible:\n```bash\nansible-playbook -l $HOSTNAME -c local -i inventory -t bss ochami_playbook.yaml\n```\nand check to make sure your BSS settings look good \n```bash\nochami bss boot params get | jq\n```\n\nSo now are using a two step cloud-init process. The current configs you have in cloud-init will still work as they are right now.\nBut we can now add \"secret\" data to cloud-init that requires a JWT to access during the boot process. \n\nWe haven't covered it very much, but the `tpm-manager` will drop a JWT on a compute node during it's boot process. It requires that the first cloud-init run drops an SSH key that allows it to ssh to the node.  \nThis allows the node to get it's secret data. \n\nThe data is populated the same way the insecure cloud-init is. \nFor example, to drop a munge key let's first generate one\n```bash\ndnf install -y munge\ncreate-munge-key\ncat /etc/munge/munge.key | base64\n```\n\nMake a cloud-init payload file that looks something like\n```yaml\nname: compute\ncloud-init:\n  userdata:\n    ssh_deletekeys: false\n    write_files:\n      - content: |\n          w7MwDvqASzXqq8pRk2K4Vd8Hs0/sdyEMs4S0BHn1AOU6PAkXSRO3dnomOLX+15IIR7DFzGyUpyBS\n          EZN1mG2tB8aeosVGn8MZ9uLtYrQT4Nbb1aiPvpxEuZsFcrzGogS+TRs8NmbC4HMyUwJtxFpw5Q==\n        path: /etc/munge/munge.key\n        permissions: '0400'\n        owner: 'munge:munge'\n        encoding: base64\n  metadata:\n    instance-id: ochami-compute-secure\n```\n\nThen add it to the secure cloud-init endpoint\n```bash\nochami cloud-init --secure config add --payload-format yaml -f compute-secure.yaml\n```\n\nWhen the node next boot it will attempt to get the secure data after the JWT is dropped. If it has something (like the munge key) it will run it just like the first cloud-init run. This would be a great place to put the ssh host keys...\n\n\n## Slurm\nSlurm is outside the scope of OpenCHAMI, but we wouldn't have a complete(ish) system without some kind of resource management. \nWe already built an image with the slurm client so all we need to do is configure the slurm controller and we're done. Right?\n\n### Slurm Controller\nInstall the controller\n```bash\ndnf install -y slurm-slurmctld-ohpc\n```\nThen create the config file. You'll have to set the `ControlMachine` to your head node's hostname and update the compute node hardware specifics...\n```\nClusterName=demo\nControlMachine=\u003chead-node-hostname\u003e\nSlurmUser=slurm\nSlurmctldPort=6817\nSlurmdPort=6818\nAuthType=auth/munge\nStateSaveLocation=/var/spool/slurmctld\nSlurmdSpoolDir=/var/spool/slurmd\nSwitchType=switch/none\nMpiDefault=none\nSlurmctldPidFile=/var/run/slurmctld.pid\nSlurmdPidFile=/var/run/slurmd.pid\nProctrackType=proctrack/pgid\nLaunchParameters=use_interactive_step\nInteractiveStepOptions=\"-n1 -N1 --mem-per-cpu=0 --interactive --pty --preserve-env --mpi=none $SHELL\"\nSlurmctldTimeout=300\nSlurmdTimeout=300\nInactiveLimit=0\nMinJobAge=300\nKillWait=30\nWaittime=0\nSchedulerType=sched/backfill\nSelectType=select/cons_tres\nSelectTypeParameters=CR_Core\nSlurmctldDebug=info\nSlurmctldLogFile=/var/log/slurmctld.log\nSlurmdDebug=info\nSlurmdLogFile=/var/log/slurmd.log\nTaskPlugin=task/affinity\nPropagateResourceLimitsExcept=MEMLOCK\nJobCompType=jobcomp/filetxt\nEpilog=/etc/slurm/slurm.epilog.clean\nNodeName=nid[001-009] Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN\nPartitionName=cluster Nodes=nid[001-009] Default=YES MaxTime=INFINITE State=UP Oversubscribe=EXCLUSIVE\nSlurmctldParameters=enable_configless\nReturnToService=1\nRebootProgram=/sbin/reboot\nResumeTimeout=600\n```\nSetup logging for `slurmctld`\n```bash\ntouch /var/log/slurmctld.log\nchown slurm:slurm /var/log/slurmctld.log\n```\nThen start munge and the controller\n```bash\nsystemctl start munge slurmctld\n```\n\nWe have munge and slurmctld running, but we'll also need chrony\n```bash\ndnf install -y chrony\n```\nThen set the `/etc/chrony.conf` to something like...\n```ini\npool 2.rocky.pool.ntp.org iburst\ndriftfile /var/lib/chrony/drift\nmakestep 1.0 3\nrtcsync\nallow 172.16.0.0/24\nkeyfile /etc/chrony.keys\nleapsectz right/UTC\nlogdir /var/log/chrony\n```\nThen start and enable chronyd\n```bash\nsystemctl enable --now chronyd\n```\n\nWe have our controller running but what about the compute settings?\n\n### Unsecure Cloud-init\nLet's use cloud-init! We'll need a couple of things\n- slurm configs\n- chrony configs\n- a populated `/etc/hosts`\n\nSince we'll be using cloud-init, let's start with a unsecure cloud-init config file `compute.yaml`.  \nwe'll start by populating it with an SSH pub key. Get the contents of you pub key.\n```bash\ncat ~/.ssh/id_rsa.pub | base64 \n```\nAnd update `compute.yaml` to look something like\n```yaml\n- name: compute\n  cloud-init:\n    userdata:\n      ssh_deletekeys: false\n      write_files:\n        - content: |\n            c3NoLXJzYSBBQUFBQjNOemFDMXljMkVBQUFBREFRQUJBQUFCZ1FDWXJ2bkI3TmlVaWovZGM0M214\n            d1JnWUttUDhUdUF0dG5TK0ptRXZ0OU9xeGxxclVLaHVHSWU1Zk16b0pRM2VaTVcveW96bUZYQmlV\n            Wlk2dzJPQUtWOFNJaUJTQ0xkTTk1K0RMOGdvNER1dldqNE1RdXkweFB6ZHpuR0FMY1UralVjZHow\n            MUt6aDFhUFJOWkJZRFBNVy9sQlJRa2w2MzNHamVZRU5KOG1UcVpFSkNKeFJPZ2VPbGFxOE11TkZO\n            aGVyVjZDeHhtMmF2R1VuYTlNOU55ZEJTZE9lR2ZPTjRjd3R5ZktpSXFXVUpETEppYkw3dVN0Nnd2\n            V3dGR2ZPaHdHZ3Yvc2ZYQmZlSG5oRTYyL245ejhEcDRLdDJnallqRWlNVTVZM0paR1A0ZWpQUnB5\n            eHN6alRQZmJydE9QK2RlYmphMlFSTi9nTDZDNTFDOUcwd2ZxQkNLUkRwa1RkK0cwd1FJbGZrKy9x\n            NUU2blgzR3FxYllFbFBCVEU2NHlRQUpkT0ROUFltK2Q3SmEwUkN1dW45RytuK3ExNmtGUGdhak5y\n            S1VXTmkzZUVhaVU0OVk0WEdHZlBWK1h1ZENydUdSNXExSGZMaUZCcjdOTWVJK3pGcUNVbmdlOHFB\n            ZjN5Vll4dnJXL1VjdGx0S1d1ckVNSmM2c1luS1hnYU1QbWxwZGs9IHJvb3RAc3QtaGVhZC5zaS51\n            c3JjCg==\n          path: /root/.ssh/authorized_keys\n          encoding: base64\n    metadata:\n      instance-id: ochami-compute\n```\nReplace the `content` section with your pub key output.  \nWe'll use this to build on going forward.\n\nThe slurm configs are easy, we'll take advantage of Slurm's configless setting. Update the `write_files` section of `compute.yaml` to include\n```yaml\n- content: |\n    SLURMD_OPTIONS=--conf-server 172.16.0.254:6817\n  path: /etc/sysconfig/slurmd\n```\n\nNow let's do chrony, another pretty easy one. Update `write_files` again to include:\n```yaml\n- content: |\n    server 172.16.0.254 iburst\n    driftfile /var/lib/chrony/drift\n    makestep 1.0 3\n    rtcsync\n    keyfile /etc/chrony.keys\n    leapsectz right/UTC\n    logdir /var/log/chrony\n  path: /etc/chrony.conf\n```\nNow let's drop an `/etc/hosts` file on our nodes. There are better ways to do this probably but we'll keep it simple. Add the following to the `write_files` section and keep in mind the hostname of you head node:\n```yaml\n- content: |\n    172.16.0.254    demo-head\n    172.16.0.1      nid001    \n    172.16.0.2      nid002    \n    172.16.0.3      nid003    \n    172.16.0.4      nid004    \n    172.16.0.5      nid005    \n    172.16.0.6      nid006    \n    172.16.0.7      nid007    \n    172.16.0.8      nid008    \n    172.16.0.9      nid009    \n    172.16.0.101    nid-bmc001\n    172.16.0.102    nid-bmc002\n    172.16.0.103    nid-bmc003\n    172.16.0.104    nid-bmc004\n    172.16.0.105    nid-bmc005\n    172.16.0.106    nid-bmc006\n    172.16.0.107    nid-bmc007\n    172.16.0.108    nid-bmc008\n    172.16.0.109    nid-bmc009\n  path: /etc/hosts\n```\nOk... we have all the files we need, let's add some commands to run post-boot. Add a section called `runcmd` on the same level as `write_files` under the `userdata` section with\n```yaml\nruncmd:\n  - setenforce 0\n  - systemctl stop firewalld\n  - systemctl restart chronyd\n```\nWe're turning off selinux and the firewall, plus telling chrony to restart after it's config is dropped.\n\nThe entire `compute.yaml` should look something like this:\n```yaml\n- name: compute\n  cloud-init:\n    userdata:\n      ssh_deletekeys: false\n      write_files:\n        - content: |\n            c3NoLXJzYSBBQUFBQjNOemFDMXljMkVBQUFBREFRQUJBQUFCZ1FDWXJ2bkI3TmlVaWovZGM0M214\n            d1JnWUttUDhUdUF0dG5TK0ptRXZ0OU9xeGxxclVLaHVHSWU1Zk16b0pRM2VaTVcveW96bUZYQmlV\n            Wlk2dzJPQUtWOFNJaUJTQ0xkTTk1K0RMOGdvNER1dldqNE1RdXkweFB6ZHpuR0FMY1UralVjZHow\n            MUt6aDFhUFJOWkJZRFBNVy9sQlJRa2w2MzNHamVZRU5KOG1UcVpFSkNKeFJPZ2VPbGFxOE11TkZO\n            aGVyVjZDeHhtMmF2R1VuYTlNOU55ZEJTZE9lR2ZPTjRjd3R5ZktpSXFXVUpETEppYkw3dVN0Nnd2\n            V3dGR2ZPaHdHZ3Yvc2ZYQmZlSG5oRTYyL245ejhEcDRLdDJnallqRWlNVTVZM0paR1A0ZWpQUnB5\n            eHN6alRQZmJydE9QK2RlYmphMlFSTi9nTDZDNTFDOUcwd2ZxQkNLUkRwa1RkK0cwd1FJbGZrKy9x\n            NUU2blgzR3FxYllFbFBCVEU2NHlRQUpkT0ROUFltK2Q3SmEwUkN1dW45RytuK3ExNmtGUGdhak5y\n            S1VXTmkzZUVhaVU0OVk0WEdHZlBWK1h1ZENydUdSNXExSGZMaUZCcjdOTWVJK3pGcUNVbmdlOHFB\n            ZjN5Vll4dnJXL1VjdGx0S1d1ckVNSmM2c1luS1hnYU1QbWxwZGs9IHJvb3RAc3QtaGVhZC5zaS51\n            c3JjCg==\n          path: /root/.ssh/authorized_keys\n          encoding: base64\n        - content: |\n            SLURMD_OPTIONS=--conf-server 172.16.0.254:6817\n          path: /etc/sysconfig/slurmd\n        - content: |\n            server 172.16.0.254 iburst\n            driftfile /var/lib/chrony/drift\n            makestep 1.0 3\n            rtcsync\n            keyfile /etc/chrony.keys\n            leapsectz right/UTC\n            logdir /var/log/chrony\n          path: /etc/chrony.conf\n        - content: |\n            172.16.0.254    demo-head\n            172.16.0.1      nid001\n            172.16.0.2      nid002\n            172.16.0.3      nid003\n            172.16.0.4      nid004\n            172.16.0.5      nid005\n            172.16.0.6      nid006\n            172.16.0.7      nid007\n            172.16.0.8      nid008\n            172.16.0.9      nid009\n            172.16.0.101    nid-bmc001\n            172.16.0.102    nid-bmc002\n            172.16.0.103    nid-bmc003\n            172.16.0.104    nid-bmc004\n            172.16.0.105    nid-bmc005\n            172.16.0.106    nid-bmc006\n            172.16.0.107    nid-bmc007\n            172.16.0.108    nid-bmc008\n            172.16.0.109    nid-bmc009\n          path: /etc/hosts\n      runcmd:\n        - setenforce 0\n        - systemctl stop firewalld\n        - systemctl restart chronyd\n    metadata:\n      instance-id: ochami-compute\n```\n### Secure cloud-init\nNow let's update our secure stuff. We already have a munge key from before and it should look something like\n```yaml\nname: compute\ncloud-init:\n  userdata:\n    ssh_deletekeys: false\n    write_files:\n      - content: |\n          w7MwDvqASzXqq8pRk2K4Vd8Hs0/sdyEMs4S0BHn1AOU6PAkXSRO3dnomOLX+15IIR7DFzGyUpyBS\n          EZN1mG2tB8aeosVGn8MZ9uLtYrQT4Nbb1aiPvpxEuZsFcrzGogS+TRs8NmbC4HMyUwJtxFpw5Q==\n        path: /etc/munge/munge.key\n        permissions: '0400'\n        owner: 'munge:munge'\n        encoding: base64\n  metadata:\n    instance-id: ochami-compute-secure\n```\nAll we need to update here is adding `runcmd` to start munge and slurmd. These are here because the config files are dropped in a pervious step but munge and slurmd require the `munge.key` which isn't in place until the second cloud-init has run.  \nAdd this to `userdata` section\n```yaml\nruncmd:\n  - systemctl start munge\n  - systemctl start slurmd\n```\n\nNow let's apply these new configs\n```bash\nochami cloud-init config update --payload-format yaml -f compute.yaml\nochami cloud-init --secure config update --payload-format yaml -f compute-secure.yaml\n```\nThe next time you reboot the nodes slurm should (hopefully) be working!\n\n## The Rest...\nNow you know how to \n- Update boot parameters with BSS\n- Build images with the image-build tool\n- Update and use cloud-init\n\nWhat else would you need to make this a system that can run jobs?\n\n- accounts? on the compute?\n- some kind of PE?\n- what about network mounted filesystems?\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenchami%2Fmini-bootcamp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenchami%2Fmini-bootcamp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenchami%2Fmini-bootcamp/lists"}