{"id":36688037,"url":"https://github.com/converged-computing/flux-apps-helm","last_synced_at":"2026-01-12T11:16:37.966Z","repository":{"id":276592799,"uuid":"929724513","full_name":"converged-computing/flux-apps-helm","owner":"converged-computing","description":"Deploy HPC applications to Kubernetes using helm charts","archived":false,"fork":false,"pushed_at":"2025-09-22T22:23:23.000Z","size":5494,"stargazers_count":1,"open_issues_count":4,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-23T00:19:32.256Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/converged-computing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":"COPYRIGHT","agents":null,"dco":null,"cla":null}},"created_at":"2025-02-09T08:33:06.000Z","updated_at":"2025-09-22T22:23:25.000Z","dependencies_parsed_at":"2025-04-18T02:16:19.302Z","dependency_job_id":"5cdc471b-97ca-4e5a-aae4-50d59db39e70","html_url":"https://github.com/converged-computing/flux-apps-helm","commit_stats":null,"previous_names":["converged-computing/flux-apps-helm"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/converged-computing/flux-apps-helm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Fflux-apps-helm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Fflux-apps-helm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Fflux-apps-helm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Fflux-apps-helm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/converged-computing","download_url":"https://codeload.github.com/converged-computing/flux-apps-helm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/converged-computing%2Fflux-apps-helm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28338970,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-12T10:58:46.209Z","status":"ssl_error","status_checked_at":"2026-01-12T10:58:42.742Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-12T11:16:37.908Z","updated_at":"2026-01-12T11:16:37.957Z","avatar_url":"https://github.com/converged-computing.png","language":"Dockerfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Flux Operator Apps\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.15665233.svg)](https://doi.org/10.5281/zenodo.15665233)\n\nThese are simple helm charts to run HPC applications in Kubernetes using the Flux Operator. You can customize each different application to your needs, from the container, to size, to iterations, etc. We have a simple strategy that uses:\n\n - [base-template](base-template): A base template MiniCluster that is used acrossed apps.\n - Applications: are each included in a subdirectory here. Usage is consistent across applications, with the exception of the application specific parameters. For each application, those are included in the respective READMEs.\n\n## Overview\n\nEach application can be customized for anything related to the MiniCluster (e.g., size, flux view, logging, TBA resources), and anything related to the application itself (parameters, containers, etc). Given the use of a common template, the actual definition of the application is fairly small (and thus they are easy to write). This is a nice approach because:\n\n- We don't require extra software installed into the MiniCluster\n- An application definition is simple (and can be written easily / quickly)\n- Changing logic for the MiniCluster only needs to be done in one place!\n- Applications can be programatically built and tested (when possible)\n- Experiments can be orchestrated via using these helm charts with a custom values.yaml for each application (see our example runs [in the Google Performance Study](https://github.com/converged-computing/google-performance-study/tree/main/experiments/gke/cpu/size-128)).\n\n## Variables\n\nThe following variables are available for every experiment, and already part of the template.  Variables with a default will have the default set, otherwise the flag (or similar) is usually left out.\n\n| Name  | Description | Default | Options |\n|-------|-------------|---------|---------|\n| nodes | Number of nodes `-N` for each job | 1 | |\n| tasks | Number of tasks `-n` for each job | unset | |\n| cpu_affinity | Set `--cpu-affinity` | `per-task` | `(off,per-task,map:LIST,on)` | \n| gpu_affinity | Set `--gpu-affinity` | `off` | `(off,per-task,map:LIST,on)` |\n| run_threads | sets `OMP_NUM_THREADS` | unset | |\n| cores_per_task | Set `--cores-per-task` | unset | |\n| exclusive | Add the `--exclusive` flag | unset | |\n\nYou define them via `--set experiment.\u003cname\u003e=\u003cvalue\u003e` or in a values.yaml to create the experiment from:\n\n```yaml\nexperiment:\n  nodes: 5\n```\n\nExperiment specific variables are defined in the values.yaml files associated with the experiment.\n\n## Usage\n\nThis example will walk through running lammps. Other example runs are [also provided below](#examples).\n\n### 1. Setup the Cluster\n\nFor simple local development:\n\n```bash\n# Create the cluster\nkind create cluster --config ./kind-config.yaml\n```\n\nFor ebpf (that requires mounting the host) I recommend a cloud:\n\n```bash\nNODES=1\nGOOGLE_PROJECT=llnl-flux\nINSTANCE=h3-standard-88\nctime gcloud container clusters create test-cluster  --threads-per-core=1  --num-nodes=$NODES --machine-type=$INSTANCE  --placement-type=COMPACT --image-type=UBUNTU_CONTAINERD --region=us-central1-a --project=${GOOGLE_PROJECT}\n\n# When time to delete\ngcloud container clusters delete test-cluster --region=us-central1-a\n```\n\nFinally, install the Flux Operator\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/refs/heads/main/examples/dist/flux-operator.yaml\n```\n\n### 2. View Values\n\nHere are the values we can customize (any can be exposed really, it's very simple).\n\n```bash\nhelm show values ./lammps-reax\n```\n```console\n# Default values for lammps experiment\n# This is a YAML-formatted file.\n# Declare variables to be passed into your templates.\n\n# Logging (quiet will hide flux setup)\nlogging:\n  quiet: true\n\nexperiment:\n  iterations: 1\n  nodes: 1\n  tasks: 2\n\nenv:\n  app: lammps\n\nlammps:\n  binary: /usr/bin/lmp\n  input: in.reaxff.hns\n  x: 2\n  y: 2\n  z: 2\n  \nflux:\n  image: ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy\n\nminicluster:\n  # Container image\n  image: \"ghcr.io/converged-computing/lammps-reax:ubuntu2204\"\n\n  # Interactive MiniCluster?\n  interactive: false\n  \n  # MiniCluster size\n  size: 1\n  \n  # Number of NVIDIA gpus\n  gpus: 0\n\n  # Add flux on the fly (set to false if Flux is already in the container)\n  addFlux: false\n```\n\nIf there are changes to the base template:\n\n```bash\nhelm dependency update lammps-reax/\nhelm install lammps lammps-reax/ --debug --dry-run\n```\n\n### 3. Install LAMMPS Chart\n\nThen install the chart. This will deploy the Flux MiniCluster and run lammps for some number of iterations. All variables are technically defined so you don't need any `--set`.\n\n```bash\ncontainer=$(ocifit ghcr.io/converged-computing/lammps-reax --instance)\nhelm install \\\n  --set minicluster.size=1 \\\n  --set minicluster.image= \\\n  --set minicluster.addFlux=true \\\n  lammps ./lammps-reax\n```\n```console\nNAME: lammps\nLAST DEPLOYED: Sun May 11 13:10:50 2025\nNAMESPACE: default\nSTATUS: deployed\nREVISION: 1\nTEST SUITE: None\n```\n\nOr just look at [the chart](./lammps-reax/values.yaml)\n\nIf you want to debug or otherwise print to the console:\n\n```bash\nhelm template --debug \\\n  --set minicluster.size=4 \\\n  --set minicluster.image=ghcr.io/converged-computing/metric-lammps-cpu:zen4-reax \\\n  ./lammps-reax\n```\n\n### 4. View Output\n\nThe output can be seen in the lead broker pod!\n\n```bash\nkubectl logs lammps-0-xxxx -f\n```\n\n\u003cdetails\u003e\n\n\u003csummary\u003e LAMMPS output \u003c/summary\u003e\n\n```console\nDefaulted container \"lammps\" out of: lammps, flux-view (init)\n#!/bin/bash\nset -euo pipefail\nmkdir -p /tmp/output\nflux resource list\n\nfor i in {1..1}\ndo\n  echo \"FLUX-RUN START lammps-iter-$i\"\n  flux run --setattr=user.study_id=lammps-iter-$i -N1 -n 2 -o cpu-affinity=per-task -o gpu-affinity=off      /usr/bin/lmp -v x 2 -v y 2 -v z 2 -in in.reaxff.hns -nocite |\u0026 tee /tmp/lammps.out\n  \n   echo \"FLUX-RUN END lammps-iter-$i\"\ndone\n\n\n     STATE NNODES   NCORES    NGPUS NODELIST\n      free      1        8        0 lammps-0\n allocated      0        0        0 \n      down      0        0        0 \nFLUX-RUN START lammps-iter-1\nLAMMPS (17 Apr 2024 - Development - a8687b5)\nOMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)\n  using 1 OpenMP thread(s) per MPI task\nReading data file ...\n  triclinic box = (0 0 0) to (22.326 11.1412 13.778966) with tilt (0 -5.02603 0)\n  2 by 1 by 1 MPI processor grid\n  reading atoms ...\n  304 atoms\n  reading velocities ...\n  304 velocities\n  read_data CPU = 0.001 seconds\nReplication is creating a 2x2x2 = 8 times larger system...\n  triclinic box = (0 0 0) to (44.652 22.2824 27.557932) with tilt (0 -10.05206 0)\n  2 by 1 by 1 MPI processor grid\n  bounding box image = (0 -1 -1) to (0 1 1)\n  bounding box extra memory = 0.03 MB\n  average # of replicas added to proc = 5.00 out of 8 (62.50%)\n  2432 atoms\n  replicate CPU = 0.000 seconds\nNeighbor list info ...\n  update: every = 20 steps, delay = 0 steps, check = no\n  max neighbors/atom: 2000, page size: 100000\n  master list distance cutoff = 11\n  ghost atom cutoff = 11\n  binsize = 5.5, bins = 10 5 6\n  2 neighbor lists, perpetual/occasional/extra = 2 0 0\n  (1) pair reaxff, perpetual\n      attributes: half, newton off, ghost\n      pair build: half/bin/ghost/newtoff\n      stencil: full/ghost/bin/3d\n      bin: standard\n  (2) fix qeq/reax, perpetual, copy from (1)\n      attributes: half, newton off\n      pair build: copy\n      stencil: none\n      bin: none\nSetting up Verlet run ...\n  Unit style    : real\n  Current step  : 0\n  Time step     : 0.1\nPer MPI rank memory allocation (min/avg/max) = 143.9 | 143.9 | 143.9 Mbytes\n   Step          Temp          PotEng         Press          E_vdwl         E_coul         Volume    \n         0   300           -113.27833      437.52134     -111.57687     -1.7014647      27418.867    \n        10   299.38517     -113.27631      1439.2511     -111.57492     -1.7013814      27418.867    \n        20   300.27107     -113.27884      3764.3921     -111.57762     -1.7012246      27418.867    \n        30   302.21063     -113.28428      7007.6315     -111.58335     -1.7009364      27418.867    \n        40   303.52265     -113.28799      9844.7899     -111.58747     -1.7005187      27418.867    \n        50   301.87059     -113.28324      9663.0837     -111.58318     -1.7000523      27418.867    \n        60   296.67807     -113.26777      7273.8688     -111.56815     -1.6996136      27418.867    \n        70   292.2         -113.25435      5533.5999     -111.55514     -1.6992157      27418.867    \n        80   293.58679     -113.25831      5993.3978     -111.55946     -1.6988534      27418.867    \n        90   300.62637     -113.27925      7202.8885     -111.58069     -1.6985591      27418.867    \n       100   305.38276     -113.29357      10085.741     -111.59518     -1.6983875      27418.867    \nLoop time of 9.43821 on 2 procs for 100 steps with 2432 atoms\n\nPerformance: 0.092 ns/day, 262.173 hours/ns, 10.595 timesteps/s, 25.768 katom-step/s\n99.8% CPU use with 2 MPI tasks x 1 OpenMP threads\n\nMPI task timing breakdown:\nSection |  min time  |  avg time  |  max time  |%varavg| %total\n---------------------------------------------------------------\nPair    | 6.9119     | 7.0673     | 7.2228     |   5.8 | 74.88\nNeigh   | 0.11603    | 0.11763    | 0.11922    |   0.5 |  1.25\nComm    | 0.013927   | 0.16934    | 0.32476    |  37.8 |  1.79\nOutput  | 0.00029069 | 0.00029232 | 0.00029395 |   0.0 |  0.00\nModify  | 2.0813     | 2.0829     | 2.0845     |   0.1 | 22.07\nOther   |            | 0.0006819  |            |       |  0.01\n\nNlocal:           1216 ave        1216 max        1216 min\nHistogram: 2 0 0 0 0 0 0 0 0 0\nNghost:         7591.5 ave        7597 max        7586 min\nHistogram: 1 0 0 0 0 0 0 0 0 1\nNeighs:         432912 ave      432942 max      432882 min\nHistogram: 1 0 0 0 0 0 0 0 0 1\n\nTotal # of neighbors = 865824\nAve neighs/atom = 356.01316\nNeighbor list builds = 5\nDangerous builds not checked\nTotal wall time: 0:00:09\n```\n\n\u003c/details\u003e\n\nTo clean up the run, you need to uninstall:\n\n```bash\nhelm uninstall lammps\n```\n\nIf you specify more than one iteration (what we often do for running experiments) each will be done.\n\n```bash\nhelm install \\\n  --set minicluster.size=1 \\\n  --set minicluster.image=ghcr.io/converged-computing/metric-lammps-cpu:zen4-reax \\\n  --set experiment.iterations=2 \\\n  --set minicluster.addFlux=true \\\n  lammps ./lammps-reax\n```\n\n### 5. Features Supported\n\n#### Flux Metadata\n\nFor actual experiments, we usually want to capture the total wrapped duration and other events from the workload manager, and be able to pipe the entire kubectl logs to file that we can parse later. That's easy to add:\n\n```\n```bash\nhelm install \\\n  --set minicluster.save_logs=true \\\n  lammps ./lammps-reax\n```\n\nHere is how the output file has changed:\n\n\u003cdetails\u003e\n\n\u003csummary\u003e LAMMPS output with flux events\u003c/summary\u003e\n\n```console\nDefaulted container \"lammps\" out of: lammps, flux-view (init)\n#!/bin/bash\nset -euo pipefail\nmkdir -p /tmp/output\nflux resource list\n\nfor i in {1..1}\ndo\n  echo \"FLUX-RUN START lammps-iter-$i\"\n  flux run --setattr=user.study_id=lammps-iter-$i -N1 -n 2 -o cpu-affinity=per-task -o gpu-affinity=off      /usr/bin/lmp -v x 2 -v y 2 -v z 2 -in in.reaxff.hns -nocite |\u0026 tee /tmp/lammps.out\n  \n   echo \"FLUX-RUN END lammps-iter-$i\"\ndone\n\n\noutput=./results/${app}\n(apt-get update \u0026\u0026 apt-get install -y jq) || (yum update -y \u0026\u0026 yum install -y jq)\nmkdir -p $output\nfor jobid in $(flux jobs -a --json | jq -r .jobs[].id); do\n    echo\n    study_id=$(flux job info $jobid jobspec | jq -r \".attributes.user.study_id\")\n    echo \"FLUX-JOB START ${jobid} ${study_id}\"\n    echo \"FLUX-JOB-JOBSPEC START\"\n    flux job info $jobid jobspec\n    echo \"FLUX-JOB-JOBSPEC END\" \n    \n    echo \"FLUX-JOB-RESOURCES START\"\n    flux job info ${jobid} R\n    echo \"FLUX-JOB-RESOURCES END\"\n    echo \"FLUX-JOB-EVENTLOG START\" \n    flux job info $jobid guest.exec.eventlog\n    echo \"FLUX-JOB-EVENTLOG END\" \n    echo \"FLUX-JOB END ${jobid} ${study_id}\"\ndone\necho \"FLUX JOB STATS\"\nflux job stats         \n\n     STATE NNODES   NCORES    NGPUS NODELIST\n      free      1        8        0 lammps-0\n allocated      0        0        0 \n      down      0        0        0 \nFLUX-RUN START lammps-iter-1\nLAMMPS (17 Apr 2024 - Development - a8687b5)\nOMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)\n  using 1 OpenMP thread(s) per MPI task\nReading data file ...\n  triclinic box = (0 0 0) to (22.326 11.1412 13.778966) with tilt (0 -5.02603 0)\n  2 by 1 by 1 MPI processor grid\n  reading atoms ...\n  304 atoms\n  reading velocities ...\n  304 velocities\n  read_data CPU = 0.002 seconds\nReplication is creating a 2x2x2 = 8 times larger system...\n  triclinic box = (0 0 0) to (44.652 22.2824 27.557932) with tilt (0 -10.05206 0)\n  2 by 1 by 1 MPI processor grid\n  bounding box image = (0 -1 -1) to (0 1 1)\n  bounding box extra memory = 0.03 MB\n  average # of replicas added to proc = 5.00 out of 8 (62.50%)\n  2432 atoms\n  replicate CPU = 0.000 seconds\nNeighbor list info ...\n  update: every = 20 steps, delay = 0 steps, check = no\n  max neighbors/atom: 2000, page size: 100000\n  master list distance cutoff = 11\n  ghost atom cutoff = 11\n  binsize = 5.5, bins = 10 5 6\n  2 neighbor lists, perpetual/occasional/extra = 2 0 0\n  (1) pair reaxff, perpetual\n      attributes: half, newton off, ghost\n      pair build: half/bin/ghost/newtoff\n      stencil: full/ghost/bin/3d\n      bin: standard\n  (2) fix qeq/reax, perpetual, copy from (1)\n      attributes: half, newton off\n      pair build: copy\n      stencil: none\n      bin: none\nSetting up Verlet run ...\n  Unit style    : real\n  Current step  : 0\n  Time step     : 0.1\nPer MPI rank memory allocation (min/avg/max) = 143.9 | 143.9 | 143.9 Mbytes\n   Step          Temp          PotEng         Press          E_vdwl         E_coul         Volume    \n         0   300           -113.27833      437.52134     -111.57687     -1.7014647      27418.867    \n        10   299.38517     -113.27631      1439.2511     -111.57492     -1.7013814      27418.867    \n        20   300.27107     -113.27884      3764.3921     -111.57762     -1.7012246      27418.867    \n        30   302.21063     -113.28428      7007.6315     -111.58335     -1.7009364      27418.867    \n        40   303.52265     -113.28799      9844.7899     -111.58747     -1.7005187      27418.867    \n        50   301.87059     -113.28324      9663.0837     -111.58318     -1.7000523      27418.867    \n        60   296.67807     -113.26777      7273.8688     -111.56815     -1.6996136      27418.867    \n        70   292.2         -113.25435      5533.5999     -111.55514     -1.6992157      27418.867    \n        80   293.58679     -113.25831      5993.3978     -111.55946     -1.6988534      27418.867    \n        90   300.62637     -113.27925      7202.8885     -111.58069     -1.6985591      27418.867    \n       100   305.38276     -113.29357      10085.741     -111.59518     -1.6983875      27418.867    \nLoop time of 9.48714 on 2 procs for 100 steps with 2432 atoms\n\nPerformance: 0.091 ns/day, 263.532 hours/ns, 10.541 timesteps/s, 25.635 katom-step/s\n99.8% CPU use with 2 MPI tasks x 1 OpenMP threads\n\nMPI task timing breakdown:\nSection |  min time  |  avg time  |  max time  |%varavg| %total\n---------------------------------------------------------------\nPair    | 6.8829     | 7.0529     | 7.2229     |   6.4 | 74.34\nNeigh   | 0.11578    | 0.11587    | 0.11596    |   0.0 |  1.22\nComm    | 0.010545   | 0.18042    | 0.35029    |  40.0 |  1.90\nOutput  | 0.00031558 | 0.00032584 | 0.0003361  |   0.0 |  0.00\nModify  | 2.1369     | 2.137      | 2.137      |   0.0 | 22.52\nOther   |            | 0.0006946  |            |       |  0.01\n\nNlocal:           1216 ave        1216 max        1216 min\nHistogram: 2 0 0 0 0 0 0 0 0 0\nNghost:         7591.5 ave        7597 max        7586 min\nHistogram: 1 0 0 0 0 0 0 0 0 1\nNeighs:         432912 ave      432942 max      432882 min\nHistogram: 1 0 0 0 0 0 0 0 0 1\n\nTotal # of neighbors = 865824\nAve neighs/atom = 356.01316\nNeighbor list builds = 5\nDangerous builds not checked\nTotal wall time: 0:00:09\nFLUX-RUN END lammps-iter-1\nGet:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]\nHit:2 http://archive.ubuntu.com/ubuntu jammy InRelease              \nGet:3 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2901 kB]\nGet:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]\nGet:5 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1245 kB]\nGet:6 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [4282 kB]\nGet:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]      \nGet:8 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [3211 kB]\nGet:9 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1546 kB]\nGet:10 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [4436 kB]\nGet:11 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [83.2 kB]\nFetched 18.1 MB in 2s (8642 kB/s)                            \nReading package lists... Done\nReading package lists... Done\nBuilding dependency tree... Done\nReading state information... Done\njq is already the newest version (1.6-2.1ubuntu3).\n0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.\n\nFLUX-JOB START 6660554752 lammps-iter-1\nFLUX-JOB-JOBSPEC START\n{\"resources\": [{\"type\": \"node\", \"count\": 1, \"with\": [{\"type\": \"slot\", \"count\": 2, \"with\": [{\"type\": \"core\", \"count\": 1}], \"label\": \"task\"}]}], \"tasks\": [{\"command\": [\"/usr/bin/lmp\", \"-v\", \"x\", \"2\", \"-v\", \"y\", \"2\", \"-v\", \"z\", \"2\", \"-in\", \"in.reaxff.hns\", \"-nocite\"], \"slot\": \"task\", \"count\": {\"per_slot\": 1}}], \"attributes\": {\"system\": {\"duration\": 0, \"cwd\": \"/opt/lammps/examples/reaxff/HNS\", \"shell\": {\"options\": {\"rlimit\": {\"cpu\": -1, \"fsize\": -1, \"data\": -1, \"stack\": 8388608, \"core\": -1, \"nofile\": 1048576, \"as\": -1, \"rss\": -1, \"nproc\": -1}, \"cpu-affinity\": \"per-task\", \"gpu-affinity\": \"off\"}}}, \"user\": {\"study_id\": \"lammps-iter-1\"}}, \"version\": 1}\nFLUX-JOB-JOBSPEC END\nFLUX-JOB-RESOURCES START\n{\"version\": 1, \"execution\": {\"R_lite\": [{\"rank\": \"0\", \"children\": {\"core\": \"6-7\"}}], \"nodelist\": [\"lammps-0\"], \"starttime\": 1746991421, \"expiration\": 4900591421}}\nFLUX-JOB-RESOURCES END\nFLUX-JOB-EVENTLOG START\n{\"timestamp\":1746991421.5843747,\"name\":\"init\"}\n{\"timestamp\":1746991421.5915587,\"name\":\"shell.init\",\"context\":{\"service\":\"0-shell-fB9a5su\",\"leader-rank\":0,\"size\":1}}\n{\"timestamp\":1746991421.5945508,\"name\":\"shell.start\",\"context\":{\"taskmap\":{\"version\":1,\"map\":[[0,1,2,1]]}}}\n{\"timestamp\":1746991421.5846651,\"name\":\"starting\"}\n{\"timestamp\":1746991432.7146101,\"name\":\"shell.task-exit\",\"context\":{\"localid\":1,\"rank\":1,\"state\":\"Exited\",\"pid\":107,\"wait_status\":0,\"signaled\":0,\"exitcode\":0}}\n{\"timestamp\":1746991432.7171538,\"name\":\"complete\",\"context\":{\"status\":0}}\n{\"timestamp\":1746991432.7171805,\"name\":\"done\"}\n\nFLUX-JOB-EVENTLOG END\nFLUX-JOB END 6660554752 lammps-iter-1\nFLUX JOB STATS\n{\"job_states\":{\"depend\":0,\"priority\":0,\"sched\":0,\"run\":0,\"cleanup\":0,\"inactive\":1,\"total\":1},\"successful\":1,\"failed\":0,\"canceled\":0,\"timeout\":0,\"inactive_purged\":0,\"queues\":[]}\n```\n\n\u003c/details\u003e\n\nWe have functions that are useful to parse the log from metadata that we will provide in an associated library.\n\n#### Running Modes\n\nThe normal running mode assumes a distributed application (across node) and simply runs iterations and prints application output from the lead broker. However, we have a few custom modes for different cases.\n\n##### 1. Select combinations of pairs\n\nFor paired runs (between pairs of nodes) you might want to run something that selects samples from pairs. We support that with `experiment.pairs`. Here is how to select 28 combinations, 8 nodes (the pairs parameter), 2 at a time, for a loop over three OSU benchmarks. This is intended to run in kind on a local machine, but you'd want to adjust the sizes for your cluster.\n\n```bash\nhelm dependency update osu-benchmarks\nfor app in osu_latency osu_bw\n  do\n  helm install \\\n  --set experiment.nodes=8 \\\n  --set minicluster.size=8 \\\n  --set minicluster.tasks=12 \\\n  --set minicluster.save_logs=true \\\n  --set experiment.pairs=8 \\\n  --set osu.binary=/opt/osu-benchmark/build.openmpi/mpi/pt2pt/$app \\\n  --set experiment.tasks=2 \\\n  osu osu-benchmarks/\n  sleep 5\n  time kubectl wait --for=condition=ready pod -l job-name=osu --timeout=600s\n  pod=$(kubectl get pods -o json | jq  -r .items[0].metadata.name)\n  kubectl logs ${pod} -f\n  helm uninstall osu\ndone\n```\n\nThe `sleep` isn't explicitly necessary, but rarely the deployment is slow enough that it will skip and cause an error in the next line. To save to an output file, you would change the second to the last line in the loop:\n\n```bash\n  kubectl logs ${pod} -f |\u0026 tee ./logs/$app.out\n```\n\n##### 2. Single Node Execution\n\nIf you have a single node benchmark, you might want to run one instance on each node in the cluster.  Here is an example of doing that with our single node benchmark.\n\n```bash\nhelm dependency update ./single-node\nhelm install \\\n  --set experiment.nodes=2 \\\n  --set minicluster.size=2 \\\n  --set minicluster.tasks=8 \\\n  --set experiment.tasks=1 \\\n  --set minicluster.save_logs=true \\\n  --set minicluster.show_logs=true \\\n  --set experiment.foreach=true \\\n  --set experiment.iterations=1 \\\n  single-node ./single-node\n\ntime kubectl wait --for=condition=ready pod -l job-name=single-node --timeout=600s\npod=$(kubectl get pods -o json | jq  -r .items[0].metadata.name)\nkubectl logs ${pod} -f\nhelm uninstall single-node\n```\n\nFor this setup, you'll see `flux submit` so the jobs will run at the same time on single nodes. Then output is presented later, with the flux events. This is why you want to set `minicluster.show_logs=true` to see that output.\n\n##### 3. Monitor with BCC\n\nThis setup will deploy a sidecar and monitor different interacts with bcc. We have several programs that help to understand tcp, file open/closes, cpu, shared memory, or futex wait times. There are two approaches:\n\n- Multiple sidecars per pod (adds overhead, but is acceptable given what the HPC community already does) and the benefit is measuring the same thing between applications.\n- Single sidecar per pod (and metrics distributed across cluster) low to zero overhead, and better for summary metrics or models. We an algorithm to select from the set of programs you requested. \n\nAlthough for both approaches you can filter to a cgroup or command, for the default we allow all containers in the pod to be seen. It generates a lot more data, but is interesting. Here is how to select a metric for a single sidecar per pod method:\n\n```bash\nhelm install \\\n  --set monitor.programs=open_close \\\n  --set minicluster.save_logs=true \\\n  --dry-run lammps ./lammps-reax\n```\n\n\u003cdetails\u003e\n\n```console\nLooking for /opt/programs/open-close/ebpf-collect.c\nStarting eBPF (Tracepoint for open entry).\n\nStart Indicator file defined '/mnt/flux/start_ebpf_collection'. Waiting.\n{\"event\": \"OPEN\", \"command\": \"python3\", \"retval\": 12, \"ts_sec\": 779.540036095, \"tgid\": 0, \"tid\": 14554, \"ppid\": 14554, \"cgroup_id\": 0, \"filename\": \"/sys/bus/event_source/devices/kprobe/type\"}\n{\"event\": \"OPEN\", \"command\": \"python3\", \"retval\": 12, \"ts_sec\": 779.540051234, \"tgid\": 0, \"tid\": 14554, \"ppid\": 14554, \"cgroup_id\": 0, \"filename\": \"/sys/bus/event_source/devices/kprobe/format/retprobe\"}\n{\"event\": \"OPEN\", \"command\": \"containerd\", \"retval\": 193, \"ts_sec\": 779.629315113, \"tgid\": 0, \"tid\": 3600, \"ppid\": 3618, \"cgroup_id\": 0, \"filename\": \"/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/204/fs\"}\n{\"event\": \"CLOSE\", \"command\": \"containerd\", \"retval\": 0, \"ts_sec\": 779.629342785, \"tgid\": 1, \"tid\": 3600, \"ppid\": 3618, \"cgroup_id\": 6520}\n...\n{\"event\": \"OPEN\", \"command\": \"touch\", \"retval\": 3, \"ts_sec\": 803.043308743, \"tgid\": 0, \"tid\": 14883, \"ppid\": 14883, \"cgroup_id\": 3257288213055174703, \"filename\": \"/usr/lib/locale/C.utf8/LC_NUMERIC\"}\n{\"event\": \"CLOSE\", \"command\": \"touch\", \"retval\": 0, \"ts_sec\": 803.043310733, \"tgid\": 14414, \"tid\": 14883, \"ppid\": 14883, \"cgroup_id\": 13176}\n{\"event\": \"OPEN\", \"command\": \"touch\", \"retval\": 3, \"ts_sec\": 803.043316595, \"tgid\": 0, \"tid\": 14883, \"ppid\": 14883, \"cgroup_id\": 3257288213055174703, \"filename\": \"/usr/lib/locale/C.utf8/LC_CTYPE\"}\n{\"event\": \"CLOSE\", \"command\": \"touch\", \"retval\": 0, \"ts_sec\": 803.043318514, \"tgid\": 14414, \"tid\": 14883, \"ppid\": 14883, \"cgroup_id\": 13176}\n{\"event\": \"OPEN\", \"command\": \"touch\", \"retval\": 3, \"ts_sec\": 803.043359627, \"tgid\": 0, \"tid\": 14883, \"ppid\": 14883, \"cgroup_id\": 3257288213055174703, \"filename\": \"/mnt/flux/stop_ebpf_collection\"}\n{\"event\": \"CLOSE\", \"command\": \"touch\", \"retval\": 0, \"ts_sec\": 803.043360931, \"tgid\": 14414, \"tid\": 14883, \"ppid\": 14883, \"cgroup_id\": 13176}\n\nIndicator file '/mnt/flux/stop_ebpf_collection' found. Stopping.\nCleaning up BPF resources...\n```\n\n\u003c/details\u003e\n\nHere is how to do multiple at once (each still a single sidecar)\n\n```bash\nhelm install \\\n  --set monitor.programs=\"cpu|shmem|tcp|futex|open_close\" \\\n  --set minicluster.save_logs=true \\\n  lammps ./lammps-reax\n```\n\nHere is how to deploy multiple sidecars:\n\n```bash\nhelm install \\\n  --set monitor.multiple=flamegraph|open_close \\\n  --set monitor.sleep=true \\\n  --set minicluster.save_logs=true \\\n  --dry-run lammps ./lammps-reax\n```\n\nFor the flamegraph, you'll want to enable the monitor container to sleep so you can copy svg and folded files out after.\n\n### 5. Delete\n\nTo clean up:\n\n```bash\nhelm uninstall lammps\n```\n\n## Examples\n\nHere are all the examples.  For any example, you need to update dependencies before you run:\n\n```bash\nhelm dependency update ./\u003capp\u003e\n```\n```bash\nhelm install amg ./amg2023\nhelm install kripke ./kripke\nhelm install lammps ./lammps-reax\nhelm install laghos ./laghos\nhelm install minife ./minife\nhelm install mtgemm ./mixbench\nhelm install mtgemm ./mt-gemm\nhelm install stream ./osu-benchmarks\nhelm install stream ./quicksilver\nhelm install stream ./single-node\nhelm install stream ./stream\n```\n\nAnd an example to use a custom yaml file (more ideal for reproducible experiments):\n\n```bash\nhelm install amg -f ./examples/amg2023/flux-minicluster.yaml ./amg2023\n```\n\n## License\n\nHPCIC DevTools is distributed under the terms of the MIT license.\nAll new contributions must be made under this license.\n\nSee [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),\n[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and\n[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.\n\nSPDX-License-Identifier: (MIT)\n\nLLNL-CODE- 842614\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconverged-computing%2Fflux-apps-helm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconverged-computing%2Fflux-apps-helm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconverged-computing%2Fflux-apps-helm/lists"}