{"id":24714306,"url":"https://github.com/nvsl/cfiddle-cluster","last_synced_at":"2026-05-09T00:46:10.848Z","repository":{"id":190325999,"uuid":"681037360","full_name":"NVSL/cfiddle-cluster","owner":"NVSL","description":null,"archived":false,"fork":false,"pushed_at":"2024-05-02T06:28:32.000Z","size":313,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-27T08:16:12.092Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVSL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-21T06:10:37.000Z","updated_at":"2024-05-02T06:28:35.000Z","dependencies_parsed_at":"2024-05-02T07:41:27.265Z","dependency_job_id":"c26f6a89-899a-4bd8-b34c-d0024bd940af","html_url":"https://github.com/NVSL/cfiddle-cluster","commit_stats":null,"previous_names":["nvsl/cfiddle-cluster"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVSL%2Fcfiddle-cluster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVSL%2Fcfiddle-cluster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVSL%2Fcfiddle-cluster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVSL%2Fcfiddle-cluster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVSL","download_url":"https://codeload.github.com/NVSL/cfiddle-cluster/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244924992,"owners_count":20532898,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-27T08:16:16.139Z","updated_at":"2026-05-09T00:46:05.760Z","avatar_url":"https://github.com/NVSL.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A Containerized CFiddle Cluster Deployment\n\nThis repo describes/embodies how to deploy a cluster of machines for\nuse with CFiddle so that multiple users (e.g., students) can share\naccess to the machines while ensuring that their cfiddle experiments\nrun alone on the machine so their measurements are accurate.  It also\nensures that the users can't mess up the machines or interfere with\neachother's jobs.\n\nThe approach it takes it to use the [Slurm job\nscheduler](https://slurm.schedmd.com/documentation.html) to schedule\nthe execution of the cfiddle experiments.  It makes extensive use of\nDocker containers and Docker's Swarm and Stack facilities.\n\nThere are many other ways to accomplish the same goal.  The approach\nhere is meant to be simplest, although it is not the most elegant.  It\nis the one the author is currently using in production.\n\nThe last section of the document describes other configurations that\nmight be desirable in your situation.\n\n## What This Repo Builds\n\n![Diagram of the system](.images/Cfiddle-Cluster.png)\n\nThe instructions and code in this repo will let you assemble the following:\n\n1.  A group of _worker nodes_ to run cfiddle jobs\n2.  A _head node_ to coordinate access to the cluster.\n3.  An appropriate configuration for cfiddle so it'll run jobs on the cluster.\n4.  A exampler _user node_ that is not part of the slurm cluster but serves Jupyter Notebooks that can submit cfiddle jobs to the cluster.\n5.  Three docker images:\n    * One for the head node and worker nodes (`cfiddle-cluster')\n    * One for the user node (`cfiddle-user`)\n    * A sandbox docker image to run the user's code (`cfiddle-sandbox`)\n6.  A couple test user accounts to show that everything works (`test_cfiddle.py`)\n\nWe will assume that the cluster is dedicated to running CFiddle jobs\nand nothing else.\n\n### Theory of Operation\n\n\nWe will configure CFiddle so that when the user's CFiddle code\n(running on the \"user node\") requests execution of an experiment,\nCFiddle will use the facilities provided by `delegate_function` to run\nthe code someplace other than the local machine.\n\nThat \"someplace\" will be one in single-use \"sandbox\" docker container\nrunning on one of the worker nodes in our Slurm cluster.\n`delegate_function` will accomplish this by submitting a Slurm job to\nthe cluster.  That job will then spawn the sandbox container on the\nworker node.\n\nThe user node and the worker nodes will all share a single `/home/`\ndirectory so the users files are available to their job running on the\nslurm cluster.\n\nHow, exactly, `delegate_function` bundles up the CFiddle code and its\ninputs and then collects its outputs is outside the scope of this\ndocument.  But you can read about it in the `delegate_funciton` source\ncode.\n\n### What We Provide and What You Need to Provide\n\nThe implementation embodied in this git repo is meant to make it easy\nas possible to set up a CFiddle cluster, so most the implementation\nthis document describes is suitable for use in deployment (although\nyou need to customize some configurations files).\n\nHowever, there are a few places that you will need to customize:\n\n1.  The shared file storage\n2.  The user image\n3.  The sand box image (maybe)\n\nFor shared user directories, the instruction below set up a simple NFS\nserver, which will work fine for simple, stand-alone installations,\nbut if you want to integrate with an existing system, it'll take some\ntweaking.\n\nThe user image is a standin for whatever environment your users will\nbe using.  The version provide is based on the standard jupyter\nnotebook image but adds what's necessary to make slurm work.\n\nThe sand box image just includes `cfiddle` and `delegate_function`, so\nit should be sufficient for running standard `cfiddle` experiments.\nIf you're doing any think fancy with `cfiddle` you'll need to modify\nthe the sand box to matches what's in the user's environment.\n\nYou will also need an account on [dockerhub.com] and you'll need to\nknow your username.\n\n### Implementation Roadmap\n\nThis deployment is fully containerized so that we don't have to\ninstall anything other than `git` (to clone this repo) and `docker` on\nmachines to get things working.\n\nWe will build this system in layers.\n\nFirst, we will acquire a head node and set of worker nodes and install\ndocker on them.\n\nSecond, we will create a docker swarm from those nodes to facilitate\ntheir management.\n\nThird, we will instantiate a set of docker \"services\" using docker\n\"stacks\".  This will start several containers on the head node to run\nSlurm and a container on each worker node.\n\nFourth, we will test the cluster by running some cfiddle experiments\nfrom the user node.\n\n## What this Repo Does not Build\n\nThis repo is not a great resource for deploying a general purpose\nSlurm cluster.\n\nIt is also not instructions for using an existing Slurm cluster to run\nCfiddle jobs.  This certainly possible and the tools support it.  If\nyou have a suitable Slurm cluster available, it's probably easier to\ngo that route.\n\n## But I Want to Do Something Slightly Different\n\nThat's great!  This repo probably provides helpful hints.\n\nAlso, the maintainers are excited to help people use CFiddle, so\nplease email sjswanson@ucsd.edu if you have questions or need help.\n\n##  What You Will Need\n\n### Hardware\n\nYou will need at least two machines: One worker node and one head node.\n\nThe head node can pretty much any server (or virtual machine) that\nruns a recent version of Docker.  You'll need root access to\nit.\n\nThe worker machines are where the CFiddle experiments will run.  They\nshould be dedicated to this task and not be running anything else.\n\nFor testing this guide, we used bare metal cloud servers from\nhttps://deploy.equinix.com/.  Any x86 instance type will do for\ntesting.  Pick something cheap.\n\nYou need to be familiar with how provision servers, install the OS,\nand be able to ssh into them.\n\n### Software\n\nWe built this guide with the following software:\n\n1. Ubuntu 22.04 (the `ubuntu:jammy` docker image)\n2. Docker version 24.0.5\n3. The latest version of `cfiddle`\n4. The latest version of `delegate_function`\n5. Python 3.10\n6. The `jupyter/scipy-notebook:python-3.10` image.\n\nIn principle, the version of Linux shouldn't matter much, but some of\nthe scripts will probably need to be adjusted.\n\nWe are using some newish docker features.  In particular, we need\ndocker-compose.yml 3.8 (which requires at least Docker 19.03), and we\nuse `CAP_PERFMON` which was broken in docker until recently.  It works\nin the version listed above, but I'm not sure when it started working\n\nThe versions of python _on all the images_ must match, since\n`delegate_function` runs everywhere.  Changing to a different version\nis complicated.  I chose 3.10 because it's what got installed under\n`ubuntu:jammy`.  Then I selected the matching jupyter image to match.\n\n### Docker.com Account\n\nYou'll need an account on docker.com, and you'll need to be logged in\nbefore you run the script to build the cluster. You can do that with:\n\n```\ndocker login -u $DOCKERHUB_USERNAME\n```\n\nBe sure to do this!  Otherwise, the build script will fail part way through.\n\n## The Actual Implementation\n\nThe actual implementation for all of this is in `build_cluster.sh`.\nIt's a heavily commented shell script that actually builds and tests\nthe cluster.  _You will need to read the comments for the first steps to do the initial setup_.\n\nYou can build the cluster with:\n\n```\n./build_cluster.sh\n```\n\n## Testing\n\nOnce `build_cluster.sh` runs successfully, you can test basic slurm functions with\n\n```\n. config.sh\n./test_cluster.sh\n```\n\nWhich will show you a live-updating view of the job queue.  When it's\nempty (or you get board), Control-C to quit, and then `exit` to get\nout of the user node container.\n\nTo make sure cfiddle is working as root and as a normal user:\n\n```\n. config.sh\ndocker exec -it $userhost pytest -s test_cfiddle.py\ndocker exec -u jovyan  -w /home/jovyan -it $userhost pytest -s /slurm/test_cfiddle.py\n```\n\n## Step 10: Access Jupyter\n\nJupyter should also now be running on the usernode container.  You should be able to access it at:\n\n```\nhttp://$HEAD_ADDR:8888/lab?token=slurmify\n```\n\nNavigate to somewhere and run a cfiddle command...\n\n## Changing the Cluster\n\nOnce the cluster is running, you can rebuild the images and restart the cluster with\n\n```\nupdate_cluster.sh\n. config.sh\n```\n\nYou will need to re-source `config.sh` to reload some environment variables.\n\n\n## Alternative Configurations\n\nThe system we just built has some draw backs:\n\n1. The proxy container mechanism means that the Slurm cluster is not suitable for general use.\n2. All the CFiddle jobs run as the same user -- `cfiddle`.\n\nThese decisions are both driven by the desire for simplicity.  In\nparticular, proxy containers avoid the need maintain user accounts or\ndirectories across the worker nodes.\n\nA more elegant installation, would keep user accounts synchronized\nacross the head node and the workers, provide unified home directories\nacross the machines, and then use the worker nodes themselves be\nmembers of the Slurm cluster.\n\n## Possibly useful\n\nhttps://github.com/nateGeorge/slurm_gpu_ubuntu\n\n## Munge Key\n\nGrab the munge uid and group we will use, and build the munge use and group\n\n```\n. config.sh\ngroupadd munge -g $MUNGE_GID\nuseradd munge -u $MUNGE_UID -g $MUNGE_GID\n```\n\nInstall `munge` which will generate `/etc/munge/munge.key` as a side effect.\n\n```\napt-get install -y munge \n```\n\nLater, if you need to change, it \n\n```\nmungekey\nchown munge:munge /etc/munge/munge.key\n```\n\n## Notes\n\n### Where To Get Servers\n\nFor testing we use [Equinix Metal](https://deploy.equinix.com/).\nTheir `c3.small.x86` instances are reasonably cheap ($0.75/hour) and\nwork well.  They are not available in all zones, so you might have to\nhunt for them when provisioning machines.\n\nAWS has bare metal servers but they are huge (e.g., 100s\nof cores) and very expensive.\n\nFor courses we use a cluster of 12 small Intel blade servers provided\nby our institution.  We used to use Equinix for courses too, but it is a bit pricey.\n\n### How Many Servers Do You Need?\n\nWe have typically have ~215 students and at peak times (e.g., right\nbefore a homework is due), the 12 servers are saturated.\n\nWhen we were using Equinix, we would manually scale up the cluster\nsize when load spiked.  The problem is that Equinix sometimes runs out\nof a particular instance type, which can be a problem because results\nneed to match across machines.  We just told students to work early\nand hoped instances were available.\n\n\n### \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvsl%2Fcfiddle-cluster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvsl%2Fcfiddle-cluster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvsl%2Fcfiddle-cluster/lists"}