{"id":19066129,"url":"https://github.com/epfml/getting-started","last_synced_at":"2026-03-09T08:03:02.289Z","repository":{"id":213709185,"uuid":"731964023","full_name":"epfml/getting-started","owner":"epfml","description":null,"archived":false,"fork":false,"pushed_at":"2026-01-12T10:01:48.000Z","size":179,"stargazers_count":26,"open_issues_count":1,"forks_count":16,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-01-12T18:50:56.618Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/epfml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-12-15T09:53:46.000Z","updated_at":"2026-01-12T10:01:54.000Z","dependencies_parsed_at":"2026-01-12T13:01:14.651Z","dependency_job_id":null,"html_url":"https://github.com/epfml/getting-started","commit_stats":null,"previous_names":["epfml/getting-started"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/epfml/getting-started","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fgetting-started","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fgetting-started/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fgetting-started/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fgetting-started/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/epfml","download_url":"https://codeload.github.com/epfml/getting-started/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/epfml%2Fgetting-started/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30287449,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T02:57:19.223Z","status":"ssl_error","status_checked_at":"2026-03-09T02:56:26.373Z","response_time":61,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T00:54:32.885Z","updated_at":"2026-03-09T08:03:02.278Z","avatar_url":"https://github.com/epfml.png","language":"Python","readme":"# MLO: Getting Started with the EPFL RCP Cluster\n\nThis repository contains the basic steps to start running scripts and notebooks on the EPFL RCP cluster. We provide scripts that make your life easier by automating most of the boilerplate. The setup is loosely based on infrastructure from TML/CLAIRE and earlier scripts by Atli.\n\n## Overview\n\nThe RCP cluster provides:\n- **GPUs**: A100 (40GB/80GB), H100 (80GB), H200 (140GB), V100\n- **Stack**: [Docker](https://www.docker.com) (containers), [Kubernetes](https://kubernetes.io) (orchestration), [run:ai](https://run.ai) (scheduler)\n\n## Getting Help\n\n- **FAQ**: Check the [frequently asked questions page](docs/faq.md)\n- **Slack**: Reach out on `#-cluster` or `#-it` channels\n- **Resources**: See [quick links](#quick-links) below\n\n\u003e [!TIP]\n\u003e If you have little prior experience with ML workflows, the setup below may seem daunting at first. You can copy‑paste the commands in order; the scripts are designed to hide most of the complexity. The only requirement is that you have a basic understanding of how to use a terminal and git.\n\n\u003e [!CAUTION]\n\u003e Using the cluster creates costs. Please be mindful of the resources you use. **Do not forget to stop your jobs when not used!**\n\nContent overview:\n- [Quick Start](#quick-start)\n- [Setup Guide](#setup-guide)\n  - [1. Pre-setup (Access \\\u0026 Repository)](#1-pre-setup-access--repository)\n  - [2. Setup Tools on Your Machine](#2-setup-tools-on-your-machine)\n  - [3. Login to the Cluster](#3-login-to-the-cluster)\n  - [4. Configure Your `.env` File](#4-configure-your-env-file)\n  - [5. Start Your First Job](#5-start-your-first-job)\n- [Using VS Code](#using-vs-code)\n- [Recommended Workflow](#recommended-workflow)\n- [`csub.py` Usage and Arguments](#csubpy-usage-and-arguments)\n- [Advanced Topics](#advanced-topics)\n- [Reference](#reference)\n\n\n---\n\n## Quick Start\n\n\u003e [!TIP] \n\u003e **TL;DR** – After completing the setup, interaction with the cluster looks like this:\n\u003e\n\u003e ```bash\n\u003e # Start an interactive job with 1 GPU\n\u003e python csub.py -n sandbox\n\u003e \n\u003e # Connect to your job\n\u003e runai exec sandbox -it -- zsh\n\u003e \n\u003e # Run your code\n\u003e cd /mloscratch/homes/\u003cyour_username\u003e\n\u003e python main.py\n\u003e \n\u003e # Or start a training job in one command\n\u003e python csub.py -n experiment --train --command \"cd /mloscratch/homes/\u003cyour_username\u003e/\u003cyour_code\u003e; python main.py\"\n\u003e ```\n\n---\n\n## Setup Guide\n\n\u003e [!IMPORTANT]\n\u003e **Network requirement**: You must be on the EPFL WiFi or connected to the VPN. The cluster is not accessible otherwise.\n\n### 1. Pre-setup (Access \u0026 Repository)\n\n**1. Request cluster access**\n\nAsk Jennifer or Martin to add you to the `runai-mlo` group: https://groups.epfl.ch/\n\n**2. Prepare your code repository**\n\nWhile waiting for access, create a GitHub repository for your code. This is best practice regardless of our cluster setup.\n\n**3. Set up experiment tracking (optional)**\n\n- **Weights \u0026 Biases**: Create an account at [wandb.ai](https://wandb.ai/) and get your API key\n- **Hugging Face**: Create an account at [huggingface.co](https://huggingface.co/) and get your token (if using their models)\n\n### 2. Setup Tools on Your Machine\n\n\u003e [!IMPORTANT]\n\u003e **Platform note**: The setup below was tested on macOS with Apple Silicon. For other systems, adapt the commands accordingly.\n\u003e - **Linux**: Replace `darwin/arm64` with `linux/amd64` in URLs\n\u003e - **Windows**: Use WSL (Windows Subsystem for Linux)\n\n#### Install kubectl\n\nDownload and install kubectl v1.30.11 (matching the cluster version):\n\n```bash\n# macOS with Apple Silicon\ncurl -LO \"https://dl.k8s.io/release/v1.30.11/bin/darwin/arm64/kubectl\"\n\n# Linux (AMD64)\n# curl -LO \"https://dl.k8s.io/release/v1.30.11/bin/linux/amd64/kubectl\"\n\n# Install\nchmod +x ./kubectl\nsudo mv ./kubectl /usr/local/bin/kubectl\nsudo chown root: /usr/local/bin/kubectl\n``` \n\nSee https://kubernetes.io/docs/tasks/tools/install-kubectl/ for other platforms.\n\n#### Setup kubeconfig\n\nDownload the kube config file to `~/.kube/config`:\n\n```bash\ncurl -o ~/.kube/config https://raw.githubusercontent.com/epfml/getting-started/main/kubeconfig.yaml\n```\n\n#### Install run:ai CLI\n\nDownload and install the run:ai CLI:\n\n```bash\n# macOS with Apple Silicon\nwget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/darwin\n\n# Linux (replace 'darwin' with 'linux')\n# wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/linux\n\n# Install\nchmod +x ./runai\nsudo mv ./runai /usr/local/bin/runai\nsudo chown root: /usr/local/bin/runai\n```\n\n### 3. Login to the Cluster\n\n#### Login to run:ai\n\n```bash\nrunai login\n```\n\n#### Verify access\n\n```bash\n# List available projects\nrunai list projects\n\n# Set your default project\nrunai config project mlo-$GASPAR_USERNAME\n```\n\n#### Verify Kubernetes connection\n\n```bash\nkubectl get nodes\n```\n\nYou should see the RCP cluster nodes listed.\n\n### 4. Configure Your `.env` File\n\nThis setup keeps **all personal configuration and secrets** in a local `.env` file (never committed to git).\n\n#### Clone and create `.env`\n\n```bash\ngit clone https://github.com/epfml/getting-started.git\ncd getting-started\ncp user.env.example .env\n```\n\n#### Fill in required fields\n\nOpen `.env` in an editor and configure:\n\n| Variable | Description | Example |\n|----------|-------------|---------|\n| `LDAP_USERNAME` | Your EPFL/Gaspar username | `jdoe` |\n| `LDAP_UID` | Your numeric LDAP user ID | `123456` |\n| `LDAP_GROUPNAME` | For MLO | `MLO-unit` |\n| `LDAP_GID` | For MLO: `83070` | `83070` |\n| `RUNAI_PROJECT` | Your project | `mlo-\u003cusername\u003e` |\n| `K8S_NAMESPACE` | Your namespace | `runai-mlo-\u003cusername\u003e` |\n| `RUNAI_IMAGE` | Docker image | `ic-registry.epfl.ch/mlo/mlo-base:uv1` |\n| `RUNAI_SECRET_NAME` | Secret name | `runai-mlo-\u003cusername\u003e-env` |\n| `WORKING_DIR` | Working directory | `/mloscratch/homes/\u003cusername\u003e` |\n\n#### Find your LDAP UID\n\nTo ensure correct file permissions:\n\n```bash\n# SSH into HaaS machine (use your Gaspar password)\nssh \u003cyour_gaspar_username\u003e@haas001.rcp.epfl.ch\n\n# Get your UID\nid\n```\n\nCopy the number after `uid=` (e.g., `uid=123456`) into `LDAP_UID` in your `.env` file.\n\n#### Optional: Add secrets and tokens\n\nOptionally configure in `.env`:\n\n- `WANDB_API_KEY` – Weights \u0026 Biases API key\n- `HF_TOKEN` – Hugging Face token\n- `GIT_USER_NAME` / `GIT_USER_EMAIL` – Git identity for commits\n- GitHub SSH keys (auto-loaded from `~/.ssh/github` if empty):\n  - `GITHUB_SSH_KEY_PATH` / `GITHUB_SSH_PUBLIC_KEY_PATH` (to override default paths)\n\n#### Sync your secret\n\nThe secret is automatically synced when starting a job. To manually sync:\n\n```bash\npython csub.py --sync-secret-only\n```\n\n### 5. Start Your First Job\n\n#### Start an interactive pod\n\n```bash\npython csub.py -n sandbox\n```\n\n#### Wait for the pod to start\n\nThis can take a few minutes. Monitor the status:\n\n```bash\n# List all jobs\nrunai list\n\n# Check specific job status\nrunai describe job sandbox\n```\n\n#### Connect to your pod\n\nOnce the status shows `Running`:\n\n```bash\nrunai exec sandbox -it -- zsh\n```\n\nYou should now be inside a terminal on the cluster! 🎉\n\n### 6. Clone and Run Your Code\n\n#### Clone your repository\n\nInside the pod, clone your code into your scratch home folder:\n\n```bash\ncd /mloscratch/homes/\u003cyour_username\u003e\ngit clone https://github.com/\u003cyour_username\u003e/\u003cyour_repo\u003e.git\ncd \u003cyour_repo\u003e\n```\n\n#### Set up your Python environment\n\nThe default image includes [uv](https://github.com/astral-sh/uv) as the recommended package manager (pip also works):\n\n```bash\n# Create and activate virtual environment\nuv venv .venv\nsource .venv/bin/activate\n\n# Install dependencies\nuv pip install -r requirements.txt\n```\n\n#### Run your code\n\n```bash\npython main.py\n```\n\nIf you configured `WANDB_API_KEY` or `HF_TOKEN` in `.env`, authentication should work automatically.\n\n---\n\n## Using VS Code\n\nFor remote development on the cluster:\n\n1. **Install extensions**\n   - [Kubernetes](https://marketplace.visualstudio.com/items?itemName=ms-kubernetes-tools.vscode-kubernetes-tools)\n   - [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)\n\n2. **Attach to your pod**\n   - Navigate to: **Kubernetes** → **rcp-cluster** → **Workloads** → **Pods**\n   - Right-click your pod → **Attach Visual Studio Code**\n   - Open `/mloscratch/homes/\u003cyour_username\u003e` in the remote session\n\nFor detailed instructions, see the [Managing Workflows guide](docs/managing_workflows.md#using-vs-code).\n\n---\n\n## Recommended Workflow\n\n\u003e [!TIP]\n\u003e **Development cycle**:\n\u003e 1. Develop code locally or on the cluster (using VS Code)\n\u003e 2. Push changes to GitHub\n\u003e 3. Run experiments on the cluster via `runai exec sandbox -it -- zsh`\n\u003e 4. Keep code and experiments organized and reproducible\n\n\u003e [!IMPORTANT]\n\u003e **Critical reminders**:\n\u003e - **Pods can be killed anytime** – Implement checkpointing and recovery\n\u003e - **Store files on scratch** – Everything in `~/` is lost when pods restart\n\u003e - **Use `/mloscratch/homes/\u003cusername\u003e`** – Shell config and VS Code settings persist here\n\u003e - **Delete failed jobs** – Run `runai delete job \u003cname\u003e` before restarting\n\u003e - **Background jobs** – Use training mode: `python csub.py -n exp --train --command \"...\"`\n\n\u003e [!CAUTION]\n\u003e **Using the cluster creates costs.** Always stop your jobs when not in use!\n\nFor detailed workflow guidance, see the [Managing Workflows guide](docs/managing_workflows.md).\n\n---\n\n\n## `csub.py` Usage and Arguments\n\nThe `csub.py` script is a thin wrapper around the run:ai CLI that simplifies job submission by:\n- Reading configuration and secrets from `.env`\n- Syncing Kubernetes secrets automatically\n- Constructing and executing `runai submit` commands\n\n### Basic Usage\n\n```bash\npython csub.py -n \u003cjob_name\u003e -g \u003cnum_gpus\u003e -t \u003ctime\u003e --command \"\u003ccmd\u003e\" [--train]\n```\n\n### Common Examples\n\n```bash\n# CPU-only pod for development\npython csub.py -n dev-cpu\n\n# Interactive development pod with 1 GPU\npython csub.py -n dev-gpu -g 1\n\n# Training job with 4 A100 GPUs\npython csub.py -n experiment --train -g 4 --command \"cd /mloscratch/homes/user/code; python train.py\"\n\n# Use specific GPU type\npython csub.py -n my-job -g 2 --node-type h100 --train --command \"...\"\n\n# Dry run (see command without executing)\npython csub.py -n test --dry --command \"...\"\n```\n\n### Available Arguments\n\n| Argument | Description | Default |\n|----------|-------------|---------|\n| `-n`, `--name` | Job name | Auto-generated (username + timestamp) |\n| `-g`, `--gpus` | Number of GPUs | `0` (CPU-only) |\n| `-t`, `--time` | Maximum runtime (e.g., `12h`, `2d6h30m`) | `12h` |\n| `-c`, `--command` | Command to run | `sleep \u003cduration\u003e` |\n| `--train` | Submit as training workload (non-interactive) | Interactive mode |\n| `-i`, `--image` | Docker image | From `RUNAI_IMAGE` in `.env` |\n| `--node-type` | GPU type: `v100`, `h100`, `h200`, `default`, `a100-40g` | `default` (A100) |\n| `--cpus` | Number of CPUs | Platform default |\n| `--memory` | CPU memory request | Platform default |\n| `-p`, `--port` | Expose container port (for Jupyter, etc.) | None |\n| `--large-shm` | Request larger `/dev/shm` | False |\n| `--host-ipc` | Share host IPC namespace | False |\n| `--backofflimit` | Retries before marking training job failed | `0` |\n\n### Secret Management\n\n| Argument | Description |\n|----------|-------------|\n| `--sync-secret-only` | Only sync `.env` to Kubernetes secret, don't submit job |\n| `--skip-secret-sync` | Don't sync secret before submission |\n| `--secret-name` | Override `RUNAI_SECRET_NAME` from `.env` |\n| `--env-file` | Path to `.env` file | `.env` |\n\n### Advanced Options\n\n| Argument | Description |\n|----------|-------------|\n| `--uid` | Override `LDAP_UID` from `.env` |\n| `--gid` | Override `LDAP_GID` from `.env` |\n| `--pvc` | Override `SCRATCH_PVC` from `.env` |\n| `--dry` | Print command without executing |\n\n### After Submission\n\nAfter submitting, `csub.py` prints useful follow-up commands:\n\n```bash\nrunai describe job \u003cname\u003e  # Check job status\nrunai logs \u003cname\u003e          # View logs\nrunai exec \u003cname\u003e -it -- zsh  # Connect to pod\nrunai delete job \u003cname\u003e    # Delete job\n```\n\nRun `python csub.py -h` for the complete help text.\n\n---\n\n## Advanced Topics\n\n### Managing Workflows\n\nFor detailed guides on day-to-day operations, see the [Managing Workflows guide](docs/managing_workflows.md):\n\n- [Pod management](docs/managing_workflows.md#managing-pods) – Commands to list, describe, delete jobs\n- [Important workflow notes](docs/managing_workflows.md#important-notes-and-workflow) – Job types, GPU selection, best practices\n- [HaaS machine](docs/managing_workflows.md#the-haas-machine) – File transfer between storage systems\n- [File management](docs/managing_workflows.md#file-management) – Understanding storage (mloscratch, mlodata1, mloraw1)\n\n### Alternative Workflows\n\n- **Run:ai CLI directly**: See [`docs/runai_cli.md`](docs/runai_cli.md) for using run:ai without `csub.py`\n- **Custom Docker images**: See [Creating Custom Images](#creating-custom-docker-images)\n- **Distributed training**: See [`docs/multinode.md`](docs/multinode.md) for multi-node jobs\n\n### Creating Custom Docker Images\n\nIf you need custom dependencies:\n\n1. **Get registry access**\n   - Login at https://ic-registry.epfl.ch/ and verify you see the MLO project\n   - The `runai-mlo` group should already have access\n\n2. **Install Docker**\n   ```bash\n   brew install --cask docker  # macOS\n   ```\n   If you get \"Cannot connect to the Docker daemon\", run Docker Desktop GUI first.\n\n3. **Login to registry**\n   ```bash\n   docker login ic-registry.epfl.ch  # Use GASPAR credentials\n   ```\n\n4. **Modify and publish**\n   - Edit `docker/Dockerfile` as needed\n   - Use `docker/publish.sh` to build and push\n   - **Important**: Rename your image (e.g., `mlo/\u003cyour-username\u003e:tag`) to avoid overwriting the default\n\n**Example workflow:**\n```bash\ndocker build . -t \u003cyour-tag\u003e\ndocker tag \u003cyour-tag\u003e ic-registry.epfl.ch/mlo/\u003cyour-tag\u003e\ndocker push ic-registry.epfl.ch/mlo/\u003cyour-tag\u003e\n```\n\nSee also [Matteo's custom Docker example](https://gist.github.com/mpagli/6d0667654bf8342eb4923fedf731660e).\n\n### Port Forwarding\n\nTo access services running in your pod (e.g., Jupyter):\n\n```bash\nkubectl get pods\nkubectl port-forward \u003cpod_name\u003e 8888:8888\n```\n\nThen access at `http://localhost:8888`\n\n### Distributed Training\n\nFor multi-node training across several compute nodes, see the detailed guide:\n\n- **Documentation**: [`docs/multinode.md`](docs/multinode.md)\n- **Official docs**: https://docs.run.ai/v2.13/Researcher/cli-reference/runai-submit-dist-pytorch/\n\n---\n\n## Reference\n\n### File Overview\n\n```\n├── csub.py                # Job submission wrapper (wraps runai submit)\n├── utils.py               # Python helpers for csub.py\n├── user.env.example       # Template for .env (copy and configure)\n├── docker/\n│   ├── Dockerfile         # uv-enabled base image (RCP template)\n│   ├── entrypoint.sh      # Runtime bootstrap script\n│   └── publish.sh         # Build and push Docker images\n├── kubeconfig.yaml        # Kubeconfig template for ~/.kube/config\n└── docs/\n    ├── faq.md             # Frequently asked questions\n    ├── managing_workflows.md  # Day-to-day operations guide\n    ├── README.md          # Architecture deep dive\n    ├── runai_cli.md       # Alternative run:ai CLI workflows\n    ├── multinode.md       # Multi-node/distributed training\n    └── how_to_use_k8s_secret.md  # Kubernetes secrets reference\n```\n\n### Deep Dive: How This Setup Works\n\nFor technical details about the Docker image, entrypoint script, environment variables, and secret management:\n\n**Read the architecture explainer**: [`docs/README.md`](docs/README.md)\n\nTopics covered:\n- Runtime environment and entrypoint\n- Permissions model and shared caches\n- uv-based Python workflow\n- Images and publishing\n- Secrets, SSH, and Kubernetes integration\n\n### Quick Links\n\n**RCP Resources**\n- [RCP Main Page](https://www.epfl.ch/research/facilities/rcp/)\n- [Documentation](https://wiki.rcp.epfl.ch)\n- [Dashboard](https://portal.rcp.epfl.ch/)\n- [Docker Registry](https://ic-registry.epfl.ch/)\n- [Quick Start Guide](https://wiki.rcp.epfl.ch/en/home/CaaS/Quick_Start)\n\n**run:ai Documentation**\n- [Official run:ai docs](https://docs.run.ai)\n\n**Related Resources**\n- [Compute and Storage @ CLAIRE](https://prickly-lip-484.notion.site/Compute-and-Storage-CLAIRE-91b4eddcc16c4a95a5ab32a83f3a8294) – Similar setup by colleagues\n\n**MLO Cluster Repositories (OUTDATED)**\n\nThese repositories contain shared tooling and infrastructure (by previous PhD students). Contact Martin for editor access. **They are outdated and not maintained anymore.**\n\n- [epfml/epfml-utils](https://github.com/epfml/epfml-utils) – Python package for shared tooling (`pip install epfml-utils`)\n- [epfml/mlocluster-setup](https://github.com/epfml/mlocluster-setup) – Base images and setup for semi-permanent machines\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fgetting-started","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepfml%2Fgetting-started","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepfml%2Fgetting-started/lists"}