{"id":17981000,"url":"https://github.com/nvidia/cuda-checkpoint","last_synced_at":"2025-05-16T18:10:44.348Z","repository":{"id":233999516,"uuid":"788109695","full_name":"NVIDIA/cuda-checkpoint","owner":"NVIDIA","description":"CUDA checkpoint and restore utility","archived":false,"fork":false,"pushed_at":"2025-01-27T17:27:19.000Z","size":255,"stargazers_count":335,"open_issues_count":18,"forks_count":16,"subscribers_count":26,"default_branch":"main","last_synced_at":"2025-05-15T00:28:54.798Z","etag":null,"topics":["checkpoint","cuda"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-17T19:45:28.000Z","updated_at":"2025-05-14T17:02:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"d0e7a87a-bf80-49b4-a4a8-d521226caa2d","html_url":"https://github.com/NVIDIA/cuda-checkpoint","commit_stats":null,"previous_names":["nvidia/cuda-checkpoint"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fcuda-checkpoint","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fcuda-checkpoint/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fcuda-checkpoint/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fcuda-checkpoint/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA","download_url":"https://codeload.github.com/NVIDIA/cuda-checkpoint/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254582907,"owners_count":22095518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["checkpoint","cuda"],"created_at":"2024-10-29T18:07:17.640Z","updated_at":"2025-05-16T18:10:44.323Z","avatar_url":"https://github.com/NVIDIA.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The cuda-checkpoint Utility\n\nCheckpoint and restore functionality for CUDA is exposed through a command-line utiity called `cuda-checkpoint`\nwhich is available in the [bin](bin) directory of this repo.\nThis utility can be used to transparently checkpoint and restore CUDA state within a running Linux process,\nand can be combined with [CRIU](https://criu.org/Main_Page) (described below) to fully checkpoint CUDA applications.\n\n## 570 Features\nDisplay driver version 570 includes these features not present in 550:\n* NVML support\n* integration with CRIU 4.0 or higher, providing process tree support\n* [CUDA Driver interfaces](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CHECKPOINT.html) at feature parity with the `cuda-checkpoint` utility\n* a separate lock command which can take a timeout to avoid deadlocks\n\nThis [demo program](src/r570-features.c) shows many of these features in action!\n\n## Background\nTransparent, per-process checkpointing offers a middle ground between virtual machine checkpointing and application-driven checkpointing.\nPer-process checkpointing can be used in combination with containers to checkpoint the state of a complex application,\nfacilitating uses-cases such as:\n\n* fault tolerance (with periodic checkpoints)\n* preemption of lower-priority work on a single node (by checkpointing the preempted task), and\n* cluster scheduling (with migration)\n\nVirtual Machine|Per-Process|Application Driven\n---|---|---\n\u003cimg src=\"images/vm.png\"\u003e|\u003cimg src=\"images/proc.png\"\u003e|\u003cimg src=\"images/app.png\"\u003e\n\nThe most popular utility for transparent per-process checkpointing is a utility called [CRIU](https://criu.org/Main_Page).\n\n## CRIU\n[CRIU](https://criu.org/Main_Page) (Checkpoint/Restore in Userspace) is an open source checkpointing utility\n(maintained outside of NVIDIA) for Linux which can checkpoint and restore process trees.\nCRIU exposes its functionality through a command line program called `criu`\nand operates by checkpointing and restoring every kernel mode resource associated with a process. These resources include:\n\n* anonymous memory,\n* threads,\n* regular files,\n* sockets, and\n* pipes between checkpointed processes.\n\nSince the behavior of these resources is specified by Linux, and are independent of the underlying hardware,\nCRIU knows how to checkpoint and restore them.\nIn contrast, NVIDIA GPUs provide functionality beyond that of a standard Linux kernel, and thus CRIU is not able to manage them.\n`cuda-checkpoint` adds this capability, and can therefore be used with CRIU to checkpoint and restore a CUDA application.\n\n\n## The Utility\n`cuda-checkpoint` checkpoints and restores the CUDA state of a single Linux process.\nThe `cuda-checkpoint` utility supports display driver version 550 and higher and is located in the [bin](bin) directory of this repo.\n\n```bash\nlocalhost$ cuda-checkpoint --help\nCUDA checkpoint and restore utility.\nVersion 570.86.10. Copyright (C) 2025 NVIDIA Corporation. All rights reserved.\n\nOperations:\n--get-state --pid \u003cpid\u003e\n        Prints the current checkpoint state of the process specified by \u003cpid\u003e\n\n--action lock | checkpoint | restore | unlock --pid \u003cpid\u003e [--timeout \u003cms\u003e]\n        Performs the specified action on \u003cpid\u003e.\n        For the lock action a timeout can be provided, the lock operation will wait up to \u003cms\u003e milliseconds for the operation to succeed.\n\n--toggle --pid \u003cpid\u003e\n        Toggles the CUDA state in the specified process between the running and checkpointed states\n\n--get-restore-tid --pid \u003cpid\u003e\n        Retrieves the CUDA restore thread ID of the process specified by \u003cpid\u003e\n\nOptions:\n--pid|-p \u003cpid\u003e\n        The pid upon which to perform the operation\n\n--timeout|-t \u003ctimeout\u003e\n        Optional timeout that can be specified for the lock action in milliseconds\n\n--help|-h\n        Print this help message\n```\n\nThe `cuda-checkpoint` binary can toggle the CUDA state of a process (specified by PID) between suspended and running.\nA running-to-suspended transition is called a suspend and the opposite transition is called a resume.\n\nA process's CUDA state is initially running.\nWhen `cuda-checkpoint` is used to suspend CUDA in a process:\n\n1. \u003cimg src=\"images/lock.png\" width=\"64\"\u003e  any CUDA driver APIs which launch work, manage resources, or otherwise impact GPU state are locked;\n2. \u003cimg src=\"images/moon.png\" width=\"64\"\u003e already-submitted CUDA work, including stream callbacks, are completed;\n3. \u003cimg src=\"images/copy-out.png\" width=\"64\"\u003e device memory is copied to the host, into allocations managed by the CUDA driver; and\n4. \u003cimg src=\"images/off.png\" width=\"64\"\u003e  all of CUDA’s GPU resources are released.\n\n``cuda-checkpoint`` does not suspend CPU threads, which may continue to safely interact with CUDA by:\ncalling runtime or driver APIs (which may block until CUDA is resumed), and\naccessing host memory (allocated by cudaMallocHost and similar APIs) which remains valid.\n\nA suspended CUDA process no longer directly refers to any GPU hardware at the OS level\nand may therefore be checkpointed by a CPU checkpointing utility such as CRIU.\n\nWhen a process’s CUDA state is resumed using ``cuda-checkpoint``:\n\n1. \u003cimg src=\"images/on.png\" width=\"64\"\u003e GPUs are re-acquired by the process;\n2. \u003cimg src=\"images/copy-in.png\" width=\"64\"\u003e device memory is copied back to the GPU, and GPU memory mappings are restored at their original addresses;\n3. \u003cimg src=\"images/sun.png\" width=\"64\"\u003e CUDA objects such as streams and contexts are restored; and\n4. \u003cimg src=\"images/unlock.png\" width=\"64\"\u003e CUDA driver APIs are unlocked.\n\nAt this point, CUDA calls will unblock and CUDA may begin running on the GPU again.\n\n## Example\nThis example will use `cuda-checkpoint` and `criu` to checkpoint a CUDA application called *counter*.\nEvery time *counter* receives a packet, it increments GPU memory and replies with the updated value.\nThe [source code](src/counter.cu) for *counter* is shown below.\n\n```cuda\n#include \u003cstdio.h\u003e\n#include \u003csys/types.h\u003e\n#include \u003csys/socket.h\u003e\n#include \u003cnetinet/in.h\u003e\n#include \u003carpa/inet.h\u003e\n\n#define PORT 10000\n\n__device__ int counter = 100;\n__global__ void increment()\n{\n    counter++;\n}\n\nint main(void)\n{\n    cudaFree(0);\n\n    int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);\n    sockaddr_in addr = {AF_INET, htons(PORT), inet_addr(\"127.0.0.1\")};\n    bind(sock, (sockaddr *)\u0026addr, sizeof addr);\n\n    while (true) {\n        char buffer[16] = {0};\n        sockaddr_in peer = {0};\n        socklen_t inetSize = sizeof peer;\n        int hCounter = 0;\n\n        recvfrom(sock, buffer, sizeof buffer, 0, (sockaddr *)\u0026peer, \u0026inetSize);\n\n        increment\u003c\u003c\u003c1,1\u003e\u003e\u003e();\n        cudaMemcpyFromSymbol(\u0026hCounter, counter, sizeof counter);\n\n        size_t bytes = sprintf(buffer, \"%d\\n\", hCounter);\n        sendto(sock, buffer, bytes, 0, (sockaddr *)\u0026peer, inetSize);\n    }\n    return 0;\n}\n```\n\nThe *counter* application can be built using `nvcc`.\n\n```bash\nlocalhost$ nvcc counter.cu -o counter\n```\n\nNext, launch *counter* and wait to be sure that it is listening on its socket\n(which is important if this demo is being launched as [a single script](src/example.sh)).\n\n```bash\nlocalhost# ./counter \u0026\n[1] 298027\nlocalhost# sleep 1\n```\n\nSave *counter*’s PID for reference in subsequent commands.\n\n```bash\nlocalhost# PID=$!\n```\n\nSend *counter* a packet and observe the returned value.\nThe initial value was 100 but the response is 101, showing that the GPU memory has changed since initialization.\n\n```bash\nlocalhost# echo hello | nc -u localhost 10000 -W 1\n101\n```\n\nUse `nvidia-smi` to confirm that *counter* is running on a GPU.\n\n```bash\nlocalhost# nvidia-smi --query --display=PIDS | grep $PID\n        Process ID                        : 298027\n```\n\nUse `cuda-checkpoint` to suspend *counter*’s CUDA state.\n\n```bash\nlocalhost# cuda-checkpoint --toggle --pid $PID\n```\n\nUse `nvidia-smi` to confirm that *counter* is no longer running on a GPU\n\n```bash\nlocalhost# nvidia-smi --query --display=PIDS | grep $PID\n```\n\nCreate a directory to hold the checkpoint image\n\n```bash\nlocalhost# mkdir -p demo\n```\n\nUse `criu` to checkpoint *counter*\n\n```bash\nlocalhost# criu dump --shell-job --images-dir demo --tree $PID\n[1]+  Killed                  ./counter\n```\n\nConfirm that *counter* is no longer running\n\n```bash\nlocalhost# ps --pid $PID\n    PID TTY          TIME CMD\n```\n\nUse `criu` to restore *counter*\n\n```bash\nlocalhost# criu restore --shell-job --restore-detached --images-dir demo\n```\n\nUse `cuda-checkpoint` to resume *counter*’s CUDA state\n\n```bash\nlocalhost# cuda-checkpoint --toggle --pid $PID\n```\n\nNow that *counter* is fully restored, send it another packet.\nThe response is 102, showing that earlier GPU operations were persisted correctly!\n\n```bash\nlocalhost# echo hello | nc -u localhost 10000 -W 1\n102\n```\n\n## Functionality\nAs of display driver version 570, checkpoint and restore functionality is still being actively developed.\nIn particular, `cuda-checkpoint`:\n\n* is x64 only,\n* does not support UVM or IPC memory,\n* does not support GPU migration,\n* waits for already-submitted CUDA work to finish before completing a checkpoint,\n* does not attempt to keep the process in a good state if an error (such as the presence of a UVM allocation) is encountered during checkpoint or restore.\n\nThese limitations will be addressed in subsequent display driver releases,\nand will not require an update to the `cuda-checkpoint` utility itself.\nThe `cuda-checkpoint` utility simply exposes functionality that is contained in the driver.\n\n## License\nBy downloading or using the software, you agree to the terms of the [License Agreement for NVIDIA Software Development Kits — EULA](https://docs.nvidia.com/cuda/eula/index.html).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia%2Fcuda-checkpoint","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvidia%2Fcuda-checkpoint","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia%2Fcuda-checkpoint/lists"}