{"id":18517421,"url":"https://github.com/sjtu-ipads/phoenixos","last_synced_at":"2025-04-05T00:05:02.109Z","repository":{"id":261419425,"uuid":"876630817","full_name":"SJTU-IPADS/PhoenixOS","owner":"SJTU-IPADS","description":"Fast OS-level support for GPU checkpoint and restore ","archived":false,"fork":false,"pushed_at":"2025-03-30T20:17:40.000Z","size":39613,"stargazers_count":171,"open_issues_count":8,"forks_count":15,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-04-05T00:04:54.725Z","etag":null,"topics":["checkpoint-restore","criu","cuda","gpu"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SJTU-IPADS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-22T09:48:43.000Z","updated_at":"2025-03-31T03:39:27.000Z","dependencies_parsed_at":"2025-02-28T13:14:56.794Z","dependency_job_id":"0c6c3236-3650-42df-95b7-c3c139440a36","html_url":"https://github.com/SJTU-IPADS/PhoenixOS","commit_stats":{"total_commits":219,"total_committers":4,"mean_commits":54.75,"dds":"0.013698630136986356","last_synced_commit":"798e48e2f023c42344a6fe0ec96933db2a31a3a2"},"previous_names":["sjtu-ipads/phoenixos"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SJTU-IPADS%2FPhoenixOS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SJTU-IPADS%2FPhoenixOS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SJTU-IPADS%2FPhoenixOS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SJTU-IPADS%2FPhoenixOS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SJTU-IPADS","download_url":"https://codeload.github.com/SJTU-IPADS/PhoenixOS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247266562,"owners_count":20910836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["checkpoint-restore","criu","cuda","gpu"],"created_at":"2024-11-06T17:03:24.240Z","updated_at":"2025-04-05T00:05:02.100Z","avatar_url":"https://github.com/SJTU-IPADS.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PhoenixOS\n[![cuda](https://img.shields.io/badge/CUDA-supported-brightgreen.svg?logo=nvidia)](https://phoenixos.readthedocs.io/en/latest/cuda_gsg/index.html#)\n[![rocm](https://img.shields.io/badge/ROCm-Developing-lightgrey.svg?logo=amd)](https://phoenixos.readthedocs.io/en/latest/rocm_gsg/index.html)\n[![ascend](https://img.shields.io/badge/Ascend-Developing-lightgrey.svg?logo=huawei)]()\n[![slack](https://img.shields.io/badge/slack-PhoenixOS-brightgreen.svg?logo=slack)](https://join.slack.com/t/phoenixoshq/shared_invite/zt-2tkievevq-xaQ3sctxs7bLnTaYeMyBBg)\n[![docs](https://img.shields.io/badge/Docs-passed-brightgreen.svg?logo=readthedocs)](https://phoenixos.readthedocs.io/en/latest/)\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./docs/docs/source/_static/images/home/logo.jpg\" height=\"200px\" /\u003e\n\u003c/div\u003e\n\n\u003cdiv\u003e\n    \u003cp\u003e\n    \u003cb\u003ePhoenixOS\u003c/b\u003e (PhOS) is an OS-level GPU checkpoint/restore (C/R) system. It can \u003cb\u003etransparently\u003c/b\u003e C/R processes that use the GPU, without requiring any cooperation from the application, a key feature required by modern systems like the cloud. Most importantly, PhOS is the first OS-level C/R system that can \u003cb\u003econcurrently execute C/R without stopping the execution of application\u003c/b\u003e.\n    \u003cp\u003e\n    Under CUDA platform, we compared the C/R performace of PhOS with \u003ca href=\"https://github.com/NVIDIA/cuda-checkpoint\"\u003envidia/cuda-checkpoint\u003c/a\u003e:\n    \u003ctable\u003e\n        \u003ctr\u003e\u003cth align=\"center\"\u003eCheckpointing Llama2-13b-chat\u003c/th\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd align=\"center\"\u003e\u003cimg src=\"./docs/docs/source/_static/images/home/llama2_ckpt.gif\" /\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003c/table\u003e\n    \u003ctable\u003e\n        \u003ctr\u003e\u003cth align=\"center\"\u003eRestoring Llama2-13b-chat\u003c/th\u003e\u003c/tr\u003e\n        \u003ctr\u003e\u003ctd align=\"center\"\u003e\u003cimg src=\"./docs/docs/source/_static/images/home/llama2_restore.gif\" /\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003c/table\u003e\n    \u003cp\u003e\n    Note that PhOS is aimming to be a generic design that towards various hardware platforms from different vendors, by providing a set of interfaces which should be implemented by specific hardware platforms. We currently provide the C/R implementation on CUDA platform, support for ROCm and Ascend are under development.\n    \u003cdiv style=\"padding: 0px 10px;\"\u003e\n        \u003cp\u003e\n        \u003ch3 style=\"margin:0px; margin-bottom:5px;\"\u003e📑 Latest News\u003c/h3\u003e\n        \u003cul\u003e\n            \u003cli style=\"margin:0px; margin-bottom:8px;\"\u003e\n                \u003cp style=\"margin:0px; margin-bottom:1px;\"\u003e\n                    \u003cb\u003e[Nov.6, 2024]\u003c/b\u003e PhOS is open sourced 🎉 [\u003ca href=\"https://github.com/PhoenixOS-IPADS/PhoenixOS\"\u003eRepo\u003c/a\u003e] [\u003ca href=\"https://phoenixos.readthedocs.io/en/latest/index.html\"\u003eDocumentations\u003c/a\u003e]\n                \u003c/p\u003e\n                \u003cp style=\"margin:0px; margin-bottom:1px;\"\u003e\n                    👉 PhOS is currently fully supporting single-GPU checkpoint and restore\n                \u003c/p\u003e\n                \u003cp style=\"margin:0px; margin-bottom:1px;\"\u003e\n                    👉 We will soon release codes for cross-node live migration and multi-GPU support :)\n                \u003c/p\u003e                \n            \u003c/li\u003e\n            \u003cli\u003e\n                \u003cp style=\"margin:0px; margin-bottom:5px;\"\u003e\n                    \u003cb\u003e[May 20, 2024]\u003c/b\u003e PhOS paper is now released on arXiv [\u003ca href=\"https://arxiv.org/abs/2405.12079\"\u003ePaper\u003c/a\u003e]\n                \u003c/p\u003e       \n            \u003c/li\u003e\n        \u003c/ul\u003e\n    \u003c/div\u003e\n    \u003ctable style=\"margin:20px 0px;\"\u003e\n        \u003ctr\u003e\u003ctd\u003e\u003cb\u003e\n        PhOS is currently under heavy development. If you're interested in contributing to this project, please join our \u003ca href=\"https://join.slack.com/t/phoenixoshq/shared_invite/zt-2tkievevq-xaQ3sctxs7bLnTaYeMyBBg\"\u003eslack workspace\u003c/a\u003e for more upcoming cool features on PhOS.\n        \u003c/b\u003e\u003c/td\u003e\u003c/tr\u003e\n    \u003c/table\u003e\n\u003c/div\u003e\n\n\u003cbr /\u003e\n\n## I. Build and Install PhOS\n\n### 💡 Option 1: Build and Install From Source\n\n1. **[Clone Repository]**\n    First of all, clone this repository **recursively**:\n\n    ```bash\n    git clone --recursive https://github.com/SJTU-IPADS/PhoenixOS.git\n    ```\n\n2. **[Start Container]**\n    PhOS can be built and installed on official vendor image.\n\n    \u003e NOTE: PhOS require libc6 \u003e= 2.29 for compiling CRIU from source.\n\n    For example, for running PhOS for CUDA 11.3,\n    one can build on official CUDA images\n    (e.g., [`nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04`](https://hub.docker.com/layers/nvidia/cuda/11.3.1-cudnn8-devel-ubuntu20.04/images/sha256-459c130c94363099b02706b9b25d9fe5822ea233203ce9fbf8dfd276a55e7e95)):\n\n\n    ```bash\n    # enter repository\n    cd PhoenixOS\n\n    # start container\n    sudo docker run -dit --gpus all                                         \\\n                -v.:/root                                                   \\\n                --privileged --network=host --ipc=host                      \\\n                --name phos nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04\n\n    # enter container\n    sudo docker exec -it phos /bin/bash\n    ```\n\n    Note that it's important to execute docker container with root privilege, as CRIU needs the permission to C/R kernel-space memory pages.\n\n3. **[Downloading Necesssary Assets]**\n    PhOS relies on some assets to build and test,\n    please download these assets by simply running following commands:\n\n    ```bash\n    # inside container\n\n    # install basic dependencies from OS pkg manager\n    apt-get update\n    apt-get install git-lfs\n    \n    # download assets\n    cd /root/scripts/build_scripts\n    bash download_assets.sh\n    ```\n\n\n4. **[Build]**\n    Building PhOS is simple!\n\n    PhOS provides a convinient build system, which covers compiling, linking and installing all PhOS components:\n\n    \u003ctable\u003e\n        \u003ctr\u003e\n            \u003cth width=\"25%\"\u003eComponent\u003c/th\u003e\n            \u003cth width=\"75%\"\u003eDescription\u003c/th\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003ephos-autogen\u003c/code\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003eAutogen Engine\u003c/b\u003e for generating most of Parser and Worker code for specific hardware platform, based on lightwight notation.\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003ephosd\u003c/code\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003ePhOS Daemon\u003c/b\u003e, which continuously run at the background, taking over the control of all GPU devices on the node.\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003elibphos.so\u003c/code\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003ePhOS Hijacker\u003c/b\u003e, which hijacks all GPU API calls on the client-side and forward to PhOS Daemon.\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003elibpccl.so\u003c/code\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003ePhOS Checkpoint Communication Library\u003c/b\u003e (PCCL), which provide highly-optimized device-to-device state migration. Note that this library is not included in current release.\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003eunit-testing\u003c/code\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003eUnit Tests\u003c/b\u003e for PhOS, which is based on GoogleTest.\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003ephos-cli\u003c/code\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003eCommand Line Interface\u003c/b\u003e (CLI) for interacting with PhOS.\u003c/td\u003e\n        \u003c/tr\u003e\n        \u003ctr\u003e\n            \u003ctd\u003e\u003ccode\u003ephos-remoting\u003c/code\u003e\u003c/td\u003e\n            \u003ctd\u003e\u003cb\u003eRemoting Framework\u003c/b\u003e, which provide highly-optimized GPU API remoting performance. See more details at \u003ca href=\"https://github.com/SJTU-IPADS/PhoenixOS-Remoting\"\u003eSJTU-IPADS/PhoenixOS-Remoting\u003c/a\u003e.\u003c/td\u003e\n        \u003c/tr\u003e\n    \u003c/table\u003e\n\n    To build and install all above components and other dependencies, simply run the build script in the container would works:\n\n    ```bash\n    # inside container\n    cd /root/scripts/build_scripts\n\n    # clear old build cache\n    #   -c: clear previous build\n    #   -3: the clean process involves all third-parties\n    bash build.sh -c -3\n\n    # start building\n    #   -3: the build process involves all third-parties\n    #   -i: install after successful building\n    bash build.sh -3 -i\n    ```\n\n    For customizing build options, please refers to and modify avaiable options under `scripts/build_scripts/build_config.yaml`.\n\n    If you encounter any build issues, you're able to see building logs under `build_log`. Please open a new issue if things are stuck :-|\n\n### 💡 Option 2: Install From Pre-built Binaries\n\n    Will soon be updated :)\n\n\n\u003cbr /\u003e\n\n## II. Usage\n\nOnce successfully installed PhOS, you can now try run your program with PhOS support!\n\n\u003ctable style=\"margin:20px 0px;\"\u003e\n    \u003ctr\u003e\u003ctd\u003e\u003cb\u003e\n    For more details, you can refer to \u003ca href=\"https://github.com/SJTU-IPADS/PhoenixOS/tree/main/examples\"\u003e\u003ccode\u003eexamples\u003c/code\u003e\u003c/a\u003e for step-by-step tutorials to run PhOS.\n    \u003c/b\u003e\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\n### (1) Start `phosd` and your program\n\n1. Start the PhOS daemon (`phosd`), which takes over all GPU reousces on the node:\n\n    ```bash\n    pos_cli --start --target daemon\n    ```\n\n2. To run your program with PhOS support, one need to put a `yaml` configure file under the directory which your program would regard as `$PWD`.\nThis file contains all necessary informations for PhOS to hijack your program. An example file looks like:\n\n    ```yaml\n    # [Field]   name of the job\n    # [Note]    job with same name would share some resources in posd, e.g., CUModule, etc.\n    job_name: \"llama2-13b-chat-hf\"\n\n    # [Field]   remote address of posd, default is local\n    daemon_addr: \"127.0.0.1\"\n    ```\n\n3. You are going for launch now! Try run your program with `env $phos` prefix, for example:\n\n    ```bash\n    env $phos python3 train.py\n    ```\n\n### (2) Pre-dump your program\n\nTo pre-dump your program, which save the CPU \u0026 GPU state without stopping your execution, simple run:\n\n```bash\n# create directory to store checkpoing files\nmkdir /root/ckpt\n\n# pre-dump command\npos_cli --pre-dump --dir /root/ckpt --pid [your program's pid]\n```\n\n### (3) Dump your program\n\nTo dump your program, which save the CPU \u0026 GPU state and stop your execution, simple run:\n\n```bash\n# create directory to store checkpoing files\nmkdir /root/ckpt\n\n# pre-dump command\npos_cli --dump --dir /root/ckpt --pid [your program's pid]\n```\n\n\n### (4) Restore your program\n\nTo restore your program, simply run:\n\n```bash\n# restore command\npos_cli --restore --dir /root/ckpt\n```\n\n\n\u003cbr /\u003e\n\n## III. How PhOS Works?\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./docs/docs/source/_static/images/pos_mechanism.jpg\" width=\"80%\" /\u003e\n\u003c/div\u003e\n\nFor more details, please check our [paper](https://arxiv.org/abs/2405.12079).\n\n\n\u003cbr /\u003e\n\n## IV. Paper\n\nIf you use PhOS in your research, please cite our paper:\n\n```bibtex\n@article{huang2024parallelgpuos,\n  title={PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation},\n  author={Huang, Zhuobin and Wei, Xingda and Hao, Yingyi and Chen, Rong and Han, Mingcong and Gu, Jinyu and Chen, Haibo},\n  journal={arXiv preprint arXiv:2405.12079},\n  year={2024}\n}\n```\n\n\n\u003cbr /\u003e\n\n## V. Contributors\n\nPlease check \u003ca href=\"https://github.com/SJTU-IPADS/PhoenixOS/blob/main/.mailmap\"\u003emailmap\u003c/a\u003e for all contributors.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsjtu-ipads%2Fphoenixos","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsjtu-ipads%2Fphoenixos","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsjtu-ipads%2Fphoenixos/lists"}