Advanced Guide: Real‑Time GPU Process Migration in Kubernetes with CRIU
This article explains how os‑criu provides transparent, OS‑level GPU checkpoint/restore, compares its performance with NVIDIA's cuda‑checkpoint, walks through building and installing the PhOS framework, demonstrates migration of a Llama2‑13b‑chat workload in Docker, and discusses current limitations and future Kubernetes integration plans.
Overview
os‑criu is an OS‑level GPU checkpoint/restore (C/R) system that transparently checkpoints and restores GPU‑using processes without application modification. It is the first OS‑level C/R solution that can perform concurrent checkpoint/restore without stopping the application.
On CUDA, os‑criu was compared with NVIDIA’s nvidia/cuda-checkpoint project [1]. The benchmark with the Llama2‑13b‑chat model shows lower checkpoint and restore latency for os‑criu (figures omitted).
Building and Installing PhOS
Clone the repository recursively:
git clone --recursive https://github.com/SJTU-IPADS/PhoenixOS.gitStart a privileged Docker container based on the CUDA image matching the target version (e.g., nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04):
sudo docker run -dit --gpus all \
-v .:/root \
--privileged --network=host --ipc=host \
--name phos nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
sudo docker exec -it phos /bin/bashInstall basic dependencies and download assets inside the container:
apt-get update
apt-get install -y git wget
cd /root/scripts/build_scripts
bash download_assets.shBuild and install all components:
# Clear previous build and third‑party caches
bash build.sh -c -3
# Build and install
bash build.sh -3 -iPhOS components: phos-autogen – Autogen Engine that generates parser and worker code for a hardware platform from a lightweight description. phosd – Daemon that runs continuously and controls all GPU devices on the node. libphos.so – Hijacker that intercepts GPU API calls on the client and forwards them to the daemon. libpccl.so – Checkpoint Communication Library (PCCL) for optimized device‑to‑device state migration (not included in the current release). unit-testing – GoogleTest‑based unit‑test framework. phos-cli – Command‑line interface for interacting with PhOS.
Running the Llama2‑13b‑chat Example
Environment
pytorch=1.13.0a0+git2263262 transformers==4.30.0 accelerate==0.20.1 sentencepiece==0.2.0 pandas==2.0.3 CUDA 11.3Pull the ready‑made Docker image phoenixos/pytorch:11.3-ubuntu20.04 and start a container:
docker run -dit --gpus all --privileged \
--ipc=host --network=host \
-v .:/root --name phos_example phoenixos/pytorch:11.3-ubuntu20.04
docker exec -it phos_example /bin/bashExecution Steps
Install required Python packages:
pip3 install transformers==4.30.0 accelerate==0.20.1 sentencepiece==0.2.0 pandas==2.0.3Download model weights and tokenizer (replace your_huggingface_token with a valid token):
export HF_TOKEN=your_huggingface_token
python3 ./download.pyStart the PhOS daemon: pos_cli --start --target daemon Run training or inference with PhOS environment variables:
# Train
env $phos python3 ./train.py
# Inference
env $phos python3 ./inference.pyFirst run may be slower because PhOS parses and detects all registered .fatbin/.cubin files.
Checkpoint / Restore
# Optional pre‑dump
mkdir /root/ckpt
pos_cli --pre-dump --dir /root/ckpt --pid [your_program_pid]
# Dump
mkdir /root/ckpt
pos_cli --dump --dir /root/ckpt --pid [your_program_pid]
# Restore
pos_cli --restore --dir /root/ckptHow PhOS Works
See the detailed design in the paper [3].
Limitations
Supports checkpoint and restore for a single GPU only.
Cannot be integrated directly with Kubernetes at present.
Future Plans
Kubernetes Integration
The native Kubernetes + CRIU solution cannot automatically trigger a restore. The crik project [4] can react to node‑failure signals and perform CPU‑level checkpoint/restore, but it lacks GPU support. Extending crik with PhOS capabilities is proposed to achieve true GPU‑level migration on Kubernetes.
References
[1] https://github.com/NVIDIA/cuda-checkpoint
[2] https://github.com/SJTU-IPADS/PhoenixOS/tree/zhuobin/fix_cli?tab=readme-ov-file#i-build-and-install-phos
[3] https://arxiv.org/abs/2405.12079
[4] https://github.com/qawolf/crik
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
