Cloud Native 9 min read

Advanced Guide: Real‑Time GPU Process Migration in Kubernetes with CRIU

This article explains how os‑criu provides transparent, OS‑level GPU checkpoint/restore, compares its performance with NVIDIA's cuda‑checkpoint, walks through building and installing the PhOS framework, demonstrates migration of a Llama2‑13b‑chat workload in Docker, and discusses current limitations and future Kubernetes integration plans.

Infra Learning Club
Infra Learning Club
Infra Learning Club
Advanced Guide: Real‑Time GPU Process Migration in Kubernetes with CRIU

Overview

os‑criu is an OS‑level GPU checkpoint/restore (C/R) system that transparently checkpoints and restores GPU‑using processes without application modification. It is the first OS‑level C/R solution that can perform concurrent checkpoint/restore without stopping the application.

On CUDA, os‑criu was compared with NVIDIA’s nvidia/cuda-checkpoint project [1]. The benchmark with the Llama2‑13b‑chat model shows lower checkpoint and restore latency for os‑criu (figures omitted).

Building and Installing PhOS

Clone the repository recursively:

git clone --recursive https://github.com/SJTU-IPADS/PhoenixOS.git

Start a privileged Docker container based on the CUDA image matching the target version (e.g., nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04):

sudo docker run -dit --gpus all \
    -v .:/root \
    --privileged --network=host --ipc=host \
    --name phos nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
sudo docker exec -it phos /bin/bash

Install basic dependencies and download assets inside the container:

apt-get update
apt-get install -y git wget
cd /root/scripts/build_scripts
bash download_assets.sh

Build and install all components:

# Clear previous build and third‑party caches
bash build.sh -c -3
# Build and install
bash build.sh -3 -i

PhOS components: phos-autogen – Autogen Engine that generates parser and worker code for a hardware platform from a lightweight description. phosd – Daemon that runs continuously and controls all GPU devices on the node. libphos.so – Hijacker that intercepts GPU API calls on the client and forwards them to the daemon. libpccl.so – Checkpoint Communication Library (PCCL) for optimized device‑to‑device state migration (not included in the current release). unit-testing – GoogleTest‑based unit‑test framework. phos-cli – Command‑line interface for interacting with PhOS.

Running the Llama2‑13b‑chat Example

Environment

pytorch=1.13.0a0+git2263262
transformers==4.30.0
accelerate==0.20.1
sentencepiece==0.2.0
pandas==2.0.3
CUDA 11.3

Pull the ready‑made Docker image phoenixos/pytorch:11.3-ubuntu20.04 and start a container:

docker run -dit --gpus all --privileged \
    --ipc=host --network=host \
    -v .:/root --name phos_example phoenixos/pytorch:11.3-ubuntu20.04
docker exec -it phos_example /bin/bash

Execution Steps

Install required Python packages:

pip3 install transformers==4.30.0 accelerate==0.20.1 sentencepiece==0.2.0 pandas==2.0.3

Download model weights and tokenizer (replace your_huggingface_token with a valid token):

export HF_TOKEN=your_huggingface_token
python3 ./download.py

Start the PhOS daemon: pos_cli --start --target daemon Run training or inference with PhOS environment variables:

# Train
env $phos python3 ./train.py
# Inference
env $phos python3 ./inference.py

First run may be slower because PhOS parses and detects all registered .fatbin/.cubin files.

Checkpoint / Restore

# Optional pre‑dump
mkdir /root/ckpt
pos_cli --pre-dump --dir /root/ckpt --pid [your_program_pid]

# Dump
mkdir /root/ckpt
pos_cli --dump --dir /root/ckpt --pid [your_program_pid]

# Restore
pos_cli --restore --dir /root/ckpt

How PhOS Works

See the detailed design in the paper [3].

Limitations

Supports checkpoint and restore for a single GPU only.

Cannot be integrated directly with Kubernetes at present.

Future Plans

Kubernetes Integration

The native Kubernetes + CRIU solution cannot automatically trigger a restore. The crik project [4] can react to node‑failure signals and perform CPU‑level checkpoint/restore, but it lacks GPU support. Extending crik with PhOS capabilities is proposed to achieve true GPU‑level migration on Kubernetes.

References

[1] https://github.com/NVIDIA/cuda-checkpoint

[2] https://github.com/SJTU-IPADS/PhoenixOS/tree/zhuobin/fix_cli?tab=readme-ov-file#i-build-and-install-phos

[3] https://arxiv.org/abs/2405.12079

[4] https://github.com/qawolf/crik

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerKubernetesGPULlama2checkpointCRIUGPU migrationPhOS
Infra Learning Club
Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.