Fine-grained Profiling of Online AI Workloads on Kubernetes Using ACK AI Profiling
This article demonstrates how to use ACK AI Profiling, built on eBPF and dynamic process injection, to perform non-intrusive, low‑overhead profiling of Kubernetes‑deployed large‑language‑model inference services, identify GPU memory growth causes, and apply optimization recommendations to prevent OOM issues.
Kubernetes has become the primary operating system for AI workloads, especially large‑language‑model (LLM) training and inference, creating a demand for fine‑grained performance profiling. ACK AI Profiling leverages eBPF and dynamic process injection to provide non‑intrusive, low‑overhead analysis of AI applications running in Kubernetes, covering Python processes, CPU calls, system calls, CUDA libraries, and CUDA kernels.
The case study focuses on a vLLM inference service that starts with the parameter --gpu-memory-utilization 0.95 , which pre‑allocates 95% of GPU memory. After a period of operation the service encounters GPU OOM, raising three questions: who is consuming the additional memory, is the behavior normal, and how can the memory be released.
In the pre‑check environment, the QwQ-32B model runs on two NVIDIA L20 GPUs. Using nvidia-smi , the initial memory allocation is observed at roughly 43161 MiB per GPU. After a single inference request, memory usage rises by 26 MiB on one GPU and 14 MiB on the other, and the increase persists.
DCGM monitoring confirms a steady memory growth after the request, while KV‑Cache utilization remains below 10%, indicating the cache is not the source of the extra allocation.
Using ACK AI Profiling, the profiling items Python , CPU , and CUDA Kernel are enabled and results are visualized with SysOM. The CUDA kernel timeline reveals specific cudaMalloc calls; the allocated byte size matches the observed memory increase, pinpointing the allocation point.
The Python call stack during the allocation is traced to thread.run() → llm_engine.step() → worker.execute_model() → model_runner.execute_model() → decorate_context() . The decorate_context function invokes PyTorch’s ctx_factory() , a context manager that reserves GPU memory. Consequently, the memory growth consists of reserved_memory and the PyTorch context, which are not released until the process ends, explaining the persistent increase.
Recommendations to mitigate the issue include reducing memory fragmentation by setting the environment variable CUDA_PYTORCH_CUDA_ALLOC_CONF with a smaller max_split_size_mb , and optionally invoking torch.cuda.empty_cache() or adjusting model and data loading strategies.
The overall inference timeline shows the typical steps: input preprocessing → model forward computation → result generation. GPU kernel profiling highlights dominant operations such as gemm matrix multiplications, paged_attention kernels, and NCCL all‑reduce communications. Small idle gaps correspond to HTTP response handling and metrics reporting.
In conclusion, ACK AI Profiling equips AI infrastructure teams with detailed observability, enabling precise problem localization and performance optimization for online AI services.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.