iQIYI Container Practice – Cloud‑Native Deployment Exploration and Implementation
At an April 10 technical salon, iQIYI’s Zhao Wei detailed the company’s shift from Mesos‑Marathon to Kubernetes, describing container application scenarios, evolving network stacks from bridge/NAT to Cilium/BGP, the adoption of Containerd with RunC and optional Kata runtimes, performance trade‑offs, and a hybrid scheduling approach that improves resource utilization across offline and online services.
On the afternoon of April 10, iQIYI's technology product team hosted an offline technical salon titled “Exploration and Practice of Cloud‑Native Landing”. Experts from Kuaishou, Baidu and ByteDance were invited to share their experiences together with iQIYI engineers.
iQIYI technical expert Zhao Wei presented the company’s container practice, covering application scenarios, network solutions and runtime choices.
1. Container Application Scenarios – More than half of iQIYI’s internal services now run as containers on physical‑machine clusters. The company migrated from a Mesos‑Marathon‑based stack to Kubernetes, planning to provide higher‑level application engines such as Serverless, FaaS and workflow on top of K8s.
2. Container Network Practice
• Bridge + NAT (Docker local bridge) was the initial solution on Mesos; it proved reliable but caused operational pain (e.g., RPC address exposure, Nginx keep‑alive failures).
• Bridge/CNI + VXLAN was tried after moving to Kubernetes; issues included pod‑to‑service IP routing on the same node, which were resolved by disabling net.bridge.bridge-nf-call-iptables or switching to Containerd.
• Cilium/CNI + BGP introduced a more aggressive design, requiring IPAM planning and BGP configuration across switches and hosts.
• Mixed deployment (Bridge/CNI + Cilium) addressed migration challenges and required unified network planning and CNI‑Agent DaemonSets.
3. Container Runtime Practice
iQIYI experimented with Docker, Mesos Unified Container, and finally settled on Containerd + RunC/Kata for production. The talk highlighted common problems such as insufficient isolation, resource contention, and security concerns.
• Kata Containers provide stronger isolation but add overhead (≈1 s startup vs. 0.5 s for RunC) and have limitations (no host network, no checkpoint/restore, etc.).
• gVisor was evaluated but not adopted due to missing tooling support (e.g., ip, sshd, netstat).
Performance benchmarks showed Kata’s CPU and memory overhead to be modest, with a 5‑10 % slowdown in deep‑learning inference workloads.
Integration steps for using Kata in Kubernetes were demonstrated, including creating a RuntimeClass, configuring Containerd, and specifying the runtime in Pod specs.
4. Application Scenarios and Resource Utilization
The presentation also discussed low server utilization in the internet industry and described a hybrid Mesos‑KVM‑Docker scheduling approach that runs idle VMs at night and leverages Mesos oversubscription for Docker workloads. In the K8s + Kata/RunC environment, resource scheduling becomes more precise, allowing offline tasks (e.g., transcoding) to run on Kata while online services stay on RunC.
Overall, iQIYI’s cloud‑native journey illustrates the evolution from legacy Mesos orchestration to modern Kubernetes‑based platforms, the trade‑offs of different networking and runtime technologies, and practical solutions to operational challenges.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.