Cloud Native 14 min read

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Punica provides a cloud‑native, one‑stop platform that unifies Baidu’s content‑understanding inference services, automates testing, resource provisioning, and monitoring, and enables unattended, self‑healing operations with dynamic scaling and GPU scheduling, cutting onboarding time by half and reclaiming hundreds of GPUs.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Background Content understanding services at Baidu process text, images, and video from multiple publishing platforms (Baidu Baijiahao, Haokan, Quanmin, Live, etc.) and tag them for downstream feeds and search. Over time, nearly a thousand inference services have been accumulated, leading to high operational complexity in service maintenance, testing, and deployment.

Challenges The existing workflow suffers from heavy configuration overhead during development, costly test‑data preparation, insufficient test resources, lengthy onboarding (about two weeks), low resource utilization, and high manual effort for capacity assessment and service upgrades.

Overall Approach The team proposes a “one‑stop + unattended” strategy implemented by the Punica system, which aims to streamline business integration, iteration, and operation while improving resource efficiency.

1. One‑Stop Platform Punica unifies multiple platforms into a single entry point, eliminating the need for users to learn and create separate PaaS services. Key capabilities include:

Parameter‑configuration portal with recommended defaults, enabling rapid testing and verification.

Automatic test‑resource provisioning without creating extra PaaS instances.

Automated test jobs that schedule idle resources, adjust load‑testing parameters, and present only final results to users.

Small‑traffic validation using spare resources before full rollout.

Automatic addition of monitoring and alerting rules for new services.

Support for both Python and high‑performance C++ inference micro‑services, as well as diverse GPU cards (T4, P4, A10, A30, Kunlun) and mixed‑GPU scheduling.

2. Unattended Operations Punica decouples business code from the underlying PaaS environment, allowing pre‑download of libraries (Python, CUDA, cuDNN, etc.) so that deployment only needs to fetch the model package. This yields high elasticity for FaaS‑style inference services.

Key unattended features include:

Resource Auto‑Scheduling : Periodic reclamation of low‑utilization services, time‑slice sharing, and automatic scaling to meet high‑priority demand.

Self‑Healing Inspection System : Instance‑level health checks, automated fault diagnosis via decision trees, and automatic remediation of single‑instance failures.

Capacity Governance : Automatic reclamation of idle services, dynamic scaling based on load, and time‑slice resource pools for background tasks.

Technical Architecture The system consists of four main components:

FaaS System : Container framework, scheduler, and proxy gateway that provide elastic deployment and routing for inference services.

Platform Front‑End : Unified UI for service registration, testing, release, and resource reporting.

Platform Back‑End : OpenAPI for service management, testing, and resource governance.

Service Governance : Self‑healing, capacity scaling, low‑utilization reclamation, and a resource market.

Service Deployment Punica’s container framework separates business code from the base environment, supports heartbeat, deployment, intervention, debugging, and offline testing. Nodes are pre‑provisioned for specific hardware (CPU, GPU types), enabling fast instance creation and version rollout.

Service Scheduling & Discovery The scheduler maps inference replicas to PaaS nodes, allowing capacity expansion, instance migration, and version updates. A unified gateway performs authentication, rate‑limiting, and routing based on user, token, and feature ID.

Results By adopting the one‑stop and unattended model, onboarding time was cut from two weeks to one week, service iteration speed increased by 100 %, scaling speed improved five‑fold, and over 400 GPU cards have been reclaimed, significantly reducing operational cost.

cloud nativeAI inferenceResource SchedulingSelf-healingInference Platformservice orchestration
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.