Cloud Native 13 min read

Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform

The article describes Punica, a cloud‑native, function‑as‑a‑service platform that unifies content‑understanding inference services through a one‑stop portal and unattended operations, improving deployment speed, resource utilization, and reducing manual effort for AI model serving.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform

Background Content understanding services tag and recommend text, image, and video from multiple publishing platforms for Baidu Feed and search, but face high configuration cost, resource inefficiency, and complex multi‑platform onboarding.

Overall Idea To improve business integration, iteration, and maintenance efficiency while boosting resource utilization, Punica adopts a one‑stop platform combined with unattended mechanisms.

One‑Stop Platform The Punica platform consolidates multiple PaaS platforms, eliminating the need for users to create separate PaaS services. It provides unified interfaces for inference service registration, testing, deployment, and operation, reducing onboarding time from two weeks to one week.

Unattended Operations By decoupling inference packages from heavy libraries (Python, CUDA, cuDNN) and pre‑downloading them, deployment becomes fast and highly elastic. Automated resource scheduling, self‑healing inspection, and capacity scaling further minimize manual intervention.

Technical Solution Punica leverages cloud‑native concepts (IaaS, PaaS, FaaS) and builds a FaaS‑style inference service. The architecture consists of a FaaS system (container framework, scheduler, proxy gateway), a unified front‑end portal, and a back‑end providing OpenAPI, service governance, and capacity management.

Service Deployment A container framework separates business and base environments, supports heartbeat, deployment, intervention, debugging, and offline deployment. The scheduler maps inference resources to PaaS apps, enabling capacity scaling, instance migration, and version updates across CPU and GPU (T4, P4, A10, A30, Kunlun) nodes.

Service Discovery & Gateway Deployment topology is exposed via the scheduler, allowing the gateway to perform routing, authentication, and rate limiting based on user, token, and feature ID.

Strategy Controller Governance policies automate self‑healing, auto‑scaling, low‑utilization reclamation, resource market balancing, and idle‑task handling, reducing operational costs and improving service stability.

Conclusion By re‑architecting inference services from a FaaS perspective, Punica achieves a 100% faster onboarding, 5× faster scaling, and significant GPU resource savings, demonstrating the benefits of a one‑stop, unattended cloud‑native platform for AI workloads.

FaaScloud nativeplatform engineeringAI inferenceResource Schedulingservice orchestration
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.