Cloud Native 13 min read

Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform

The article describes Punica, a cloud‑native, function‑as‑a‑service platform that unifies content‑understanding inference services through a one‑stop portal and unattended operations, improving deployment speed, resource utilization, and reducing manual effort for AI model serving.

High Availability Architecture

Apr 3, 2023

Design and Implementation of Punica: A One‑Stop, Unattended AI Inference Platform

Background Content understanding services tag and recommend text, image, and video from multiple publishing platforms for Baidu Feed and search, but face high configuration cost, resource inefficiency, and complex multi‑platform onboarding.

Overall Idea To improve business integration, iteration, and maintenance efficiency while boosting resource utilization, Punica adopts a one‑stop platform combined with unattended mechanisms.

One‑Stop Platform The Punica platform consolidates multiple PaaS platforms, eliminating the need for users to create separate PaaS services. It provides unified interfaces for inference service registration, testing, deployment, and operation, reducing onboarding time from two weeks to one week.

Unattended Operations By decoupling inference packages from heavy libraries (Python, CUDA, cuDNN) and pre‑downloading them, deployment becomes fast and highly elastic. Automated resource scheduling, self‑healing inspection, and capacity scaling further minimize manual intervention.

Technical Solution Punica leverages cloud‑native concepts (IaaS, PaaS, FaaS) and builds a FaaS‑style inference service. The architecture consists of a FaaS system (container framework, scheduler, proxy gateway), a unified front‑end portal, and a back‑end providing OpenAPI, service governance, and capacity management.

Service Deployment A container framework separates business and base environments, supports heartbeat, deployment, intervention, debugging, and offline deployment. The scheduler maps inference resources to PaaS apps, enabling capacity scaling, instance migration, and version updates across CPU and GPU (T4, P4, A10, A30, Kunlun) nodes.

Service Discovery & Gateway Deployment topology is exposed via the scheduler, allowing the gateway to perform routing, authentication, and rate limiting based on user, token, and feature ID.

Strategy Controller Governance policies automate self‑healing, auto‑scaling, low‑utilization reclamation, resource market balancing, and idle‑task handling, reducing operational costs and improving service stability.

Conclusion By re‑architecting inference services from a FaaS perspective, Punica achieves a 100% faster onboarding, 5× faster scaling, and significant GPU resource savings, demonstrating the benefits of a one‑stop, unattended cloud‑native platform for AI workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

FaaS Platform Engineering AI inference Resource Scheduling Service Orchestration

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.