Artificial Intelligence 13 min read

Punica System: Enhancing AI Inference Service Efficiency Through FaaS Architecture

The Punica system unifies AI inference development, testing, deployment, and maintenance on a FaaS‑based one‑stop platform that automates resource scheduling, self‑healing, and monitoring, supporting multiple frameworks and GPUs, thereby doubling onboarding speed, quintuple scaling efficiency, and reclaiming hundreds of GPU cards.

Baidu Tech Salon

Mar 29, 2023

Punica System: Enhancing AI Inference Service Efficiency Through FaaS Architecture

This article introduces the Punica system, a comprehensive platform designed to improve the efficiency of AI inference services. The system addresses challenges in service development, testing, deployment, and maintenance by implementing a one-stop platform and autonomous operation mechanisms.

The content understanding inference service processes text, images, and videos from multiple sources like Baijiahao, Haokan, Quanmin, and live broadcasts, marking content tags for Baidu's Feed information stream and search services. With nearly a thousand inference services accumulated over time, managing existing services and efficiently onboarding new ones has become increasingly challenging.

The article outlines key problems in the current workflow: complex framework parameters and high configuration costs during development; expensive test data construction and insufficient test resources; lengthy learning curves for multiple PaaS platforms and complicated monitoring setup during deployment; and slow deployment times (20+ minutes) with low elasticity and poor resource utilization.

The solution focuses on two main aspects: one-stop platform and autonomous operation. The one-stop platform unifies multiple platforms, reduces user learning costs, provides recommended parameter configurations, enables rapid test resource allocation, offers automated testing tasks, supports quick small-scale verification, and automates monitoring and alerting for new services. It also lowers business resource costs by supporting both Python and high-performance C++ microservice frameworks, multiple GPU types (T4, P4, A10, A30, Kunlun), and model performance optimization through PaddlePaddle integration.

The autonomous operation component includes resource auto-scheduling (regular resource recovery, time-based resource reuse, and capacity expansion/contraction), self-healing inspection systems (instance-level and service-level health monitoring, fault root cause analysis), and reduced maintenance costs (PaaS App reduction by one order of magnitude, simplified cloud-native transformation).

The technical architecture separates business environments from basic environments, achieving high elasticity through FaaS (Function as a Service) principles. The system consists of four main parts: FaaS system (container framework, scheduling system, proxy gateway), platform frontend (unified service access, testing, deployment management), platform backend (OpenAPI support, service management), and service governance (self-healing inspection, capacity scaling).

Key technical implementations include: container framework with heartbeat mechanisms, deployment capabilities, intervention interfaces, and debug mechanisms; service scheduling with Resource-to-App and Replica-to-Node mapping; service discovery through a unified gateway with authentication and rate limiting; and strategy controllers for autonomous operations including service inspection, capacity scaling, resource recycling, and idle task support.

The system has achieved significant improvements: 100% increase in new service onboarding efficiency, 5x faster service scaling, and recovery of over 400 GPU cards. It has been successfully deployed for new business iterations and development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native AI inference GPU scheduling Service Governance autonomous operations container framework FaaS architecture

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.