Cloud Native 9 min read

Evolution and Architecture of Baidu's Fengjing APM System

From its 2016 debut to the present, Baidu’s Fengjing APM system has evolved through four major releases—moving from invasive jar‑based probes to non‑invasive bytecode agents, adding modular hot‑swap plugins, scaling to thousands of containers, handling billions of daily metrics via Kafka, Doris, and SIA TSDB, while solving probe upgrade downtime, data‑ingestion volume, and call‑graph query latency.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Evolution and Architecture of Baidu's Fengjing APM System

Fengjing is Baidu's commercial APM system focusing on Java applications, covering thousands of services and containers. It automatically instruments mainstream middleware (Spring Web, RPC, databases, caches) to provide full‑stack performance metrics, health status, and alerts.

Data collection is performed by the Fengjing probe, which injects into business processes without affecting them. Collected metrics are stored in Baidu's SIA TSDB for time‑series visualization and in the Doris (Palo) data warehouse for call‑graph analysis.

The article outlines the evolution of Fengjing from its inception in 2016 (Version 1.0) through successive releases 2.0, 3.0, and 4.0, describing the architectural changes, challenges faced, and technical solutions adopted at each stage:

Version 1.0 : invasive probe requiring manual jar dependencies and hard‑coded data enrichment.

Version 2.0 : introduced Java‑agent + CGLIB AOP to reduce integration cost, used protobuf + gzip over HTTP to Kafka, and switched storage to Doris.

Version 3.0 : moved to bytecode‑enhancement (non‑invasive) probes, modular plugin classloaders for hot‑swap, and addressed scalability of data ingestion and query latency.

Version 4.0 : integrated with micro‑service platforms, scaled to thousands of containers, and tackled probe hot‑upgrade and massive data volume.

Key technical challenges discussed include long service‑side debugging cycles, high cost of log collection, probe upgrade downtime, massive data ingestion (150 billion rows per day), and slow call‑graph queries. Solutions involve hot‑plug classloader bridges, custom plugin isolation, roll‑up tables, and stream‑ing ingestion to Doris.

The system also relies on Baidu's SIA TSDB for time‑series storage and leverages its multi‑dimensional visualization capabilities. The article concludes with a summary of the current architecture and future directions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaCloud NativeBackend ArchitectureAPMPerformance MonitoringDistributed Tracing
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.