Evolution and Architecture of Baidu's Fengjing APM System

This article chronicles the four‑year evolution of Baidu's Fengjing performance‑monitoring platform, detailing its data collection, processing pipelines, successive architectural versions (1.0‑4.0), challenges such as probe intrusion and massive data volume, and the engineering solutions that enabled large‑scale, low‑cost, cloud‑native observability for thousands of Java services.

Baidu Intelligent Testing
Baidu Intelligent Testing
Baidu Intelligent Testing
Evolution and Architecture of Baidu's Fengjing APM System

Fengjing is Baidu's commercial APM system focused on Java applications, automatically instrumenting mainstream middleware (Spring Web, RPC, databases, caches) to provide full‑stack performance metrics, health status, and alerting for thousands of business services.

The platform collects data via probes that embed into business processes without affecting them, stores time‑series data in Baidu's TSDB and call‑chain data in the Doris (formerly Palo) data warehouse, and offers visual reports, anomaly alerts, error stack analysis, and service latency insights.

Since its inception in 2016, Fengjing has undergone several architectural milestones:

1.0 used an invasive Java‑agent probe requiring explicit dependency jars and hard‑coded business data, with data written to disk and shipped via Kafka, processed by Storm and HBase.

2.0 reduced probe integration cost by adopting a Java‑agent with CGLIB‑based AOP, consolidating dependencies to a single jar, and switching to a protobuf+gzip protocol over HTTP to Kafka, lowering I/O overhead.

3.0 eliminated invasive instrumentation, adopting bytecode‑enhancement for zero‑touch integration, introduced a plugin‑based architecture with isolated classloaders for hot‑swap capability, and migrated backend storage to the Doris MPP SQL warehouse, simplifying analytics and avoiding Spark/Storm.

4.0 scaled with the micro‑service and containerization wave, integrating Fengjing into Baidu's micro‑service hosting platform, expanding deployment from hundreds to thousands of applications and containers.

Key challenges addressed include reducing probe restart impact, handling 150 billion daily data points, improving call‑chain query latency, and separating data processing layers to support real‑time visualization and low‑cost operations.

The article concludes that after four years of continuous innovation, Fengjing has become a cloud‑native, highly scalable observability solution, with future posts planned to dive deeper into implementation details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Javacloud nativearchitectureBig DataAPMPerformance MonitoringDistributed Tracing
Baidu Intelligent Testing
Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.