Evolution and Architecture of Baidu's Fengjing APM System
From its 2016 debut to the present, Baidu’s Fengjing APM system has evolved through four major releases—moving from invasive jar‑based probes to non‑invasive bytecode agents, adding modular hot‑swap plugins, scaling to thousands of containers, handling billions of daily metrics via Kafka, Doris, and SIA TSDB, while solving probe upgrade downtime, data‑ingestion volume, and call‑graph query latency.
Fengjing is Baidu's commercial APM system focusing on Java applications, covering thousands of services and containers. It automatically instruments mainstream middleware (Spring Web, RPC, databases, caches) to provide full‑stack performance metrics, health status, and alerts.
Data collection is performed by the Fengjing probe, which injects into business processes without affecting them. Collected metrics are stored in Baidu's SIA TSDB for time‑series visualization and in the Doris (Palo) data warehouse for call‑graph analysis.
The article outlines the evolution of Fengjing from its inception in 2016 (Version 1.0) through successive releases 2.0, 3.0, and 4.0, describing the architectural changes, challenges faced, and technical solutions adopted at each stage:
Version 1.0 : invasive probe requiring manual jar dependencies and hard‑coded data enrichment.
Version 2.0 : introduced Java‑agent + CGLIB AOP to reduce integration cost, used protobuf + gzip over HTTP to Kafka, and switched storage to Doris.
Version 3.0 : moved to bytecode‑enhancement (non‑invasive) probes, modular plugin classloaders for hot‑swap, and addressed scalability of data ingestion and query latency.
Version 4.0 : integrated with micro‑service platforms, scaled to thousands of containers, and tackled probe hot‑upgrade and massive data volume.
Key technical challenges discussed include long service‑side debugging cycles, high cost of log collection, probe upgrade downtime, massive data ingestion (150 billion rows per day), and slow call‑graph queries. Solutions involve hot‑plug classloader bridges, custom plugin isolation, roll‑up tables, and stream‑ing ingestion to Doris.
The system also relies on Baidu's SIA TSDB for time‑series storage and leverages its multi‑dimensional visualization capabilities. The article concludes with a summary of the current architecture and future directions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
