How Baidu’s Fengjing Uses Holographic Logs to Debug Massive Microservices
Baidu’s Fengjing monitoring platform tackles the daunting challenge of pinpointing failures in its massive Java‑based microservice ecosystem by employing a non‑intrusive probe that captures log metadata, stores it in a database, and reconstructs full request‑level logs with minimal storage overhead.
Background
Baidu commercial products serve advertisers across search, feed, brand, and other channels, built on a complex Java microservice ecosystem. The sheer number of services, intricate call relationships, and heavy component dependencies make troubleshooting difficult, yet any outage directly impacts advertisers' ability to launch or modify campaigns.
To reduce the time needed to locate the root cause of incidents, the Baidu Commercial Platform team created a large‑scale distributed microservice monitoring system called Fengjing.
When an alarm is triggered, on‑call engineers must identify the faulty module, the failing service interface, and the exact code line. Fengjing provides call‑chain data (status codes, latency) and captures error stack traces, enabling rapid diagnosis.
Technical Principle
Traditional solutions store all logs in Elasticsearch for search, but the volume (approaching petabytes per day) makes this prohibitively expensive for a platform‑level system. Fengjing instead leverages a non‑intrusive probe that records log metadata (file name, offset, timestamps, rotation policy) without persisting the full log content.
The probe runs on thousands of microservice containers managed by Baidu’s Jarvis platform. When a request is processed, the probe records the associated log file name and offset, storing this metadata in a database. During a query, Fengjing retrieves the container address, log file name, and offset via the trace ID, then fetches the exact log segment from the container and presents it to the user.
Algorithm Implementation
The holographic log technique consists of two main components:
Metadata collection: intercept log‑printing operations before and after execution to capture timestamps, file descriptors, rotation policies, log levels, and a unique trace ID.
Metadata parsing: when a user searches by trace ID, the system aggregates all related metadata records, determines the current log file and its rotation state, simulates a log writer to compute the exact file position, reads the relevant log segment, and aligns the content using the trace ID.
Insert bytecode before the original log call to record start time, file offset, and rotation parameters.
Insert bytecode after the call to record end time, post‑offset, and the actual file written.
Read file descriptors directly for performance and embed the trace ID in the log content for precise correlation.
During retrieval, query all metadata with the same trace ID, extract timestamps, file names, and rotation policies, simulate the log writer to locate the exact position, read the surrounding log lines, and verify with the trace ID.
Conclusion
Fengjing’s self‑developed holographic log technology enables engineers to retrieve complete request‑level logs with only a fraction of the storage and compute resources required by traditional log‑centralization solutions. While the approach relies on the limited retention time of container logs, it satisfies the majority of real‑time debugging scenarios and fills a critical gap in distributed tracing for Baidu’s massive microservice landscape.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
