Design and Implementation of the Next‑Generation Cloud‑Native Monitoring System at Autohome
The article describes Autohome's third‑generation cloud‑native monitoring platform, detailing its background, strategic goals for R&D efficiency, mobile‑first design, Prometheus‑based architecture with multi‑replica storage and InfluxDB remote storage, its operational impact, and future directions such as AI‑driven noise reduction.
Autohome's monitoring system has evolved through three generations: the first used distributed Zabbix for basic availability, the second built on Open‑Falcon for reliability, and the third aims to be "good to use" by leveraging the company's monitoring expertise.
In 2020, pressing needs for R&D efficiency, continuous technical breakthroughs, and business transformation drove the planning of a next‑generation system, focusing on how monitoring can further improve development productivity, exploit cloud‑native breakthroughs, and differentiate the platform within a competitive cloud‑computing market.
The strategy, named AutoCMP, centers on three pillars: mobile‑first R&D efficiency (real‑time alerts and fault handling on mobile), technical breakthroughs by adopting the most popular cloud‑native and distributed stacks, and business transformation by building an application‑centric, zero‑setup monitoring experience for developers.
Implementation highlights include a mobile monitoring feature integrated into the "AutoMan" mini‑program, a Prometheus‑centric architecture that unifies time‑series storage, a gateway‑to‑dual‑Prometheus design that resolves multi‑replica storage and synchronization issues, and the addition of InfluxDB as remote storage to retain a year of monitoring data. Prometheus runs both on physical nodes and as containers within a Kubernetes cluster, providing stable, replicated time‑series databases.
Since AutoCMP went live, it has been adopted across all Autohome applications, significantly reducing off‑site troubleshooting time, shortening application configuration, and extending the query window for monitoring data, thereby making faults easier to detect and locate.
Future work includes enriching upstream data correlation for better root‑cause analysis and applying AI techniques to filter noise and pinpoint container‑level failures that do not impact business services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
