Microservice Governance and Stability Platform at Haodf.com: Architecture, Monitoring, and SLO Design
The article presents a comprehensive case study of Haodf.com's transition to a micro‑service architecture, detailing the challenges of service stability and observability, the design of a unified governance platform with log‑holographic analysis, real‑time alerts, application profiling, SLO/SLA definition, and future roadmap for capacity and reliability improvements.
Abstract
Haodf.com has accumulated over 690,000 doctor profiles and serves 67 million users, prompting a shift from a monolithic PHP system to a Java‑based micro‑service architecture with more than 350 services. Rapid growth exposed governance issues such as service coupling, long release cycles, and difficulty locating failures.
Micro‑service Governance Pain Points
The main concerns are service discovery, registration, orchestration, configuration, gateway, monitoring, alerting, and log analysis. Stability and availability are critical, requiring fast fault isolation and impact assessment.
Solution Overview
A platform was built that integrates full‑stack log analysis, real‑time alerts, application profiling, and unified configuration management. The platform aims to reduce investigation time, move risk detection forward, and provide a single pane of glass for developers, testers, and operators.
Platform Evolution
The governance platform evolved from manual troubleshooting to tool‑assisted, then to a systematic platform covering research, design, implementation, and operation phases. Key components include:
Application runtime profiling (Prometheus, Elasticsearch, ClickHouse, Grafana)
Machine‑resource profiling (node‑exporter, Prometheus, migration from Zabbix)
Log holographic analysis (Golang‑based Snow, Kafka ingestion, MySQL storage)
APM link tracing (AOP instrumentation for PHP/Node/Java, side‑car alternatives)
Real‑time alerting (email, WeChat bot, SMS, phone, on‑call rotation)
Key Technical Choices
Frontend uses Vue + Element‑Admin; backend services are written in Go and Python. Monitoring migrated from Zabbix to Prometheus for better cloud‑native support. Data storage shifted from Elasticsearch to ClickHouse, achieving a 1:4.2 compression ratio and sub‑second dashboard rendering.
Challenges Encountered
Massive log volume (≈30 billion entries per day, 800 GB daily) required efficient storage and analysis; migration to ClickHouse solved performance and cost issues. Deciding between Zabbix and Prometheus highlighted the need for fine‑grained alert grouping and low‑latency metrics.
SLO Design
Following Google SRE practices and industry guidance, five golden metrics were selected: capacity, availability, latency (p95), error rate, and manual‑intervention count. These metrics drive SLO thresholds and alerting policies.
Future Plans
Upcoming work includes integrating profiling and log analysis into a single UI, platform‑wide circuit‑breaker management, full‑link load testing, and intelligent capacity auto‑scaling.
HaoDF Tech Team
HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.