Operations 16 min read

Microservice Governance and Stability Platform at Haodf.com: Architecture, Monitoring, and SLO Design

The article presents a comprehensive case study of Haodf.com's transition to a micro‑service architecture, detailing the challenges of service stability and observability, the design of a unified governance platform with log‑holographic analysis, real‑time alerts, application profiling, SLO/SLA definition, and future roadmap for capacity and reliability improvements.

HaoDF Tech Team
HaoDF Tech Team
HaoDF Tech Team
Microservice Governance and Stability Platform at Haodf.com: Architecture, Monitoring, and SLO Design

Abstract

Haodf.com has accumulated over 690,000 doctor profiles and serves 67 million users, prompting a shift from a monolithic PHP system to a Java‑based micro‑service architecture with more than 350 services. Rapid growth exposed governance issues such as service coupling, long release cycles, and difficulty locating failures.

Micro‑service Governance Pain Points

The main concerns are service discovery, registration, orchestration, configuration, gateway, monitoring, alerting, and log analysis. Stability and availability are critical, requiring fast fault isolation and impact assessment.

Solution Overview

A platform was built that integrates full‑stack log analysis, real‑time alerts, application profiling, and unified configuration management. The platform aims to reduce investigation time, move risk detection forward, and provide a single pane of glass for developers, testers, and operators.

Platform Evolution

The governance platform evolved from manual troubleshooting to tool‑assisted, then to a systematic platform covering research, design, implementation, and operation phases. Key components include:

Application runtime profiling (Prometheus, Elasticsearch, ClickHouse, Grafana)

Machine‑resource profiling (node‑exporter, Prometheus, migration from Zabbix)

Log holographic analysis (Golang‑based Snow, Kafka ingestion, MySQL storage)

APM link tracing (AOP instrumentation for PHP/Node/Java, side‑car alternatives)

Real‑time alerting (email, WeChat bot, SMS, phone, on‑call rotation)

Key Technical Choices

Frontend uses Vue + Element‑Admin; backend services are written in Go and Python. Monitoring migrated from Zabbix to Prometheus for better cloud‑native support. Data storage shifted from Elasticsearch to ClickHouse, achieving a 1:4.2 compression ratio and sub‑second dashboard rendering.

Challenges Encountered

Massive log volume (≈30 billion entries per day, 800 GB daily) required efficient storage and analysis; migration to ClickHouse solved performance and cost issues. Deciding between Zabbix and Prometheus highlighted the need for fine‑grained alert grouping and low‑latency metrics.

SLO Design

Following Google SRE practices and industry guidance, five golden metrics were selected: capacity, availability, latency (p95), error rate, and manual‑intervention count. These metrics drive SLO thresholds and alerting policies.

Future Plans

Upcoming work includes integrating profiling and log analysis into a single UI, platform‑wide circuit‑breaker management, full‑link load testing, and intelligent capacity auto‑scaling.

MonitoringOperationsplatformLoggingSLO
HaoDF Tech Team
Written by

HaoDF Tech Team

HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.