Operations 16 min read

Microservice Governance and Stability Platform at Haodf.com: Architecture, Monitoring, and SLO Design

The article presents a comprehensive case study of Haodf.com's transition to a micro‑service architecture, detailing the challenges of service stability and observability, the design of a unified governance platform with log‑holographic analysis, real‑time alerts, application profiling, SLO/SLA definition, and future roadmap for capacity and reliability improvements.

HaoDF Tech Team

Nov 25, 2020

Microservice Governance and Stability Platform at Haodf.com: Architecture, Monitoring, and SLO Design

Abstract

Haodf.com has accumulated over 690,000 doctor profiles and serves 67 million users, prompting a shift from a monolithic PHP system to a Java‑based micro‑service architecture with more than 350 services. Rapid growth exposed governance issues such as service coupling, long release cycles, and difficulty locating failures.

Micro‑service Governance Pain Points

The main concerns are service discovery, registration, orchestration, configuration, gateway, monitoring, alerting, and log analysis. Stability and availability are critical, requiring fast fault isolation and impact assessment.

Solution Overview

A platform was built that integrates full‑stack log analysis, real‑time alerts, application profiling, and unified configuration management. The platform aims to reduce investigation time, move risk detection forward, and provide a single pane of glass for developers, testers, and operators.

Platform Evolution

The governance platform evolved from manual troubleshooting to tool‑assisted, then to a systematic platform covering research, design, implementation, and operation phases. Key components include:

Application runtime profiling (Prometheus, Elasticsearch, ClickHouse, Grafana)

Machine‑resource profiling (node‑exporter, Prometheus, migration from Zabbix)

Log holographic analysis (Golang‑based Snow, Kafka ingestion, MySQL storage)

APM link tracing (AOP instrumentation for PHP/Node/Java, side‑car alternatives)

Real‑time alerting (email, WeChat bot, SMS, phone, on‑call rotation)

Key Technical Choices

Frontend uses Vue + Element‑Admin; backend services are written in Go and Python. Monitoring migrated from Zabbix to Prometheus for better cloud‑native support. Data storage shifted from Elasticsearch to ClickHouse, achieving a 1:4.2 compression ratio and sub‑second dashboard rendering.

Challenges Encountered

Massive log volume (≈30 billion entries per day, 800 GB daily) required efficient storage and analysis; migration to ClickHouse solved performance and cost issues. Deciding between Zabbix and Prometheus highlighted the need for fine‑grained alert grouping and low‑latency metrics.

SLO Design

Following Google SRE practices and industry guidance, five golden metrics were selected: capacity, availability, latency (p95), error rate, and manual‑intervention count. These metrics drive SLO thresholds and alerting policies.

Future Plans

Upcoming work includes integrating profiling and log analysis into a single UI, platform‑wide circuit‑breaker management, full‑link load testing, and intelligent capacity auto‑scaling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring microservices Platform Logging SLO

Written by

HaoDF Tech Team

HaoDF Online tech practice and sharing—join us to discuss and help create quality healthcare through technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.