Inside Zhaozhuan’s Service Governance: Architecture, Registration, and Monitoring
Facing the complexities of microservice adoption, Zhaozhuan’s service governance platform integrates service registration, discovery, configuration, monitoring, authentication, and rate limiting via an SDK‑driven architecture, illustrating AP‑style registration, node grouping, gray‑release discovery, real‑time metrics aggregation, and method‑level access control.
Overview
As companies scale, many migrate from monolithic to microservice architectures, splitting applications into independent services to improve isolation, manageability, and deployment speed. However, microservices introduce new challenges: inter‑service communication now relies on RPC, requiring robust service governance for registration, discovery, monitoring, authentication, and rate limiting.
Overall Architecture
The Zhaozhuan Service Management Platform combines service registration & discovery, a configuration center, monitoring, alarm, authentication, and rate limiting into a single governance solution. The RPC framework interacts with the platform through an SDK.
When a service starts, it registers with the platform and subscribes to call relationships for authentication and throttling. Callers receive node up/down events and pull the latest node list via the SDK.
During calls, both caller and callee report metrics such as latency, timeouts, exceptions, and distribution percentiles to the platform.
The platform also acts as a configuration center, allowing real‑time updates of RPC parameters like timeout and serialization protocol.
Service Registration and Discovery
AP vs CP Model
High‑availability registries run multiple nodes. According to the CAP theorem, a system cannot simultaneously guarantee consistency, availability, and partition tolerance. Registries must choose between AP (availability) and CP (consistency). Zookeeper and etcd are CP; Eureka is AP; Nacos can operate in either mode.
Consider a scenario with two registry nodes R1 and R2 and callers C1 and C2.
Service S1 registers to R1, which notifies C1, then R2, which notifies C2. Both callers receive S1’s up event and can call S1.
Service S2 registers to R2, which notifies C2, then R1. If R2 fails to notify R1, C1 misses S2’s up event and can only call S1, while C2 can call both S1 and S2.
This inconsistency shows that a CP registry would block new registrations until consistency is restored, effectively halting service registration. In practice, RPC calls can still proceed, so an AP design is acceptable.
To resolve temporary inconsistencies, the SDK schedules periodic pulls of the latest node list when notifications are missed, achieving eventual consistency.
Conclusion: Service registries should be designed as AP systems; short‑lived inconsistencies are tolerable as long as eventual consistency is achieved.
Node Grouping
When a service has multiple callers with different importance levels, node isolation is needed. Nodes can be grouped so that high‑priority callers receive a stable subset.
Example: Service A has nodes A‑S1…A‑S4. Caller B is high‑priority, while C and D are lower. Group A‑S1 and A‑S2 into a “B‑only” group, and A‑S3 and A‑S4 into a “default” group.
Gray Release Discovery
Gray discovery allows specific callers to see only selected service nodes, enabling staged rollouts. Without it, upgrading services B, C, D from v1 to v2 would require a full rollout, risking large rollback costs.
With gray discovery, deploy one node of each new version (B‑v2, C‑v2, D‑v2). B‑v2 discovers only C‑v2, which discovers only D‑v2. Adjust B‑v2’s traffic weight to a low value for limited exposure. If issues arise, remove B‑v2 from its group or set weight to zero without a full rollback.
Configuration Center
RPC calls expose many configurable parameters such as TCP timeout, request timeout (service‑level and method‑level), serialization protocol, and RPC version.
When a method’s latency grows after a release, callers may need a higher timeout. Embedding the timeout in code would require redeployment; instead, the platform supports hot‑updating RPC parameters, allowing real‑time adjustments without code changes.
Monitoring Center
Monitoring is critical for thousands of services and their call graphs. Key metrics include call volume, error count, latency, latency distribution, and percentile latency.
SDK Aggregation
The SDK pre‑computes total latency, average latency, max latency, percentile, and distribution per minute before reporting, dramatically reducing data volume, bandwidth usage, and resource consumption.
Backend Pre‑Aggregation
Raw SDK data contains per‑node metrics, e.g., <C,CIP,S,SIP,Data>. To support queries by client ( <C>), service ( <S>), or client‑service pair ( <C,S>), the platform aggregates these dimensions before storage, enabling millisecond‑level query performance.
Authentication & Rate Limiting
Authentication controls which services may call others; rate limiting controls call volume. The platform provides service‑level and method‑level policies.
Method‑level control requires a unique identifier called methodKey with the format (${ServiceImpl})${ServiceInterface}.$method($parameterTypes). Example: for UserService.saveUser(User) implemented by UserServiceImpl, the methodKey is (UserServiceImpl)UserService.saveUser(User). The RPC framework validates uniqueness of methodKeys.
When a service starts, the SDK uploads the list of exposed RPC methods to the platform. Before invoking a method, callers must request permission and rate‑limit settings from the platform, which are enforced via SDK‑based filters.
Alarm
Alarms act as sentinels, notifying owners of exceptions, timeouts, or throttling events. Users can configure alarm intervals, channels, and target services or callers.
Summary
This article presented Zhaozhuan’s end‑to‑end service governance architecture, covering registration, discovery, node grouping, gray release, configuration hot‑updates, metric aggregation, authentication, rate limiting, and alarm mechanisms. While the platform addresses many challenges, open issues remain such as high‑availability of notification mechanisms, consistency guarantees, and monitoring data storage scalability.
References
[1] Why Alibaba does not use ZooKeeper for service discovery: https://github.com/markdown-it/markdown-it/issues/410
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Brother's Insights
A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
