Operations 10 min read

Internal Resource Governance Practices for High‑Availability Systems

This article outlines comprehensive internal resource governance techniques—including degradation, circuit breaking, isolation, async conversion, thread‑pool management, JVM and hardware metric monitoring, and daily operational practices—to enhance system stability and high availability in large‑scale backend services.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Internal Resource Governance Practices for High‑Availability Systems

Author Introduction

Zheng Jiming joined the domestic hotel quotation center team in August 2019, responsible for quotation system development and architecture optimization, with strong interest in high concurrency and high availability, experience in distributed systems handling tens of millions of daily orders, algorithm research, ACM‑ICPC participation, and winning Qunar Hackathon.

Background

Previously we introduced system‑dependency governance across services, covering flow control, caching, Dubbo, HTTP, DB, MQ, etc. However, governing only inter‑service dependencies is insufficient; we also need to analyze and manage internal resources.

This article focuses on internal resource governance, including degradation, circuit breaking, isolation, converting synchronous to asynchronous, and managing core resources such as thread pools.

Governance Methods and Solutions

Degradation and Circuit Breaking

These address scenarios where external interfaces or resources fail. Pre‑plan handling to ensure the main flow is not interrupted. For example, P1 application calls P3; P3 failures must not affect P1 core processing.

1) For core interfaces, investigate and implement degradation, preferring lossless degradation, with loss‑ful fallback as needed.

2) For non‑core scenarios, apply circuit breaking based on failure rate and latency, defining default responses or exceptions.

3) Prepare alternative interfaces or resources for circuit‑break fallback and allow dynamic threshold adjustment.

4) (Optional) New features or refactoring should support downgrade before release, enabling safe roll‑backs.

Isolation

This handles cases where a subset of resources fails and impacts others.

1) Thread‑pool isolation: includes Dubbo thread pools, custom business pools, and JDK8 parallel streams which share a common pool.

2) Data storage isolation: separate core and non‑core data, possibly sharding core data.

3) Core vs non‑core interface isolation: use different applications, groups, thread pools, or clients.

4) Any distinction between core and non‑core warrants isolation.

Synchronous to Asynchronous, Serial to Parallel

This converts blocking synchronous operations to asynchronous to avoid whole‑process blockage.

1) Main flow remains synchronous; auxiliary flows become asynchronous with proper exception handling.

2) Parallelize core interface processing to reduce response time.

3) Apply asynchronous calls for Dubbo, HTTP, etc.

Thread‑Pool Governance

Thread pools are precious resources that require monitoring and control.

1) All custom thread pools should have monitoring (active threads, queue size, completed tasks) to assess usage and plan resources.

2) Allow dynamic adjustment of core parameters (core size, max size, queue length) without redeployment.

3) Recommend separate thread pools for core business logic, following isolation strategies.

JVM and Hardware Metric Governance

1) Monitor full GC (FGC) frequency; e.g., no more than 2 times per 5 minutes.

2) Monitor young GC (YGC) frequency; e.g., no more than 10 times per minute.

3) Monitor GC pause time; e.g., YGC pause should not exceed 0.7 s.

4) Limit active Tomcat connections on VMs to 300 (adjust per app).

5) Keep CPU usage for I/O‑intensive apps below 60% (adjust per app).

6) Monitor blocked threads, set alerts, and capture stack traces.

Daily Operations

1) Long‑term alarm governance: control daily alarm count, ensure alerts indicate real issues needing manual handling.

2) Long‑term exception governance: continuously monitor top‑5 core app alerts and occasional RuntimeExceptions, iteratively improve.

3) Service inspection: after peak periods and each release, verify service health and address abnormal metrics promptly.

4) Automated fault localization: leverage monitoring, alerts, and trace analysis to automatically pinpoint failures, reducing impact.

5) Periodic stress testing of core systems to evaluate scaling needs; consider internal traffic‑shaping testing tools.

Conclusion

The stability governance effort has reached a milestone. From daily operations to governance techniques, we reviewed cross‑region deployment, high‑availability components, post‑release inspections, regular stress tests, and AIOps‑driven fault auto‑location.

Governance methods such as degradation, circuit breaking, flow control, isolation, multi‑channel, and multi‑replica were employed to manage core resources.

Achieving zero failures is unrealistic, but many high‑availability practices exist; proper use greatly improves stability. Continuous awareness and implementation of these practices are essential.

We hope this practice guides others.

System StabilityHigh Availabilitythread poolcircuit breakerresource governancedegradationbackend operations
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.