How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies
This comprehensive guide explains how to improve system stability and reduce online incidents by building observability, implementing distributed tracing, applying rate‑limiting and circuit‑breaker patterns, adopting blue‑green and gray deployments, managing data consistency with distributed transactions, planning capacity, optimizing performance, and preparing emergency response plans.
Introduction
Recently a colleague asked how to improve system stability and reduce online incidents. System stability is crucial, and this article discusses practical ways to achieve it.
1. Build a Complete Observability System
Observability is the foundation of system stability. Many encounter situations where the system slows down but CPU and memory appear normal.
1.1 Multi‑dimensional Monitoring
// Using Spring Boot Actuator to implement health checks
@Configuration
public class HealthConfig {
// ... configuration beans ...
}
@Component
public class CustomHealthIndicator implements HealthIndicator {
@Autowired
private DataSource dataSource;
@Autowired
private RedisTemplate redisTemplate;
@Override
public Health health() {
// Check database connection
if (!checkDatabase()) {
return Health.down().withDetail("database", "连接失败").build();
}
// Check Redis connection
if (!checkRedis()) {
return Health.down().withDetail("redis", "连接异常").build();
}
// Check disk space
if (!checkDiskSpace()) {
return Health.down().withDetail("disk", "空间不足").build();
}
return Health.up().build();
}
private boolean checkDatabase() {
try {
return dataSource.getConnection().isValid(5);
} catch (Exception e) {
return false;
}
}
private boolean checkRedis() {
try {
redisTemplate.opsForValue().get("health_check");
return true;
} catch (Exception e) {
return false;
}
}
private boolean checkDiskSpace() {
File file = new File(".");
return file.getFreeSpace() > 1024 * 1024 * 1024; // more than 1GB free
}
}Code logic analysis:
Implement HealthIndicator to customize health‑check logic.
Check whether the database connection is normal.
Check Redis and other middleware connection status.
Check system resources such as disk space.
When any component fails, immediately return a down status to enable timely alerts.
1.2 Distributed Tracing
// Using Spring Cloud Sleuth for tracing
@RestController
public class OrderController {
private final Tracer tracer;
public OrderController(Tracer tracer) {
this.tracer = tracer;
}
@PostMapping("/orders")
public ResponseEntity<Order> createOrder(@RequestBody OrderRequest request) {
// Start a new Span
Span span = tracer.nextSpan().name("createOrder").start();
try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
// Record business operation
span.tag("user.id", request.getUserId());
span.tag("order.amount", String.valueOf(request.getAmount()));
// Execute order creation logic
Order order = orderService.createOrder(request);
// Record success
span.event("order.created");
return ResponseEntity.ok(order);
} catch (Exception e) {
// Record error
span.error(e);
throw e;
} finally {
span.finish();
}
}
}Value of tracing:
Quickly locate performance bottlenecks.
Visualize service call relationships.
Analyze cross‑service exceptions.
2. Build Resilient Architecture: Rate Limiting, Degradation, Circuit Breaking
During traffic spikes, systems can be overwhelmed, requiring resilience mechanisms.
2.1 Intelligent Rate‑Limiting Strategies
// Using Resilience4j for rate limiting
@Configuration
public class RateLimitConfig {
@Bean
public RateLimiterRegistry rateLimiterRegistry() {
return RateLimiterRegistry.of(
RateLimiterConfig.custom()
.limitForPeriod(100) // 100 requests per second
.limitRefreshPeriod(Duration.ofSeconds(1))
.timeoutDuration(Duration.ofMillis(100))
.build()
);
}
}
@Service
public class OrderService {
private final RateLimiter rateLimiter;
public OrderService(RateLimiterRegistry registry) {
this.rateLimiter = registry.rateLimiter("orderService");
}
@RateLimiter(name = "orderService", fallbackMethod = "createOrderFallback")
public Order createOrder(OrderRequest request) {
// Normal order creation logic
return processOrderCreation(request);
}
// Degradation method
public Order createOrderFallback(OrderRequest request, Exception e) {
// Log degradation
log.warn("订单服务触发限流降级, userId: {}", request.getUserId());
// Return fallback data or throw business exception
throw new BusinessException("系统繁忙,请稍后重试");
}
}Rate‑limiting details:
Fixed window : simple but has edge cases.
Sliding window : more precise but complex.
Token bucket : allows burst traffic.
Leaky bucket : smooths traffic to a stable rate.
2.2 Service Circuit‑Breaker Mechanism
// Circuit breaker configuration
@Configuration
public class CircuitBreakerConfig {
@Bean
public CircuitBreakerRegistry circuitBreakerRegistry() {
return CircuitBreakerRegistry.of(
CircuitBreakerConfig.custom()
.failureRateThreshold(50) // 50% failure threshold
.slowCallRateThreshold(50)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(60))
.permittedNumberOfCallsInHalfOpenState(10)
.minimumNumberOfCalls(10)
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10)
.build()
);
}
}
@Service
public class PaymentService {
private final CircuitBreaker circuitBreaker;
public PaymentService(CircuitBreakerRegistry registry) {
this.circuitBreaker = registry.circuitBreaker("paymentService");
}
public PaymentResult processPayment(PaymentRequest request) {
return circuitBreaker.executeSupplier(() -> paymentClient.pay(request));
}
}Circuit‑breaker states:
CLOSED : normal operation, requests pass through.
OPEN : all requests are rejected.
HALF_OPEN : partial requests allowed for probing.
3. High‑Availability Deployment Strategies
3.1 Blue‑Green Deployment
# Kubernetes blue‑green deployment configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service-v2
spec:
replicas: 3
selector:
matchLabels:
app: order-service
version: v2
template:
metadata:
labels:
app: order-service
version: v2
spec:
containers:
- name: order-service
image: order-service:v2.0.0
readinessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
name: order-service
spec:
selector:
app: order-service
version: v2
ports:
- port: 80
targetPort: 8080Blue‑green deployment advantages:
Fast rollback by modifying the Service selector.
Zero‑downtime releases.
Avoids version compatibility issues.
3.2 Gray Release
// Traffic‑weight based gray release
@Component
public class GrayReleaseRouter {
@Value("${gray.release.ratio:0.1}")
private double grayRatio;
@Autowired
private HttpServletRequest request;
public boolean shouldRouteToNewVersion() {
String userId = getUserIdFromRequest();
// Hash routing based on user ID
int hash = Math.abs(userId.hashCode());
double ratio = (hash % 100) / 100.0;
return ratio < grayRatio;
}
public Object routeRequest(Object request) {
if (shouldRouteToNewVersion()) {
// Forward to new version service
return callNewVersion(request);
} else {
// Use old version service
return callOldVersion(request);
}
}
private String getUserIdFromRequest() {
// Extract user ID from request header
return request.getHeader("X-User-Id");
}
}Gray‑release policies:
Route by user ID.
Route by traffic proportion.
Route by business parameters (e.g., city, user level).
4. Data Consistency and Transaction Management
4.1 Distributed Transaction Solutions
// Using Seata for distributed transactions
@Service
public class OrderServiceImpl implements OrderService {
@GlobalTransactional
@Override
public Order createOrder(OrderRequest request) {
// 1. Create order (local transaction)
Order order = orderMapper.insert(request);
// 2. Deduct inventory (remote service)
inventoryFeignClient.deduct(request.getProductId(), request.getQuantity());
// 3. Add points (remote service)
pointsFeignClient.addPoints(request.getUserId(), request.getAmount());
return order;
}
}
@Service
public class InventoryServiceImpl implements InventoryService {
@Transactional
@Override
public void deduct(String productId, Integer quantity) {
// Check inventory
Inventory inventory = inventoryMapper.selectByProductId(productId);
if (inventory.getStock() < quantity) {
throw new BusinessException("库存不足");
}
// Deduct stock
inventoryMapper.deductStock(productId, quantity);
// Log inventory change
inventoryLogMapper.insert(new InventoryLog(productId, quantity));
}
}Distributed transaction models:
2PC: strong consistency, lower performance.
TCC: high performance, complex implementation.
SAGA: long‑running transactions, eventual consistency.
Local message table: simple and reliable, widely used.
4.2 Reliable Message Delivery
// Local message table for eventual consistency
@Service
@Transactional
public class OrderServiceWithLocalMessage {
@Autowired
private OrderMapper orderMapper;
@Autowired
private MessageLogMapper messageLogMapper;
@Autowired
private RabbitTemplate rabbitTemplate;
public void createOrder(OrderRequest request) {
// 1. Create order
Order order = orderMapper.insert(request);
// 2. Record local message
MessageLog messageLog = new MessageLog();
messageLog.setMessageId(UUID.randomUUID().toString());
messageLog.setContent(buildMessageContent(order));
messageLog.setStatus(MessageStatus.PENDING);
messageLogMapper.insert(messageLog);
// 3. Send message to MQ
try {
rabbitTemplate.convertAndSend("order.exchange", "order.created", messageLog.getContent());
// 4. Update status to SENT
messageLogMapper.updateStatus(messageLog.getMessageId(), MessageStatus.SENT);
} catch (Exception e) {
log.error("消息发送失败", e);
}
}
// Retry failed messages every minute
@Scheduled(fixedDelay = 60000)
public void retryFailedMessages() {
List<MessageLog> failedMessages = messageLogMapper.selectByStatus(MessageStatus.PENDING);
for (MessageLog message : failedMessages) {
try {
rabbitTemplate.convertAndSend("order.exchange", "order.created", message.getContent());
messageLogMapper.updateStatus(message.getMessageId(), MessageStatus.SENT);
} catch (Exception e) {
log.error("重试消息发送失败: {}", message.getMessageId(), e);
}
}
}
}Key points of reliable messaging:
Persist the message before sending to MQ.
Use scheduled tasks to retry unsent messages.
Consumers must implement idempotency.
5. Capacity Planning and Performance Optimization
5.1 Stress Testing and Capacity Assessment
// Using JMH for benchmark testing
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
public class OrderServiceBenchmark {
private OrderService orderService;
private OrderRequest request;
@Setup
public void setup() {
orderService = new OrderService();
request = new OrderRequest("user123", "product456", 2, 100.0);
}
@Benchmark
public void createOrderBenchmark() {
orderService.createOrder(request);
}
public static void main(String[] args) throws RunnerException {
Options options = new OptionsBuilder()
.include(OrderServiceBenchmark.class.getSimpleName())
.forks(1)
.warmupIterations(2)
.measurementIterations(3)
.build();
new Runner(options).run();
}
}Capacity planning steps:
Benchmark testing : obtain single‑node performance metrics.
Stress testing : identify system bottlenecks.
Capacity calculation : compute required resources based on business goals.
Reserve buffer : typically keep 30%‑50% redundancy.
5.2 Database Performance Optimization
-- Slow query analysis
EXPLAIN ANALYZE
SELECT o.*, u.username, p.product_name
FROM orders o
LEFT JOIN users u ON o.user_id = u.user_id
LEFT JOIN products p ON o.product_id = p.product_id
WHERE o.create_time BETWEEN '2023-01-01' AND '2023-12-31'
AND o.status = 'COMPLETED'
AND u.city = '北京'
ORDER BY o.amount DESC
LIMIT 100;
-- Index optimization suggestions
-- 1. Composite index covering common query conditions
CREATE INDEX idx_orders_user_time ON orders(user_id, create_time);
-- 2. Covering index to avoid table lookups
CREATE INDEX idx_orders_covering ON orders(status, create_time, amount) INCLUDE (user_id, product_id);
-- 3. Functional index for complex conditions
CREATE INDEX idx_orders_month ON orders(EXTRACT(MONTH FROM create_time));Database optimization strategies:
Read‑write separation: master for writes, slaves for reads.
Sharding tables: horizontal partitioning of large tables.
Index optimization: avoid full table scans.
Query optimization: reduce JOINs, avoid SELECT *.
6. Emergency Plans and Fault Handling
6.1 Fault‑Handling Library
// Automatic failover handling
@Component
public class AutoFailoverHandler {
@Autowired
private CircuitBreakerRegistry circuitBreakerRegistry;
@Autowired
private RedisTemplate redisTemplate;
@EventListener
public void handleDatabaseFailure(DatabaseDownEvent event) {
log.warn("数据库故障,启用降级策略");
// 1. Open circuit breaker to prevent request pile‑up
CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker("databaseService");
circuitBreaker.transitionToOpenState();
// 2. Enable local cache mode
redisTemplate.opsForValue().set("degradation.mode", "true");
// 3. Send alert notification
alertService.sendCriticalAlert("数据库故障,已启用降级模式");
}
// Periodically check database recovery
@Scheduled(fixedRate = 30000)
public void checkDatabaseRecovery() {
if (isDatabaseRecovered()) {
log.info("数据库已恢复,关闭降级模式");
redisTemplate.delete("degradation.mode");
CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker("databaseService");
circuitBreaker.transitionToClosedState();
}
}
}6.2 Fault Injection (Chaos Engineering)
// Chaos engineering fault injection
@RestController
public class ChaosController {
@PostMapping("/chaos/inject")
public String injectChaos(@RequestBody ChaosConfig config) {
switch (config.getFaultType()) {
case "latency":
// Inject latency
injectLatency(config.getDuration(), config.getLatencyMs());
break;
case "exception":
// Inject exception
injectException(config.getDuration(), config.getExceptionRate());
break;
case "memory":
// Consume memory
consumeMemory(config.getMemoryMb());
break;
default:
throw new IllegalArgumentException("不支持的故障类型");
}
return "故障注入成功";
}
private void injectLatency(Duration duration, long latencyMs) {
ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor();
AtomicBoolean enabled = new AtomicBoolean(true);
ThreadLocalRandom random = ThreadLocalRandom.current();
AspectJProxyFactory factory = new AspectJProxyFactory(new Object());
factory.addAspect(new LatencyAspect(enabled, latencyMs, random));
executor.schedule(() -> enabled.set(false), duration.toMillis(), TimeUnit.MILLISECONDS);
}
// Additional methods for exception and memory injection omitted for brevity
}Conclusion
Key points for improving system stability include proactive monitoring, capacity planning, code quality, intelligent alerts, tracing, log analysis, resilient architecture (rate limiting, degradation, circuit breaking), automated operations (one‑click rollback, auto‑scaling), emergency response, continuous improvement, post‑mortem analysis, chaos engineering, and technical‑debt management.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
