Operations 19 min read

How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies

This comprehensive guide explains how to improve system stability and reduce online incidents by building observability, implementing distributed tracing, applying rate‑limiting and circuit‑breaker patterns, adopting blue‑green and gray deployments, managing data consistency with distributed transactions, planning capacity, optimizing performance, and preparing emergency response plans.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How to Boost System Stability: Observability, Resilience, and High‑Availability Strategies

Introduction

Recently a colleague asked how to improve system stability and reduce online incidents. System stability is crucial, and this article discusses practical ways to achieve it.

1. Build a Complete Observability System

Observability is the foundation of system stability. Many encounter situations where the system slows down but CPU and memory appear normal.

1.1 Multi‑dimensional Monitoring

// Using Spring Boot Actuator to implement health checks
@Configuration
public class HealthConfig {
    // ... configuration beans ...
}

@Component
public class CustomHealthIndicator implements HealthIndicator {
    @Autowired
    private DataSource dataSource;
    @Autowired
    private RedisTemplate redisTemplate;

    @Override
    public Health health() {
        // Check database connection
        if (!checkDatabase()) {
            return Health.down().withDetail("database", "连接失败").build();
        }
        // Check Redis connection
        if (!checkRedis()) {
            return Health.down().withDetail("redis", "连接异常").build();
        }
        // Check disk space
        if (!checkDiskSpace()) {
            return Health.down().withDetail("disk", "空间不足").build();
        }
        return Health.up().build();
    }

    private boolean checkDatabase() {
        try {
            return dataSource.getConnection().isValid(5);
        } catch (Exception e) {
            return false;
        }
    }

    private boolean checkRedis() {
        try {
            redisTemplate.opsForValue().get("health_check");
            return true;
        } catch (Exception e) {
            return false;
        }
    }

    private boolean checkDiskSpace() {
        File file = new File(".");
        return file.getFreeSpace() > 1024 * 1024 * 1024; // more than 1GB free
    }
}

Code logic analysis:

Implement HealthIndicator to customize health‑check logic.

Check whether the database connection is normal.

Check Redis and other middleware connection status.

Check system resources such as disk space.

When any component fails, immediately return a down status to enable timely alerts.

1.2 Distributed Tracing

// Using Spring Cloud Sleuth for tracing
@RestController
public class OrderController {
    private final Tracer tracer;

    public OrderController(Tracer tracer) {
        this.tracer = tracer;
    }

    @PostMapping("/orders")
    public ResponseEntity<Order> createOrder(@RequestBody OrderRequest request) {
        // Start a new Span
        Span span = tracer.nextSpan().name("createOrder").start();
        try (Tracer.SpanInScope ws = tracer.withSpanInScope(span)) {
            // Record business operation
            span.tag("user.id", request.getUserId());
            span.tag("order.amount", String.valueOf(request.getAmount()));
            // Execute order creation logic
            Order order = orderService.createOrder(request);
            // Record success
            span.event("order.created");
            return ResponseEntity.ok(order);
        } catch (Exception e) {
            // Record error
            span.error(e);
            throw e;
        } finally {
            span.finish();
        }
    }
}

Value of tracing:

Quickly locate performance bottlenecks.

Visualize service call relationships.

Analyze cross‑service exceptions.

2. Build Resilient Architecture: Rate Limiting, Degradation, Circuit Breaking

During traffic spikes, systems can be overwhelmed, requiring resilience mechanisms.

2.1 Intelligent Rate‑Limiting Strategies

// Using Resilience4j for rate limiting
@Configuration
public class RateLimitConfig {
    @Bean
    public RateLimiterRegistry rateLimiterRegistry() {
        return RateLimiterRegistry.of(
            RateLimiterConfig.custom()
                .limitForPeriod(100) // 100 requests per second
                .limitRefreshPeriod(Duration.ofSeconds(1))
                .timeoutDuration(Duration.ofMillis(100))
                .build()
        );
    }
}

@Service
public class OrderService {
    private final RateLimiter rateLimiter;

    public OrderService(RateLimiterRegistry registry) {
        this.rateLimiter = registry.rateLimiter("orderService");
    }

    @RateLimiter(name = "orderService", fallbackMethod = "createOrderFallback")
    public Order createOrder(OrderRequest request) {
        // Normal order creation logic
        return processOrderCreation(request);
    }

    // Degradation method
    public Order createOrderFallback(OrderRequest request, Exception e) {
        // Log degradation
        log.warn("订单服务触发限流降级, userId: {}", request.getUserId());
        // Return fallback data or throw business exception
        throw new BusinessException("系统繁忙,请稍后重试");
    }
}

Rate‑limiting details:

Fixed window : simple but has edge cases.

Sliding window : more precise but complex.

Token bucket : allows burst traffic.

Leaky bucket : smooths traffic to a stable rate.

2.2 Service Circuit‑Breaker Mechanism

// Circuit breaker configuration
@Configuration
public class CircuitBreakerConfig {
    @Bean
    public CircuitBreakerRegistry circuitBreakerRegistry() {
        return CircuitBreakerRegistry.of(
            CircuitBreakerConfig.custom()
                .failureRateThreshold(50) // 50% failure threshold
                .slowCallRateThreshold(50)
                .slowCallDurationThreshold(Duration.ofSeconds(2))
                .waitDurationInOpenState(Duration.ofSeconds(60))
                .permittedNumberOfCallsInHalfOpenState(10)
                .minimumNumberOfCalls(10)
                .slidingWindowType(SlidingWindowType.COUNT_BASED)
                .slidingWindowSize(10)
                .build()
        );
    }
}

@Service
public class PaymentService {
    private final CircuitBreaker circuitBreaker;

    public PaymentService(CircuitBreakerRegistry registry) {
        this.circuitBreaker = registry.circuitBreaker("paymentService");
    }

    public PaymentResult processPayment(PaymentRequest request) {
        return circuitBreaker.executeSupplier(() -> paymentClient.pay(request));
    }
}

Circuit‑breaker states:

CLOSED : normal operation, requests pass through.

OPEN : all requests are rejected.

HALF_OPEN : partial requests allowed for probing.

3. High‑Availability Deployment Strategies

3.1 Blue‑Green Deployment

# Kubernetes blue‑green deployment configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service-v2
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
      version: v2
  template:
    metadata:
      labels:
        app: order-service
        version: v2
    spec:
      containers:
        - name: order-service
          image: order-service:v2.0.0
          readinessProbe:
            httpGet:
              path: /actuator/health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /actuator/health
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
  name: order-service
spec:
  selector:
    app: order-service
    version: v2
  ports:
    - port: 80
      targetPort: 8080

Blue‑green deployment advantages:

Fast rollback by modifying the Service selector.

Zero‑downtime releases.

Avoids version compatibility issues.

3.2 Gray Release

// Traffic‑weight based gray release
@Component
public class GrayReleaseRouter {
    @Value("${gray.release.ratio:0.1}")
    private double grayRatio;
    @Autowired
    private HttpServletRequest request;

    public boolean shouldRouteToNewVersion() {
        String userId = getUserIdFromRequest();
        // Hash routing based on user ID
        int hash = Math.abs(userId.hashCode());
        double ratio = (hash % 100) / 100.0;
        return ratio < grayRatio;
    }

    public Object routeRequest(Object request) {
        if (shouldRouteToNewVersion()) {
            // Forward to new version service
            return callNewVersion(request);
        } else {
            // Use old version service
            return callOldVersion(request);
        }
    }

    private String getUserIdFromRequest() {
        // Extract user ID from request header
        return request.getHeader("X-User-Id");
    }
}

Gray‑release policies:

Route by user ID.

Route by traffic proportion.

Route by business parameters (e.g., city, user level).

4. Data Consistency and Transaction Management

4.1 Distributed Transaction Solutions

// Using Seata for distributed transactions
@Service
public class OrderServiceImpl implements OrderService {
    @GlobalTransactional
    @Override
    public Order createOrder(OrderRequest request) {
        // 1. Create order (local transaction)
        Order order = orderMapper.insert(request);
        // 2. Deduct inventory (remote service)
        inventoryFeignClient.deduct(request.getProductId(), request.getQuantity());
        // 3. Add points (remote service)
        pointsFeignClient.addPoints(request.getUserId(), request.getAmount());
        return order;
    }
}

@Service
public class InventoryServiceImpl implements InventoryService {
    @Transactional
    @Override
    public void deduct(String productId, Integer quantity) {
        // Check inventory
        Inventory inventory = inventoryMapper.selectByProductId(productId);
        if (inventory.getStock() < quantity) {
            throw new BusinessException("库存不足");
        }
        // Deduct stock
        inventoryMapper.deductStock(productId, quantity);
        // Log inventory change
        inventoryLogMapper.insert(new InventoryLog(productId, quantity));
    }
}

Distributed transaction models:

2PC: strong consistency, lower performance.

TCC: high performance, complex implementation.

SAGA: long‑running transactions, eventual consistency.

Local message table: simple and reliable, widely used.

4.2 Reliable Message Delivery

// Local message table for eventual consistency
@Service
@Transactional
public class OrderServiceWithLocalMessage {
    @Autowired
    private OrderMapper orderMapper;
    @Autowired
    private MessageLogMapper messageLogMapper;
    @Autowired
    private RabbitTemplate rabbitTemplate;

    public void createOrder(OrderRequest request) {
        // 1. Create order
        Order order = orderMapper.insert(request);
        // 2. Record local message
        MessageLog messageLog = new MessageLog();
        messageLog.setMessageId(UUID.randomUUID().toString());
        messageLog.setContent(buildMessageContent(order));
        messageLog.setStatus(MessageStatus.PENDING);
        messageLogMapper.insert(messageLog);
        // 3. Send message to MQ
        try {
            rabbitTemplate.convertAndSend("order.exchange", "order.created", messageLog.getContent());
            // 4. Update status to SENT
            messageLogMapper.updateStatus(messageLog.getMessageId(), MessageStatus.SENT);
        } catch (Exception e) {
            log.error("消息发送失败", e);
        }
    }

    // Retry failed messages every minute
    @Scheduled(fixedDelay = 60000)
    public void retryFailedMessages() {
        List<MessageLog> failedMessages = messageLogMapper.selectByStatus(MessageStatus.PENDING);
        for (MessageLog message : failedMessages) {
            try {
                rabbitTemplate.convertAndSend("order.exchange", "order.created", message.getContent());
                messageLogMapper.updateStatus(message.getMessageId(), MessageStatus.SENT);
            } catch (Exception e) {
                log.error("重试消息发送失败: {}", message.getMessageId(), e);
            }
        }
    }
}

Key points of reliable messaging:

Persist the message before sending to MQ.

Use scheduled tasks to retry unsent messages.

Consumers must implement idempotency.

5. Capacity Planning and Performance Optimization

5.1 Stress Testing and Capacity Assessment

// Using JMH for benchmark testing
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
public class OrderServiceBenchmark {
    private OrderService orderService;
    private OrderRequest request;

    @Setup
    public void setup() {
        orderService = new OrderService();
        request = new OrderRequest("user123", "product456", 2, 100.0);
    }

    @Benchmark
    public void createOrderBenchmark() {
        orderService.createOrder(request);
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(OrderServiceBenchmark.class.getSimpleName())
                .forks(1)
                .warmupIterations(2)
                .measurementIterations(3)
                .build();
        new Runner(options).run();
    }
}

Capacity planning steps:

Benchmark testing : obtain single‑node performance metrics.

Stress testing : identify system bottlenecks.

Capacity calculation : compute required resources based on business goals.

Reserve buffer : typically keep 30%‑50% redundancy.

5.2 Database Performance Optimization

-- Slow query analysis
EXPLAIN ANALYZE
SELECT o.*, u.username, p.product_name
FROM orders o
LEFT JOIN users u ON o.user_id = u.user_id
LEFT JOIN products p ON o.product_id = p.product_id
WHERE o.create_time BETWEEN '2023-01-01' AND '2023-12-31'
  AND o.status = 'COMPLETED'
  AND u.city = '北京'
ORDER BY o.amount DESC
LIMIT 100;

-- Index optimization suggestions
-- 1. Composite index covering common query conditions
CREATE INDEX idx_orders_user_time ON orders(user_id, create_time);
-- 2. Covering index to avoid table lookups
CREATE INDEX idx_orders_covering ON orders(status, create_time, amount) INCLUDE (user_id, product_id);
-- 3. Functional index for complex conditions
CREATE INDEX idx_orders_month ON orders(EXTRACT(MONTH FROM create_time));

Database optimization strategies:

Read‑write separation: master for writes, slaves for reads.

Sharding tables: horizontal partitioning of large tables.

Index optimization: avoid full table scans.

Query optimization: reduce JOINs, avoid SELECT *.

6. Emergency Plans and Fault Handling

6.1 Fault‑Handling Library

// Automatic failover handling
@Component
public class AutoFailoverHandler {
    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;
    @Autowired
    private RedisTemplate redisTemplate;

    @EventListener
    public void handleDatabaseFailure(DatabaseDownEvent event) {
        log.warn("数据库故障,启用降级策略");
        // 1. Open circuit breaker to prevent request pile‑up
        CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker("databaseService");
        circuitBreaker.transitionToOpenState();
        // 2. Enable local cache mode
        redisTemplate.opsForValue().set("degradation.mode", "true");
        // 3. Send alert notification
        alertService.sendCriticalAlert("数据库故障,已启用降级模式");
    }

    // Periodically check database recovery
    @Scheduled(fixedRate = 30000)
    public void checkDatabaseRecovery() {
        if (isDatabaseRecovered()) {
            log.info("数据库已恢复,关闭降级模式");
            redisTemplate.delete("degradation.mode");
            CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker("databaseService");
            circuitBreaker.transitionToClosedState();
        }
    }
}

6.2 Fault Injection (Chaos Engineering)

// Chaos engineering fault injection
@RestController
public class ChaosController {
    @PostMapping("/chaos/inject")
    public String injectChaos(@RequestBody ChaosConfig config) {
        switch (config.getFaultType()) {
            case "latency":
                // Inject latency
                injectLatency(config.getDuration(), config.getLatencyMs());
                break;
            case "exception":
                // Inject exception
                injectException(config.getDuration(), config.getExceptionRate());
                break;
            case "memory":
                // Consume memory
                consumeMemory(config.getMemoryMb());
                break;
            default:
                throw new IllegalArgumentException("不支持的故障类型");
        }
        return "故障注入成功";
    }

    private void injectLatency(Duration duration, long latencyMs) {
        ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor();
        AtomicBoolean enabled = new AtomicBoolean(true);
        ThreadLocalRandom random = ThreadLocalRandom.current();
        AspectJProxyFactory factory = new AspectJProxyFactory(new Object());
        factory.addAspect(new LatencyAspect(enabled, latencyMs, random));
        executor.schedule(() -> enabled.set(false), duration.toMillis(), TimeUnit.MILLISECONDS);
    }

    // Additional methods for exception and memory injection omitted for brevity
}

Conclusion

Key points for improving system stability include proactive monitoring, capacity planning, code quality, intelligent alerts, tracing, log analysis, resilient architecture (rate limiting, degradation, circuit breaking), automated operations (one‑click rollback, auto‑scaling), emergency response, continuous improvement, post‑mortem analysis, chaos engineering, and technical‑debt management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Observabilitysystem stabilityhigh availabilityDistributed Tracingrate limitingDistributed Transactionscircuit breakerDeployment Strategies
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.