Operations 25 min read

Improving System Availability: Fault Prevention, Real‑time Detection, and Rapid Recovery

The article examines how a payment platform improves its 24/7 availability by preventing failures, detecting incidents in real time, and implementing rapid recovery measures such as dynamic routing, resource limits, monitoring, logging, and service degradation, while sharing practical Q&A insights.

Java Architect Essentials

Apr 8, 2024

Improving System Availability: Fault Prevention, Real‑time Detection, and Rapid Recovery

1. Background

For internet and enterprise applications, 24/7 availability is required. Availability is measured by nines (99.9%, 99.99%, 99.999%) with corresponding downtime minutes shown in the table below.

Availability Metric

Calculation

Downtime (minutes)

99.9%

0.1%*365*24*60

525.6

99.99%

0.01%*365*24*60

52.56

99.999%

0.001%*365*24*60

5.256

Maintaining high availability is challenging as functionality and data volume grow. "PayNow" (a fictional payment platform) has explored ways to avoid single points of failure, ensure service resilience, and handle traffic spikes.

In ideal conditions, the service can achieve 99.999% availability.

The article focuses on improving the application's own availability; other topics like single‑point‑failure avoidance and traffic growth are covered elsewhere.

First step is to avoid failures, but complete elimination is impossible; even low‑probability incidents can cascade.

RabbitMQ was considered highly reliable, but a hardware failure of the host caused a service outage.

When a failure occurs, rapid detection and resolution are essential. The system aims for second‑level detection, quick diagnosis, and mitigation.

2. Problems

Typical issues encountered include:

Missing timeout settings for new third‑party channels, causing queue blockage.

Insufficient database connections after adding a new module.

Worker thread exhaustion due to third‑party timeout.

DDoS limits triggered by a third‑party network operator.

Sequence field overflow when transaction volume exceeds limits.

These hidden problems are common in internet systems.

3. Solutions

3.1 Prevent Failures

3.1.1 Design Fault‑tolerant System

Example: dynamic routing among 30+ payment channels. If channel A fails, traffic is rerouted to B or C, ensuring payment success.

OOM protection similar to Tomcat: reserve memory and catch OOM exceptions.

3.1.2 Fail‑Fast Principle

Terminate the main flow immediately when an error is detected.

Abort startup if queue configuration fails.

Abort transaction processing after 40 s and notify merchant.

Skip Redis operations exceeding 50 ms.

3.1.3 Self‑Protection Mechanisms

Isolate third‑party dependencies, split message queues, and limit resource usage.

Split message queues per business, channel, and merchant.

Limit connection count, memory usage, thread creation, and concurrency.

3.2 Real‑time Fault Detection

3.2.1 Real‑time Alerting System

Alerts are delivered via SMS, email, and dashboards with multiple severity levels.

3.2.2 Data Embedding

Each module records metrics to Redis; a central analysis system evaluates rules and triggers alerts.

3.2.3 Analysis System

Architecture and workflow are illustrated in the following diagrams:

Business monitoring points are divided into alarm‑type and attention‑type.

Alarm‑type: network anomalies, order timeout, transaction success rate, etc.

Attention‑type: abnormal traffic volume, large transaction amount, illegal IP, etc.

3.2.4 Non‑business Monitoring

Includes host, network, storage, and log monitoring via Zabbix, rsyslog, and plugins.

3.2.5 Log Recording and Analysis

200 W orders per day generate massive logs; logs are aggregated with rsyslog, parsed, and visualized.

2016-07-22 18:15:00.512||pool-73-thread-4||ChannelAdapter||ChannelAdapter-PostThirdParty||CEX16XXXXXXX5751||16201XXXX337||||||04||9000||【Settlement Platform Message】Processing||0000105||98XX543210||GHT||03||11||2016-07-22 18:15:00.512||ZhangZhang||||01||tunnelQuery||true||||Pending||||10.100.140.101||8cff785d-0d01-4ed4-b771-cb0b1faa7f95||10.999.140.101||O001||||0.01||||||||http://10.100.444.59:8080/regression/notice||||240||2016-07-20 19:06:13.000xxxxxxx

Visualization of the log trajectory is shown below:

3.2.6 24/7 Monitoring Room

A dedicated monitoring team operates around the clock to ensure service stability.

3.3 Rapid Fault Recovery

3.3.1 Automatic Repair

Unstable third‑party channels are automatically rerouted.

3.3.2 Service Degradation

When a fault cannot be fixed quickly, non‑core functions are disabled to preserve core services.

4. Q&A

Answers to common questions about RabbitMQ failure, dev‑ops separation, language choices, third‑party skepticism, data consistency, routing strategies, automatic repair, promotion traffic handling, log storage, and monitoring integration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations High Availability System Design fault tolerance

Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.