Operations 36 min read

Mastering High‑Availability: JD Real‑World Pitfalls & Fixes for Apps, DBs, Cache & MQ

This article shares JD's practical high‑availability architecture lessons, detailing common pitfalls across applications, databases, caches, RPC frameworks, containers, data centers, GC, and message queues, and provides concrete troubleshooting steps and optimization techniques to help engineers design more resilient, fault‑tolerant systems.

JD Cloud Developers

Sep 4, 2025

Mastering High‑Availability: JD Real‑World Pitfalls & Fixes for Apps, DBs, Cache & MQ

When building high‑availability systems, engineers face challenges across applications, databases, caches, message queues, and more. This guide, based on real JD technical scenarios, systematically outlines typical HA pitfalls and solutions, offering a practical checklist to avoid risks and improve system stability and fault tolerance.

1. Introduction

High availability is defined as achieving at least four 9s (99.99%) uptime, i.e., less than 8.64 seconds of downtime per day, or five 9s (99.999%) for stricter standards. The article explores how to meet these goals by addressing unreliable components, rapid traffic growth, and the need for comprehensive monitoring.

2. Application HA

2.1 Code Faults

Code faults are divided into application‑level and platform‑level issues.

2.1.1 Application‑level Faults

Integer overflow when parsing strings to int.

String length mismatches between code and database.

Division errors (divide‑by‑zero or non‑terminating division).

General logic bugs that can cause complete service outage.

Example of integer overflow:

Exception in thread "main" java.lang.ArithmeticException: Rounding necessary
    at java.math.BigDecimal.commonNeedIncrement(BigDecimal.java:4179)
    ...

2.1.2 Platform‑level Faults

These include JDK bugs, RPC framework issues, cache SDK problems, and outdated third‑party jars.

JDK ArrayIndexOutOfBounds

A bug in JDK 8u311 caused frequent exceptions during JSON parsing. Upgrading the JDK resolves the issue, but many JD services still run older versions, leaving a hidden risk.

public class ATest {
    public static void main(String[] args) {
        String str = "{\"aa\":\"bb\"}";
        try {
            Object o = JacksonMapper.getInstance()
                .readValue(str, new TypeReference<Map<String, String>>() {});
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

RPC Framework (JSF) – Method‑Not‑Found

Using the msgpack protocol with BigDecimal parameters can trigger method‑not‑found exceptions. Switching to the Hessian protocol resolves the problem, albeit with a slight performance impact.

Cache Framework (jimdb) – Buffer Overflow

Versions below 2.1.12 of jimdb may throw buffer‑overflow exceptions during read/write operations. Upgrading to the latest version eliminates the issue.

Cache Framework – NullPointerException

Old titan‑profiler‑sdk jars cause NPEs under high load. Upgrading to the 2025‑07‑18 hotfix version resolves it.

2.2 Single‑Container Faults

Even a single container failure can affect web services, RPC providers, MQ consumers, and workflow engines, especially when automatic failover or load‑balancing is not configured.

2.3 Data‑Center Faults

Failure of an entire data center impacts traffic entry points, RPC services, MQ, DB, and cache layers, potentially causing widespread outages.

2.4 GC Faults

Frequent young‑GC pauses and mis‑configured connection pools degrade performance. Optimizations such as increasing container resources, tuning GC threads, and adjusting pool parameters reduced average latency from 176.7 ms to 17.2 ms.

String interning in Jackson caused memory bloat; disabling the cache fixed the issue.

3. Database HA

3.1 JED Single‑Shard Query Failures

JED routes queries based on shard keys. Missing shard keys trigger cross‑shard queries, increasing latency and risking total outage if any shard is down.

3.2 JED Transaction Failures

Default transaction queries lack shard keys, causing random shard scans. Setting useLocalSessionState=true in the JDBC URL mitigates the risk.

3.3 Global‑ID Failures

Using a global auto‑increment ID on the first shard makes the whole system unavailable when that shard fails.

3.4 Slow SQL, Large Transactions, Traffic Amplification, Field Length, and Storage Issues

These common DB problems (slow queries, oversized transactions, excessive SQL traffic, insufficient column lengths, and storage capacity limits) are highlighted with mitigation advice such as sharding, avoiding large transactions, and monitoring storage thresholds.

4. Redis (jimdb) HA

4.1 Timeout & Hot‑Key Management

Improper timeout settings hinder fast circuit‑breakers during jimdb failures. Hot‑key patterns should be avoided or mitigated.

4.2 Dangerous Commands

Lua script uploads block when a node is unavailable. Upload scripts once at initialization and re‑upload only on ScriptNotFoundException.

5. MQ HA

5.1 JMQ Acknowledgement Timeout

When a consumer crashes, its partition lock remains, causing message backlog. Reducing the acknowledgement timeout (e.g., to ten times the recent TP99) and supporting graceful shutdown in the SDK alleviate the issue.

5.2 Message Size Limits

Large messages may fail even with compression; keep payloads small.

5.3 Storage Bottlenecks

High‑volume large messages can saturate the broker’s network bandwidth, leading to failures.

The content above reflects the collective experience of JD’s engineering teams and will be continuously updated with new case studies and best‑practice recommendations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend High Availability system design fault tolerance

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.