Operations 37 min read

Mastering High Availability: Real-World Pitfalls and Solutions from JD's Production Systems

This article walks through the challenges of building high‑availability systems—covering applications, databases, caches, message queues, containers, GC, and more—using JD’s production experiences to highlight common pitfalls, root‑cause analyses, and practical mitigation strategies for engineers seeking resilient architecture.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Mastering High Availability: Real-World Pitfalls and Solutions from JD's Production Systems

01 Introduction

When building high‑availability (HA) systems, developers face multi‑dimensional challenges across applications, databases, caches, and message queues. Drawing from real JD technical scenarios, this guide systematically outlines common HA pitfalls and solutions, offering a practical checklist to help teams avoid risks during design and improve stability and fault tolerance.

02 Application HA

2.1 Code Faults

2.1.1 Application‑level faults

Typical 100% hit scenarios include integer overflow, string length overflow, division‑by‑zero, and null‑pointer exceptions, which can render services completely unavailable.

2.1.1.1 Integer overflow

Calling Integer.parseInt on a value exceeding int limits or on a string that cannot be parsed causes conversion failure; if this occurs in a hot path, the impact can be catastrophic.

Integer overflow example
Integer overflow example

2.1.1.2 String length overflow

Issues arise when application string lengths do not match database column lengths, or when business logic assumes fixed‑length substrings; changes in data size can break these assumptions.

String length overflow example
String length overflow example

2.1.1.3 Division faults

Common causes are missing scale settings leading to non‑terminating division and division‑by‑zero scenarios, both of which can cause total service outage.

Exception in thread "main" java.lang.ArithmeticException: Rounding necessary
    at java.math.BigDecimal.commonNeedIncrement(BigDecimal.java:4179)
    ...

2.1.1.4 Logical code faults

Various unpredictable bugs can arise; thorough testing and shared knowledge are essential to avoid them.

2.1.2 Platform‑level faults

These are hidden deep in dependencies such as JDK, RPC frameworks, or cache libraries. Upgrading to fixed versions is often required, but seamless migration can be difficult.

2.1.2.1 JDK array‑index‑out‑of‑bounds

During a large‑scale promotion, JDK 8u311 fixed an array‑index issue that caused high‑frequency exception throws, impacting GC and performance.

java.lang.ArrayIndexOutOfBoundsException
    at sun.reflect.generics.parser.SignatureParser.current(SignatureParser.java:95)
    ...

2.1.2.2 RPC framework method‑not‑found

JD’s internal JSF RPC framework using the msgpack protocol can throw MethodNotFoundException when handling BigDecimal parameters; switching to the Hessian protocol resolves the issue.

JSF RPC method not found
JSF RPC method not found

2.1.2.3 Cache framework buffer overflow

Older versions of JD’s internal jimdb cache (pre‑2.1.12) can overflow buffers during Redis commands; the SDK catches the exception and doubles the buffer size, but the pattern is inefficient.

Cache buffer overflow
Cache buffer overflow

2.1.2.4 Cache null‑pointer

Using an outdated titan-profiler-sdk jar with newer jimdb versions leads to NPEs that degrade performance.

Cache NPE
Cache NPE

2.2 Single‑container faults

Even a single node failure can affect web services, RPC providers, MQ consumers, or workflow engines, especially when automatic failover or load‑balancing is not configured.

2.3 Data‑center faults

Whole‑zone outages impact traffic entry points, RPC services, MQ, and databases; mitigation requires rapid traffic isolation and graceful degradation.

2.4 GC faults

Frequent young‑GC pauses (e.g., 400 ms) can throttle throughput; solutions include increasing container resources, tuning ParallelGCThreads, and optimizing connection‑pool parameters, which reduced average latency from 176 ms to 17 ms in a case study.

03 Database HA

3.1 JED single‑shard query fault

JED (JD’s MySQL‑compatible gateway) routes queries based on shard keys; missing shard keys cause cross‑shard scans, increasing latency and risking total outage if any shard fails.

3.2 JED transaction fault

Default select @@session.tx_read_only statements trigger random shard scans, doubling failure probability; setting useLocalSessionState=true in the JDBC URL mitigates this.

3.3 JED global‑id fault

Global auto‑increment IDs rely on the first shard; if that shard is down, all inserts fail.

3.4 Large transaction fault

Monolithic transactions across multiple RPC writes cause DB lock contention; splitting into idempotent insert‑then‑update steps improves throughput.

3.5 Traffic amplification fault

Orders that should generate ~10 SQL statements sometimes generate 10‑100× more, leading to hidden performance bottlenecks.

3.6 Field‑length insufficiency

Schema mismatches between upstream and downstream services cause write failures when fields grow beyond original limits.

3.7 Single‑cluster storage shortage

Assess storage headroom under 10× or 100× traffic growth; plan archiving, sharding, or scaling before capacity is exhausted.

04 Redis HA

4.1 JIMDB timeout & hot‑key governance

Improper timeout settings prevent fast circuit‑break; hot‑key patterns (fixed constant keys) amplify failures. Adjust read/write timeouts and avoid constant‑key writes.

4.2 JIMDB high‑risk command governance

Lua script uploads block on unavailable nodes; upload once during initialization and handle ScriptNotFoundException on node recovery.

Lua script upload
Lua script upload

05 MQ HA

5.1 JMQ acknowledgment timeout fault

When a consumer instance crashes after pulling messages, the partition lock remains held, causing message backlog; the default 120 s ack timeout can be reduced to 10× the recent tp999 value (e.g., 50 s) to mitigate.

5.2 JMQ oversized‑message fault

Messages exceeding size limits may fail even with compression; keep payloads minimal.

5.3 JMQ storage fault

High‑volume consumption can saturate a broker’s 1 Gbps downstream bandwidth, leading to send failures; scaling broker count or network capacity is required.

Conclusion

The content is distilled from frontline engineering experiences at JD, offering a continuously updated compendium of HA best practices across applications, databases, caches, and messaging systems.

distributed-systemsPerformance optimizationCachehigh availabilitysystem designfault toleranceJDK
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.