Mastering High Availability: Real-World Pitfalls and Solutions from JD's Production Systems
This article walks through the challenges of building high‑availability systems—covering applications, databases, caches, message queues, containers, GC, and more—using JD’s production experiences to highlight common pitfalls, root‑cause analyses, and practical mitigation strategies for engineers seeking resilient architecture.
01 Introduction
When building high‑availability (HA) systems, developers face multi‑dimensional challenges across applications, databases, caches, and message queues. Drawing from real JD technical scenarios, this guide systematically outlines common HA pitfalls and solutions, offering a practical checklist to help teams avoid risks during design and improve stability and fault tolerance.
02 Application HA
2.1 Code Faults
2.1.1 Application‑level faults
Typical 100% hit scenarios include integer overflow, string length overflow, division‑by‑zero, and null‑pointer exceptions, which can render services completely unavailable.
2.1.1.1 Integer overflow
Calling Integer.parseInt on a value exceeding int limits or on a string that cannot be parsed causes conversion failure; if this occurs in a hot path, the impact can be catastrophic.
2.1.1.2 String length overflow
Issues arise when application string lengths do not match database column lengths, or when business logic assumes fixed‑length substrings; changes in data size can break these assumptions.
2.1.1.3 Division faults
Common causes are missing scale settings leading to non‑terminating division and division‑by‑zero scenarios, both of which can cause total service outage.
Exception in thread "main" java.lang.ArithmeticException: Rounding necessary
at java.math.BigDecimal.commonNeedIncrement(BigDecimal.java:4179)
...2.1.1.4 Logical code faults
Various unpredictable bugs can arise; thorough testing and shared knowledge are essential to avoid them.
2.1.2 Platform‑level faults
These are hidden deep in dependencies such as JDK, RPC frameworks, or cache libraries. Upgrading to fixed versions is often required, but seamless migration can be difficult.
2.1.2.1 JDK array‑index‑out‑of‑bounds
During a large‑scale promotion, JDK 8u311 fixed an array‑index issue that caused high‑frequency exception throws, impacting GC and performance.
java.lang.ArrayIndexOutOfBoundsException
at sun.reflect.generics.parser.SignatureParser.current(SignatureParser.java:95)
...2.1.2.2 RPC framework method‑not‑found
JD’s internal JSF RPC framework using the msgpack protocol can throw MethodNotFoundException when handling BigDecimal parameters; switching to the Hessian protocol resolves the issue.
2.1.2.3 Cache framework buffer overflow
Older versions of JD’s internal jimdb cache (pre‑2.1.12) can overflow buffers during Redis commands; the SDK catches the exception and doubles the buffer size, but the pattern is inefficient.
2.1.2.4 Cache null‑pointer
Using an outdated titan-profiler-sdk jar with newer jimdb versions leads to NPEs that degrade performance.
2.2 Single‑container faults
Even a single node failure can affect web services, RPC providers, MQ consumers, or workflow engines, especially when automatic failover or load‑balancing is not configured.
2.3 Data‑center faults
Whole‑zone outages impact traffic entry points, RPC services, MQ, and databases; mitigation requires rapid traffic isolation and graceful degradation.
2.4 GC faults
Frequent young‑GC pauses (e.g., 400 ms) can throttle throughput; solutions include increasing container resources, tuning ParallelGCThreads, and optimizing connection‑pool parameters, which reduced average latency from 176 ms to 17 ms in a case study.
03 Database HA
3.1 JED single‑shard query fault
JED (JD’s MySQL‑compatible gateway) routes queries based on shard keys; missing shard keys cause cross‑shard scans, increasing latency and risking total outage if any shard fails.
3.2 JED transaction fault
Default select @@session.tx_read_only statements trigger random shard scans, doubling failure probability; setting useLocalSessionState=true in the JDBC URL mitigates this.
3.3 JED global‑id fault
Global auto‑increment IDs rely on the first shard; if that shard is down, all inserts fail.
3.4 Large transaction fault
Monolithic transactions across multiple RPC writes cause DB lock contention; splitting into idempotent insert‑then‑update steps improves throughput.
3.5 Traffic amplification fault
Orders that should generate ~10 SQL statements sometimes generate 10‑100× more, leading to hidden performance bottlenecks.
3.6 Field‑length insufficiency
Schema mismatches between upstream and downstream services cause write failures when fields grow beyond original limits.
3.7 Single‑cluster storage shortage
Assess storage headroom under 10× or 100× traffic growth; plan archiving, sharding, or scaling before capacity is exhausted.
04 Redis HA
4.1 JIMDB timeout & hot‑key governance
Improper timeout settings prevent fast circuit‑break; hot‑key patterns (fixed constant keys) amplify failures. Adjust read/write timeouts and avoid constant‑key writes.
4.2 JIMDB high‑risk command governance
Lua script uploads block on unavailable nodes; upload once during initialization and handle ScriptNotFoundException on node recovery.
05 MQ HA
5.1 JMQ acknowledgment timeout fault
When a consumer instance crashes after pulling messages, the partition lock remains held, causing message backlog; the default 120 s ack timeout can be reduced to 10× the recent tp999 value (e.g., 50 s) to mitigate.
5.2 JMQ oversized‑message fault
Messages exceeding size limits may fail even with compression; keep payloads minimal.
5.3 JMQ storage fault
High‑volume consumption can saturate a broker’s 1 Gbps downstream bandwidth, leading to send failures; scaling broker count or network capacity is required.
Conclusion
The content is distilled from frontline engineering experiences at JD, offering a continuously updated compendium of HA best practices across applications, databases, caches, and messaging systems.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
