Mastering High‑Availability: JD Real‑World Pitfalls & Fixes for Apps, DBs, Cache & MQ
This article shares JD's practical high‑availability architecture lessons, detailing common pitfalls across applications, databases, caches, RPC frameworks, containers, data centers, GC, and message queues, and provides concrete troubleshooting steps and optimization techniques to help engineers design more resilient, fault‑tolerant systems.
When building high‑availability systems, engineers face challenges across applications, databases, caches, message queues, and more. This guide, based on real JD technical scenarios, systematically outlines typical HA pitfalls and solutions, offering a practical checklist to avoid risks and improve system stability and fault tolerance.
1. Introduction
High availability is defined as achieving at least four 9s (99.99%) uptime, i.e., less than 8.64 seconds of downtime per day, or five 9s (99.999%) for stricter standards. The article explores how to meet these goals by addressing unreliable components, rapid traffic growth, and the need for comprehensive monitoring.
2. Application HA
2.1 Code Faults
Code faults are divided into application‑level and platform‑level issues.
2.1.1 Application‑level Faults
Integer overflow when parsing strings to int.
String length mismatches between code and database.
Division errors (divide‑by‑zero or non‑terminating division).
General logic bugs that can cause complete service outage.
Example of integer overflow:
Exception in thread "main" java.lang.ArithmeticException: Rounding necessary
at java.math.BigDecimal.commonNeedIncrement(BigDecimal.java:4179)
...2.1.2 Platform‑level Faults
These include JDK bugs, RPC framework issues, cache SDK problems, and outdated third‑party jars.
JDK ArrayIndexOutOfBounds
A bug in JDK 8u311 caused frequent exceptions during JSON parsing. Upgrading the JDK resolves the issue, but many JD services still run older versions, leaving a hidden risk.
public class ATest {
public static void main(String[] args) {
String str = "{\"aa\":\"bb\"}";
try {
Object o = JacksonMapper.getInstance()
.readValue(str, new TypeReference<Map<String, String>>() {});
} catch (IOException e) {
e.printStackTrace();
}
}
}RPC Framework (JSF) – Method‑Not‑Found
Using the msgpack protocol with BigDecimal parameters can trigger method‑not‑found exceptions. Switching to the Hessian protocol resolves the problem, albeit with a slight performance impact.
Cache Framework (jimdb) – Buffer Overflow
Versions below 2.1.12 of jimdb may throw buffer‑overflow exceptions during read/write operations. Upgrading to the latest version eliminates the issue.
Cache Framework – NullPointerException
Old titan‑profiler‑sdk jars cause NPEs under high load. Upgrading to the 2025‑07‑18 hotfix version resolves it.
2.2 Single‑Container Faults
Even a single container failure can affect web services, RPC providers, MQ consumers, and workflow engines, especially when automatic failover or load‑balancing is not configured.
2.3 Data‑Center Faults
Failure of an entire data center impacts traffic entry points, RPC services, MQ, DB, and cache layers, potentially causing widespread outages.
2.4 GC Faults
Frequent young‑GC pauses and mis‑configured connection pools degrade performance. Optimizations such as increasing container resources, tuning GC threads, and adjusting pool parameters reduced average latency from 176.7 ms to 17.2 ms.
String interning in Jackson caused memory bloat; disabling the cache fixed the issue.
3. Database HA
3.1 JED Single‑Shard Query Failures
JED routes queries based on shard keys. Missing shard keys trigger cross‑shard queries, increasing latency and risking total outage if any shard is down.
3.2 JED Transaction Failures
Default transaction queries lack shard keys, causing random shard scans. Setting useLocalSessionState=true in the JDBC URL mitigates the risk.
3.3 Global‑ID Failures
Using a global auto‑increment ID on the first shard makes the whole system unavailable when that shard fails.
3.4 Slow SQL, Large Transactions, Traffic Amplification, Field Length, and Storage Issues
These common DB problems (slow queries, oversized transactions, excessive SQL traffic, insufficient column lengths, and storage capacity limits) are highlighted with mitigation advice such as sharding, avoiding large transactions, and monitoring storage thresholds.
4. Redis (jimdb) HA
4.1 Timeout & Hot‑Key Management
Improper timeout settings hinder fast circuit‑breakers during jimdb failures. Hot‑key patterns should be avoided or mitigated.
4.2 Dangerous Commands
Lua script uploads block when a node is unavailable. Upload scripts once at initialization and re‑upload only on ScriptNotFoundException.
5. MQ HA
5.1 JMQ Acknowledgement Timeout
When a consumer crashes, its partition lock remains, causing message backlog. Reducing the acknowledgement timeout (e.g., to ten times the recent TP99) and supporting graceful shutdown in the SDK alleviate the issue.
5.2 Message Size Limits
Large messages may fail even with compression; keep payloads small.
5.3 Storage Bottlenecks
High‑volume large messages can saturate the broker’s network bandwidth, leading to failures.
The content above reflects the collective experience of JD’s engineering teams and will be continuously updated with new case studies and best‑practice recommendations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
