Avoid Common High‑Availability Pitfalls: Real‑World JD Practices and Solutions
This article analyzes the multi‑dimensional challenges of building high‑availability systems—covering applications, databases, caches, message queues, containers, GC, and more—by sharing real JD engineering scenarios, common failure patterns, and concrete mitigation strategies to help engineers design more resilient services.
2. Application High Availability
2.1 Code Faults
2.1.1 Application‑Level Faults
Typical issues include integer overflow, string length overflow, division‑by‑zero, and null‑pointer exceptions, which can cause complete service outage when they occur in high‑traffic paths.
2.1.1.1 Integer Overflow
Calling Integer.parseInt on a value that exceeds the int range throws an exception; if this happens in a critical flow the impact can be catastrophic.
2.1.1.2 String Length Overflow
Mismatched lengths between application strings and database columns or hard‑coded length checks can cause write failures or logic errors.
2.1.1.3 Division Faults
Missing scale settings or division‑by‑zero lead to ArithmeticException and service unavailability.
Exception in thread "main" java.lang.ArithmeticException: Rounding necessary
at java.math.BigDecimal.commonNeedIncrement(BigDecimal.java:4179)
...2.1.1.4 Logic Faults
Complex business flows (e.g., order fulfillment) require differentiated thread‑pool configurations to avoid thread starvation and avalanche effects.
2.1.2 Platform‑Level Faults
These faults are hidden in underlying platforms such as JDK, RPC frameworks, cache libraries, etc., and often require version upgrades or configuration changes.
2.1.2.1 JDK Array‑Index‑Out‑Of‑Bounds
Older JDK versions (pre‑8u311) may throw high‑frequency ArrayIndexOutOfBoundsException during JSON parsing, increasing GC pressure.
"main"@1" prio=5 tid=0x1 nid=NA runnable
java.lang.Thread.State: RUNNABLE
at java.lang.ArrayIndexOutOfBoundsException.<init>(ArrayIndexOutOfBoundsException.java:65)
...2.1.2.1 RPC Framework – Method‑Not‑Found
JSF (JD’s internal RPC) using the msgpack protocol can throw “method not found” when BigDecimal parameters are involved; switching to the Hessian protocol resolves the issue at the cost of slight performance loss.
2.1.2.1 Cache Framework – Buffer Overflow
Older jimdb versions (< 2.1.12) may trigger buffer‑overflow exceptions during read/write operations.
2.1.2.1 Cache Framework – Null Pointer
Using an outdated titan-profiler-sdk jar can cause NPEs after upgrading jimdb.
2.2 Single‑Container Faults
Even a single node failure can affect web services, RPC providers, MQ consumers, or workflow engines, especially when automatic failover or load‑balancing is not configured.
2.3 Data‑Center Faults
Failure of an entire data‑center impacts traffic entry points, RPC services, MQ, DB, and cache layers, requiring rapid traffic isolation and fallback strategies.
2.4 GC Faults
Improper GC settings (e.g., insufficient ParallelGCThreads) and connection‑pool misconfigurations can cause long pauses; tuning container resources and pool parameters reduced average latency from 176 ms to 17 ms.
3. DB High Availability
3.1 JED Single‑Shard Failure
JED routes queries based on a shard key; missing the key forces cross‑shard scans, increasing latency and risking total outage if any shard is down.
3.2 JED Transaction Fault
Default select @@session.tx_read_only scans a random shard, doubling failure probability; setting useLocalSessionState=true mitigates this.
3.3 Global‑ID Failure
Using a global auto‑increment ID on the first shard makes the whole service unavailable when that shard fails.
3.4 Slow SQL
Large or unoptimized queries degrade performance; avoid them.
3.5 Large Transactions
Big transactions cause DB lock contention; redesign to use idempotent inserts/updates.
3.6 Traffic Amplification
Some orders generate 10‑100× more SQL statements than expected, hidden until DB pressure surfaces.
3.7 Field Length Insufficiency
Schema mismatches across services lead to write failures when field lengths are too short.
3.8 Cluster Storage Exhaustion
Plan storage capacity for 10× or 100× traffic growth; monitor disk usage and QPS thresholds.
4. Redis (JIMDB) High Availability
4.1 Timeout and Hot‑Key Governance
Set reasonable timeouts to enable fast circuit‑break; avoid fixed hot‑key patterns.
4.2 Dangerous Commands Governance
Lua script uploads block when a node is down; upload scripts once at startup and retry on ScriptNotFoundException.
5. MQ High Availability
5.1 JMQ Acknowledgement Timeout
Consumer crashes keep partitions locked; reduce ack timeout (e.g., 10× recent TP99) and add graceful‑shutdown support.
5.2 Message Size Fault
Large messages trigger compression failures; keep payloads small.
5.3 Storage Fault
High‑volume large messages can saturate broker network bandwidth, causing send failures.
This content is derived from frontline engineering experience at JD and will be continuously updated with more real‑world cases and technical reflections.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
