Operations 36 min read

Avoid Common High‑Availability Pitfalls: Real‑World JD Practices and Solutions

This article analyzes the multi‑dimensional challenges of building high‑availability systems—covering applications, databases, caches, message queues, containers, GC, and more—by sharing real JD engineering scenarios, common failure patterns, and concrete mitigation strategies to help engineers design more resilient services.

JD Tech Talk
JD Tech Talk
JD Tech Talk
Avoid Common High‑Availability Pitfalls: Real‑World JD Practices and Solutions

2. Application High Availability

2.1 Code Faults

2.1.1 Application‑Level Faults

Typical issues include integer overflow, string length overflow, division‑by‑zero, and null‑pointer exceptions, which can cause complete service outage when they occur in high‑traffic paths.

2.1.1.1 Integer Overflow

Calling Integer.parseInt on a value that exceeds the int range throws an exception; if this happens in a critical flow the impact can be catastrophic.

Integer overflow example
Integer overflow example

2.1.1.2 String Length Overflow

Mismatched lengths between application strings and database columns or hard‑coded length checks can cause write failures or logic errors.

String length overflow
String length overflow

2.1.1.3 Division Faults

Missing scale settings or division‑by‑zero lead to ArithmeticException and service unavailability.

Exception in thread "main" java.lang.ArithmeticException: Rounding necessary
    at java.math.BigDecimal.commonNeedIncrement(BigDecimal.java:4179)
    ...

2.1.1.4 Logic Faults

Complex business flows (e.g., order fulfillment) require differentiated thread‑pool configurations to avoid thread starvation and avalanche effects.

2.1.2 Platform‑Level Faults

These faults are hidden in underlying platforms such as JDK, RPC frameworks, cache libraries, etc., and often require version upgrades or configuration changes.

2.1.2.1 JDK Array‑Index‑Out‑Of‑Bounds

Older JDK versions (pre‑8u311) may throw high‑frequency ArrayIndexOutOfBoundsException during JSON parsing, increasing GC pressure.

"main"@1" prio=5 tid=0x1 nid=NA runnable
    java.lang.Thread.State: RUNNABLE
    at java.lang.ArrayIndexOutOfBoundsException.<init>(ArrayIndexOutOfBoundsException.java:65)
    ...

2.1.2.1 RPC Framework – Method‑Not‑Found

JSF (JD’s internal RPC) using the msgpack protocol can throw “method not found” when BigDecimal parameters are involved; switching to the Hessian protocol resolves the issue at the cost of slight performance loss.

RPC method not found
RPC method not found

2.1.2.1 Cache Framework – Buffer Overflow

Older jimdb versions (< 2.1.12) may trigger buffer‑overflow exceptions during read/write operations.

jimdb buffer overflow
jimdb buffer overflow

2.1.2.1 Cache Framework – Null Pointer

Using an outdated titan-profiler-sdk jar can cause NPEs after upgrading jimdb.

Cache NPE
Cache NPE

2.2 Single‑Container Faults

Even a single node failure can affect web services, RPC providers, MQ consumers, or workflow engines, especially when automatic failover or load‑balancing is not configured.

2.3 Data‑Center Faults

Failure of an entire data‑center impacts traffic entry points, RPC services, MQ, DB, and cache layers, requiring rapid traffic isolation and fallback strategies.

2.4 GC Faults

Improper GC settings (e.g., insufficient ParallelGCThreads) and connection‑pool misconfigurations can cause long pauses; tuning container resources and pool parameters reduced average latency from 176 ms to 17 ms.

3. DB High Availability

3.1 JED Single‑Shard Failure

JED routes queries based on a shard key; missing the key forces cross‑shard scans, increasing latency and risking total outage if any shard is down.

3.2 JED Transaction Fault

Default select @@session.tx_read_only scans a random shard, doubling failure probability; setting useLocalSessionState=true mitigates this.

3.3 Global‑ID Failure

Using a global auto‑increment ID on the first shard makes the whole service unavailable when that shard fails.

3.4 Slow SQL

Large or unoptimized queries degrade performance; avoid them.

3.5 Large Transactions

Big transactions cause DB lock contention; redesign to use idempotent inserts/updates.

3.6 Traffic Amplification

Some orders generate 10‑100× more SQL statements than expected, hidden until DB pressure surfaces.

3.7 Field Length Insufficiency

Schema mismatches across services lead to write failures when field lengths are too short.

3.8 Cluster Storage Exhaustion

Plan storage capacity for 10× or 100× traffic growth; monitor disk usage and QPS thresholds.

4. Redis (JIMDB) High Availability

4.1 Timeout and Hot‑Key Governance

Set reasonable timeouts to enable fast circuit‑break; avoid fixed hot‑key patterns.

4.2 Dangerous Commands Governance

Lua script uploads block when a node is down; upload scripts once at startup and retry on ScriptNotFoundException.

5. MQ High Availability

5.1 JMQ Acknowledgement Timeout

Consumer crashes keep partitions locked; reduce ack timeout (e.g., 10× recent TP99) and add graceful‑shutdown support.

5.2 Message Size Fault

Large messages trigger compression failures; keep payloads small.

5.3 Storage Fault

High‑volume large messages can saturate broker network bandwidth, causing send failures.

This content is derived from frontline engineering experience at JD and will be continuously updated with more real‑world cases and technical reflections.

backenddistributed-systemshigh availabilityfault tolerance
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.