Databases 11 min read

Why Did One Aerospike Node’s XDR Lag So High? Diagnosis and Fix

A single Aerospike node experienced extreme XDR replication lag due to a misconfigured forward setting, and the article walks through terminology, monitoring data, root‑cause analysis, and step‑by‑step commands to resolve the issue and improve cluster performance.

Xiaolei Talks DB

Apr 7, 2022

Why Did One Aerospike Node’s XDR Lag So High? Diagnosis and Fix

1. Terminology

What is Aerospike? Aerospike is a distributed NoSQL database:

Hybrid storage architecture (key stored in memory index, value on SSD)

Scalable and stable for large‑scale data

High performance plus strong consistency

Significantly lower cost compared with pure‑memory Redis

Reduced operational cost: easy maintenance and automatic rebalance

What is XDR? Cross‑datacenter replication (XDR) copies data asynchronously between Aerospike clusters. Critical data usually has a standby cluster; Aerospike uses XDR to sync data from the primary cluster to a downstream cluster.

2. Sync lag problem description

Monitoring showed that a specific node (78) constantly had a large amount of data to sync and a high lag, while other nodes did not.

3. Analysis and investigation

1) Since only one node showed lag, we first checked whether a "write offset" caused it. Monitoring revealed that node 78’s average writes were ten times higher than other nodes.

Using aql show sets we saw that the set esc_join (corresponding to a MySQL table) had a high write rate (≈20‑30 k/s).

+------------------+----------+-------------+-------------------+----------------------------------------+-------------------+-------------------+--------------+------------+
| disable-eviction | ns       | objects     | stop-writes-count | set                                    | memory_data_bytes | device_data_bytes | truncate_lut | tombstones |
+------------------+----------+-------------+-------------------+----------------------------------------+-------------------+-------------------+--------------+------------+
| "false"          | "mediav" | "115437494" | "0"               | "esc_join"                             | "0"               | "140764973360"    | "0"          | "0"        |
| "false"          | "mediav" | "37096386"  | "0"               | "unionad_xxxxx_imei"                   | "0"               | "66129833424"     | "0"          | "0"        |
... (additional rows omitted) ...
+------------------+----------+-------------+-------------------+----------------------------------------+-------------------+-------------------+--------------+------------+

Business owners confirmed the write pattern; the high write rate was not due to a few hot keys. Hot‑key statistics on node 78 matched other nodes, so write offset was ruled out.

Mar 03 2022 18:04:59 GMT+0800: INFO (info): (ticker.c:725) {mediav} xdr-from-proxy: write (155545,0,0) delete (0,0,0,0)
Mar 03 2022 18:04:59 GMT+0800: INFO (info): (dc.c:1469) xdr-dc shyc-queryad: nodes 9 lag 26026 throughput 5477 latency-ms 3 in-queue 2306245 in-progress 36 complete (233214491,0,132723,0) retries (0,0,0) recoveries (437,292) hot-keys 43788323
... (additional log lines omitted) ...

We then examined XDR configuration differences and found that node 78 had forward=true while other nodes had false.

The official meaning: when forward=false, data replicated from the source cluster can only flow downstream; setting forward=true allows the downstream cluster to forward data back upstream, creating a chain‑replication architecture.

In this case, node 78’s forward=true caused a loop A→B→A→B, leading to the observed lag.

4. Solution

(1) Adjust transaction-queue-limit to alleviate lag. This parameter controls the maximum number of elements in XDR’s in‑memory transaction queue per partition. The command below doubles the default value:

asinfo -v 'set-config:context=xdr;dc=DataCenter1;namespace=someNameSpace;transaction-queue-limit=32768'

Official description:

transaction-queue-limit
Maximum number of elements allowed in XDR's in-memory transaction queue per partition, per namespace, per datacenter. Each element is 25 bytes.

Value must be a power of 2 and must be expressed as an integer, not an exponent.

Default: 16*1024 = 16384.
Minimum: 1024.
Maximum: 1048576.

(2) Real fix: set forward=false on node 78 to stop the replication loop.

asinfo -v "set-config:context=xdr;dc=shyc-queryad;namespace=mediav;forward=false"

After applying the change, lag disappeared and write rates returned to normal.

5. Takeaways

(1) When only one node misbehaves, look for configuration or environment differences between that node and the rest.

(2) Communicate with business owners; hot‑spot traffic may stem from upstream data skew (the 80/20 rule).

(3) Packet capture (tcpdump) can help investigate excessive write requests.

(4) Monitoring should be granular to the set level to detect anomalies early.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Tuning NoSQL Aerospike Database Lag Replication Loop XDR

Written by

Xiaolei Talks DB

Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.