Backend Development 13 min read

Analysis of Memcached Failure and Consistent Hashing Mechanism in XMemcached Client

This article presents a detailed failure analysis of a memcached service outage, explaining how the XMemcached client uses consistent hashing and heartbeat mechanisms to manage sessions, the impact of server loss on request latency, and the verification steps and solution to reduce session timeout and improve recovery.

Qunar Tech Salon

May 25, 2018

Analysis of Memcached Failure and Consistent Hashing Mechanism in XMemcached Client

In April 2018, a migration of memcached hosts caused one cache server (cache1) to go offline, leading to gateway storage timeouts and an 8‑minute drop in payment success rate.

1. Fault Background

When cache1 was taken down, the client experienced connection errors such as "Cannot connect to Hessian remote service" and socket timeout exceptions.

2. Fault Analysis

2.1 Basic Analysis

The application configured a 2‑second memcached timeout. Cache1 QPS was about 2.7, cache2 QPS about 2. When all 50 sessions of a machine were closed, requests stopped being sent to cache1, and failover took 6‑8 minutes because sessions were closed only after the heartbeat mechanism detected idle connections.

memcached.server2.host=192.*.*.44
memcached.server2.port=6666
memcached.server2.weight=2

2.2 Problem Analysis

2.2.1 Source Dependency : xmemcached‑2.0.0.jar with a 2‑second timeout.

<property name="sessionLocator">
  <bean class="net.rubyeye.xmemcached.impl.KetamaMemcachedSessionLocator"/>
</property>

2.2.2 Source Code : The client uses a consistent‑hash ring (Ketama) to map keys to sessions. Virtual nodes are created (NUM_REPS = 160) and weighted by server weight, resulting in 480 keys (320 for cache1, 160 for cache2). When a server disappears, the ring is rebuilt, ensuring that the same key continues to use the same server until all its sessions are closed.

static final int NUM_REPS = 160;
TreeMap<Long, List<Session>> ketamaSessions = new TreeMap<>();
...

The heartbeat mechanism periodically checks idle sessions. If a session is idle for more than the configured timeout (default 5 s), a heartbeat command is sent. Failure counts are tracked, and after a configurable number of failures the session is closed.

public void checkIdle(Session session) {
    if (controller.getSessionIdleTimeout() > 0 && session.isIdle()) {
        ((NioSession) session).onEvent(EventType.IDLE, selector);
    }
}

Reactor threads (equal to the number of CPU cores) manage selectors; when a selector returns zero selected keys, idle events are triggered, leading to heartbeat execution via a dedicated thread pool.

ThreadPoolExecutor heartBeatThreadPool = new ThreadPoolExecutor(
    1, MAX_HEARTBEAT_THREADS, keepAliveTime, TimeUnit.MILLISECONDS,
    new SynchronousQueue<>(), new ThreadFactory(){...}, new DiscardPolicy());

When a session fails heartbeat 10 times, it is closed and the ring is rebuilt, allowing traffic to shift to the remaining server.

3. Verification

Offline tests with iptables dropping traffic to cache1 showed that increasing cache2 QPS accelerated session closure on cache1, confirming that higher load on the surviving server speeds up failover.

4. Solution

Adjust the session timeout from 2 s to 200 ms, lower the heartbeat failure threshold, and reduce the session pool size to two per machine. This shortens the failover window and prevents prolonged payment latency.

Key configuration snippets:

memcached.server1.host=12.22.12.233
memcached.server1.port=6666
memcached.server1.weight=2
memcached.server2.host=12.22.12.233
memcached.server2.port=6667
memcached.server2.weight=1
memcached.connection.pool.size=50

After applying these changes, the system recovered within seconds during subsequent failure simulations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend heartbeat Consistent Hashing Memcached failure analysis xmemcached

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.