Backend Development 11 min read

Uncovering Hidden Cache Failures: A Robustness Test of Memcached with spymemcached

This article details a comprehensive robustness test of a core public service system's caching layer, exposing how decreasing Memcached (MC) instances dramatically impacts TPS and latency, analyzes the underlying Ketama consistent‑hash algorithm, and proposes concrete improvements to mitigate such failures.

Vipshop Quality Engineering
Vipshop Quality Engineering
Vipshop Quality Engineering
Uncovering Hidden Cache Failures: A Robustness Test of Memcached with spymemcached

Robustness Test

The system, a core public service platform, normally handles 330,000 QPS with sub‑millisecond latency, but this performance assumes all components are healthy; the article explores the impact of component failures and presents a full robustness test that uncovered a middleware algorithm issue.

Interface Cache Mechanism

The interface layer uses a distributed cache (memcache) and a local cache. The cache lookup flow is:

When a request arrives, look up in Memcached (MC); if hit, return the value.

If MC miss, look up the local cache; if hit, return the value and asynchronously write it back to MC.

If both caches miss, fetch from the DB, return the result, and asynchronously write to the local cache then to MC.

Test Plan

100 SKUs are mapped to keys in MC; the 100 keys are evenly distributed across 15 MC instances. Each request uses 10 or 100 SKUs, runs on 10 load generators with 10 threads each, and gradually shuts down MC instances until only one remains, then restarts them to observe service degradation.

Test Results

Scenario 1: 10 SKUs per request

Scenario 2: 100 SKUs per request

Observed Degradation When MC Instances Drop Below 9

When the number of alive MC instances falls from 15 to 10, results match expectations. However, with only 9 instances alive, TPS drops sharply and response time rises; under a 100‑SKU load, TPS falls to less than one‑tenth of the original and latency exceeds 20×.

Problem Discovery Process

According to the original design, when an MC instance goes down, its key‑value pairs should be rehashed to surviving nodes, and subsequent requests should retrieve the data from those nodes. The test confirmed that rehashing occurs, but the interface service sometimes reports the key as missing, contradicting the design.

Memcached

When TPS sharply declines, the hit rate of the remaining MC instances is examined. The hit rate shows large fluctuations, inconsistent with the expected behavior.

MySQL

Database QPS remains around 50 throughout the test, indicating that most interface responses are served from the cache rather than the DB.

Mechanism and Algorithm

The system communicates with the memcached cluster via the Java client spymemcached , which relies on Java NIO for asynchronous, single‑threaded operation.

spymemcached Overview

Features

Efficient storage

Strong tolerance to server and network interruptions

Full asynchronous API

Single‑threaded simplicity and efficiency

Optimized for high‑throughput processing

Core Components

MemcachedClient : API façade exposed to users.

MemcachedConnection : Manages connections to MC instances.

NodeLocator : Finds the MC node for a given key; supports algorithms such as ArrayModNodeLocator and KetamaNodeLocator. The system uses the Ketama consistent‑hash implementation.

MemcachedNode : Handles the actual NIO‑based communication with a single MC instance.

Operation : Abstract representation of any MC operation passed between MemcachedConnection and MemcachedNode .

Ketama Algorithm

Each MC instance is represented by 128 virtual nodes on the hash ring. When a key is hashed, it maps to a virtual node; if the corresponding MC instance is alive, the value is returned. If not, the next virtual node is tried, up to seven attempts before the request fails.

public final class KetamaNodeLocator extends SpyObject implements NodeLocator {
    ...
    public Iterator<MemcachedNode> getSequence(String k){
        // Seven searches gives us a 1 in 2^7 chance of hitting the same dead node all of the time.
        return new KetamaIterator(k, 7, getKetamaNodes(), hashAlg);
    }
    ...
}

If after seven attempts no alive MC is found, the request waits 400 ms before returning.

Impact of Ketama on Test Scenario

In this scenario each SKU maps to a distinct MC instance; with 15 × 128 virtual nodes on the ring, a higher proportion of downed MC instances dramatically increases the chance of missing a live node, especially when many keys are requested simultaneously.

Improvement Suggestions

Paginate requests so that each batch contains no more than 10 SKUs; with one‑third of MC instances still alive, TPS only drops to about half of the original, which is acceptable.

Introduce a custom NodeLocator subclass that exposes the retry count, e.g.:

public Iterator<MemcachedNode> getSequence(String k) {
    return new KetamaIterator(k, retryTimes, getKetamaNodes(), hashAlg);
}

Expose the default 400 ms timeout for possible intervention.

Takeaways

The root cause is now clear, and the issue likely exists in any system using spymemcached. Immediate cache‑logic audits and business/system‑logic refinements are recommended to prevent similar failures.

Final Note

All middleware clusters (e.g., MC, Redis, Zookeeper) can be evaluated with similar single‑instance/multi‑instance fault simulations to assess failover behavior and performance during degradation.

backend testingconsistent hashingdistributed cachingMemcachedcache robustnessketama algorithmspymemcached
Vipshop Quality Engineering
Written by

Vipshop Quality Engineering

Technology exchange and sharing for quality engineering

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.