Backend Development 23 min read

How to Master In‑Memory Caching: Strategies, Pitfalls, and Performance Boosts

This article explores common caching scenarios, selection criteria, and best‑practice pitfalls, demonstrates a Go demo with configurable cache modes, compares byte‑versus‑struct caches, discusses concurrency, expiration, failover, cache transfer, lock contention, memory management, and provides benchmark results to guide high‑performance backend development.

MaGe Linux Operations

Feb 1, 2024

How to Master In‑Memory Caching: Strategies, Pitfalls, and Performance Boosts

Origin

Caching is the simplest way to improve performance, but it is not a panacea; in some cases, transactional or consistency constraints prevent reuse of computed results, making cache invalidation one of the two most common challenges in computer science.

If operations are limited to immutable data, cache invalidation is unnecessary and caching merely reduces network overhead. When mutable data must be synchronized, cache invalidation becomes critical.

The simplest invalidation method is TTL‑based expiration. Although event‑driven invalidation can be more precise, TTL is simpler and more portable, especially when events may be delayed or lost.

Short TTLs often represent a trade‑off between performance and consistency, acting as a barrier to reduce load on the data source under high traffic.

Demo Application

The demo receives a URL with request parameters and returns a JSON object; because data is stored in a database, the interaction is relatively slow.

It uses the plt tool for load testing with the following parameters: cardinality – number of unique URLs generated, affecting cache hit rate group – number of similar requests sent concurrently, simulating concurrent access to the same key

go run ./cmd/cplt --cardinality 10000 --group 100 --live-ui --duration 10h --rate-limit 5000 curl --concurrency 200 -X 'GET' 'http://127.0.0.1:8008/hello?name=World&locale=ru-RU' -H 'accept: application/json'

The command launches a client that sends 10,000 distinct URLs at 5,000 requests per second with a maximum concurrency of 200, batching 100 requests per URL. Real‑time metrics are shown below.

The demo defines three cache modes via the CACHE environment variable: none – no caching; every request hits the database naive – simple map with a 3‑minute TTL advanced – uses github.com/bool64/cache library with many performance‑enhancing features and a 3‑minute TTL

The code resides at github.com/vearutop/cache-story and can be started with make start-deps run.

Without caching, the system reaches a maximum of 500 RPS; once concurrent connections exceed 130, the database blocks with “Too many connections”.

Using the advanced cache increases throughput by 60×, reduces latency, and eases DB pressure.

go run ./cmd/cplt --cardinality 10000 --group 1 --live-ui --duration 10h curl --concurrency 100 -X 'GET' 'http://127.0.0.1:8008/hello?name=World&locale=ru-RU' -H 'accept: application/json'

Requests per second: 25064.03
Successful requests: 15692019
Time spent: 10m26.078s
Request latency percentiles:
99%: 28.22ms
95%: 13.87ms
90%: 9.77ms
50%: 2.29ms

Byte vs. Struct Cache

Byte cache ( []byte ) advantages:

Immutable data, requiring decoding on read

Less memory fragmentation

GC‑friendly (no traversable objects)

Easy to transmit over the wire

Precise memory limiting

Its main drawback is the overhead of encoding/decoding, which can be significant in hot loops.

Struct cache advantages:

No encoding/decoding on read

Richer expressiveness; can cache non‑serializable content

Struct cache disadvantages include accidental mutation, higher memory sparsity, GC pressure from long‑lived structs, and inability to tightly bound total memory usage.

This article uses a struct cache.

Native Cache

The native implementation protects a map with a mutex. On a cache miss, it builds the value from the data source, stores it, and returns it. The logic is simple but can suffer from lock‑related issues.

Concurrent Updates

When multiple callers miss the same key simultaneously, they may all try to build the value, causing lock contention or cache stampede. If a build fails, callers may also fail even if a valid cached value exists.

Using low cardinality and high group values can simulate this problem:

go run ./cmd/cplt --cardinality 100 --group 1000 --live-ui --duration 10h --rate-limit 5000 curl --concurrency 150 -X 'GET' 'http://127.0.0.1:8008/hello?name=World&locale=ru-RU' -H 'accept: application/json'

The diagram shows that the naive cache suffers from severe latency and DB‑operation spikes, while the advanced cache mitigates these issues.

Background Updates

When a cache entry expires, rebuilding may be slow. Synchronous rebuilding can increase tail latency; pre‑building hot items or serving stale data while rebuilding can improve responsiveness. Care must be taken to avoid using a cancelled parent context during background rebuilds.

Synchronous Expiration

TTL‑based caches can cause a “thundering herd” when many entries expire simultaneously, leading to a sudden load spike. Adding jitter (e.g., 10% randomization between 0.95 × TTL and 1.05 × TTL) mitigates this problem.

High cardinality and high concurrency simulations illustrate the issue and the benefit of jittered expiration.

go run ./cmd/cplt --cardinality 10000 --group 1 --live-ui --duration 10h --rate-limit 5000 curl --concurrency 200 -X 'GET' 'http://127.0.0.1:8008/hello?name=World&locale=ru-RU' -H 'accept: application/json'

Cache Errors

If value construction fails, returning the error directly can cause all traffic to bypass the cache, overwhelming the data source. Using a short TTL to cache errors is crucial for high‑load systems.

Failover Mode

Serving slightly stale data is often preferable to returning an error, especially when the data has just expired. This trade‑off improves resilience in distributed systems.

Cache Transfer

Cold starts suffer when a new instance starts with an empty cache. Pre‑warming by traversing recent data or transferring cache state from an active instance can reduce this penalty. Transfer can be done via HTTP endpoint /debug/transfer-cache, but care must be taken not to expose sensitive data.

CACHE_TRANSFER_URL=http://127.0.0.1:8008/debug/transfer-cache HTTP_LISTEN_ADDR=127.0.0.1:8009 go run main.go

2022-05-09T02:33:42.871+0200 INFO cache/http.go:282 cache restored {"processed":10000,"elapsed":"12.963942ms","speed":"39.564084 MB/s","bytes":537846}
2022-05-09T02:33:42.874+0200 INFO brick/http.go:66 starting server, Swagger UI at http://127.0.0.1:8009/docs
2022-05-09T02:34:01.162+0200 INFO cache/http.go:175 cache dump finished {"processed":10000,"elapsed":"12.654621ms","bytes":537846,"speed":"40.530944 MB/s","name":"greetings","trace.id":"31aeeb8e9e622b3cd3e1aa29fa3334af","transaction.id":"a0e8d90542325ab4"}

Transfer eliminates the warm‑up penalty and also enables developers to copy production cache data locally for debugging.

Lock Competition and Underlying Performance

Most cache implementations use a key‑value map with a mutex for concurrency. For read‑heavy workloads, a simple sync.Map or sharded map suffices; for write‑heavy workloads, more advanced structures like github.com/puzpuzpuz/xsync.Map (CLHT) or map sharding reduce contention.

Memory Management

Cache size must be bounded. Eviction strategies (LFU, LRU, FIFO, random) balance CPU/memory usage against hit/miss rates. For byte caches, memory usage can be precisely controlled; for struct caches, estimating heap impact is harder, so eviction may be based on a percentage of used memory.

Benchmark

Benchmark of 1 M small structs (10 % reads, 0.1 % writes) shows various implementations:

sync.Map         192 MB 142ns/10% 29.8ns/0.1%
shardedMap       196 MB 53.3ns/10% 28.4ns/0.1%
mutexMap         182 MB 226ns/10% 207ns/0.1%
rwMutexMap       182 MB 233ns/10% 67.8ns/0.1%
ristretto        346 MB 167ns/10% 54.1ns/0.1%
xsync.Map        380 MB 31.4ns/10% 22.0ns/0.1%
bigcache         340 MB 75.8ns/10% 72.9ns/0.1%
freecache        333 MB 98.1ns/10% 77.8ns/0.1%
fastcache        44.9 MB 60.6ns/10% 64.1ns/0.1%

Byte caches are not always more memory‑efficient; for CPU‑intensive workloads, xsync.Map performs best, while fastcache offers the best memory usage when serialization is possible.

Developer Friendliness

Caches can exacerbate bugs, so safe invalidation is essential. Bulk deletion can overload the data source; instead, set expirations and update in the background, serving stale data during rebuilds. Selective cache bypass via request headers can aid debugging but must be used cautiously.

Conclusion

The article compares byte‑based and struct‑based caches, discusses cache penetration, errors, pre‑warming, transfer, failover, eviction, and presents benchmark results for several popular Go cache libraries, providing guidance for building robust, high‑performance backend systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Go Caching TTL Cache Eviction In-Memory Cache

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.