Databases 13 min read

How False Sharing in CPU Cache Slowed Down Our Database Table Scans—and How We Fixed It

During a client’s upgrade test, a database’s compressed tables exhibited severe slowdown under concurrent full‑table scans, which we traced to CPU cache line false sharing in the decompression code; using Linux perf tools we identified the hotspot, aligned memory, and restored performance.

dbaplus Community
dbaplus Community
dbaplus Community
How False Sharing in CPU Cache Slowed Down Our Database Table Scans—and How We Fixed It

Background

A financial‑industry client upgraded to a newer version of our database and observed that full‑table scans on compressed tables became dramatically slower as the number of concurrent scans increased, while uncompressed tables showed no such degradation. The issue was unexpected because the compression feature had been stable for years.

Investigation Approach

Because we could not access the client’s production system, we reproduced the problem locally. Initial hypotheses focused on configuration parameters, but disabling a compression‑related setting only yielded modest improvement. The next suspect was lock contention, yet system‑wide lock statistics showed no significant contention. The investigation then turned to CPU‑cache behavior.

Using Linux perf to Locate Hotspots

We employed the Linux perf suite:

perf stat -e task-clock -e cycles -e stalled-cycles-frontend -e stalled-cycles-backend -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e L1-dcache-prefetches -e L1-dcache-prefetch-misses -e LLC-loads -e LLC-load-misses -p 29261 sleep 50

The output showed that stalled CPU cycles exceeded 80 %.

perf record -p 29261 -g sleep 50
perf report > perf_record.out

Analyzing the report revealed two hotspot functions involved in data decompression. Using perf annotate we drilled down to a single line that consumed 67 % of CPU time: perf annotate --symbol=decompress_xxx_by_yyy The corresponding source line (variables renamed for IP protection) was:

new_index = decompressinfo->xxx_arrayC[decompressinfo->xxx_arrayB[index]];

And the array access itself:

xxx_arrayC[temp_index]

Root Cause: False Sharing

The decompressinfo structure (≈48 bytes) is allocated from a memory pool and contains several pointer arrays. Its layout can cause the arrayC region to share a cache line with adjacent allocations. On a NUMA system with multiple cores, concurrent tasks writing to neighboring memory cause the shared cache line to bounce between cores—a classic false‑sharing scenario that inflates latency and stalls CPU cycles.

Illustration of the memory layout and cache‑line interaction:

Fix Implementation

We realigned the decompressinfo allocation to a cache‑line boundary, ensuring that its fields no longer share a line with other concurrently written data. After rebuilding and redeploying the database, we re‑ran the perf suite.

Results

Performance measurements before and after the fix:

Before optimization:

After optimization:

The hotspot line’s CPU consumption dropped dramatically, and overall metrics improved:

Stalled CPU cycles decreased markedly, indicating better CPU utilization.

L1‑dcache loads increased, showing more data fetched from low‑latency cache.

LLC loads fell, confirming reduced reliance on slower L3 cache.

Conclusion

The investigation demonstrates that subtle false‑sharing issues can cause severe performance degradation in high‑concurrency database workloads. Careful use of profiling tools, understanding of CPU cache architecture, and proper memory alignment are essential for diagnosing and fixing such problems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Code OptimizationLinux perfDatabase PerformanceCPU cachefalse sharingNUMA
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.