Cloud Native 14 min read

Why a Tiny Memory‑Intensive Process Caused 100× Latency Spikes After Pinterest’s Search Migration to Kubernetes

During Pinterest’s migration of its high‑traffic Manas search platform to the PinCompute Kubernetes environment, engineers observed an extremely rare latency outlier—one in a million requests took 100 times longer—prompting a deep investigation that traced the root cause to cAdvisor’s memory‑intensive smaps scans interfering with leaf node processing.

DevOps Coach
DevOps Coach
DevOps Coach
Why a Tiny Memory‑Intensive Process Caused 100× Latency Spikes After Pinterest’s Search Migration to Kubernetes

1. Migration Overview

Pinterest’s core search service, Manas, powers millions of daily queries across the home feed, search bar, and recommendation widgets. To improve stability and maintainability, the team rewrote Manas for PinCompute, Pinterest’s internal Kubernetes platform, adding a custom operator, GitOps configuration, Envoy, and Spinnaker.

2. Symptom Discovered

In early 2025, performance validation on the new cluster revealed a minute but critical anomaly: roughly one request per one million exhibited latency up to five seconds, about 100 × the normal 60 ms target. The outliers appeared only on leaf nodes, showing a sharp spike in the P100 latency metric while P99 remained within expected bounds.

3. Initial Investigation

The team first broke down the Manas request flow. Manas uses a two‑layer fan‑out architecture: a root node receives external queries and dispatches them to many leaf nodes, each handling a shard of the index. The leaf‑node processing consists of four stages—query parsing, candidate retrieval via memory‑mapped structures, sorting/hydration, and result return.

Aggregated latency charts showed the P100 metric far exceeding the normal range, while lower percentiles stayed stable. Zooming into a single leaf node revealed periodic spikes in both index‑retrieval and sorting phases, occurring every few minutes and lasting only a few seconds.

4. Systematic Variable Elimination

To isolate the cause, the engineers created a stripped‑down test environment:

Dedicated nodes and network resources for Manas pods.

Deployed the test cluster on larger EC2 instances to ensure the entire index loaded into memory, eliminating page‑faults.

Unified the Kubernetes node AMI with the production version.

Removed all cgroup CPU and memory limits and even ran Manas directly on the host, bypassing containers.

These changes did not improve latency, indicating the issue was not a resource‑limit problem.

5. Dual‑Track Debugging

Clearbox (OS‑level) debugging: Collected CPU, memory, and network metrics with perf, comparing Kubernetes nodes to bare‑metal production nodes. No significant CPU pre‑emption or lock contention was observed.

Blackbox (process‑level) debugging: Incrementally disabled non‑essential processes using taskset and cpusets, and even ran the Manas binary outside the container. Again, no latency improvement was seen.

Both tracks converged on the conclusion that the problem was not a user‑space resource bottleneck but something introduced by the Kubernetes environment itself.

6. Identifying the Culprit

The team performed a “hard‑stop” experiment: they sequentially sent SIGSTOP to every non‑core process (log collectors, metrics pipelines, security agents, kubelet, etc.). When cAdvisor was paused, the latency spikes vanished instantly.

Further perf analysis showed that the majority of CPU time on the leaf node was spent in kernel smaps handling, confirming that cAdvisor’s memory‑usage collection was the source of contention.

7. Root‑Cause Explanation

cAdvisor periodically invokes the Linux kernel’s smaps interface to compute the container_referenced_bytes metric (a Work‑Set‑Size estimate). This operation walks the entire page table, clearing the accessed bits. On a leaf node with hundreds of gigabytes of memory‑mapped index data, the page table can contain tens of millions of entries. cAdvisor runs this scan every 30 seconds, causing heavy kernel‑level lock contention that interferes with the leaf node’s own memory‑intensive retrieval and sorting phases.

8. Mitigation

The immediate fix was to disable the WSS (working‑set‑size) estimation in cAdvisor across all PinCompute nodes. This single configuration change eliminated the 100× latency outliers without sacrificing essential monitoring.

The team also opened an issue on the cAdvisor GitHub repository to document the finding and suggest making the WSS metric optional for memory‑heavy workloads.

9. Lessons Learned

Resource isolation is subtle: CPU shielding alone cannot fully prevent cross‑process interference.

Rapid problem‑space convergence: Confirming the AMI was not at fault saved countless dead‑end experiments.

Black‑box tactics are valuable: Systematically stopping services provided a practical binary‑search method for root‑cause identification.

Although the issue cannot be permanently solved by disabling cAdvisor (it provides critical metrics for autoscaling), the mitigation restores stable performance and highlights a hidden risk for any memory‑intensive service migrating to Kubernetes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeMemory ManagementKubernetesPerformance debuggingsearch infrastructurecAdvisor
DevOps Coach
Written by

DevOps Coach

Master DevOps precisely and progressively.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.