Why One in a Million Searches Slowed 100× After Moving to Kubernetes
During Pinterest’s migration of its custom search platform Manas to the PinCompute Kubernetes environment, a rare latency spike—one request per million taking 100 times longer—was traced to cAdvisor’s memory‑intensive smaps scans, revealing hidden resource contention and prompting a targeted fix.
1. Migrating Manas to Kubernetes
Pinterest’s search service, built on the proprietary Manas system, powers millions of daily user interactions. Since 2017 the system grew complex, managing over 100 clusters across thousands of hosts. To improve reliability, Manas was moved to Pinterest’s in‑house Kubernetes platform, PinCompute, incorporating Envoy, Spinnaker, a GitOps config system, and a custom Kubernetes Operator.
Early 2025 performance tests on the new cluster revealed intermittent timeouts: a few per minute, each causing a brief dip in recommendation quality. Although the overall metrics remained stable, the risk of a larger outage forced the team to halt the migration.
2. Investigating Request Latency
Manas uses a two‑level fan‑out architecture: a root node distributes a request to leaf nodes, each handling a shard of a search index. The leaf node processing consists of four stages—query parsing, candidate retrieval via memory‑mapped lookup, ranking/hydration via another lookup, and returning results.
Latency aggregation showed normal p99.9 values (<60 ms) but occasional p100 spikes up to 5 s—about 100× the expected latency. Detailed per‑leaf‑node traces revealed sharp P100 peaks every few minutes, both in the retrieval and ranking phases.
Because each request runs on a single thread, the team hypothesized two possibilities: external interference beyond the CPU, or an internal operation slowing down.
3. Narrowing the Scope
The team first simplified the test environment: they moved the search cluster to larger EC2 instances, loaded the full index into memory, used the same AMI as production, and removed cgroup CPU/memory limits. They also ran Manas directly on the host instead of inside containers.
These changes had no effect, indicating the issue was tied to the Kubernetes/AMI setup.
Two parallel investigation tracks were then launched:
White‑box testing: Collected CPU, memory, and network metrics; used perf to record scheduling events; compared kernel lock contention and cache usage between Kubernetes nodes and the legacy environment.
Black‑box testing: Isolated the pod with taskset and CPU shielding, and also ran the binary directly on the host to eliminate container‑level interference.
Neither approach revealed obvious CPU saturation or scheduling anomalies.
4. Pinpointing the Root Cause
To force a reduction of variables, the team stopped all non‑essential processes on the nodes (logging agents, metrics pipelines, security daemons, kubelet, and Pinterest‑specific daemons). This made the root node a health reference for leaf nodes.
After sequentially killing these processes, latency spikes vanished only when cAdvisor was stopped.
5. Deep Dive into the Root Cause
cAdvisor gathers container metrics by invoking the Linux smaps interface, which scans the entire page table to compute container_referenced_bytes (an intrusive Working Set Size estimate). This scan locks the page‑table structures and clears access bits, an operation that is expensive for memory‑heavy workloads.
Manas leaf nodes map hundreds of gigabytes (up to >1 TB) of index data into memory. On a host with ~100 GB used memory, the page table can contain ~25 million entries. cAdvisor runs every 30 seconds, meaning the kernel lock is held twice per minute while traversing and clearing all those entries.
The contention between cAdvisor’s smaps scans and Manas’s own memory‑intensive processing caused the rare but severe latency spikes observed (≈1 in a million requests).
6. Fix Implemented
The team disabled cAdvisor’s WSS estimation on all PinCompute nodes by changing a single configuration line, eliminating the intrusive smaps calls. They also opened a GitHub issue in the cAdvisor repository to share the findings.
With cAdvisor’s heavy memory scans turned off, the P100 latency collapsed to the P99 level, and the migration could proceed safely.
7. Lessons Learned
Resource isolation is hard: CPU shielding alone does not fully isolate processes from neighboring workloads.
Narrowing the scope is powerful: Confirming the issue was not caused by the PinCompute AMI helped focus the investigation.
Black‑box testing remains valuable: Systematic disabling of components can quickly surface hidden interactions.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
