Evolution of Pinterest's Monitoring System: From Time-Series Metrics to Distributed Tracing
Over seven years, Pinterest’s monitoring team built and refined a three‑pronged observability platform—time‑series metrics, log search, and distributed tracing—scaling from a single‑machine system to handling millions of data points per second across tens of thousands of AWS VMs, while addressing reliability, cost, and usability challenges.
Pinterest, a Silicon Valley startup founded in 2010, grew to 190 million monthly active users and a cloud infrastructure of tens of thousands of virtual machines. To support this growth, its SRE team progressively evolved its monitoring platform over seven years.
The platform now consists of three tightly integrated subsystems: a time‑series metrics and alerting system, a log‑search service, and a distributed tracing system. Together they provide a unified observability stack capable of processing millions of data points per second.
不同的业务场景中我们对各个运维系统的需求也是不同的,Pinterest 是来自于硅谷的初创公司,在他们成长的过程中一步步对运维系统进行改进和升级,如今的 Pinterest 的监控系统更是实现了监控报警,日志搜索和分布式跟踪三位一体的功能体系。
The metrics system ingests high‑frequency data via a Kafka‑based pub/sub pipeline, normalizes and samples it, then stores it on disk. It currently handles about 250 data points per second and serves roughly 35 k queries per second, 90 % of which originate from the alerting component. To improve query latency, the team shards data by type and age, moving hot recent data to SSD‑backed clusters and cold older data to HDD clusters, and is developing an in‑memory time‑series database for sub‑second access.
The log‑search subsystem aggregates 500‑800 GB of logs daily using a combination of Sumologic and Elasticsearch. Logs are emitted through a standardized API that enriches each entry with context (file, line, process ID, request ID) and outputs JSON, facilitating indexing and fast retrieval. Users can define alert rules that trigger on pattern matches.
The distributed tracing system records end‑to‑end request flows across more than a hundred microservices. Language‑specific libraries (Java, Python) capture spans, store them locally, and forward them via Kafka to Elasticsearch. The collected traces form waterfall visualizations that reveal service‑to‑service latency, identify bottlenecks, and enable per‑service CPU usage calculations.
Key challenges encountered include massive data volume (≈100 TB per day), reliability requirements (99.9 % uptime), and query performance (0.5‑5 s chart load times). Solutions involved aggressive data reduction, tiered storage, redundancy across monitoring clusters, and proactive health checks.
Operational hurdles such as unpredictable read/write patterns, 90 % of data never being accessed, and difficulty promoting new tools were mitigated through capacity‑elastic scaling, cost‑awareness education for engineers, and polished user interfaces that integrate metrics, logs, and traces on a single dashboard.
Future work focuses on deeper integration of the three data types into a unified UI, richer automation via intelligent alert deduplication and routing, and expanding the platform’s capabilities with AI‑driven anomaly detection.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.