How Tencent Built a 10 TB‑Per‑Day Full‑Link Log Monitoring Platform
This article explains how Tencent's ZhiYun full‑link log monitoring platform handles massive daily logs, overcomes challenges of diverse log formats, high throughput, fault‑tolerant design, and provides scalable storage, query, and alerting capabilities for distributed micro‑service environments.
Background
Full‑link log monitoring is essential in modern micro‑service and distributed environments to improve issue localization efficiency. Open‑source solutions like Zipkin (based on Google’s Dapper) exist, but ZhiYun’s platform faces more complex real‑world scenarios, requiring a shift from open‑source components to self‑developed solutions.
Case Scenario
On 31 August 2017, a metric anomaly in module X dropped success rate from 99.988% to 97.325%. Multi‑dimensional monitoring identified the issue in the iPhone client of the Space‑On‑Demand service (return code -310110004). Full‑link logs were then examined to trace user actions and pinpoint the root cause.
Usage Scenarios
Individual analysis : handling user complaints and point‑to‑point anomaly analysis.
Development debugging : viewing related module logs during development and testing.
Monitoring alerts : extracting dimensions from log data for anomaly detection and root‑cause analysis.
Challenges
Business diversity : multiple services (QQ, Space, Live, Video‑On‑Demand, Membership) generate heterogeneous log formats without a unified RPC middleware.
Massive data volume : over 2 billion online users, daily log storage >10 TB, bandwidth >30 GB/s, requiring high‑performance, low‑cost storage and processing.
Solutions
Log Diversity
The platform supports four data formats—delimiter, regex, JSON, and API reporting. Delimiter, regex, and JSON are non‑intrusive but have lower parsing performance (≈40 k records/s). API reporting achieves ≈100 k records/s and is recommended for internal services.
Automatic Disaster Recovery and Scaling
Modules are designed statelessly (ingest, parse, process) for independent deployment. A heartbeat mechanism detects node failures: downstream nodes report status every 6 seconds, and upstream nodes disable unavailable links. Stateful services (e.g., storage) use master‑slave election via ZooKeeper.
Data Channel Resilience
Two mechanisms are employed: dual‑write for high‑quality monitoring data and message queues (Kafka or RabbitMQ + MongoDB) for log data. Dual‑write provides low latency and high throughput; message queues handle peak loads but may introduce latency and require monitoring for backlog.
Query Capability
Logs are hashed by user or request ID, sharded, cached for 1 minute or 1 MB, then written to a file server cluster. File paths are indexed in Elasticsearch. Two query modes are offered: (1) primary‑key lookup using hash‑based ES retrieval, and (2) keyword search with incremental file scanning to balance performance and completeness.
Summary
Key takeaways include leveraging mature open‑source components for initial functionality, iteratively replacing them with self‑developed modules to improve performance and stability, adopting stateless design with routing and load balancing, abstracting ETL capabilities for extensibility, and building a platform‑wide, fault‑tolerant architecture to support diverse business needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
