Big Data 10 min read

How Tencent Built a 10 TB‑Per‑Day Full‑Link Log Monitoring Platform

This article explains how Tencent's ZhiYun full‑link log monitoring platform handles massive daily logs, overcomes challenges of diverse log formats, high throughput, fault‑tolerant design, and provides scalable storage, query, and alerting capabilities for distributed micro‑service environments.

Efficient Ops

Nov 15, 2017

How Tencent Built a 10 TB‑Per‑Day Full‑Link Log Monitoring Platform

Background

Full‑link log monitoring is essential in modern micro‑service and distributed environments to improve issue localization efficiency. Open‑source solutions like Zipkin (based on Google’s Dapper) exist, but ZhiYun’s platform faces more complex real‑world scenarios, requiring a shift from open‑source components to self‑developed solutions.

Case Scenario

On 31 August 2017, a metric anomaly in module X dropped success rate from 99.988% to 97.325%. Multi‑dimensional monitoring identified the issue in the iPhone client of the Space‑On‑Demand service (return code -310110004). Full‑link logs were then examined to trace user actions and pinpoint the root cause.

Usage Scenarios

Individual analysis : handling user complaints and point‑to‑point anomaly analysis.

Development debugging : viewing related module logs during development and testing.

Monitoring alerts : extracting dimensions from log data for anomaly detection and root‑cause analysis.

Challenges

Business diversity : multiple services (QQ, Space, Live, Video‑On‑Demand, Membership) generate heterogeneous log formats without a unified RPC middleware.

Massive data volume : over 2 billion online users, daily log storage >10 TB, bandwidth >30 GB/s, requiring high‑performance, low‑cost storage and processing.

Solutions

Log Diversity

The platform supports four data formats—delimiter, regex, JSON, and API reporting. Delimiter, regex, and JSON are non‑intrusive but have lower parsing performance (≈40 k records/s). API reporting achieves ≈100 k records/s and is recommended for internal services.

Automatic Disaster Recovery and Scaling

Modules are designed statelessly (ingest, parse, process) for independent deployment. A heartbeat mechanism detects node failures: downstream nodes report status every 6 seconds, and upstream nodes disable unavailable links. Stateful services (e.g., storage) use master‑slave election via ZooKeeper.

Data Channel Resilience

Two mechanisms are employed: dual‑write for high‑quality monitoring data and message queues (Kafka or RabbitMQ + MongoDB) for log data. Dual‑write provides low latency and high throughput; message queues handle peak loads but may introduce latency and require monitoring for backlog.

Query Capability

Logs are hashed by user or request ID, sharded, cached for 1 minute or 1 MB, then written to a file server cluster. File paths are indexed in Elasticsearch. Two query modes are offered: (1) primary‑key lookup using hash‑based ES retrieval, and (2) keyword search with incremental file scanning to balance performance and completeness.

Summary

Key takeaways include leveraging mature open‑source components for initial functionality, iteratively replacing them with self‑developed modules to improve performance and stability, adopting stateless design with routing and load balancing, abstracting ETL capabilities for extensibility, and building a platform‑wide, fault‑tolerant architecture to support diverse business needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data data pipeline scalability fault tolerance Log Monitoring

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.