How Tencent Analytics Achieves TB-Scale Real-Time Data Processing
This article examines Tencent Analytics' architecture, detailing its real-time data collection, in‑memory computation, and NoSQL storage strategies that enable second‑level updates for terabyte‑scale web traffic.
Abstract: TA (Tencent Analytics) is a free website analytics system for third‑party webmasters, praised for data stability, timeliness, and second‑level real‑time updates. This article explores its system architecture and implementation principles.
Web analytics analyzes user browsing behavior to monitor site performance and guide optimization. Popular products include Google Analytics, CNZZ, and Baidu Tongji. TA distinguishes itself with community analysis, user profiling, and tools, handling daily TB‑level data with high availability.
Basic Principles and System Architecture
TA collects user behavior via a JavaScript snippet embedded on sites, sending data to a collection cluster that filters, encodes, and formats it before distribution. A processing cluster computes business logic and writes results to a storage cluster, which are then presented to webmasters.
The backend consists of real‑time and offline parts; the real‑time component updates data every second, while the offline part handles complex cross‑day analyses.
Http Access: parses HTTP, cleans and formats data.
ESC (Event Streaming Coder): encodes non‑enumerable types into integers.
ESP (Event Streaming Processor): reorganizes data by site/UID and calculates metrics such as PV, UV, dwell time, and bounce rate.
ESA (Event Streaming Aggregator): aggregates ESP results per site and writes to Redis.
Center: manages configuration, data routing, and failover.
Logserver: writes raw access data to files and uploads to TDCP.
TDCP (Tencent Distributed Computing Platform): processes offline data and writes results to MySQL.
Real‑Time Solution
TA processes TB‑scale data from hundreds of thousands of sites, storing billions of keys. The solution emphasizes full binary data, in‑memory computation, and NoSQL storage.
Real‑Time Computation
Inspired by Hadoop, S4, and Storm, TA implements a generic, extensible in‑memory event processing system.
Data Organization. All non‑int types are converted to ints; enumerable types are mapped to unique ints, while non‑enumerable types use MD5 to generate approximate ints, reducing memory usage.
Protocol. An extensible Event structure supports semi‑automatic serialization (similar to msgpack) and compact binary encoding (Zigzag, akin to Protobuf), enabling high‑performance I/O.
Incremental Computation Model. Consists of Processor (business logic), Data Holder (stores incremental results and intermediate state), and Emitter (periodically outputs and clears results).
Processor: handles specific business calculations.
Data Holder: retains incremental results and intermediate data.
Emitter: triggers periodic output and clears results.
The model reduces per‑machine transaction state, simplifying distributed implementation and boosting performance.
Real‑Time Storage
Real‑time statistics are read by the web layer and have two characteristics: frequent writes (up to once per second) and relatively low reads. Data is split into fixed (e.g., URLs, keywords) and dynamic (e.g., per‑site PV/UV) categories.
TA uses NoSQL solutions: Redis for dynamic data and LevelDB for fixed data.
Redis
Redis serves as the primary real‑time storage component, offering sharding rather than clustering. Its rich data structures (hashes, sets, etc.) fit analytics needs. Extensions to Redis commands enable arithmetic operations, reducing multiple queries to a single one and improving CPU utilization and throughput.
LevelDB
LevelDB complements Redis by storing immutable data on disk with high write performance. TA employs double‑write replication and sharding by domain. Dynamic sharding adjustments are made without moving data, based on load.
Query components include a Redis Protocol Stack for direct client access, a Query Rule Engine for intelligent multi‑source queries, and a Query Compute Engine for real‑time calculation, reducing Redis storage usage.
Future Outlook
While TA already provides second‑level data updates, future work will focus on dynamic data refreshes to give webmasters immediate insight into marketing effectiveness.
Source: http://blog.csdn.net/guolong1983811/article/details/50393093
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
