Architecture and Real‑Time Processing Design of Tencent Analytics (TA)
This article explains the architecture, real‑time computation framework, and storage solutions of Tencent Analytics, detailing how massive TB‑level web‑traffic data are collected via JavaScript, processed in memory‑centric streaming components, and stored using Redis and LevelDB to achieve second‑level updates.
TA (Tencent Analytics) is a free web‑analytics platform for third‑party site owners that has gained praise for its data stability, timeliness, and second‑level real‑time updates. The article explores TA’s system architecture and implementation principles, covering real‑time data processing, storage, and overall design.
TA collects user behavior data through a JavaScript snippet embedded on websites, sends the data to a collection cluster for filtering, encoding, and formatting, then forwards it to processing clusters that compute business metrics and write results to storage clusters for presentation to site owners. The basic workflow is illustrated in the accompanying diagram.
The backend consists of both offline and real‑time components; the real‑time part updates metrics every second, while the offline part handles complex cross‑day analyses with daily updates.
Http Access : parses HTTP protocol, cleans and formats data.
ESC (Event Streaming Coder) : encodes non‑enumerable data types into integers and persists the mapping.
ESP (Event Streaming Processor) : reorganizes data by site and UID, calculates PV, UV, dwell time, bounce rate, etc.
ESA (Event Streaming Aggregator) : aggregates ESP results per site and writes them to Redis.
Center : central node for configuration, data routing, and disaster‑recovery switching.
Logserver : writes collected access data as strings to files and uploads them to TDCP.
TDCP (Tencent Distributed Computing Platform) : performs offline calculations and writes results into MySQL.
The real‑time solution addresses the challenge of processing TB‑scale daily data from hundreds of thousands of sites, with billions of URLs and over a billion keys. The approach emphasizes full binary data, in‑memory computation, and NoSQL storage.
Real‑time Computation draws inspiration from Hadoop, S4, and Storm to build a generic, highly extensible, in‑memory event‑processing system. Data organization converts all non‑int types to ints (enumerable types via configuration mapping, non‑enumerable via MD5 hashing). The protocol defines an extensible Event structure with semi‑automatic serialization/deserialization, compact binary encoding (Zigzag, similar to Protobuf), and support for arbitrary Event implementations.
The incremental computation model consists of three parts:
Processor : executes business‑logic calculations.
Data Holder : stores incremental results and intermediate state.
Emitter : periodically outputs and clears incremental results.
The processing flow follows three steps: receive Event → Processor computes; Data Holder saves results and intermediate data; Emitter triggers periodic output and clears the state. This model reduces per‑machine transaction state and improves overall performance.
Real‑time Storage must support two typical data patterns: frequent write‑heavy updates (as fast as once per second) and read‑light queries. Fixed, immutable data such as URLs and keywords are stored in LevelDB, while dynamic, frequently updated metrics are stored in Redis.
Redis is chosen for its high performance, rich data structures (hashes, sets, etc.), and ease of extension. Custom commands (e.g., extended sort , hmget , hmincrby ) enable arithmetic operations and batch field updates, reducing query latency and CPU usage dramatically.
LevelDB complements Redis by handling disk‑based storage of immutable data. The system employs double‑write replication and sharding by domain name to ensure high availability and balanced load. Sharding strategies can be adjusted dynamically without moving data.
Query capabilities are built on three layers: the Redis Protocol Stack provides a universal query interface; the Query Rule Engine performs intelligent multi‑source queries and join‑like operations across Redis, LevelDB, and optionally MySQL/HBase; the Query Compute Engine adds real‑time computation on top of basic query results, reducing Redis storage pressure.
Future Outlook notes that while TA already achieves second‑level data updates, the presentation layer remains static. Future work will focus on dynamic data refreshes to give site owners immediate insight into marketing performance.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.