Big Data 13 min read

How Tencent Analytics Achieves TB-Scale Real-Time Data Processing

This article examines Tencent Analytics' architecture, detailing its real-time data collection, in‑memory computation, and NoSQL storage strategies that enable second‑level updates for terabyte‑scale web traffic.

ITFLY8 Architecture Home

Nov 22, 2016

How Tencent Analytics Achieves TB-Scale Real-Time Data Processing

Abstract: TA (Tencent Analytics) is a free website analytics system for third‑party webmasters, praised for data stability, timeliness, and second‑level real‑time updates. This article explores its system architecture and implementation principles.

Web analytics analyzes user browsing behavior to monitor site performance and guide optimization. Popular products include Google Analytics, CNZZ, and Baidu Tongji. TA distinguishes itself with community analysis, user profiling, and tools, handling daily TB‑level data with high availability.

Basic Principles and System Architecture

TA collects user behavior via a JavaScript snippet embedded on sites, sending data to a collection cluster that filters, encodes, and formats it before distribution. A processing cluster computes business logic and writes results to a storage cluster, which are then presented to webmasters.

The backend consists of real‑time and offline parts; the real‑time component updates data every second, while the offline part handles complex cross‑day analyses.

Http Access: parses HTTP, cleans and formats data.

ESC (Event Streaming Coder): encodes non‑enumerable types into integers.

ESP (Event Streaming Processor): reorganizes data by site/UID and calculates metrics such as PV, UV, dwell time, and bounce rate.

ESA (Event Streaming Aggregator): aggregates ESP results per site and writes to Redis.

Center: manages configuration, data routing, and failover.

Logserver: writes raw access data to files and uploads to TDCP.

TDCP (Tencent Distributed Computing Platform): processes offline data and writes results to MySQL.

Real‑Time Solution

TA processes TB‑scale data from hundreds of thousands of sites, storing billions of keys. The solution emphasizes full binary data, in‑memory computation, and NoSQL storage.

Real‑Time Computation

Inspired by Hadoop, S4, and Storm, TA implements a generic, extensible in‑memory event processing system.

Data Organization. All non‑int types are converted to ints; enumerable types are mapped to unique ints, while non‑enumerable types use MD5 to generate approximate ints, reducing memory usage.

Protocol. An extensible Event structure supports semi‑automatic serialization (similar to msgpack) and compact binary encoding (Zigzag, akin to Protobuf), enabling high‑performance I/O.

Incremental Computation Model. Consists of Processor (business logic), Data Holder (stores incremental results and intermediate state), and Emitter (periodically outputs and clears results).

Processor: handles specific business calculations.

Data Holder: retains incremental results and intermediate data.

Emitter: triggers periodic output and clears results.

The model reduces per‑machine transaction state, simplifying distributed implementation and boosting performance.

Real‑Time Storage

Real‑time statistics are read by the web layer and have two characteristics: frequent writes (up to once per second) and relatively low reads. Data is split into fixed (e.g., URLs, keywords) and dynamic (e.g., per‑site PV/UV) categories.

TA uses NoSQL solutions: Redis for dynamic data and LevelDB for fixed data.

Redis

Redis serves as the primary real‑time storage component, offering sharding rather than clustering. Its rich data structures (hashes, sets, etc.) fit analytics needs. Extensions to Redis commands enable arithmetic operations, reducing multiple queries to a single one and improving CPU utilization and throughput.

LevelDB

LevelDB complements Redis by storing immutable data on disk with high write performance. TA employs double‑write replication and sharding by domain. Dynamic sharding adjustments are made without moving data, based on load.

Query components include a Redis Protocol Stack for direct client access, a Query Rule Engine for intelligent multi‑source queries, and a Query Compute Engine for real‑time calculation, reducing Redis storage usage.

Future Outlook

While TA already provides second‑level data updates, future work will focus on dynamic data refreshes to give webmasters immediate insight into marketing effectiveness.

Source: http://blog.csdn.net/guolong1983811/article/details/50393093

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Big Data redis LevelDB

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.