Design and Evolution of Zhihu's Event Tracking (埋点) System
This article presents a comprehensive overview of Zhihu's event‑tracking system, covering its motivation, toolset, demand‑management platform, verification workflow, data‑collection pipeline, query service architecture, cloud‑native data service design, and practical Q&A on best practices and optimization strategies.
With the continuous development of big‑data, DT, and AI technologies, event‑tracking (埋点) has become a crucial data source for analysis and decision‑making, especially in the AI era where massive data is required to train models.
The talk is organized into eight parts: an introduction, an overview of tracking tools, demand‑management, verification, data collection, data query, data service, and a Q&A session.
Event‑tracking tools include SDKs and web‑request sniffers that help developers design, implement, and validate tracking points, improving efficiency and data quality.
The tracking‑demand management platform at Zhihu evolved from version 1.0 to 2.0, focusing on cost reduction and efficiency. The new version consolidates multiple configuration steps into a single streamlined workflow, lowering the learning curve and speeding up design.
Verification moved from manual single‑point packet capture to a cloud‑native, high‑availability platform that uses message‑queue middleware, enabling stateless multi‑node deployment and faster, more reliable testing.
Data collection in version 1.0 relied on a Python‑based pipeline with local buffers and Kafka, suffering high latency and maintenance risk. Version 2.0 redesigns the pipeline with multi‑path message backup, reducing end‑to‑end latency to ~30 ms (about 1/15 of the previous time) and allowing horizontal scaling.
Data query is provided via a web‑API that abstracts the underlying storage. It uses Doris for high‑throughput dimensional queries and Presto on Hive for both batch and real‑time analytics, delivering fast and accurate results to product, regional, and other business stakeholders.
The data service layer integrates three core designs: data integration to lower heterogeneous source costs, logical models to avoid duplicated physical schemas and enable API‑driven access, and cloud‑native architecture to ensure high availability and seamless field‑change handling.
The Q&A covers topics such as responsible roles for parameter design, client‑ vs‑server‑side session reporting, characteristics of a good tracking system, version‑to‑product alignment, and cost‑optimisation through lifecycle management of tracking points and warehouse tables.
Overall, the presentation demonstrates how a modern, cloud‑native event‑tracking platform can support large‑scale data collection, high‑quality verification, and efficient querying, thereby empowering data‑driven product and operation decisions.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.