Big Data 22 min read

Real-Time Data and User Profiling Practices at Zhihu: Architecture, Challenges, and Solutions

This article presents a comprehensive case study of Zhihu's data empowerment team, detailing the design of a real‑time data platform and user profiling system, the challenges faced in scalability, latency, and data quality, and the practical solutions and architectural choices implemented to drive business value.

IT Architects Alliance

Jun 5, 2022

Real-Time Data and User Profiling Practices at Zhihu: Architecture, Challenges, and Solutions

In August 2021, Zhihu's platform team established a data empowerment group to address growing demands for user profiling and real‑time data across multiple business lines. They selected Baidu Intelligent Cloud's Palo as the real‑time data warehouse and built layers for data integration, scheduling, quality, and application.

The core challenges included delivering timely business metrics, providing complex real‑time calculations, supporting multi‑dimensional user segmentation, and ensuring data freshness within minutes for algorithmic features. Specific pain points involved high‑frequency data deduplication, complex joins, and the need for rapid data ingestion and processing.

To tackle these, the team adopted a Lambda architecture combining minute‑level batch processing in Palo with second‑level stream processing using Flink. They constructed a modular stack consisting of an application layer, business model layer, tool layer, and infrastructure layer, each responsible for different aspects of data handling and business logic.

Key technical solutions included:

Building a real‑time data integration system that abstracts source configurations and supports various ingestion methods.

Implementing a real‑time scheduling framework that coordinates Kafka offsets and task dependencies to avoid premature execution.

Deploying a data quality center with monitoring, alerting, and automated remediation to detect and fix anomalies quickly.

Optimizing Palo performance by tuning routine loads, leveraging runtime filters, and adjusting parallel execution parameters.

Scaling the DMP (Data Management Platform) for user profiling by partitioning tags, parallelizing bitmap operations, and reducing file sizes.

These efforts resulted in significant performance gains: daily ingestion of over 900 billion rows completed within three hours, user segment estimation in under a second, and full user‑profile analysis completed within five minutes. Business metrics such as exposure, conversion rates, and content creation quality also improved.

Looking forward, the team plans to further strengthen the toolchain, enhance data‑quality coverage, explore sub‑minute real‑time capabilities, and deepen user‑understanding tools to continue delivering value.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Data Quality Real-time Data user profiling Lambda architecture

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.