Big Data 22 min read

Real-Time Data and User Profiling Practices at Zhihu: Architecture, Challenges, and Solutions

This article presents a comprehensive case study of Zhihu's data empowerment team, detailing the design of a real‑time data platform and user profiling system, the challenges faced in scalability, latency, and data quality, and the practical solutions and architectural choices implemented to drive business value.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Real-Time Data and User Profiling Practices at Zhihu: Architecture, Challenges, and Solutions

In August 2021, Zhihu's platform team established a data empowerment group to address growing demands for user profiling and real‑time data across multiple business lines. They selected Baidu Intelligent Cloud's Palo as the real‑time data warehouse and built layers for data integration, scheduling, quality, and application.

The core challenges included delivering timely business metrics, providing complex real‑time calculations, supporting multi‑dimensional user segmentation, and ensuring data freshness within minutes for algorithmic features. Specific pain points involved high‑frequency data deduplication, complex joins, and the need for rapid data ingestion and processing.

To tackle these, the team adopted a Lambda architecture combining minute‑level batch processing in Palo with second‑level stream processing using Flink. They constructed a modular stack consisting of an application layer, business model layer, tool layer, and infrastructure layer, each responsible for different aspects of data handling and business logic.

Key technical solutions included:

Building a real‑time data integration system that abstracts source configurations and supports various ingestion methods.

Implementing a real‑time scheduling framework that coordinates Kafka offsets and task dependencies to avoid premature execution.

Deploying a data quality center with monitoring, alerting, and automated remediation to detect and fix anomalies quickly.

Optimizing Palo performance by tuning routine loads, leveraging runtime filters, and adjusting parallel execution parameters.

Scaling the DMP (Data Management Platform) for user profiling by partitioning tags, parallelizing bitmap operations, and reducing file sizes.

These efforts resulted in significant performance gains: daily ingestion of over 900 billion rows completed within three hours, user segment estimation in under a second, and full user‑profile analysis completed within five minutes. Business metrics such as exposure, conversion rates, and content creation quality also improved.

Looking forward, the team plans to further strengthen the toolchain, enhance data‑quality coverage, explore sub‑minute real‑time capabilities, and deepen user‑understanding tools to continue delivering value.

big datadata pipelinedata qualityReal-time Datauser profilinglambda architecture
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.