Real-Time Data and User Profiling Practices at Zhihu: Architecture, Challenges, and Solutions
This article presents a comprehensive case study of Zhihu's data empowerment team, detailing the design of a real‑time data platform and user profiling system, the challenges faced in scalability, latency, and data quality, and the practical solutions and architectural choices implemented to drive business value.
In August 2021, Zhihu's platform team established a data empowerment group to address growing demands for user profiling and real‑time data across multiple business lines. They selected Baidu Intelligent Cloud's Palo as the real‑time data warehouse and built layers for data integration, scheduling, quality, and application.
The core challenges included delivering timely business metrics, providing complex real‑time calculations, supporting multi‑dimensional user segmentation, and ensuring data freshness within minutes for algorithmic features. Specific pain points involved high‑frequency data deduplication, complex joins, and the need for rapid data ingestion and processing.
To tackle these, the team adopted a Lambda architecture combining minute‑level batch processing in Palo with second‑level stream processing using Flink. They constructed a modular stack consisting of an application layer, business model layer, tool layer, and infrastructure layer, each responsible for different aspects of data handling and business logic.
Key technical solutions included:
Building a real‑time data integration system that abstracts source configurations and supports various ingestion methods.
Implementing a real‑time scheduling framework that coordinates Kafka offsets and task dependencies to avoid premature execution.
Deploying a data quality center with monitoring, alerting, and automated remediation to detect and fix anomalies quickly.
Optimizing Palo performance by tuning routine loads, leveraging runtime filters, and adjusting parallel execution parameters.
Scaling the DMP (Data Management Platform) for user profiling by partitioning tags, parallelizing bitmap operations, and reducing file sizes.
These efforts resulted in significant performance gains: daily ingestion of over 900 billion rows completed within three hours, user segment estimation in under a second, and full user‑profile analysis completed within five minutes. Business metrics such as exposure, conversion rates, and content creation quality also improved.
Looking forward, the team plans to further strengthen the toolchain, enhance data‑quality coverage, explore sub‑minute real‑time capabilities, and deepen user‑understanding tools to continue delivering value.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.