Big Data 20 min read

User Segmentation and Growth Practices for Mini‑Programs Based on Doris

This article presents a comprehensive case study of how Baidu’s senior R&D engineer Zhao Yuyang built a Doris‑based user‑segmentation system for mini‑programs, detailing the product’s private‑domain fine‑grained operation capabilities, the four technical challenges, the architecture and solutions—including global dictionaries, bitmap storage, partitioning, tag optimization, dynamic‑static query handling, and rapid user‑package generation—along with future roadmap plans.

DataFunTalk
DataFunTalk
DataFunTalk
User Segmentation and Growth Practices for Mini‑Programs Based on Doris

The presentation introduces the private‑domain fine‑grained operation capabilities of Baidu’s mini‑program platform, highlighting two main pain points: low value of private‑domain users and lack of proactive reach.

To address these, a two‑part layered operation solution is proposed, combining targeted outreach with precise user segmentation based on profile and behavior data.

Benefits for developers include better utilization of private‑domain traffic and increased user activation; for the ecosystem, the solution improves traffic utilization, developer engagement, and promotes a healthy growth loop.

B‑side view shows a custom filter UI allowing developers to select dimensions such as interests, coupons, transactions, activity, gender, and age, with real‑time estimated audience size. After filtering, users can be managed and pushed via private messages, group messages, or other channels, with analysis of delivery metrics.

C‑side view demonstrates how the segmented audience appears in the Baidu App as notifications or private messages.

The architecture is divided into online and offline parts. The online part consists of service, parsing, computation, and storage layers, plus scheduling and monitoring platforms. The service layer handles permission control, segment management, metadata, and task management. The parsing layer optimizes DSL queries and routes them to appropriate SQL templates. The computation layer uses Spark for offline tasks and Doris for real‑time queries. Storage includes MySQL for segment metadata, Redis for caching, Doris for profile and behavior data, and AFS for user‑package files.

The offline part performs ETL on raw data sources, builds a global dictionary, and writes cleaned data into Doris, ultimately producing user packages for the mini‑program B‑side and Baidu analytics.

Technical challenges and solutions :

TB‑scale data: solved by compressing storage with bitmap and a global dictionary that maps sparse user IDs to dense sequential IDs.

Millisecond‑level query latency: addressed by Doris’s MPP architecture and a custom partitioning strategy using a hidden bucket ID ( hid = floor(V/(M/N)) ), where V is the dense ID, M is the estimated total user count, and N is the number of buckets.

Complex calculations: combined static bitmap queries with dynamic behavior queries using Doris’s to_bitmap function to convert IDs to bitmaps and then compute intersections.

Fast user‑package generation: replaced Spark‑based batch jobs with Doris’s SELECT INTO OUTFILE to export results directly to AFS, reducing generation time to under three minutes for millions of users.

Performance gains include sub‑second 95th‑percentile query latency, 9.67× storage reduction, and eight‑fold row count reduction.

Future plans focus on enriching segmentation scenarios, adding more profile dimensions, real‑time behavior ingestion, and modularizing the global dictionary for broader reuse across services.

data engineeringBig Datauser segmentationreal-time analyticsDorisbitmap indexing
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.