Big Data 11 min read

Design and Implementation of a Scalable User Profiling System Using Hive and SQL Templates

To meet the growing demands of precise, cost‑effective user operations, the article outlines a lightweight, flexible profiling system built on Hive that uses SQL templates, custom UDFs, and set‑operation logic to enable attribute‑based user segmentation, batch processing, and seamless integration with downstream services.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Design and Implementation of a Scalable User Profiling System Using Hive and SQL Templates

Background: Zhuanzhuan, a leading second‑hand e‑commerce platform, has experienced rapid growth in users and transactions. To improve service quality and sustain growth, the product operation strategy shifts from coarse‑grained acquisition to fine‑grained, data‑driven operations, focusing on targeted push notifications, coupon distribution, and low‑cost user recall.

Overall Architecture: The solution adopts Hive as the computation engine because it allows developers to add new attributes by simply creating SQL templates on a web page and to express user logic directly in SQL, making the system easy to extend and maintain.

Core Module 3.1 – SQL Templates: Two template types are defined. The attribute‑filter template selects users with specific attributes (e.g., gender, age range). The attribute‑output template returns additional properties (e.g., purchase count) together with the user ID. Each template must output three string fields: xxid (user ID), tag (label ID for set operations), and p (optional parameters). Placeholders such as ${stat_date_s} , ${reg_start_date_omg_} , ${reg_end_date_omg_} , and ${platform_array} allow dynamic parameter injection.

Core Module 3.2 – Set Operations: Each attribute template represents a user set. The system supports full set operations (intersection, union, complement) with multi‑level nesting. Users define logical expressions using tag IDs; each node has at most two children and is enclosed in parentheses. A custom UDF parses the expression into a tree, caches it, and evaluates it with short‑circuit logic for efficiency.

Core Module 3.3 – Similar Tag Merging: Frequently used tags (e.g., by city, category, frequency) are grouped and computed as a single task. The previously unused p field now stores the grouping key, enabling batch creation of tags and reducing the number of daily jobs by an order of magnitude.

Core Module 3.4 – Upstream/Downstream Integration: The profiling system ingests data via the configured SQL templates, pulling from data warehouses or event models. Machine‑learning results can also be imported through SQL, enriching rule‑based tags. Downstream systems (e.g., operation dashboards) can consume the generated profiles for real‑time, self‑service push campaigns.

Extension: Building on the user‑profile foundation, the team later added product‑profile and order‑profile modules, as well as retention and funnel analysis tools, all configured through the same SQL‑template mechanism.

Conclusion: The lightweight yet flexible architecture has supported tens of thousands of profiling requests, cutting development cycles from days to half an hour and freeing engineering resources. Future work includes adopting a new computation engine and optimizing the data model to further accelerate profiling tasks.

Data EngineeringHiveuser profilingset operationsSQL templates
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.