Big Data 11 min read

Construction and Evaluation of User Profiles: Identification, Tagging, Storage, and Quality Assessment

This article explains how to build user profiles by distinguishing persona from profile, describing the evolution of ID‑mapping techniques, designing a multi‑layer tag system, implementing statistical, interest, and model tags, storing the data in Hive, HBase, Codis and Elasticsearch, and finally evaluating profile timeliness, coverage and accuracy.

HomeTech
HomeTech
HomeTech
Construction and Evaluation of User Profiles: Identification, Tagging, Storage, and Quality Assessment

When discussing user profiling, two terms are often used: "persona" and "profile". A persona (user role) describes an abstract individual for product and UX discussions, while a profile is data‑driven, built from tags for operations and analytics; this article focuses on profile construction.

The company has been building its own user profiles for five years, evolving through several attempts. The overall logical architecture is illustrated in the diagram below.

The architecture consists of five parts: user identification, tag system, profile construction, profile storage, and profile quality assessment.

1. User Identification

In the early PC‑centric era, companies sought a "god view" of the user's full online journey, but weak account systems limited this. ID‑mapping technology evolved from generating UUIDs from the largest connected sub‑graph to a strong‑relationship + independent‑account model.

Phase 1 (v1.0) used an iterative coloring algorithm (distributed union‑find) to connect accounts, but suffered low linkage rates and erroneous connections caused by shared devices.

Phase 2 (v2.0) adopts a strong‑relationship + independent‑account approach: primary accounts (userid, phone) are linked to secondary accounts (pc‑cookie, m‑cookie, deviceid). Each secondary identifier also maintains an independent portrait, together forming a complete internet persona.

2. Tag System

The tag system abstracts raw tags into a logical hierarchy, grouping them into categories such as demographic, network, geographic, interest, commercial, and business attributes.

Tag construction proceeds in two stages: a planning‑driven stage that defines a universal tag schema for the enterprise, and a demand‑driven stage that refines tags for specific scenarios (e.g., car‑hunting, mini‑programs, youth channel, growth operations, finance, intelligent recommendation).

3. Tag Construction

Tags are divided into three methodological groups:

Statistical tags : derived from business rules (e.g., favorite list, search keywords, insurance expiry, visit counts).

Interest tags : built using an interest‑migration model. The formula is InterestTag = BehaviorWeight * TimeDecay * BehaviorCount , where behavior weight reflects cost, time decay follows a Newton‑cooling‑law‑style curve, and count is the frequency within a fixed window.

Model tags : generated by machine‑learning models such as RF+LR for car‑ownership prediction, DBSCAN for residence clustering, GBDT for purchase conversion, and K‑means for user segmentation.

Interest tags are updated incrementally on a daily basis, while model tags occupy a smaller proportion of the overall tag set.

4. Profile Storage

Profile data is stored using three main technologies: relational databases, NoSQL stores, and data warehouses. In practice the company combines Hive, HBase, Elasticsearch (ES), and Codis. All storage follows an ontology‑based model to represent user attributes and relationships.

Hive : builds a profile marketplace, decouples tag relationships, and stores tags across multiple Hive tables for easy analytical queries.

HBase & Codis : merge dispersed tag data into a complete portrait and provide fast ID‑based lookups.

Elasticsearch : serves scenarios such as audience segmentation, insight analysis, and user outreach.

5. Profile Evaluation

Following Peter Drucker’s principle that “what gets measured gets managed,” the quality of user profiles is assessed through timeliness, coverage, and accuracy.

Timeliness : critical for real‑time recommendation and conversion scenarios; SLA targets are set for near‑real‑time construction.

Coverage : high coverage benefits marketing and outreach, but may trade off against accuracy; balance is decided per business need.

Accuracy : evaluated differently per tag type—statistical tags via correctness checks, interest tags via reasonableness tests, model tags via AUC, F1, etc. Periodic sampling and cross‑validation improve overall confidence.

Big DataMachine LearningUser Profilingdata storagedata taggingprofile evaluation
HomeTech
Written by

HomeTech

HomeTech tech sharing

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.