Big Data 11 min read

How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

Zhihu created a unified OneID framework that merges scattered account, device, and behavior data into a global unique identifier, using strong and weak IDs, graph‑based connectivity, device governance, and a device half‑life model to improve recommendation, push, and advertising effectiveness.

Zhihu Tech Column
Zhihu Tech Column
Zhihu Tech Column
How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

Background and Motivation

As Zhihu’s community services grew, user identity fragments across products became increasingly complex, and the existing account system lacked a stable, globally unique identifier. A unified OneID was needed to enhance user recognition, link accounts, devices, and behavior chains, and to support fine‑grained operations and algorithmic scenarios such as interest recommendation, profile recall, push, and behavior analysis.

Traffic Domain Mapping

The overall traffic domain was split into external (commercial and natural) and internal (main site, fast version, and relatively independent services like YanYan) flows. This mapping clarified sources, boundaries, and platforms, providing a data map for subsequent identity unification, cross‑device linking, behavior attribution, and profile consolidation.

User Identifier Construction

The goal is to build a unified identity system for natural persons, standardizing and mapping multi‑source identifiers from different platforms, terminals, and business chains. Over ten categories of IDs have been identified, recognized, and merged.

Strong IDs

Strong IDs have high certainty and can be directly used for identity recognition or stable mapping. Current strong IDs include:

Entity IDs such as MemberID

Device IDs: browser Cookie, IDFA, OAID, Android ID, CAID, etc.

Attribute IDs: phone number, email address

External IDs: GeTui ID, Shumei ID, and other third‑party ecosystem identifiers

The system has completed recognition, mapping, and unified merging for more than ten ID types.

Weak IDs

Weak IDs are used as auxiliary signals and include:

Network environment IDs: IP address, Wi‑Fi name

Spatial location IDs: resident city, AOI, latitude/longitude

User profile IDs: nickname, industry, occupation

Connectivity Computation

Traditional connectivity approaches cause exponential edge growth when the number of associated IDs exceeds three, leading to massive resource consumption and unstable execution. For example, with a daily device table of 4 billion rows and seven IDs per row, the maximum edge count reaches C(7,2)=21, averaging about 12 edges per row, resulting in roughly 480 billion edges.

To address this, a mid‑as‑bridge local‑cluster pre‑aggregation strategy was designed. The simplified workflow (shown in the diagram) uses partial clustering, isolated‑node handling, and identical‑vertex construction to dramatically reduce edge scale and vertex merge depth, keeping the execution time of hundred‑billion‑edge graph tasks under 20 minutes and ensuring downstream data timeliness.

Grey‑Product (Abnormal) Identification

After the initial OneID connectivity, a super‑connected component containing billions of mids was discovered, caused by abnormal devices and grey‑product data. In addition to standard null/empty checks, the following cleaning rules were added:

Validate device ID length ranges; values outside are marked abnormal.

Detect repeated‑character IDs (e.g., 0000‑0000‑0000‑0000‑0000, aaaa‑aaaa‑aaaa‑aaaa‑aaaa).

Identify high‑risk mapping scenarios and perform statistical distribution analysis per device and per account over cycles to set reasonable thresholds, building abnormal device and abnormal mid libraries.

Applying these checks reduced the abnormal super‑connected component to the million‑level, markedly improving stability and accuracy.

Device Relationship Governance

Device IDs serve as crucial bridges but do not always reflect natural‑person relationships. Scenarios such as device swaps, shared devices, and temporary borrowing affect stability. The governance mechanism handles these cases differently:

For device swaps, historical interests and base attributes are inherited for a limited period, after which they decay to avoid long‑term influence.

For shared or borrowed devices, metrics like the number of accounts per device, switch frequency, and cross‑cycle stability are used to flag high‑risk devices. Clearly shared devices are excluded from strong connectivity bridges, and low‑frequency, short‑term, or long‑inactive relationships are down‑weighted or filtered.

Device Half‑Life

Device signals decay over time due to swaps, reinstallations, multi‑user sharing, and account migrations. The system introduces a “device half‑life” concept to manage the effective period of device signals differently across scenarios. In sparse‑behavior contexts (e.g., reback, low‑activity), longer device history is retained to supplement recommendation and push; in highly active or real‑time intent scenarios, device history decays faster to prevent stale interests from affecting decisions.

Weak ID Usage

Weak IDs are not used as primary connectivity edges but serve as auxiliary evidence in the identity association process. For cases where multiple registered users are merged, both strong edges and sufficient weak‑edge evidence are required, reducing erroneous merges. While weak IDs have limited power for unique person identification, they are valuable for group identification and fine‑grained tagging in contexts such as university or workplace populations.

Application Scenarios

OneID integrates user and visitor behavior data from PC, APP, and mobile sites, exposing unified identity capabilities to advertising engines, feature platforms, and profile services. Beyond merely linking IDs, it transforms scattered behavior data into a stable user asset that powers recommendation optimization, push targeting, ad delivery, profile construction, and new‑user cold‑start.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataUser IdentityOneIDGraph ComputationDevice GovernanceStrong IDWeak ID
Zhihu Tech Column
Written by

Zhihu Tech Column

Sharing Zhihu tech posts and exploring community technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.