Big Data 9 min read

Core Techniques of ID Mapping for Data Integration in Big Data Platforms

This article explains why ID mapping is essential for breaking data silos, describes traditional address cleaning and DSP scenarios, and details a graph‑based, six‑step process that builds a unified One‑ID dictionary to enable comprehensive user profiling and analytics in big‑data environments.

Big Data Technology & Architecture

Nov 28, 2021

Core Techniques of ID Mapping for Data Integration in Big Data Platforms

ID mapping is a core capability of Alibaba's data middle platform, aimed at solving data island problems by unifying disparate identifiers across systems.

The author shares personal experiences with traditional address cleaning and highlights the limitations of manual rule‑based matching, then introduces the need for ID mapping in internet advertising (DSP) where cookies are used to identify the same user across devices.

Because cookies are domain‑specific, cross‑domain ID mapping is required to link user activity on different sites, enabling personalized recommendations.

In modern big‑data environments, the author advocates using graph databases and graph computation to model data as nodes and edges, allowing automatic identification of the same entity through connected sub‑graph algorithms.

The ID‑mapping workflow consists of six steps: (1) recognize elements and raw IDs from each source, (2) abstract them into nodes and edges with threshold filtering, (3) build a graph model and run connectivity algorithms, (4) assign a new unified ID, (5) deduplicate and merge data, and (6) iterate steps 3‑5 while reusing existing IDs.

The resulting ID‑mapping dictionary acts as a bridge between data islands, enabling the construction of comprehensive user profiles, and can be stored in fast‑query stores such as Elasticsearch for external One‑ID lookup services.

Finally, the article emphasizes the importance of a data warehouse (e.g., Hive on Hadoop) for storing tags and user features, and notes practical considerations such as handling many‑to‑many matches, noise filtering, and the need for ongoing engineering practice.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Elasticsearch Data Warehouse Data Integration graph computing ID mapping

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.