Big Data 20 min read

Building a User Profile Data Warehouse at 58.com: Architecture, Modeling, and Practices

This article details the design and implementation of a user‑profile data warehouse at 58.com, covering data‑warehouse fundamentals, user‑profile tag generation, layered architecture, dimensional modeling choices, ETL migration from Hive to Spark, data‑quality safeguards, and the resulting scale of tables, metrics and tags.

DataFunTalk
DataFunTalk
DataFunTalk
Building a User Profile Data Warehouse at 58.com: Architecture, Modeling, and Practices

The speaker, Bao Lei, a senior data‑R&D engineer at 58.com, introduces the three‑part session: an overview of data warehouses and user profiling, the construction process of the user‑profile data warehouse, and the final outcomes and summary.

Data Warehouse & User Profile Overview – A data warehouse is described as an integrated, subject‑oriented, relatively stable collection of data that captures historical changes. Its value includes fast data access, high‑quality output, rapid response to business changes, data security, timely data services, and improved decision‑making. User profiling tags are categorized into statistical tags (e.g., visit city, visit days) and algorithmic tags (e.g., predicted gender, age) and serve to turn raw behavior data into personalized services.

Construction Process – The workflow is divided into six stages: requirement communication, design (including source analysis, domain division, metric definition, and ETL design), technical review, development (SQL‑centric), testing, and release & acceptance. The warehouse follows a four‑layer architecture (ODS → DWD → DWS → APP) and adopts a dimensional‑modeling approach, using fact tables (transaction, periodic snapshot, cumulative snapshot, aggregate) and various dimension types (degenerate, snowflake, role‑playing, junk, bridge, mini, slowly changing). Modeling standards emphasize high cohesion, low coupling, separation of core and extension models, and consistent naming.

To improve performance, the ETL pipeline was migrated from Hive‑SQL (MapReduce‑based) to Spark, reducing Yarn container requests by ~90% and shortening job runtimes. Data‑quality assurance includes source‑level profiling, SLA monitoring, and coverage/accuracy metrics for generated tags.

Results & Summary – After applying the standards, the warehouse now spans eight subject domains, contains 102 tables, defines 312 metrics, and runs 256 tag‑generation ETL tasks. The system supports large‑scale user‑profile analytics and enables downstream personalized services.

Big Datadata qualityData Warehouseuser profilingETLdimensional modeling
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.