Big Data 18 min read

How 58 Daojia Leverages User Portraits to Boost Operations and Fight Fraud

This article details 58 Daojia's data‑driven approach to building user‑portrait tags, covering tag construction, evaluation, and practical applications such as personalized recommendations, anti‑fraud measures, coupon distribution, and dynamic pricing, while outlining the underlying big‑data architecture and technical challenges.

ITPUB

Jun 11, 2016

How 58 Daojia Leverages User Portraits to Boost Operations and Fight Fraud

Overview

58 Daojia built a user‑portrait system that generates ~200 tags per user, covering basic attributes, location, interests, intent, and business‑specific dimensions. Tags are stored as factual tags derived from transaction and log data, then enriched into model‑based tags (e.g., purchase preference) and predictive tags (e.g., churn probability).

Tag Construction Pipeline

Data ingestion: raw transaction tables and click‑stream logs are loaded into Hive partitions by day.

Fact tag generation: SQL scripts in Hive compute deterministic attributes (e.g., device count, order frequency, average order value). Example:

INSERT OVERWRITE TABLE user_fact_tags PARTITION(dt='${date}')
SELECT uid,
       COUNT(DISTINCT device_id) AS device_cnt,
       SUM(order_amount) AS total_spent,
       AVG(order_amount) AS avg_order,
       MAX(order_time) AS last_order_ts
FROM orders
WHERE dt='${date}'
GROUP BY uid;

Model tag generation: Spark (or Mahout) jobs train supervised models on labeled data (e.g., purchase preference, fraud risk). The trained model is applied to the fact‑tag table to produce a probability score.

spark-submit --class com.daojia.tags.PurchasePrefModel \
  --master yarn \
  purchase_pref.jar --input hive://user_fact_tags --output hive://user_model_tags

Predictive tag generation: the model output is stored back to Hive and later merged into HBase for low‑latency lookup.

Storage and Retrieval Architecture

Batch computation runs in Hive; the resulting tag rows are periodically bulk‑loaded into HBase. HBase column families separate fact, model, and predict tags, enabling fast point queries by UID. A RESTful portal service exposes endpoints such as /tags/{uid} for downstream recommendation, anti‑fraud, and coupon‑targeting modules.

Identity Resolution

Because a user may appear with different identifiers (IMEI, cookie, UID), the system maintains a mapping table that records the relationship and timestamps of each identifier. During tag queries the service merges records by:

Majority‑vote across sources.

Any‑non‑null rule.

Weighted average based on source reliability.

Gender Inference

Three weak signals are combined:

Name parsing from order contacts (e.g., “张女士”).

Installed app categories that exhibit gender bias.

Text analysis of user comments for gender‑specific keywords.

The signals are weighted and the final gender tag is stored as gender=male/female/unknown.

Fraud‑Model Training Workflow

Training data are collected through a manual review UI that displays driver profile, order statistics, and historical fraud flags. Analysts label each record as fraudulent or benign. The labeled set feeds a binary classification model (e.g., Gradient Boosted Trees in Spark MLlib). Model iteration continues until validation AUC exceeds a predefined threshold (e.g., 0.92). The final model outputs a fraud score that becomes a predictive tag fraud_score.

Group Portraits and Merchant Portraits

Individual user tags are aggregated using K‑means clustering on selected dimensions (age, gender, income, service frequency). Each cluster forms a group portrait with representative tag values. Merchant portraits are derived similarly but focus on revenue, credit rating, and service‑quality metrics.

Evaluation Metrics

Tag coverage: proportion of active users that have a non‑null value for a given tag.

Accuracy: measured against a held‑out labeled set (e.g., gender inference 94% accuracy).

Business impact: AB‑test on personalized nail‑service list showed 15% reduction in browsing time and 8% lift in conversion rate.

Anti‑fraud effectiveness: detection of cross‑device coupon abuse reduced fraudulent coupon redemption by ~20%.

System Evolution

Initial prototype was rule‑based and released after ~3 months. Subsequent iterations introduced machine‑learning pipelines (Mahout, Spark) and real‑time scoring. The architecture now supports:

Real‑time recommendation via tag lookup in HBase.

Dynamic coupon targeting through the portal service.

Anti‑fraud scoring integrated into order‑validation pipelines.

Key Takeaways

The portrait system demonstrates how a unified tag store, combined with batch big‑data processing and low‑latency key‑value retrieval, can drive multiple product functions—personalization, fraud detection, pricing, and driver‑merchant matching—while providing a feedback loop for continuous model improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data mining recommendation anti-fraud user profiling product operation

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.