How Multi‑Task Multi‑Scene Modeling Powers ZhiZhuan’s Search: Algorithms, Industry Practices, and Lessons

This article analyzes the challenges of multi‑task and multi‑scene recommendation for large‑scale C‑end services, reviews key academic and industry solutions such as Shared‑Bottom, MMoE, PLE, ESMM, LHUC, PEPNet, MTMS and HiNet, and details ZhiZhuan’s end‑to‑end architecture that achieved over 6% click‑through and 2% conversion improvements.

Architect
Architect
Architect
How Multi‑Task Multi‑Scene Modeling Powers ZhiZhuan’s Search: Algorithms, Industry Practices, and Lessons

1. Overview of Multi‑Task & Multi‑Scene Challenges

Large‑scale consumer‑facing applications often need to optimize several user‑experience metrics (e.g., CTR, CVR, collection rate) across many usage scenarios (feed, search, etc.). Training separate models per scenario is costly and hampers iteration, while a unified model can suffer from data‑distribution mismatches, leading to a "seesaw" effect where dominant scenarios degrade others. Similarly, tasks with different sample sparsity (CTR vs. CVR) create training‑inference gaps.

1.1 Background

Multi‑task learning aims to jointly learn related objectives, whereas multi‑scene modeling seeks a shared representation that can adapt to diverse user behaviors and material supplies across scenarios.

1.2 Multi‑Task Solutions

The evolution starts with Shared‑Bottom (shared bottom network, task‑specific heads) which benefits correlated tasks but can hurt unrelated ones. MoE introduces a set of expert networks gated by a learned router, mitigating negative transfer. MMoE extends MoE by providing task‑specific gating, allowing each task to weight experts differently. PLE further combines shared and task‑specific experts in a progressive layered architecture, achieving strong empirical results.

Shared‑Bottom, MoE, MMoE, PLE diagram
Shared‑Bottom, MoE, MMoE, PLE diagram

Alibaba’s ESMM addresses conditional relationships between tasks (e.g., click → conversion) by modeling the entire space and aligning training and inference sample distributions, reporting notable accuracy gains in e‑commerce settings.

ESMM architecture
ESMM architecture

1.3 Multi‑Scene Solutions

LHUC (Learning Hidden Unit Contributions), originally for speaker adaptation, is repurposed to adjust dense parameters per scene, preventing representation collapse when feature engineering is insufficient.

LHUC applied to recommendation
LHUC applied to recommendation

Dynamic‑weight gating further yields algorithms such as PEPNet (Kuaishou), M2M , AdaSparse , and STAR (Alibaba), all of which rely on gating networks to filter or recombine expert outputs for each scene or task.

In summary, most multi‑task‑multi‑scene models can be viewed as variations of gating‑based information selection and re‑composition.

2. Industry Solutions Overview

PEPNet (Parameter and Embedding Personalized Network) uses a GateNU gating module to personalize both embedding and parameter networks (EPNet and PPNet), aligning scene information with task‑specific embeddings.

PEPNet architecture
PEPNet architecture

MTMS (Multi‑Task and Multi‑Scene) – Baidu adopts a multi‑tower design: independent embeddings per scene/task and a two‑stage training pipeline (representation learning → fine‑tune). Unlike ESMM’s end‑to‑end training, MTMS first learns separate embeddings, then concatenates them and trains only the top MLP.

MTMS two‑stage training diagram
MTMS two‑stage training diagram

HiNet (Hierarchical Information Extraction Network) – Meituan builds on MMoE with a hierarchical scene‑extraction module (shared experts, scene‑specific experts, scene‑sensitive attention) and a task‑extraction module that re‑uses MMoE’s gating to produce task‑specific embeddings.

HiNet modules
HiNet modules

3. ZhiZhuan’s Multi‑Business Multi‑Scene Solution

3.1 Problem & Solution

ZhiZhuan expanded from mobile 3C products to a broad catalog (electronics, appliances, etc.), introducing multiple business lines and scenarios (search, recommendation, group‑buy, etc.). Directly applying MTMS‑style independent embeddings would suffer from data imbalance in small scenes, and a unified pretrained embedding would miss business‑specific material features.

The adopted architecture combines EPNET with feature‑level dynamic weighting. The model consists of:

Scene representation derived from the category set of the item.

SparseFeatures and DenseFeatures that encode user, query, and material (including business‑specific attributes). DomainNet that processes all features, outputs weights applied to non‑scene features, and aggregates them into a global vector.

A prediction head that re‑uses DCN (Deep & Cross Network) for CTR (or other task) prediction.

ZhiZhuan EPNET architecture
ZhiZhuan EPNET architecture

The model is trained end‑to‑end (unlike MTMS’s two‑stage approach), with the representation module handling multi‑business, multi‑scene, material, user, and query signals, and the prediction module delivering task outputs.

Online results show a +6% lift in overall click‑through rate and a +2% increase in purchase conversion , especially pronounced in low‑traffic categories where gains exceed the average.

3.2 Future Plans

While the solution proves effective for CTR, extending it to CVR and other recommendation tasks is planned. A current limitation is cold‑start handling for new scenes or material attributes, which may hinder full‑site rollout; future work will focus on alleviating this bottleneck.

References

[1] MMoE: Modeling Task Relationships in Multi‑task Learning with Multi‑gate Mixture‑of‑Experts.

[2] PLE: Progressive Layered Extraction (PLE): A Novel Multi‑task Learning Model for Personalized Recommendations.

[3] MoE: Adaptive Mixtures of Local Experts.

[4] ESMM: Entire Space Multi‑Task Model: An Effective Approach for Estimating Post‑Click Conversion Rate.

[5] LHUC: Learning Hidden Unit Contribution for Unsupervised Speaker Adaptation of Neural Network Acoustic Models.

[6] PEPNet: Parameter and Embedding Personalized Network for Infusing with Personalized Prior Information.

[7] M2M: A Multi‑Scenario Multi‑Task Meta‑Learning Approach for Advertiser Modeling.

[8] AdaSparse: Learning Adaptively Sparse Structures for Multi‑Domain Click‑Through Rate Prediction.

[9] STAR: One Model to Serve All: Star Topology Adaptive Recommender for Multi‑Domain CTR Prediction.

[10] MTMS: Multi‑Task and Multi‑Scene Unified Ranking Model for Online Advertising.

[11] HiNet: Novel Multi‑Scenario & Multi‑Task Learning with Hierarchical Information Extraction.

[12] DCN: Deep & Cross Network for Ad Click Predictions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multi-task learningRecommendation SystemsAI model architectureindustry case studygating networksmulti-scene recommendationZhiZhuan
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.