Mastering Multi-Task Learning: Network Designs & Loss Balancing
This article reviews the challenges of multi‑task learning, compares various network architectures such as hard‑parameter sharing, MMoE, CGC, and PLE, and examines loss‑balancing techniques like GradNorm, Dynamic Weight Average and task‑prioritization, offering insights on how to mitigate the “seesaw” effect and improve overall performance.
What is Multi‑Task Learning?
Multi‑Task Learning (MTL) trains a single model to solve several related sub‑tasks simultaneously, such as multiple content‑safety tags or joint semantic segmentation and depth estimation in autonomous driving.
Why Use MTL?
Only one forward pass is needed to obtain predictions for all tasks, improving inference efficiency.
Sharing data and network parameters across tasks can boost overall learning and alleviate data scarcity for individual tasks.
Key Bottlenecks in MTL
Despite its benefits, MTL faces two major challenges:
Network Architecture Design – The model must learn both shared representations and task‑specific features. Over‑fitting to a single task or under‑fitting due to insufficient specialization can occur.
Loss Function Design – Different tasks converge at different speeds and have varying loss magnitudes. Simple summation of task losses often lets one dominant task dictate training.
Network Structure Optimizations
Early MTL models used hard parameter sharing, which can cause negative transfer when tasks conflict. Subsequent designs introduced more flexible sharing:
Asymmetric Sharing Network : Some task outputs are shared, others remain private.
Customized Sharing Network : Decouples shared expert parameters from task‑specific expert parameters.
Multi‑gate Mixture‑of‑Experts (MMoE) : Adds a gating network that linearly combines expert outputs per task.
Customized Gate Control (CGC) : Splits experts into shared and task‑exclusive groups, with a gate that aggregates their outputs.
Progressive Layered Extraction (PLE) : Stacks multiple CGC layers, progressively merging shared and exclusive experts to alleviate the “seesaw” effect.
Loss Function Optimizations
When tasks have different loss scales, a naïve sum can cause one task to dominate training. Several strategies have been proposed:
Fixed Weighted Sum : Assign a constant weight Wi to each task’s loss.
Dynamic Weighting : Adjust Wi during training based on task difficulty or convergence speed.
Notable works include:
GradNorm : Adds a gradient‑norm regularizer to keep task losses at similar magnitudes, encouraging synchronized convergence.
Dynamic Weight Average (DWA) : Computes a task‑specific learning rate from recent loss values and updates Wi accordingly, without extra gradient computation.
Dynamic Task Prioritization (DTP) : Uses a KPI metric derived from task difficulty to allocate larger weights to harder tasks, updating the KPI with an exponential moving average.
Overall Summary
Current research identifies two root causes of MTL performance bottlenecks: (1) conflicting shared‑vs‑task‑specific features in network design, and (2) imbalance in loss scales and convergence speeds across tasks. Combining expert‑based architectures (e.g., CGC/PLE) with dynamic loss‑balancing methods (e.g., GradNorm, DWA, DTP) can mitigate both issues and lead to more robust multi‑task models.
NetEase Smart Enterprise Tech+
Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
