Operations 18 min read

How Policy Regularization Boosts Deep Reinforcement Learning for Large‑Scale Inventory Management

This article presents DeepStock, a deep reinforcement learning framework with policy regularization that integrates classic inventory heuristics, achieving 7% turnover reduction and multi‑million cost savings across millions of SKU‑warehouse pairs in Alibaba's self‑operated ecosystem.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
How Policy Regularization Boosts Deep Reinforcement Learning for Large‑Scale Inventory Management

In October 2025, the Taotian Group’s self‑operated technology and algorithm team received the prestigious Daniel H. Wagner Prize for Excellence in Advanced Analytics and Operations Research for their paper "DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management," marking the team’s second honor after a 2022 nomination.

Problem background : Traditional two‑stage inventory optimization—first forecasting demand, then solving a replenishment model—suffers from error propagation, especially under high‑dimensional contextual data (promotions, seasonality, social trends). Classic inventory models (e.g., newsvendor) rely on strong demand distribution assumptions that rarely hold in practice.

Proposed solution : The authors embed well‑known inventory heuristics directly into the deep reinforcement learning (DRL) policy via policy regularization . Two regularization forms are introduced:

Base‑stock regularization : The order quantity is constrained to follow a base‑stock structure, order = BaseTarget - CurrentInventory, where the base target is produced by a neural network conditioned on exogenous features.

Coefficients regularization : Order quantities are expressed as a linear combination of key demand features with coefficients generated by a neural network, encouraging interpretable and stable policies.

These regularizations differ from generic entropy or trust‑region penalties; they encode domain‑specific knowledge into the policy architecture.

Modeling details : The inventory system is modeled as a discrete‑time Markov decision process with horizon T, replenishment lead time L, and review period P. State vectors include static SKU attributes (category, supplier, lead time, profit margin) and dynamic contextual signals (upcoming promotions, seasonal factors). The reward combines two operational metrics: service rate (stock‑out penalty) and turnover time (holding‑cost penalty), weighted to reflect business priorities.

Training methodology : Trajectories are split into training, validation, and test sets. The authors evaluate three DRL algorithms—DDPG, PPO, and a differentiable simulator (DS) approach—both with and without the proposed regularizations. Hyper‑parameter search is performed extensively, and the final policy is selected based on a weighted sum of service‑rate and turnover losses.

Experimental results :

On synthetic datasets (IID and AR(1) demand), base‑stock regularization consistently reduces test loss compared with unregularized DRL.

DS achieves lower validation loss but overfits, leading to higher test loss, especially when trajectory count is limited.

In large‑scale offline experiments, regularized DRL outperforms DS in both sample efficiency and generalization.

Online deployment on Alibaba’s Tmall Supermarket and International Direct channels covered 100% of over 1 million SKU‑warehouse pairs. Service‑rate remained stable while turnover days dropped 7%, translating to an annual inventory value reduction of ¥3.5 billion and holding‑cost savings of ¥15 million for a ¥50 billion inventory scale.

Full‑scale rollout (July–August 2025) showed a 20% reduction in average turnover time without degrading service‑rate, confirming the robustness of the approach.

Conclusions : Policy regularization bridges the gap between classic inventory theory and modern DRL, delivering interpretable, stable, and scalable replenishment policies. The study demonstrates that, with domain‑specific regularization, DRL can surpass differentiable‑simulator methods in both efficiency and performance, establishing a new benchmark for industrial‑grade inventory decision systems.

Limitations and future work : The current regularizations are problem‑specific and may not directly transfer to other domains without adaptation. Future research will explore automated discovery of regularization structures and integration with large‑scale foundation models for broader supply‑chain applications.

deep learningoperations researchinventory managementReinforcement learningIndustrial AIpolicy regularization
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.