Optimizing News Recall with DDPG Reinforcement Learning and Transformer Architecture
This article explains how reinforcement learning, specifically the DDPG algorithm combined with Transformer-based networks, is applied to improve large‑scale news recall systems, detailing the business scenario, algorithm selection, model architecture, speed optimizations, training challenges, and observed online performance gains.
In the era of AI challenging human capabilities, the AlphaGo series stands out, but reinforcement learning (RL) has already found extensive industrial applications; this article presents one such use case—optimizing a news recall model with RL.
Business Scenario : Personalized news recommendation aims to increase user engagement by selecting relevant news items for each user. The system architecture (Figure 1) involves a Center request, an Engine that performs coarse recall from a news pool, and a ranking module that finalizes the selection.
Why Choose Reinforcement Learning? Unlike supervised learning, RL does not require explicit target labels; it learns from reward signals such as click or dwell time, making it cheaper, capable of modeling delayed rewards, and suitable for the exploration‑exploitation trade‑off inherent in real‑time recommendation.
Why DDPG? The news recall problem can be modeled as a Markov Decision Process where user features form the state, the selection of news categories and counts form the action, and user clicks serve as the reward. Off‑policy methods like DDPG can train on historical online data and handle continuous actions, which matches the requirement of outputting proportional news counts per category.
What Is DDPG? DDPG consists of an actor (policy) network and a critic (Q‑value) network. The actor proposes actions, while the critic evaluates them, guiding policy updates via gradient descent. Both networks use a replay buffer and target networks for stability. The target Q‑value is approximated by the formula where the reward, next‑state Q‑value, and target networks are involved.
Transformer Integration : Both actor and critic networks adopt Transformer structures (Figure 2) to fuse user state features via multi‑head self‑attention, benefiting from parallelism and the ability to handle unordered inputs. The scaled dot‑product attention is expressed as .
Speed Optimization : To reduce the computational cost of the massive action space, the fully‑connected output layer is replaced with an embedding lookup (Figure 4), eliminating unnecessary calculations for sparse actions and accelerating training and inference.
Training Challenges : Offline pre‑training struggles to evaluate RL models because the offline data cannot cover the entire action space, leading to biased critic estimates. Constraints on actor outputs and regularization with dropout are applied to keep the critic stable. Additionally, monitoring actor updates is difficult; TensorBoard visualizations of Q‑values, gradients, and parameter distributions (Figure 5) are used to debug training.
Conclusion : The DDPG‑based reinforcement learning approach, enhanced with Transformer architectures and optimized output layers, has been deployed in the Sohu News app, yielding noticeable improvements in per‑user click and reading‑time metrics in online A/B tests.
References
[1] Timothy P. Lillicrap et al., "Continuous control with deep reinforcement learning," arXiv 2015.
[2] Vaswani A. et al., "Attention is All You Need," NIPS 2017.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.