Artificial Intelligence 13 min read

Optimizing News Recall with DDPG Reinforcement Learning and Transformer Architecture

This article explains how reinforcement learning, specifically the DDPG algorithm combined with Transformer-based networks, is applied to improve large‑scale news recall systems, detailing the business scenario, algorithm selection, model architecture, speed optimizations, training challenges, and observed online performance gains.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Optimizing News Recall with DDPG Reinforcement Learning and Transformer Architecture

In the era of AI challenging human capabilities, the AlphaGo series stands out, but reinforcement learning (RL) has already found extensive industrial applications; this article presents one such use case—optimizing a news recall model with RL.

Business Scenario : Personalized news recommendation aims to increase user engagement by selecting relevant news items for each user. The system architecture (Figure 1) involves a Center request, an Engine that performs coarse recall from a news pool, and a ranking module that finalizes the selection.

Why Choose Reinforcement Learning? Unlike supervised learning, RL does not require explicit target labels; it learns from reward signals such as click or dwell time, making it cheaper, capable of modeling delayed rewards, and suitable for the exploration‑exploitation trade‑off inherent in real‑time recommendation.

Why DDPG? The news recall problem can be modeled as a Markov Decision Process where user features form the state, the selection of news categories and counts form the action, and user clicks serve as the reward. Off‑policy methods like DDPG can train on historical online data and handle continuous actions, which matches the requirement of outputting proportional news counts per category.

What Is DDPG? DDPG consists of an actor (policy) network and a critic (Q‑value) network. The actor proposes actions, while the critic evaluates them, guiding policy updates via gradient descent. Both networks use a replay buffer and target networks for stability. The target Q‑value is approximated by the formula where the reward, next‑state Q‑value, and target networks are involved.

Transformer Integration : Both actor and critic networks adopt Transformer structures (Figure 2) to fuse user state features via multi‑head self‑attention, benefiting from parallelism and the ability to handle unordered inputs. The scaled dot‑product attention is expressed as .

Speed Optimization : To reduce the computational cost of the massive action space, the fully‑connected output layer is replaced with an embedding lookup (Figure 4), eliminating unnecessary calculations for sparse actions and accelerating training and inference.

Training Challenges : Offline pre‑training struggles to evaluate RL models because the offline data cannot cover the entire action space, leading to biased critic estimates. Constraints on actor outputs and regularization with dropout are applied to keep the critic stable. Additionally, monitoring actor updates is difficult; TensorBoard visualizations of Q‑values, gradients, and parameter distributions (Figure 5) are used to debug training.

Conclusion : The DDPG‑based reinforcement learning approach, enhanced with Transformer architectures and optimized output layers, has been deployed in the Sohu News app, yielding noticeable improvements in per‑user click and reading‑time metrics in online A/B tests.

References

[1] Timothy P. Lillicrap et al., "Continuous control with deep reinforcement learning," arXiv 2015.

[2] Vaswani A. et al., "Attention is All You Need," NIPS 2017.

aiTransformerReinforcement Learningonline advertisingnews recommendationDDPG
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.