Sutton’s New Intentional Updates: Solving Streaming RL’s Major Flaw with a 1967 Formula

The article reviews the recent Intentional Updates framework—co‑authored by Turing laureate Richard Sutton—that redefines step‑size in streaming reinforcement learning using a 1967 NLMS‑style formula, details its algorithmic design, experimental validation, and remaining challenges.

Machine Heart
Machine Heart
Machine Heart
Sutton’s New Intentional Updates: Solving Streaming RL’s Major Flaw with a 1967 Formula

The Stream Barrier in Reinforcement Learning

Recent work (arXiv:2410.14606) showed that deep reinforcement learning struggles with truly online learning: removing replay buffers and using batch size 1 causes training to collapse, a problem dubbed the "stream barrier".

Intentional Updates – A New Perspective

Sharifnassab, Elsayed, Mahmood, and Sutton propose to replace the traditional step‑size (how much parameters move) with an "intentional" step‑size that directly specifies the desired change in the function output. This idea traces back to Nagumo and Noda’s 1967 NLMS algorithm, which set step‑size based on expected output change rather than parameter change.

In the Intentional Updates framework, each update first states the intended outcome (e.g., reduce value‑prediction error by 5% or limit policy probability change to a small amount) and then derives the appropriate step‑size.

Mathematical Core

The core formula is simple:

step‑size = (expected output change) / (actual influence of the gradient direction on the output)

. For value learning, the "actual influence" is the norm of the gradient vector, yielding smaller steps on steep regions and larger steps on flat regions. For policy learning, the expected change is proportional to the advantage function, normalized by a running average to keep policy updates stable.

Algorithmic Enhancements

The authors combine the intentional step‑size with two engineering tricks: RMSProp‑style diagonal scaling to handle parameter‑wise magnitude differences, and eligibility traces to propagate rewards backward in time.

Three concrete algorithms result:

Intentional TD(λ) for value prediction

Intentional Q(λ) for discrete‑action control

Intentional Policy Gradient for continuous control

Experimental Validation

On MuJoCo continuous‑control benchmarks (Ant, Humanoid, HalfCheetah) with batch size 1 and no replay, Intentional AC matches or exceeds SAC, while each update costs only about 1/140 of a SAC update in floating‑point operations.

On Atari and MinAtar discrete‑action games, Intentional Q achieves performance comparable to DQN with replay, using a single hyper‑parameter set across all tasks.

To verify that the intended updates are realized, the authors measured the ratio of actual to expected update magnitude. With eligibility traces disabled, the ratio’s standard deviation stayed between 0.016 and 0.029, and the 99th percentile remained below 1.07, confirming that updates closely follow their specifications.

Ablation studies show that removing RMSProp scaling or the σ term degrades performance but the "intentional scaling" component remains the primary contributor.

Remaining Issues

In policy learning, the step‑size depends on the sampled action, introducing a subtle bias. Cosine similarity between the expected and actual update directions stays near 0.96 for Humanoid tasks but drops to a median of 0.63 for Ant‑v4, indicating that the bias can be significant in some environments.

The authors suggest future work on action‑independent step‑size strategies to keep the "intent" unbiased in expectation.

Conclusion

Intentional Updates demonstrate that a principled, output‑driven step‑size can overcome the stream barrier, enabling streaming deep reinforcement learning without large replay buffers or massive GPU clusters. While not a replacement for batch training of large models, this approach is promising for robots, edge devices, and any setting requiring continual, low‑cost adaptation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

reinforcement learningpolicy gradientintentional updatesstep sizestreaming RLSutton
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.