Can the OaK Architecture Unlock General AI? A Deep Dive into Continuous Learning and Planning

The article presents Richard Sutton’s OaK architecture—a domain‑general, empirical, open‑ended framework that equips agents with continuously learnable components, meta‑learned step‑sizes, and a five‑stage FC‑STOMP pipeline to build world models, generate sub‑problems, learn options, and plan at run‑time.

Data Party THU
Data Party THU
Data Party THU
Can the OaK Architecture Unlock General AI? A Deep Dive into Continuous Learning and Planning

OaK (Options‑Knowledge) Architecture

The OaK framework is a model‑based reinforcement‑learning system designed for continual, domain‑general learning. Its three distinguishing properties are:

Every component (policy, value function, option models, feature generators, etc.) is capable of online continual learning.

Each learned weight is paired with a dedicated step‑size that is meta‑learned by online cross‑validation (e.g., IDBD‑style adaptation).

Abstract concepts are created through a five‑step evolution path called FC‑STOMP (Feature Construction → Sub‑Task Posing → Option Learning → Model Learning → Planning).

OaK architecture diagram
OaK architecture diagram

Design Goals

Domain‑General : No hand‑crafted priors; the system should work in any environment.

Empirical : All knowledge must emerge from interaction‑time experience, not from a pre‑training phase.

Open‑Ended Complexity : The agent’s abstraction capacity is limited only by available computation.

Design‑time vs. Run‑time

Design‑time is the construction phase where domain knowledge could be hard‑coded. Run‑time is the phase after deployment, when the agent continuously interacts with the world, learns, plans and adapts. Large language models exemplify design‑time learning (they stop learning after deployment), whereas OaK insists that all crucial learning happen at run‑time.

Problem Setting: Reinforcement Learning and the Reward Hypothesis

The objective is to build an agent that maximises the expected cumulative reward in an arbitrary, partially unknown world. The Reward Hypothesis states that any goal can be expressed as maximising this scalar reward signal, a principle that underlies most of RL theory.

Eight Parallel Run‑time Steps

Learn the main policy and value function : Standard RL optimisation of the primary reward.

Generate new state features : Transform existing representations into potentially more useful features (e.g., via nonlinear basis expansion or learned encoders).

Rank features : Maintain meta‑data (e.g., TD‑error reduction, contribution to reward) for each feature and order them by utility.

Create sub‑problems : For the highest‑ranked features, instantiate a Reward‑Respecting, Feature‑Achieving Sub‑Problem that aims to drive the feature to a high value while limiting loss of the main reward.

Learn options : Solve each sub‑problem with RL to obtain an option (policy + termination condition).

Learn option models : Predict the world transition resulting from executing an option; this constitutes the “knowledge” component.

Plan with models : Incorporate learned option models into a high‑level world model and perform approximate value‑iteration over options.

Management & maintenance : Continuously evaluate, prune, and create components to keep the system efficient.

Reward‑Respecting Feature‑Achieving Sub‑Problem

When a new feature ϕ_i is discovered, the agent defines a sub‑problem whose objective is to maximise

E\Big[ \sum_{t=0}^{T-1} R_t \;+\; \kappa\,\phi_i(S_T) \;+\; V(S_T) \Big]

where the expectation is over the option’s stochastic policy, 𝜅 controls the desirability of the feature, and V(S_T) is the value of the terminating state. This formulation guarantees that the agent pursues the feature without sacrificing long‑term return.

Sub‑problem objective diagram
Sub‑problem objective diagram

FC‑STOMP: Five‑Step Evolution Path

Feature Construction : Perception builds interesting state features (e.g., via auto‑encoders, predictive coding, or hand‑crafted detectors).

Posing a Sub‑Task : High‑ranking features give rise to reward‑respecting sub‑tasks.

Learning an Option : Solve the sub‑task with RL to obtain an option (policy + termination).

Learning a Model : Learn the transition model of the new option (predict next state distribution and cumulative reward).

Planning : Integrate the option and its model into the world model for long‑horizon planning.

This loop creates a discovery‑improvement cycle: new features inspire sub‑tasks, sub‑tasks produce options and models, and the resulting models guide the generation of better features.

Algorithmic Foundations

Learning option value functions and models can reuse off‑policy General Value Function (GVF) algorithms such as GTD, Emphatic TD, Retrace, and ABQ. Planning corresponds to an approximate value‑iteration where “actions” are replaced by “options” and “single‑step rewards” by “option‑execution rewards”.

Key Technical Challenges

Reliable continual deep learning : Catastrophic forgetting and loss of plasticity remain open problems for deep networks; recent approaches (e.g., continual back‑propagation, replay buffers, regularisation) are promising but not yet sufficient.

Generating new state features (the “new term” problem): Requires a generate‑and‑test pipeline that produces many candidate features and selects those that improve performance. Meta‑learning of per‑feature step‑sizes (e.g., IDBD) is a crucial component for evaluating candidate usefulness online.

Conclusion and Outlook

OaK provides a concrete answer to several fundamental AI questions:

How does high‑level knowledge emerge from low‑level experience? – via the FC‑STOMP loop that turns discovered features into sub‑problems, options and models.

How are concepts formed as internal representations for self‑generated sub‑problems? – features become the basis of reward‑respecting sub‑tasks.

What is reasoning? – planning with a learned high‑level world model built from option models.

What purpose does play serve? – self‑generated sub‑problems (play) drive discovery of useful abstractions.

How does perception operate without human labels? – by forming concepts needed to solve the sub‑problems it creates.

For reinforcement‑learning researchers, OaK outlines a roadmap toward agents that continuously learn models, generate and solve their own sub‑problems, and improve through a closed loop of feature discovery, option learning, and planning—capabilities that are currently missing from mainstream AI systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

reinforcement learningcontinual learningAI ArchitectureWorld Modelsmeta‑learning
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.