Why Large Language Models Are Short‑Sighted and How Next‑ToBE Unlocks Anticipatory Reasoning

The article examines the short‑sighted nature of current next‑token prediction in LLMs, presents the Next‑ToBE (Next Token‑Bag Exploitation) method that reshapes the training objective to expose latent future‑token awareness, and shows through extensive experiments that this approach improves anticipatory reasoning and downstream task performance.

Data Party THU
Data Party THU
Data Party THU
Why Large Language Models Are Short‑Sighted and How Next‑ToBE Unlocks Anticipatory Reasoning

Background

Next‑Token Prediction (NTP) together with the Transformer enabled large‑scale language models but forces the model to optimize only the immediate next token, creating a short‑sighted bias that hampers long‑range planning.

Problem

When the training objective is one‑step optimal, models become overly confident in the current step while the chain of subsequent steps may diverge, especially in multi‑step reasoning tasks such as mathematics, code generation, and planning.

Evidence of latent anticipatory capacity

At each step the model outputs a full probability distribution. Experiments show that this distribution already contains information about several future tokens. The authors define Future‑tokens Hit Rate (FtHR): for a given step, take the top‑k tokens from the distribution and check whether they cover the tokens that actually appear in a future window.

Figure 2 demonstrates (1) the current distribution covers a substantial portion of future tokens, and (2) the higher a future token ranks in the current distribution, the more likely it will be generated correctly later.

Next‑ToBE method

Next‑ToBE (Next Token‑Bag Exploitation) modifies the training target without changing the model architecture. The one‑hot target is replaced by a soft target distribution that allocates a small probability mass to a bag of tokens within a future window. The loss combines the original NTP term with a weighted future‑token term. The weight is determined by the model’s intrinsic anticipatory preference (α) and a time‑semantic relation factor (β) obtained via random walks.

Design principles for the future‑token distribution:

Current token remains the primary objective. The NTP loss is retained as the main term.

Future tokens receive spatio‑temporal weighted attention. Their weight depends on (a) the model’s raw probability for the token and (b) its temporal‑semantic proximity to the current token (β).

Normalization of the target distribution. Weights are normalized to form a probability distribution; the model’s predictions are encouraged to match it (e.g., via KL divergence).

Unlike Multi‑Token Prediction (MTP) methods such as Medusa, Next‑ToBE does not add extra prediction heads or alter inference; it remains a standard autoregressive decoder.

Experimental evaluation

The authors evaluate three questions:

Does Next‑ToBE increase the model’s awareness of future tokens?

Does this awareness translate into higher accuracy of subsequent generations?

Does it improve performance on complex reasoning tasks?

After fine‑tuning, FtHR rises markedly, k‑step generation accuracy improves in tandem, and next‑token confidence drops from 0.87 to 0.81, indicating a deliberate reduction of over‑confidence (Figure 4).

Fine‑tuning three base models (Qwen2.5‑Math‑1.5B, Qwen2.5‑Math‑7B, Llama3.1‑8B‑Instruct) on three task families (mathematical reasoning, code generation, commonsense reasoning) yields 36 comparative experiments. In 35 out of 36 cases the Next‑ToBE‑fine‑tuned model achieves the best result (Table 1).

Next‑ToBE also incurs lower memory usage and training time compared with MTP‑style approaches.

Confidence–accuracy trade‑off

Increasing the hyper‑parameter λ shifts probability mass from the next token to future tokens, causing next‑token confidence to decline while reasoning accuracy follows an A‑shaped curve: it first rises as the model becomes “moderately uncertain,” then falls when uncertainty becomes excessive (Figure 5).

Conclusion

The short‑sightedness of LLMs originates from the one‑hot NTP loss that forces all probability mass onto a single token. Next‑ToBE relaxes this constraint by allocating a modest probability budget to a bag of future tokens, thereby re‑activating the model’s latent anticipatory capacity. This yields consistent gains in downstream reasoning tasks without architectural changes.

Paper link: https://openreview.net/pdf?id=T8IJojfaOh

Code example

来源:PaperWeekly
本文
约3300字
,建议阅读
6
分钟
本文介绍 Next-ToBE 方案,打破 NTP 短视局限,激活大模型前瞻推理能力。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Large Language ModelsLLM evaluationNext-ToBEFuture Token PredictionAnticipatory ReasoningTraining Objective
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.