When Does Dot-Product Attention Switch from Positional to Semantic? A Phase Transition Theory

This paper presents a solvable low‑rank dot‑product attention model and, using high‑dimensional asymptotics and GAMP analysis, derives closed‑form characterizations of global optima that reveal a phase transition between positional and semantic attention mechanisms as sample complexity grows, with empirical validation against linear baselines.

Data Party THU
Data Party THU
Data Party THU
When Does Dot-Product Attention Switch from Positional to Semantic? A Phase Transition Theory

Abstract

We study a solvable dot‑product attention model with trainable low‑rank query and key matrices and provide a closed‑form characterization of the global optimum of the non‑convex empirical risk. The optimum corresponds to two distinct mechanisms: positional attention, where tokens interact based on their sequence positions, and semantic attention, where interaction depends on token content. As sample complexity increases, the model undergoes a clear phase transition between these mechanisms. Comparisons with a linear positional baseline show that, with sufficient data, dot‑product attention leveraging semantic mechanisms significantly outperforms the baseline.

Introduction

Self‑attention layers are central to extracting information from textual data, simultaneously capturing positional order and semantic meaning. Empirical studies have shown that training scale and data volume determine which algorithmic mechanism the attention layer adopts, yet a theoretical description is lacking. Inspired by phase‑transition theory in physics, we construct an analytically tractable single‑layer dot‑product attention model and prove, in the high‑dimensional limit, a sharp transition between positional and semantic attention.

Model Construction

To obtain tractable results, we design a simplified self‑attention model that contains only one dot‑product attention layer. The query and key share the same trainable matrix, constrained to be low‑rank. In a teacher‑student framework, the teacher’s attention matrix is a mixture of a positional component and a semantic component, guaranteeing that the data contain both clear positional dependencies and deep semantic correlations. The student model learns a linear mapping from inputs with positional encodings to approximate the teacher’s mixed attention matrix by minimizing an ℓ₂‑regularized mean‑squared error loss.

Figure 1: Phase transition in the simplified attention model
Figure 1: Phase transition in the simplified attention model

High‑Dimensional Closed‑Form Characterization

In the proportional high‑dimensional limit where the input dimension d and sample size n grow together with a constant ratio α = n/d, we apply the Generalized Approximate Message Passing (GAMP) state‑evolution framework. Solving the resulting self‑consistent equations yields closed‑form expressions for training loss and test error at the global optimum. These expressions allow us to identify which mechanism (positional or semantic) the optimum corresponds to for any given α and teacher composition.

Positional‑Semantic Phase Transition

The analysis shows that when the semantic weight of the teacher is low, the global optimum is a positional solution for α < α_c, where α_c is a critical sample‑complexity threshold. Once α exceeds α_c, the optimum switches to a semantic solution: the student discards positional encodings and learns to align its query‑key matrix with the teacher’s semantic component. This behavior mirrors sub‑critical and super‑critical phase transitions in physical systems.

Empirical Comparison

We compare the dot‑product attention model with a linear baseline that can only implement positional mixing. For α < α_c, the linear baseline slightly outperforms dot‑product attention, but for α > α_c the attention model achieves significantly lower test mean‑squared error, confirming that the semantic mechanism provides a genuine advantage when data are abundant.

Figure 2: Loss landscape and empirical phase transition
Figure 2: Loss landscape and empirical phase transition

Conclusion and Future Work

This work delivers the first rigorous high‑dimensional probabilistic characterization of the emergence of positional and semantic algorithmic mechanisms in dot‑product attention, revealing a clear phase transition between them. The findings deepen our understanding of attention fundamentals and suggest new directions for designing self‑attention models with better generalization. Future research may extend the analysis to multi‑head and cross‑attention architectures, non‑Gaussian long‑sequence data, and the dynamics of random initialization and gradient descent within the loss landscape.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

theoretical analysisdot-product attentionGAMPhigh-dimensional limitphase transition
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.