16 min read

Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path

The paper introduces JTok and its dynamic variant JTok‑M, a token‑indexed parameter scaling method that decouples model capacity from compute, achieving up to 35% compute reduction while delivering consistent performance gains across a wide range of downstream tasks and model sizes.

Machine Learning Algorithms & Natural Language Processing

Mar 3, 2026

Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path

Large language model scaling traditionally follows a dense or MoE path, where increasing parameters inevitably raises compute and memory linearly, leading to diminishing returns and even performance regression at scale.

LLM 是否还存在新的扩展方向，能带我们走出这个困境？

To break this binding, researchers from Shanghai Jiao Tong University and XiaoHongShu Hi Lab propose a third scaling dimension: token‑indexed parameters implemented via the JTok/JTok‑M modules. Instead of widening or deepening the backbone, each token retrieves a modulation vector from an embedding table (static JTok) or a context‑aware set of vectors (dynamic JTok‑M) and injects it element‑wise into the Transformer’s MLP residuals.

Static and Dynamic Modulation

JTok adds a lightweight plugin to every Transformer layer. For each token, its ID indexes a vector that, after normalization, multiplies the layer’s residual. This requires no architectural changes and adds negligible FLOPs.

JTok‑M extends this idea with two mechanisms:

Modulation‑vector pool: each token owns a set of candidate vectors forming a semantic sub‑space.

Context router: based on the token’s current hidden state, the router selects and blends the top‑K vectors, producing a context‑sensitive modulation.

Both mechanisms inherit MoE‑style load‑balancing loss to keep the vector pool efficiently utilized.

Engineering Optimizations

The lookup table operates asynchronously with the backbone, allowing memory accesses to be overlapped with computation. Token frequency’s long‑tail distribution enables batch merging of identical lookups, dramatically lowering memory pressure. Training supports embedding‑parallelism, while inference can offload lookups to CPU, keeping GPU memory overhead minimal.

Lookup can be overlapped with main compute, hiding memory latency.

Frequent tokens are merged, reducing memory traffic.

Training uses parallel embedding; inference streams only required vector slices.

With these tricks, JTok‑M adds less than 7% training‑throughput loss and under 7.3% inference‑throughput loss, while almost not increasing GPU memory usage.

Theoretical Analysis

The authors model the effective parameter count as N_{eff}=N_c+\gamma\eta N_c, where N_c is the backbone activation parameters, \eta=N_n/N_c is the ratio of token‑indexed parameters to backbone parameters, and \gamma captures sparsity‑induced discount. Substituting N_{eff} into the classic scaling‑law formula yields a parallel shift of the performance‑vs‑compute curve, meaning the same performance can be achieved with roughly 35% less compute, independent of model size.

Empirical log‑log plots confirm that JTok‑M’s performance‑vs‑compute frontier is almost perfectly parallel to the dense baseline, validating the theory.

Empirical Results

Across model scales from 650 M to 61 B parameters, JTok‑M consistently reduces loss and improves downstream benchmarks:

MMLU +4.1, ARC‑C +8.3, CEval +8.9 points.

Achieves the same performance with one‑third less compute.

Two key questions are answered:

When the backbone grows, does JTok‑M’s benefit persist? Yes—its scaling effect remains stable, saving ~35% compute regardless of backbone size.

When JTok‑M’s own parameters increase, does it follow a clear power‑law? Yes—doubling JTok‑M parameters reduces validation loss by ~0.0118 consistently, showing no saturation.

Downstream Task Gains

On a 1.5 B dense model, adding JTok raises the average accuracy of 14 tasks by 4.32 points (≈20% relative gain), with notable jumps of +4.6 on MMLU and +5.8 on ARC‑C.

On MoE backbones, JTok‑M delivers larger lifts, e.g., a 3.2 B model gains +5.59 average accuracy, with ARC‑C +7.25 and GSM8K +6.31.

For a 17 B MoE model (effective 61 B parameters), JTok‑M achieves early sample‑efficiency—surpassing native MoE after only a few billion tokens—and ends training with +4 MMLU points and +8‑9 points on harder benchmarks.

Conclusions

JTok/JTok‑M introduces a new, orthogonal scaling axis—token‑indexed capacity—that can be quantified, predicted, and applied without altering the training pipeline. It extends the classic two‑dimensional scaling law (parameters + data) to a three‑dimensional one, offering a low‑cost, stable path for future LLM development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer Compute Efficiency JTok Token-indexed scaling

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.