Artificial Intelligence 8 min read

How VersatileFFN Cuts Memory Use While Boosting LLM Performance

The article introduces Huawei's VersatileFFN, an adaptive wide‑and‑deep feed‑forward design for large language models that reuses parameters to slash memory consumption while delivering stronger inference, detailing its dual‑system inspiration, technical mechanisms, experimental gains, and implications for efficient LLM deployment.

Data Party THU

Jan 19, 2026

How VersatileFFN Cuts Memory Use While Boosting LLM Performance

Overview

VersatileFFN is a parameter‑efficient feed‑forward network (FFN) designed to replace the standard FFN in Transformer‑based large language models (LLMs). It dynamically allocates computation either by widening the network (virtual experts) or by deepening the processing (recursive loops) on a per‑token basis, while keeping the total number of trainable parameters unchanged.

Key Components

1. Width‑Versatile Path (Virtual Mixture‑of‑Experts)

The original FFN weight matrix W ∈ ℝ^{d_{model}×d_{ff}} is sliced into overlapping sub‑matrices using a sliding‑window scheme. Each slice acts as a “virtual expert” that processes a subset of the input features. Because the slices share the underlying parameters, no additional storage is required, yet the model obtains MoE‑style sparse routing.

2. Depth‑Versatile Path (Recursive Computation)

A single FFN block is applied repeatedly to the same token. A differentiable Loop Predictor predicts an integer k_i for token i, indicating how many recursion steps are needed. The recursion is implemented as:

h^{(0)}_i = x_i
for t in range(k_i):
    h^{(t+1)}_i = FFN(h^{(t)}_i)

This allows “slow‑thinking” tokens (e.g., logical reasoning) to receive more computation.

3. Difficulty‑Aware Fusion Gate

A scalar gate g_i ∈ [0,1] is learned per token. The final output is a weighted sum of the two paths:

y_i = g_i · FFN_{wide}(x_i) + (1 - g_i) · h^{(k_i)}_i

Simple, high‑frequency tokens obtain g_i≈1 (favoring the wide path), while complex tokens obtain g_i≈0 (favoring the deep path).

Training Setup

Models were trained on the FineWeb‑Edu corpus at three scales (≈354 M, 720 M, and 1.2 B parameters). The same hyper‑parameters as the baseline Transformer were used, with the addition of the Loop Predictor loss (cross‑entropy on the predicted loop count) and the gating loss (binary cross‑entropy encouraging appropriate gate values).

Results

Evaluation was performed on eight downstream benchmarks (ARC, HellaSwag, PIQA, etc.). Key findings:

At 720 M parameters, VersatileFFN achieved an average accuracy of 57.03 % , compared with 53.83 % for a standard Transformer (Δ +3.2 %).

VersatileFFN outperformed a traditional MoE model (55.87 % accuracy) while using zero additional parameters (MoE required 1.6× more parameters to reach comparable performance).

Compared with a fixed 6‑loop recurrent baseline, VersatileFFN obtained higher accuracy with ≈45 % fewer FLOPs at the 354 M scale.

Parameter efficiency is highlighted: MoE models needed to increase from 720 M to 1.145 B parameters to match VersatileFFN’s performance, whereas VersatileFFN stayed at ~721 M parameters with negligible memory overhead.

Analysis and Visualization

Layer‑wise inspection on the ARC‑c dataset showed that different layers learn distinct loop‑count distributions, confirming that the model allocates deeper computation selectively. Word‑cloud analysis indicated that verbs and concrete nouns (e.g., “remove”, “cut”) trigger more loops, while high‑frequency function words (e.g., “make”, “use”) trigger fewer loops.