Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results
This article presents a comprehensive overview of the Efficient Conformer model for large‑scale end‑to‑end speech recognition, detailing its architectural improvements such as progressive downsampling and grouped multi‑head self‑attention, the PyTorch implementation in WeNet, streaming inference handling, experimental CER gains on AISHELL‑1 and production data, and future development plans.
The 58.com TEG‑AI Lab replaced its Kaldi‑based ASR system with a WeNet end‑to‑end recognizer and further optimized it using the Efficient Conformer architecture, achieving a 3% absolute CER reduction over the Kaldi optimum and a 61% decoding speed increase.
Model Improvements
Efficient Conformer modifies the original Conformer by introducing two key techniques:
Progressive Downsampling – adds stride‑2 convolution in the early Conformer blocks to halve the time dimension, reducing computational cost of subsequent blocks.
Grouped Multi‑Head Self‑Attention (Grouped MHSA) – splits the attention heads into groups, lowering the complexity from O(n²d) to O(n²d/g) where g is the group size.
Additional efficient attention variants (Stride MHSA, Relative MHSA, Local MHSA) are also mentioned for interested readers.
Implementation Details
The model was re‑implemented in the WeNet open‑source project under the efficient_conformer module. Key code changes include:
self.depthwise_conv = nn.Conv1d(
channels,
channels,
kernel_size,
stride=stride, # for depthwise_conv in StrideConv
padding=padding,
groups=channels,
bias=bias,
)Mask synchronization after downsampling:
if mask_pad.size(2) > 0: # time > 0
if mask_pad.size(2) != x.size(2):
mask_pad = mask_pad[:, :, ::self.stride]
x.masked_fill_(~mask_pad, 0.0)Pointwise projection for residual connections:
# add pointwise_conv for efficient conformer
if self.pointwise_conv_layer is not None:
residual = residual.transpose(1, 2)
residual = self.pointwise_conv_layer(residual)
residual = residual.transpose(1, 2)
assert residual.size(0) == x.size(0)
assert residual.size(1) == x.size(1)
assert residual.size(2) == x.size(2)Grouped MHSA implementation (group size = 3):
class GroupedRelPositionMultiHeadedAttention(MultiHeadedAttention):
def __init__(self, n_head, n_feat, dropout_rate, group_size=3):
super().__init__(n_head, n_feat, dropout_rate)
self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)
self.group_size = group_size
self.d_k = n_feat // n_head
self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k * self.group_size))
self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k * self.group_size))
torch.nn.init.xavier_uniform_(self.pos_bias_u)
torch.nn.init.xavier_uniform_(self.pos_bias_v)Padding helper to make sequence length divisible by the group size:
def pad4group(self, Q, K, V, P, mask, group_size: int = 3):
overflow_Q = Q.size(2) % group_size
overflow_KV = K.size(2) % group_size
padding_Q = (group_size - overflow_Q) * int(overflow_Q // (overflow_Q + 1e-15))
padding_KV = (group_size - overflow_KV) * int(overflow_KV // (overflow_KV + 1e-15))
Q = F.pad(Q, (0, 0, 0, padding_Q), value=0.0)
K = F.pad(K, (0, 0, 0, padding_KV), value=0.0)
V = F.pad(V, (0, 0, 0, padding_KV), value=0.0)
if mask is not None and mask.size(2) > 0:
mask = mask[:, ::group_size, ::group_size]
# reshape for grouped attention
...
return Q, K, V, P, mask, padding_QStreaming Inference
WeNet’s streaming mode calls forward_chunk on the encoder. Because Efficient Conformer performs temporal downsampling, cache tensors (attention and CNN caches) must be padded or repeated to match the original time resolution. The down‑sampling factor is computed per layer:
def calculate_downsampling_factor(self, i: int) -> int:
factor = 1
for idx, stride_idx in enumerate(self.stride_layer_idx):
if i > stride_idx:
factor *= self.stride[idx]
return factorAttention cache is repeated to restore the original length, and CNN cache is padded to kernel_size‑1 for causal convolution.
Experimental Results
On the internal 58.com dataset (≈10 M h audio/year, >50 M dialogs/year) Efficient Conformer outperformed the baseline Conformer. On AISHELL‑1, the best CER achieved was 4.56% (no LM), better than the Conformer’s 4.61%.
Two configuration variants were evaluated:
efficient_conf:
stride_layer_idx: [3]
stride: [2]
group_layer_idx: [0, 1, 2, 3]
group_size: 3
stride_kernel: true efficient_conf:
stride_layer_idx: [3, 7]
stride: [2, 2]
group_layer_idx: [3, 7]
group_size: 3
stride_kernel: falseBoth variants showed consistent CER reductions and decoding speed gains.
Future Work
Further improve open‑source benchmark results.
Add ONNX export support and GPU‑accelerated streaming deployment.
References
[1] 58同城:WeNet端到端语音识别大规模落地方案
[2] Efficient Conformer: https://arxiv.org/pdf/2109.01163.pdf
[3] WeNet Efficient Conformer PR: https://github.com/wenet-e2e/wenet/pull/1636
[4] Efficient Conformer Code: https://github.com/burchim/EfficientConformer
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.