Artificial Intelligence 16 min read

Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results

This article presents a comprehensive overview of the Efficient Conformer model for large‑scale end‑to‑end speech recognition, detailing its architectural improvements such as progressive downsampling and grouped multi‑head self‑attention, the PyTorch implementation in WeNet, streaming inference handling, experimental CER gains on AISHELL‑1 and production data, and future development plans.

58 Tech
58 Tech
58 Tech
Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results

The 58.com TEG‑AI Lab replaced its Kaldi‑based ASR system with a WeNet end‑to‑end recognizer and further optimized it using the Efficient Conformer architecture, achieving a 3% absolute CER reduction over the Kaldi optimum and a 61% decoding speed increase.

Model Improvements

Efficient Conformer modifies the original Conformer by introducing two key techniques:

Progressive Downsampling – adds stride‑2 convolution in the early Conformer blocks to halve the time dimension, reducing computational cost of subsequent blocks.

Grouped Multi‑Head Self‑Attention (Grouped MHSA) – splits the attention heads into groups, lowering the complexity from O(n²d) to O(n²d/g) where g is the group size.

Additional efficient attention variants (Stride MHSA, Relative MHSA, Local MHSA) are also mentioned for interested readers.

Implementation Details

The model was re‑implemented in the WeNet open‑source project under the efficient_conformer module. Key code changes include:

self.depthwise_conv = nn.Conv1d(
    channels,
    channels,
    kernel_size,
    stride=stride,  # for depthwise_conv in StrideConv
    padding=padding,
    groups=channels,
    bias=bias,
)

Mask synchronization after downsampling:

if mask_pad.size(2) > 0:  # time > 0
    if mask_pad.size(2) != x.size(2):
        mask_pad = mask_pad[:, :, ::self.stride]
    x.masked_fill_(~mask_pad, 0.0)

Pointwise projection for residual connections:

# add pointwise_conv for efficient conformer
if self.pointwise_conv_layer is not None:
    residual = residual.transpose(1, 2)
    residual = self.pointwise_conv_layer(residual)
    residual = residual.transpose(1, 2)
    assert residual.size(0) == x.size(0)
    assert residual.size(1) == x.size(1)
    assert residual.size(2) == x.size(2)

Grouped MHSA implementation (group size = 3):

class GroupedRelPositionMultiHeadedAttention(MultiHeadedAttention):
    def __init__(self, n_head, n_feat, dropout_rate, group_size=3):
        super().__init__(n_head, n_feat, dropout_rate)
        self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)
        self.group_size = group_size
        self.d_k = n_feat // n_head
        self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k * self.group_size))
        self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k * self.group_size))
        torch.nn.init.xavier_uniform_(self.pos_bias_u)
        torch.nn.init.xavier_uniform_(self.pos_bias_v)

Padding helper to make sequence length divisible by the group size:

def pad4group(self, Q, K, V, P, mask, group_size: int = 3):
    overflow_Q = Q.size(2) % group_size
    overflow_KV = K.size(2) % group_size
    padding_Q = (group_size - overflow_Q) * int(overflow_Q // (overflow_Q + 1e-15))
    padding_KV = (group_size - overflow_KV) * int(overflow_KV // (overflow_KV + 1e-15))
    Q = F.pad(Q, (0, 0, 0, padding_Q), value=0.0)
    K = F.pad(K, (0, 0, 0, padding_KV), value=0.0)
    V = F.pad(V, (0, 0, 0, padding_KV), value=0.0)
    if mask is not None and mask.size(2) > 0:
        mask = mask[:, ::group_size, ::group_size]
    # reshape for grouped attention
    ...
    return Q, K, V, P, mask, padding_Q

Streaming Inference

WeNet’s streaming mode calls forward_chunk on the encoder. Because Efficient Conformer performs temporal downsampling, cache tensors (attention and CNN caches) must be padded or repeated to match the original time resolution. The down‑sampling factor is computed per layer:

def calculate_downsampling_factor(self, i: int) -> int:
    factor = 1
    for idx, stride_idx in enumerate(self.stride_layer_idx):
        if i > stride_idx:
            factor *= self.stride[idx]
    return factor

Attention cache is repeated to restore the original length, and CNN cache is padded to kernel_size‑1 for causal convolution.

Experimental Results

On the internal 58.com dataset (≈10 M h audio/year, >50 M dialogs/year) Efficient Conformer outperformed the baseline Conformer. On AISHELL‑1, the best CER achieved was 4.56% (no LM), better than the Conformer’s 4.61%.

Two configuration variants were evaluated:

efficient_conf:
    stride_layer_idx: [3]
    stride: [2]
    group_layer_idx: [0, 1, 2, 3]
    group_size: 3
    stride_kernel: true
efficient_conf:
    stride_layer_idx: [3, 7]
    stride: [2, 2]
    group_layer_idx: [3, 7]
    group_size: 3
    stride_kernel: false

Both variants showed consistent CER reductions and decoding speed gains.

Future Work

Further improve open‑source benchmark results.

Add ONNX export support and GPU‑accelerated streaming deployment.

References

[1] 58同城:WeNet端到端语音识别大规模落地方案

[2] Efficient Conformer: https://arxiv.org/pdf/2109.01163.pdf

[3] WeNet Efficient Conformer PR: https://github.com/wenet-e2e/wenet/pull/1636

[4] Efficient Conformer Code: https://github.com/burchim/EfficientConformer

model optimizationPyTorchSpeech RecognitionASREfficient Conformerstreaming inference
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.