Artificial Intelligence 16 min read

Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results

This article presents a comprehensive overview of the Efficient Conformer model for large‑scale end‑to‑end speech recognition, detailing its architectural improvements such as progressive downsampling and grouped multi‑head self‑attention, the PyTorch implementation in WeNet, streaming inference handling, experimental CER gains on AISHELL‑1 and production data, and future development plans.

58 Tech

Jan 12, 2023

Efficient Conformer for End‑to‑End Speech Recognition: Model, Implementation, Streaming Inference, and Experimental Results

The 58.com TEG‑AI Lab replaced its Kaldi‑based ASR system with a WeNet end‑to‑end recognizer and further optimized it using the Efficient Conformer architecture, achieving a 3% absolute CER reduction over the Kaldi optimum and a 61% decoding speed increase.

Model Improvements

Efficient Conformer modifies the original Conformer by introducing two key techniques:

Progressive Downsampling – adds stride‑2 convolution in the early Conformer blocks to halve the time dimension, reducing computational cost of subsequent blocks.

Grouped Multi‑Head Self‑Attention (Grouped MHSA) – splits the attention heads into groups, lowering the complexity from O(n²d) to O(n²d/g) where g is the group size.

Additional efficient attention variants (Stride MHSA, Relative MHSA, Local MHSA) are also mentioned for interested readers.

Implementation Details

The model was re‑implemented in the WeNet open‑source project under the efficient_conformer module. Key code changes include:

self.depthwise_conv = nn.Conv1d(
    channels,
    channels,
    kernel_size,
    stride=stride,  # for depthwise_conv in StrideConv
    padding=padding,
    groups=channels,
    bias=bias,
)

Mask synchronization after downsampling:

if mask_pad.size(2) > 0:  # time > 0
    if mask_pad.size(2) != x.size(2):
        mask_pad = mask_pad[:, :, ::self.stride]
    x.masked_fill_(~mask_pad, 0.0)

Pointwise projection for residual connections:

# add pointwise_conv for efficient conformer
if self.pointwise_conv_layer is not None:
    residual = residual.transpose(1, 2)
    residual = self.pointwise_conv_layer(residual)
    residual = residual.transpose(1, 2)
    assert residual.size(0) == x.size(0)
    assert residual.size(1) == x.size(1)
    assert residual.size(2) == x.size(2)

Grouped MHSA implementation (group size = 3):

class GroupedRelPositionMultiHeadedAttention(MultiHeadedAttention):
    def __init__(self, n_head, n_feat, dropout_rate, group_size=3):
        super().__init__(n_head, n_feat, dropout_rate)
        self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)
        self.group_size = group_size
        self.d_k = n_feat // n_head
        self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k * self.group_size))
        self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k * self.group_size))
        torch.nn.init.xavier_uniform_(self.pos_bias_u)
        torch.nn.init.xavier_uniform_(self.pos_bias_v)

Padding helper to make sequence length divisible by the group size:

def pad4group(self, Q, K, V, P, mask, group_size: int = 3):
    overflow_Q = Q.size(2) % group_size
    overflow_KV = K.size(2) % group_size
    padding_Q = (group_size - overflow_Q) * int(overflow_Q // (overflow_Q + 1e-15))
    padding_KV = (group_size - overflow_KV) * int(overflow_KV // (overflow_KV + 1e-15))
    Q = F.pad(Q, (0, 0, 0, padding_Q), value=0.0)
    K = F.pad(K, (0, 0, 0, padding_KV), value=0.0)
    V = F.pad(V, (0, 0, 0, padding_KV), value=0.0)
    if mask is not None and mask.size(2) > 0:
        mask = mask[:, ::group_size, ::group_size]
    # reshape for grouped attention
    ...
    return Q, K, V, P, mask, padding_Q

Streaming Inference

WeNet’s streaming mode calls forward_chunk on the encoder. Because Efficient Conformer performs temporal downsampling, cache tensors (attention and CNN caches) must be padded or repeated to match the original time resolution. The down‑sampling factor is computed per layer:

def calculate_downsampling_factor(self, i: int) -> int:
    factor = 1
    for idx, stride_idx in enumerate(self.stride_layer_idx):
        if i > stride_idx:
            factor *= self.stride[idx]
    return factor

Attention cache is repeated to restore the original length, and CNN cache is padded to kernel_size‑1 for causal convolution.

Experimental Results

On the internal 58.com dataset (≈10 M h audio/year, >50 M dialogs/year) Efficient Conformer outperformed the baseline Conformer. On AISHELL‑1, the best CER achieved was 4.56% (no LM), better than the Conformer’s 4.61%.

Two configuration variants were evaluated:

efficient_conf:
    stride_layer_idx: [3]
    stride: [2]
    group_layer_idx: [0, 1, 2, 3]
    group_size: 3
    stride_kernel: true

efficient_conf:
    stride_layer_idx: [3, 7]
    stride: [2, 2]
    group_layer_idx: [3, 7]
    group_size: 3
    stride_kernel: false

Both variants showed consistent CER reductions and decoding speed gains.

Future Work

Further improve open‑source benchmark results.

Add ONNX export support and GPU‑accelerated streaming deployment.

References

[1] 58同城：WeNet端到端语音识别大规模落地方案

[2] Efficient Conformer: https://arxiv.org/pdf/2109.01163.pdf

[3] WeNet Efficient Conformer PR: https://github.com/wenet-e2e/wenet/pull/1636

[4] Efficient Conformer Code: https://github.com/burchim/EfficientConformer

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Model Optimization PyTorch speech recognition ASR Efficient Conformer Streaming Inference

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.