How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss

This article explains how the Fine‑ and Coarse‑Granularity Hybrid Self‑Attention (FCA) mechanism reduces BERT’s computational cost by over 50% while keeping accuracy loss under 1%, detailing the method, experimental results, and its significance for efficient large‑scale language models.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
How FCA Doubles BERT’s Inference Speed with Less Than 1% Accuracy Loss

Supported by deep learning and big data, natural language processing has advanced rapidly, and pre‑trained language models have ushered in a new era of research and application.

Google’s 2018 BERT, a large‑scale self‑supervised model, succeeded in three major ways: the number of parameters in large models grows nearly tenfold each year (BERT had 300 M parameters in 2018, surpassing a trillion by 2021); large models dramatically lower development cost and raise the accuracy ceiling by pre‑training on massive unlabeled data and fine‑tuning on a small labeled set; and Transformer‑based models have expanded beyond NLP to vision, speech, and multimodal tasks.

However, the self‑attention mechanism in Transformers has a quadratic computational complexity O(L²) with respect to input sequence length, causing excessive computation, memory demand, and slow inference, especially on resource‑constrained devices.

For example, BERT‑base processes a 100‑token sentence with about 3 G FLOPs, taking roughly 700 ms on CPU and 400 ms on GPU, which hinders real‑time applications such as intelligent客服 and search.

Main Idea of FCA

The JD Yanshi team proposes a Fine‑ and Coarse‑Granularity Hybrid Self‑Attention (FCA) mechanism to replace standard self‑attention. FCA inserts a neuron‑information‑scoring module between attention layers, classifying neurons into high‑information and low‑information groups. High‑information neurons are kept unchanged, while low‑information neurons are aggregated into one or a few neurons for the next layer, effectively shortening the sequence length at each layer.

By progressively reducing the sequence length, FCA cuts the overall computation without sacrificing most of the model’s knowledge.

Experimental Results

Experiments on seven NLU tasks (text similarity, sentiment classification, natural language inference, QA, etc.) show that FCA‑BERT achieves more than a two‑fold speedup with less than 1% drop in accuracy compared to the original BERT.

Table 1 (image) demonstrates the FLOPs reduction, and Table 2 (image) shows the corresponding accuracy, confirming that FCA provides a better trade‑off between efficiency and performance than current knowledge‑distillation acceleration methods.

FCA’s approach belongs to the category of model‑structure improvements, differing from model‑distillation or quantization methods that rely on external knowledge or hardware support.

Beyond this work, JD Yanshi has delivered numerous AI achievements across NLP and multimodal interaction, winning several international competition championships and deploying the technology in various industries such as finance, logistics, and manufacturing.

Model framework diagram
Model framework diagram
FLOPs reduction table
FLOPs reduction table
Accuracy comparison table
Accuracy comparison table
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep Learningnatural language processingBERTmodel efficiencySelf-AttentionFCA
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.