DIDB‑ViT Achieves SOTA Binary ViT Results, Outperforms Full‑Precision ResNet‑34 on ADE20K
The paper introduces DIDB‑ViT, a high‑fidelity differential‑information‑driven binary Vision Transformer that closes the performance gap with full‑precision models while keeping the original ViT architecture, and demonstrates state‑of‑the‑art results on image classification and ADE20K segmentation, even surpassing full‑precision ResNet‑34.
Introduction
Vision Transformers (ViT) deliver strong performance on visual tasks but their large parameter count and high computational cost hinder deployment on resource‑constrained edge devices. Existing binary ViT approaches either suffer severe accuracy loss or rely on full‑precision modules, compromising efficiency.
Related Work
Early binary neural network research focused on CNNs (e.g., BinaryConnect [4], XNOR‑Net [33], ReActNet [24]). Directly applying these techniques to ViT is ineffective because the attention mechanism behaves differently. Prior binary ViT methods such as BiViT [10], Bi‑ViT [17], BVT‑IMA [40], GSB [7], and SI‑BiViT [47] mitigate information loss with tricks like softmax‑aware binarization, channel‑wise scaling, or spatial interaction, yet they still depend on partial full‑precision components.
Method
Overview
DIDB‑ViT proposes a suite of modules that preserve information during binarization without altering the standard ViT topology:
Differential‑Information Binary Attention (DIBA) : analyzes the change between full‑precision and binary attention matrices, extracts the differential term, and injects it back to restore token‑wise importance. This mitigates the uniform‑weight issue caused by naive binarization.
High‑Fidelity Similarity Calculation (HFSC) : applies a non‑downsampled discrete Haar wavelet to decompose the attention input into high‑ and low‑frequency components. Separate binary Q and K tensors are computed for each frequency, and their similarity scores are fused, improving the fidelity of the binary similarity measure.
Improved RPReLU (IRPReLU) : extends the RPReLU activation by adding a token‑wise learnable offset, allowing the activation distribution to be reshaped per token while adding negligible parameters and FLOPs.
All modules are compatible with the original ViT block and require no extra operations or topology changes.
Technical Details
Binary weights and activations are constrained to {‑1, 1} (or {0, 1}) enabling XNOR and popcount arithmetic. The linear layer binarization follows the standard sign‑based function with a straight‑through estimator for gradients. In DIBA, the authors denote the full‑precision attention matrix A and its binary counterpart Â; the update for token i is expressed as a sum of the original term plus a differential correction derived from A‑Â. HFSC computes Haar‑wavelet coefficients H_low and H_high for the input, then forms binary Q/K pairs Q_low, K_low and Q_high, K_high. The final similarity is Sim = α·Sim_low + (1‑α)·Sim_high, where α is a learnable scalar. IRPReLU modifies the RPReLU formula y = max(0, x) + a·min(0, x) by adding a token‑wise bias b_i: y_i = max(0, x_i) + a·min(0, x_i) + b_i.
Experiments
Setup
All experiments use PyTorch on a dual‑NVIDIA A100 server. Training employs AdamW with cosine‑annealing learning rate; initial LR = 0.001. Batch sizes: 256 for classification, 18 for image segmentation, 4 for road‑segmentation. Two training regimes are evaluated: single‑stage (default) and two‑stage (distillation‑enhanced). The loss combines cross‑entropy with a distillation term (weight = 0.9) for classification.
Classification Results
On CIFAR‑100, DIDB‑ViT surpasses the previous SOTA binary ViT (BiT) by 5.9% absolute accuracy in the single‑stage setting and by 0.5% in the two‑stage setting. On Tiny‑ImageNet, using DeiT‑Tiny and DeiT‑Small backbones, it outperforms BFD by 2.1% and 7.0% (single‑stage) and by 13.6% and 19.3% (two‑stage). Notably, with two‑stage training DIDB‑ViT reaches 76.3% accuracy on CIFAR‑100, exceeding full‑precision DeiT‑Small by 3.3% and binary CNN ReActNet by 7.5%.
ImageNet‑1K
Across four ViT variants (DeiT‑Tiny, DeiT‑Small, BinaryViT, Swin‑Tiny), DIDB‑ViT consistently beats the best binary baselines. Gains include +9.3% over BVT‑IMA on DeiT‑Tiny and +5.0% over Si‑BiViT on DeiT‑Small. Improvements on BinaryViT (+0.7%) and Swin‑Tiny (+6.0%) further validate the method’s generality.
Segmentation
On ADE20K, DIDB‑ViT achieves the highest pixel accuracy and mIoU among binary models, surpassing BiDense and other BNN‑based segmenters. For aerial road segmentation (RS‑LVF dataset), it exceeds ReActNet and even full‑precision ResNet‑34, demonstrating the benefit of the binary attention and HFSC modules for dense prediction.
Ablation Studies
Removing DIBA, HFSC, or IRPReLU from the DeiT‑Small ImageNet‑1K model drops accuracy by 5.2%, 4.2%, and 1.8% respectively, confirming each component’s contribution. HFSC’s benefit is amplified under two‑stage training. Varying the balance coefficient α shows optimal performance at α = 0.9. Adjusting the receptive field size θ of the binary convolution in DIBA reveals a sweet spot (θ = 3) that balances accuracy and computation.
Limitations
Despite the gains, binary models remain constrained by limited representational capacity, especially for more complex downstream tasks such as object detection or video understanding. The fixed 3×3 receptive field and Haar‑wavelet decomposition may be further improved with adaptive mechanisms. Training relies on knowledge distillation and a two‑stage schedule, which could affect scalability.
Conclusion
DIDB‑ViT introduces a differential‑information‑driven binary ViT that retains the original architecture while delivering SOTA performance on classification and segmentation benchmarks. By integrating DIBA, HFSC, and IRPReLU, the method mitigates information loss inherent to binarization, offering an efficient solution for deploying vision transformers on edge hardware.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
