BEVANet’s Triple Boost for Real-Time Segmentation: Field, Edge, Speed

BEVANet tackles the efficiency‑accuracy trade‑off in real‑time semantic segmentation by integrating large‑kernel attention, an efficient visual attention (EVA) module, a bilateral architecture, and boundary‑guided adaptive fusion, delivering up to 81 % mIoU on Cityscapes at 33 FPS and surpassing prior state‑of‑the‑art models on both accuracy and speed.

AIWalker
AIWalker
AIWalker
BEVANet’s Triple Boost for Real-Time Segmentation: Field, Edge, Speed

Problem Statement and Challenges

Real‑time semantic segmentation for autonomous driving and robotics must simultaneously achieve high accuracy and low latency. Existing models either lack a sufficiently large receptive field, fail to refine object boundaries, or incur prohibitive computational cost.

Key Architectural Innovations (BEVANet)

Large‑Kernel Attention (LKA) : a lightweight attention primitive that captures long‑range dependencies with minimal FLOPs.

Efficient Visual Attention (EVA) module : combines LKA with a convolutional feed‑forward network (CFFN) to refine fused features via pointwise convolutions.

Sparse Decomposed Large Separable Kernel Attention (SDLSKA) : decomposes a large kernel into a small convolution and two strip‑dilated kernels, preserving context while keeping computation low.

Comprehensive Kernel Selection (CKS) : a dynamic channel‑ and spatial‑attention block that fuses small‑ and large‑kernel features, adapting the receptive field on the fly.

Deep Large‑Kernel Pyramid Pooling Module (DLKPPM) : hierarchical residual pooling that integrates dilated convolutions and LKA, extending the effective receptive field to 35×35.

Bilateral Architecture (BA) : inspired by PIDNet, maintains a High‑Level branch (semantic context) and a Low‑Level branch (contour detail) with continuous cross‑branch communication.

Boundary‑Guided Adaptive Fusion (BGAF) : uses shortcut residual connections and boundary‑importance weighting to merge high‑level semantics and low‑level details adaptively.

Methodology

The EVA module processes an input feature map F as follows:

# LKA captures global context
F_lka = LKA(F)
# CFFN refines via pointwise convs
F_out = CFFN(F_lka)

SDLSKA expands the receptive field by first applying a 3×3 convolution, then two strip‑dilated convolutions (horizontal and vertical) with dilation rates chosen to approximate a large kernel (e.g., 31×31). CKS receives the three parallel outputs, computes channel‑wise attention a_c and spatial attention a_s, and produces a weighted sum: F_fused = a_c * a_s * concat(F_small, F_strip_h, F_strip_v) DLKPPM stacks several pooling branches of increasing stride, each followed by a dilated convolution and an LKA block, then aggregates them with a residual connection, preserving fine‑grained spatial detail while enlarging the receptive field.

BA splits the backbone into two streams. The High‑Level stream repeatedly downsamples to aggregate semantic context; the Low‑Level stream retains higher resolution for edge detail. BGAF receives the two streams, computes a boundary importance map via a shallow edge detector, and fuses the streams with learned weights proportional to boundary confidence.

Training Protocol

Datasets: Cityscapes (19 classes, 2,975 training images) and CamVid (11 of 32 annotated classes, 701 images). Pre‑training on ImageNet for 100 epochs (batch 256, LR 0.1, SGD, weight decay 1e‑4, momentum 0.9). Main training follows prior work:

Cityscapes: 484 epochs, batch 12, LR 0.008, polynomial LR decay, online hard example mining (OHEM).

CamVid: 200 epochs, batch 24, LR 0.003, same decay and OHEM settings.

All inference benchmarks run on an NVIDIA RTX 3090 with PyTorch 2.4, CUDA 12.1, Ubuntu 20.04, batch 1.

Quantitative Results

With ImageNet pre‑training, BEVANet achieves 81.0 % mIoU on Cityscapes at 33 FPS; without pre‑training, 79.3 % mIoU.

Compared to PIDNet‑M, BEVANet improves mIoU by 0.9 % while sacrificing only 7 FPS.

On CamVid, the lightweight BEVANet‑S variant reaches 83 % mIoU with 20.1 GFLOPs, 1.1 % higher than PIDNet‑S‑Wider while using ~40 GFLOPs less.

Small‑object detection (traffic signs, plants) shows fewer mis‑classifications; BEVANet captures objects missed by PIDNet.

Completeness tests demonstrate reliable detection of cones, grass, and sidewalks, outperforming PIDNet on large‑object coverage.

Ablation Studies

Removing the bilateral architecture slightly reduces speed but retains accuracy, confirming its efficiency. Replacing SDLSKA with its three constituent kernels drops mIoU by 0.8 % (to 78.6 %) and shrinks receptive‑field coverage. Substituting CKS with the LSKNet approach loses 0.26 % mIoU and costs 0.5 FPS, demonstrating CKS’s superior multi‑scale fusion.

BGAF outperforms the baseline BAG module by 0.4 % mIoU, attributed to its shortcut path and boundary‑aware weighting. DLKPPM adds 0.5 % mIoU with only a 0.4 FPS penalty, confirming the benefit of deeper pyramid pooling with large kernels.

Qualitative Analysis

Visual inspection shows BEVANet correctly classifies small traffic signs and plants that PIDNet mislabels as pedestrians, and it captures unannotated vegetation, indicating stronger semantic understanding. It also fully detects large objects such as traffic cones and sidewalks where PIDNet fails.

Limitations and Future Directions

Although BEVANet reduces computational overhead compared with prior real‑time models, it still incurs non‑trivial cost; further optimization of the fusion strategy is planned. Pre‑training on ImageNet yields a 1.7 % mIoU boost, suggesting room for improving generalization under limited data. Evaluation is limited to Cityscapes and CamVid, so broader scene testing is needed.

Conclusion

BEVANet delivers a competitive trade‑off between accuracy (up to 81 % mIoU) and speed (33 FPS) for real‑time semantic segmentation. The combination of SDLSKA, CKS, DLKPPM, the bilateral architecture, and BGAF expands receptive‑field coverage, refines boundaries, and maintains efficiency. Code and pretrained models are released at the following URL:

https://github.com/maomao0819/BEVANet

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

real-timecomputer visionefficiencysemantic segmentationlarge kernel attention
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.