RoadSceneBench: A Lightweight Benchmark for Mid‑Level Road Scene Understanding

The CVPR 2026 paper introduces RoadSceneBench, a lightweight benchmark that evaluates models on six structured mid‑level road‑scene tasks using short front‑view video clips, and presents MapVLM with HRRP‑T training, which significantly outperforms existing closed‑ and open‑source visual‑language models.

Baidu Maps Tech Team
Baidu Maps Tech Team
Baidu Maps Tech Team
RoadSceneBench: A Lightweight Benchmark for Mid‑Level Road Scene Understanding

CVPR 2026 features a paper from Baidu Maps that proposes RoadSceneBench, a lightweight benchmark targeting mid‑level road scene understanding—a crucial link between low‑level perception and high‑level driving decisions.

The benchmark addresses the need for models to infer road structure, such as lane count, ego‑lane position, junction/entrance/exit presence, lane‑change feasibility, traffic condition, and overall scene type, rather than merely detecting objects.

RoadSceneBench is built from front‑view video clips collected in 20 representative Chinese cities. It contains 2,341 short clips, each comprising 5 consecutive frames sampled at 1 FPS, totaling 11,705 high‑resolution images (4096 × 2160) and over 160,000 structured annotations. The dataset emphasizes lightweight, structured, and temporally consistent evaluation.

The benchmark defines six core tasks:

Lane Count Estimation

Ego‑lane Index

Junction/Entrance/Exit Recognition

Lane‑change Feasibility Reasoning

Traffic Condition Understanding

Road Scene Classification

Data construction follows three stages: (1) data collection from sensor‑equipped vehicles, (2) automatic filtering combined with manual review by 20 professional annotators, and (3) structured Q&A annotation that serves both as ground‑truth for evaluation and as supervision for model fine‑tuning. Pseudo‑labels generated by existing segmentation models are corrected by experts, and annotations enforce both intra‑frame logical consistency and inter‑frame temporal consistency.

Based on the benchmark, the authors introduce MapVLM, a visual‑language model built on Qwen2.5‑VL‑7B. Training proceeds in two phases: supervised fine‑tuning (SFT) using LoRA on the structured Q&A, followed by HRRP‑T (Hierarchical Relational Reward Propagation with Temporal Consistency). HRRP‑T provides two reward types: frame‑level rewards (scene‑level, relational‑level, semantic‑level) that enforce correct single‑frame reasoning, and temporal‑level rewards (smoothness and plausibility) that penalize implausible changes across consecutive frames.

Experiments evaluate numerous closed‑source models (GPT‑4o, Gemini‑2.5‑Pro, Claude‑3.7‑Sonnet) and open‑source VLMs (ERNIE, DeepSeek, LLaVA, InternVL, Qwen series) on all six tasks using Precision and Recall. MapVLM attains the highest overall scores (Precision 75.78, Recall 72.17), markedly surpassing the strongest closed‑source baseline Gemini‑2.5‑Pro (P 60.61, R 52.70) and all open‑source models. The HRRP‑T component notably raises ego‑lane Recall from 50.37 % to 84.67 %.

Result analysis shows that RoadSceneBench poses a significant challenge for existing VLMs, especially on structured tasks such as ego‑lane localization and lane‑change reasoning, where many models exhibit unstable predictions. The temporal consistency enforced by HRRP‑T reduces prediction drift in occluded or crowded scenes, as illustrated by a five‑frame congested urban example.

Future directions include expanding the dataset to more geographic regions and incorporating dynamic events like construction, accidents, and temporary lane closures, thereby fostering broader research on reliable road‑scene understanding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkautonomous drivingvisual language modelCVPR 2026HRRP-TMapVLMroad scene understanding
Baidu Maps Tech Team
Written by

Baidu Maps Tech Team

Want to see the Baidu Maps team's technical insights, learn how top engineers tackle tough problems, or join the team? Follow the Baidu Maps Tech Team to get the answers you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.