Can Language‑Centric Tree Reasoning Transform Video Question Answering?
This article introduces a language‑centric tree reasoning (LTR) framework that recursively decomposes VideoQA queries into perceptual sub‑questions and performs bottom‑up logical inference with video assistance, achieving significantly higher accuracy and explainability across eleven benchmark datasets.
1. Introduction
Existing multimodal large language models (MLLM) for Video Question Answering (VideoQA) suffer from uncontrolled and opaque reasoning, limiting their ability to perform advanced cognitive inference. To tackle this, the Bilibili Index team together with Shanghai Jiao Tong University propose a novel language‑centric tree reasoning (LTR) framework, which has been accepted at ICML 2025.
2. Motivation
VideoQA requires moving from low‑level perception (identifying objects, actions, scenes) to high‑level cognition (understanding the logical structure behind questions). Current MLLM extensions such as Video‑LLaMA and Video‑LLaVA provide some explanations but lack systematic System‑2 reasoning, making their inference paths difficult to control and verify.
3. Method
The LTR framework operates in two stages. In the first stage, it recursively generates a language‑centric logical tree by splitting the original question into simpler, logically coherent sub‑questions until they become perceptual leaf nodes. In the second stage, the MLLM answers all leaf nodes, then, with video assistance, performs bottom‑up logical inference within the tree, aggregating answers to produce the final response and a traceable reasoning path.
The process is training‑free; prompts are used to guide both the tree construction (via retrieval‑augmented generation) and the bottom‑up inference.
4. Experiments
LTR was evaluated on eleven VideoQA benchmarks (MSVD‑QA, MSRVTT‑QA, TGIF‑QA, ActivityNet‑QA, AGQA‑Decomp, NExT‑QA, CausalVidQA, STAR, EgoSchema, Video‑MME, MVBench). For open‑ended questions, GPT‑3.5 was used for scoring; for multiple‑choice, the model selected answers directly. On AGQA‑Decomp, compositional consistency metrics (cR, cP, c‑F1) were also reported.
Results show that LTR consistently outperforms nine baselines in accuracy, scoring, and compositional consistency, especially on complex cognitive sub‑questions. The improvement is attributed to the cooperative “top‑down recursive splitting” and “bottom‑up tree reasoning” stages, which enhance logical reasoning while preserving traceability.
Across all benchmarks, LTR yields larger gains on tasks requiring deeper logical reasoning (counterfactual, prediction) than on pure perception tasks, confirming that the framework primarily enhances cognitive inference capabilities.
5. Conclusion
The proposed two‑stage language‑centric tree reasoning framework improves both the accuracy and interpretability of multimodal LLMs for VideoQA. By recursively constructing a logical tree and performing bottom‑up inference with video evidence, LTR provides a transparent, verifiable reasoning path and sets a new direction for language‑driven video understanding.
References
Chen, J., Yan, J., Fang, Y., and Niu, L. Meta‑point learning and refining for category‑agnostic pose estimation. CVPR, 2024.
Zhang, H., Li, X., and Bing, L. Video‑LLaMA: An instruction‑tuned audio‑visual language model for video understanding. EMNLP, 2023.
Lin, B., Zhu, B., et al. Video‑LLaVA: Learning united visual representation by alignment before projection. EMNLP, 2024.
Fei, H., Wu, S., et al. Video‑of‑thought: Step‑by‑step video reasoning from perception to cognition. ICML, 2024.
Qian, Z., Wang, X., et al. Dynamic spatio‑temporal modular network for video question answering. ACM MM, 2022.
Xu, J., Mei, T., et al. MSR‑VTT: A large video description dataset for bridging video and language. CVPR, 2016.
Jang, Y., Song, Y., et al. TGIF‑QA: Toward spatio‑temporal reasoning in visual question answering. CVPR, 2017.
Yu, Z., Xu, D., et al. ActivityNet‑QA: A dataset for understanding complex web videos via question answering. AAAI, 2019.
Gandhi, M., Gul, M. O., et al. Measuring compositional consistency for video question answering. CVPR, 2022.
Xiao, J., Shang, X., et al. NExT‑QA: Next phase of question‑answering to explaining temporal actions. CVPR, 2021.
Li, J., Niu, L., and Zhang, L. From Representation to Reasoning: Towards both evidence and commonsense reasoning for video question‑answering. CVPR, 2022.
Wu, B., Yu, S., et al. STAR: A benchmark for situated reasoning in real‑world videos. NeurIPS, 2023.
Mangalam, K., Akshulakov, R., and Malik, J. EgoSchema: A diagnostic benchmark for very long‑form video language understanding. NeurIPS, 2023.
Fu, C., Dai, Y., et al. Video‑MME: The first comprehensive evaluation benchmark of multimodal LLMs in video analysis. arXiv, 2024.
Li, K., Wang, Y., et al. MVBench: A comprehensive multimodal video understanding benchmark. CVPR, 2024.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
