Can Language‑Centric Tree Reasoning Transform Video Question Answering?

This article introduces a language‑centric tree reasoning (LTR) framework that recursively decomposes VideoQA queries into perceptual sub‑questions and performs bottom‑up logical inference with video assistance, achieving significantly higher accuracy and explainability across eleven benchmark datasets.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Can Language‑Centric Tree Reasoning Transform Video Question Answering?

1. Introduction

Existing multimodal large language models (MLLM) for Video Question Answering (VideoQA) suffer from uncontrolled and opaque reasoning, limiting their ability to perform advanced cognitive inference. To tackle this, the Bilibili Index team together with Shanghai Jiao Tong University propose a novel language‑centric tree reasoning (LTR) framework, which has been accepted at ICML 2025.

2. Motivation

VideoQA requires moving from low‑level perception (identifying objects, actions, scenes) to high‑level cognition (understanding the logical structure behind questions). Current MLLM extensions such as Video‑LLaMA and Video‑LLaVA provide some explanations but lack systematic System‑2 reasoning, making their inference paths difficult to control and verify.

Figure 1: Human System‑2 reasoning process for complex VideoQA
Figure 1: Human System‑2 reasoning process for complex VideoQA

3. Method

The LTR framework operates in two stages. In the first stage, it recursively generates a language‑centric logical tree by splitting the original question into simpler, logically coherent sub‑questions until they become perceptual leaf nodes. In the second stage, the MLLM answers all leaf nodes, then, with video assistance, performs bottom‑up logical inference within the tree, aggregating answers to produce the final response and a traceable reasoning path.

Figure 2: Overview of the LTR framework
Figure 2: Overview of the LTR framework

The process is training‑free; prompts are used to guide both the tree construction (via retrieval‑augmented generation) and the bottom‑up inference.

4. Experiments

LTR was evaluated on eleven VideoQA benchmarks (MSVD‑QA, MSRVTT‑QA, TGIF‑QA, ActivityNet‑QA, AGQA‑Decomp, NExT‑QA, CausalVidQA, STAR, EgoSchema, Video‑MME, MVBench). For open‑ended questions, GPT‑3.5 was used for scoring; for multiple‑choice, the model selected answers directly. On AGQA‑Decomp, compositional consistency metrics (cR, cP, c‑F1) were also reported.

Table 1: Accuracy, score, and compositional consistency on AGQA‑Decomp
Table 1: Accuracy, score, and compositional consistency on AGQA‑Decomp

Results show that LTR consistently outperforms nine baselines in accuracy, scoring, and compositional consistency, especially on complex cognitive sub‑questions. The improvement is attributed to the cooperative “top‑down recursive splitting” and “bottom‑up tree reasoning” stages, which enhance logical reasoning while preserving traceability.

Table 2: Results on MVBench
Table 2: Results on MVBench
Table 3: Zero‑shot performance on Causal‑VidQA
Table 3: Zero‑shot performance on Causal‑VidQA
Table 4: Results on NExT‑QA
Table 4: Results on NExT‑QA

Across all benchmarks, LTR yields larger gains on tasks requiring deeper logical reasoning (counterfactual, prediction) than on pure perception tasks, confirming that the framework primarily enhances cognitive inference capabilities.

5. Conclusion

The proposed two‑stage language‑centric tree reasoning framework improves both the accuracy and interpretability of multimodal LLMs for VideoQA. By recursively constructing a logical tree and performing bottom‑up inference with video evidence, LTR provides a transparent, verifiable reasoning path and sets a new direction for language‑driven video understanding.

References

Chen, J., Yan, J., Fang, Y., and Niu, L. Meta‑point learning and refining for category‑agnostic pose estimation. CVPR, 2024.

Zhang, H., Li, X., and Bing, L. Video‑LLaMA: An instruction‑tuned audio‑visual language model for video understanding. EMNLP, 2023.

Lin, B., Zhu, B., et al. Video‑LLaVA: Learning united visual representation by alignment before projection. EMNLP, 2024.

Fei, H., Wu, S., et al. Video‑of‑thought: Step‑by‑step video reasoning from perception to cognition. ICML, 2024.

Qian, Z., Wang, X., et al. Dynamic spatio‑temporal modular network for video question answering. ACM MM, 2022.

Xu, J., Mei, T., et al. MSR‑VTT: A large video description dataset for bridging video and language. CVPR, 2016.

Jang, Y., Song, Y., et al. TGIF‑QA: Toward spatio‑temporal reasoning in visual question answering. CVPR, 2017.

Yu, Z., Xu, D., et al. ActivityNet‑QA: A dataset for understanding complex web videos via question answering. AAAI, 2019.

Gandhi, M., Gul, M. O., et al. Measuring compositional consistency for video question answering. CVPR, 2022.

Xiao, J., Shang, X., et al. NExT‑QA: Next phase of question‑answering to explaining temporal actions. CVPR, 2021.

Li, J., Niu, L., and Zhang, L. From Representation to Reasoning: Towards both evidence and commonsense reasoning for video question‑answering. CVPR, 2022.

Wu, B., Yu, S., et al. STAR: A benchmark for situated reasoning in real‑world videos. NeurIPS, 2023.

Mangalam, K., Akshulakov, R., and Malik, J. EgoSchema: A diagnostic benchmark for very long‑form video language understanding. NeurIPS, 2023.

Fu, C., Dai, Y., et al. Video‑MME: The first comprehensive evaluation benchmark of multimodal LLMs in video analysis. arXiv, 2024.

Li, K., Wang, Y., et al. MVBench: A comprehensive multimodal video understanding benchmark. CVPR, 2024.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Artificial IntelligenceExplainabilitymultimodal LLMTree ReasoningVideoQA
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.