Artificial Intelligence 17 min read

Can Language‑Centric Tree Reasoning Transform Video Question Answering?

This article introduces a language‑centric tree reasoning (LTR) framework that recursively decomposes VideoQA queries into perceptual sub‑questions and performs bottom‑up logical inference with video assistance, achieving significantly higher accuracy and explainability across eleven benchmark datasets.

Bilibili Tech

Aug 8, 2025

Can Language‑Centric Tree Reasoning Transform Video Question Answering?

1. Introduction

Existing multimodal large language models (MLLM) for Video Question Answering (VideoQA) suffer from uncontrolled and opaque reasoning, limiting their ability to perform advanced cognitive inference. To tackle this, the Bilibili Index team together with Shanghai Jiao Tong University propose a novel language‑centric tree reasoning (LTR) framework, which has been accepted at ICML 2025.

2. Motivation

VideoQA requires moving from low‑level perception (identifying objects, actions, scenes) to high‑level cognition (understanding the logical structure behind questions). Current MLLM extensions such as Video‑LLaMA and Video‑LLaVA provide some explanations but lack systematic System‑2 reasoning, making their inference paths difficult to control and verify.

Figure 1: Human System‑2 reasoning process for complex VideoQA

3. Method

The LTR framework operates in two stages. In the first stage, it recursively generates a language‑centric logical tree by splitting the original question into simpler, logically coherent sub‑questions until they become perceptual leaf nodes. In the second stage, the MLLM answers all leaf nodes, then, with video assistance, performs bottom‑up logical inference within the tree, aggregating answers to produce the final response and a traceable reasoning path.

The process is training‑free; prompts are used to guide both the tree construction (via retrieval‑augmented generation) and the bottom‑up inference.

4. Experiments

LTR was evaluated on eleven VideoQA benchmarks (MSVD‑QA, MSRVTT‑QA, TGIF‑QA, ActivityNet‑QA, AGQA‑Decomp, NExT‑QA, CausalVidQA, STAR, EgoSchema, Video‑MME, MVBench). For open‑ended questions, GPT‑3.5 was used for scoring; for multiple‑choice, the model selected answers directly. On AGQA‑Decomp, compositional consistency metrics (cR, cP, c‑F1) were also reported.

Table 1: Accuracy, score, and compositional consistency on AGQA‑Decomp

Results show that LTR consistently outperforms nine baselines in accuracy, scoring, and compositional consistency, especially on complex cognitive sub‑questions. The improvement is attributed to the cooperative “top‑down recursive splitting” and “bottom‑up tree reasoning” stages, which enhance logical reasoning while preserving traceability.

Table 3: Zero‑shot performance on Causal‑VidQA

Across all benchmarks, LTR yields larger gains on tasks requiring deeper logical reasoning (counterfactual, prediction) than on pure perception tasks, confirming that the framework primarily enhances cognitive inference capabilities.

5. Conclusion

The proposed two‑stage language‑centric tree reasoning framework improves both the accuracy and interpretability of multimodal LLMs for VideoQA. By recursively constructing a logical tree and performing bottom‑up inference with video evidence, LTR provides a transparent, verifiable reasoning path and sets a new direction for language‑driven video understanding.

References

Chen, J., Yan, J., Fang, Y., and Niu, L. Meta‑point learning and refining for category‑agnostic pose estimation. CVPR, 2024.

Zhang, H., Li, X., and Bing, L. Video‑LLaMA: An instruction‑tuned audio‑visual language model for video understanding. EMNLP, 2023.

Lin, B., Zhu, B., et al. Video‑LLaVA: Learning united visual representation by alignment before projection. EMNLP, 2024.

Fei, H., Wu, S., et al. Video‑of‑thought: Step‑by‑step video reasoning from perception to cognition. ICML, 2024.

Qian, Z., Wang, X., et al. Dynamic spatio‑temporal modular network for video question answering. ACM MM, 2022.

Xu, J., Mei, T., et al. MSR‑VTT: A large video description dataset for bridging video and language. CVPR, 2016.

Jang, Y., Song, Y., et al. TGIF‑QA: Toward spatio‑temporal reasoning in visual question answering. CVPR, 2017.

Yu, Z., Xu, D., et al. ActivityNet‑QA: A dataset for understanding complex web videos via question answering. AAAI, 2019.

Gandhi, M., Gul, M. O., et al. Measuring compositional consistency for video question answering. CVPR, 2022.

Xiao, J., Shang, X., et al. NExT‑QA: Next phase of question‑answering to explaining temporal actions. CVPR, 2021.

Li, J., Niu, L., and Zhang, L. From Representation to Reasoning: Towards both evidence and commonsense reasoning for video question‑answering. CVPR, 2022.

Wu, B., Yu, S., et al. STAR: A benchmark for situated reasoning in real‑world videos. NeurIPS, 2023.

Mangalam, K., Akshulakov, R., and Malik, J. EgoSchema: A diagnostic benchmark for very long‑form video language understanding. NeurIPS, 2023.

Fu, C., Dai, Y., et al. Video‑MME: The first comprehensive evaluation benchmark of multimodal LLMs in video analysis. arXiv, 2024.

Li, K., Wang, Y., et al. MVBench: A comprehensive multimodal video understanding benchmark. CVPR, 2024.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence Explainability multimodal LLM Tree Reasoning VideoQA

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.