Can Large Language Models Master Co‑Temporal Reasoning? Introducing COTEMPQA

This article presents the COTEMPQA benchmark for evaluating large language models on co‑temporal reasoning, details its four scenario types, construction pipeline, experimental results across models, error analysis, and proposes the MR‑COT strategy that leverages mathematical reasoning to significantly improve performance.

NewBeeNLP
NewBeeNLP
NewBeeNLP
Can Large Language Models Master Co‑Temporal Reasoning? Introducing COTEMPQA

Background and Related Work

Temporal reasoning is essential for language models to understand the world, yet existing time‑reasoning datasets such as TIMEQA, TEMPLAMA, and TEMPREASON focus on isolated events and fail to capture the complexity of co‑temporal (simultaneous) events that occur together in real scenarios.

Our Contribution – COTEMPQA Dataset

The COTEMPQA benchmark contains 4,748 samples designed to evaluate large language models across four co‑temporal scenarios: Equal, Overlap, During, and Mix.

Dataset Overview

COTEMPQA provides a comprehensive set of co‑temporal question‑answer pairs, each consisting of a conditional fact and a query fact, covering the four defined scenarios.

Four Co‑Temporal Scenarios

Equal : Two facts occur over exactly the same time interval; the model only needs to recognize identical periods.

Overlap : Two facts partially overlap in time; the model must detect the intersecting segment.

During : One fact’s interval is fully contained within another’s, requiring understanding of a nested relationship.

Mix : A combination of equal, overlap, and during relations, representing the most complex case with multiple correct answers.

Dataset Construction Process

Extract time‑related facts from Wikidata and convert them into a five‑tuple format (subject, relation, object, start time, end time).

Group facts by subject, ensuring each group contains at least three temporal facts.

Identify co‑temporal relationships by comparing timestamps and classifying them into the four scenario types.

Generate QA pairs by selecting one fact as the condition and another as the query, using 17 predefined relation pairs and corresponding question templates.

Experimental Results and Analysis

Model Performance

GPT‑4 achieves the highest overall score but still lags far behind human performance (54.7 vs. 92.8).

Performance drops sharply from Equal (92.7) to Overlap (59.4), During (50.1), and Mix (45.0) scenarios.

Closed‑Book vs. Open‑Book QA

Closed‑book QA: GPT‑4 scores 14.5, indicating limited reasoning without external context.

Open‑book QA: GPT‑4 improves to 54.7, showing that access to retrieved information helps but still falls short of humans.

Error Analysis

Incomplete Answer : The model returns only a subset of correct answers when multiple exist.

Uncertainty Error : The model refuses to answer because it cannot confidently extract the co‑temporal relation.

Wrong Answer : The model provides an outright incorrect answer, revealing gaps in co‑temporal reasoning.

Case Study

Basic Ability: LLMs handle simple equal scenarios well.

Increased Complexity: Overlap and During scenarios require deeper inference about intersecting intervals.

Mixed Scenario: Multiple correct answers and varied relations make this the most challenging.

Impact of Different Abilities

Mathematical reasoning models (e.g., WizardMath‑70B) outperform base models on co‑temporal tasks, suggesting a strong correlation between math skills and temporal inference.

Even the best math‑oriented model struggles with the Mix scenario because it tends to output a single answer instead of enumerating all valid possibilities.

Improvement Strategy – MR‑COT

Why Mathematical Reasoning Matters

Experiments show that incorporating mathematical reasoning dramatically boosts performance on co‑temporal tasks; WizardMath‑70B scores 30.1 versus 22.2 for LLaMA‑70B.

Proposed MR‑COT Strategy

MR‑COT combines mathematical reasoning with chain‑of‑thought prompting to enhance co‑temporal inference. The procedure includes:

Identify Key Time Points : Pinpoint the exact timestamps of events.

Structure a Timeline : Order events chronologically.

Mathematically Detect Overlaps : Use arithmetic to determine intersecting intervals.

Results

In open‑book QA, MR‑COT improves scores on Overlap, During, and Mix tasks by 14.6, 11.4, and 13.5 points respectively.

In closed‑book QA, it yields a modest overall gain of 1.3 points.

Despite these gains, a large gap remains compared to human performance (92.8), indicating ample room for further research.

Conclusion

We introduced the COTEMPQA benchmark to assess large language models on co‑temporal reasoning, revealing that while models handle simple equal scenarios well, they struggle with overlapping, during, and mixed relations. Mathematical reasoning proves crucial, and the MR‑COT approach substantially narrows the performance gap, offering a promising direction for future improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsbenchmark datasetLLM evaluationco-temporal reasoningMR-COTtime reasoning
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.