Papers
Topics
Authors
Recent
Search
2000 character limit reached

LTLBench: Temporal Logic Benchmark

Updated 2 February 2026
  • LTLBench is a synthetic benchmark designed to evaluate LLM temporal reasoning by generating LTL-based challenges with varied complexity.
  • It employs a four-stage pipeline including directed graph generation, LTL formula synthesis, NuSMV model checking, and natural-language conversion to produce rigorous ground-truth labels.
  • The benchmark systematically analyzes the impact of formula depth and state-space size on model performance, revealing limitations in handling nested and global temporal relations.

LTLBench is a synthetic benchmark designed to precisely evaluate temporal reasoning capabilities in LLMs by generating challenges grounded in Linear Temporal Logic (LTL). It utilizes a controlled pipeline to systematically vary problem complexity through manipulation of event space and formula depth, leveraging automated model checking to generate rigorous ground-truth labels. The benchmark reveals critical insights into the strengths and current limitations of LLMs in handling temporal-logic inference, especially beyond basic operator usage (Tang et al., 2024).

1. Construction Pipeline

LTLBench employs a structured four-stage generation pipeline:

A. Random Directed Graph Generation

A random directed graph G=(V,E)G = (V, E) with n>1n > 1 nodes (events) models the transition system. Each edge eventieventjevent_i \rightarrow event_j denotes that event jj can succeed ii, supporting cycles and arbitrary connectivity. This graph defines both the NuSMV state-space and the vocabulary for LTL formulas.

B. LTL Formula Generation

Random LTL formulas of exact operator length m>0m>0 are synthesized using adapted Zhu (2021) methodology.

  • Unary operators: X\mathbf{X} (next), F\mathbf{F} (eventually), G\mathbf{G} (globally), ¬\lnot (negation)
  • Binary operators: \land (and), \lor (or), \rightarrow (implication) Recursive construction picks operators uniformly, with subformulas built up to meet the mm operator constraint: ϕ::=p¬ϕϕ1ϕ2XϕFϕGϕϕ1ϕ2\phi ::= p \mid \lnot \phi \mid \phi_1 \land \phi_2 \mid \mathbf{X}\,\phi \mid \mathbf{F}\,\phi \mid \mathbf{G}\,\phi \mid \phi_1 \rightarrow \phi_2 Example (n=2n=2, m=3m=3):

(pG(Fq))m=3,pevent1,  qevent2\underbrace{(p\rightarrow \mathbf{G}(\mathbf{F}\,q))}_{m=3}, \quad p\equiv \text{event}_1,\;q\equiv \text{event}_2

C. NuSMV Model Checking

A NuSMV module is produced with:

  • One state variable over nn events
  • Random initial state eventkevent_k
  • Transition dynamics encoding all edges via “next(state) := case … esac;”
  • Attached LTLSPEC for the generated formula NuSMV labelings (“SAT”/“UNSAT”) serve as ground-truth (i.e., does the formula hold on every infinite path through the graph).

D. Natural-Language Challenge Conversion

For evaluation, each problem is presented as:

  • Context: “Initially, eventk_k happened. After eventi_i, eventj_j can happen.” (lists all transitions)
  • Hypothesis: Decomposes the LTL spec into named clauses C1,,CrC_1,…,C_r, then asks if CrC_r is True or False. Deterministic zero-shot prompting (temperature=0) is used; responses must be “True” or “False.”

2. Dataset Composition and Complexity

LTLBench provides systematic control over temporal reasoning challenge difficulty:

  • Core Benchmark:
    • (n=3n=3, m=3m=3): 2,000 examples, balanced (1,000 “True”, 1,000 “False”)
  • Difficulty Sweeps:
    • Fix n=2n=2, sweep m{1,2,3,4,5,7,9}m\in\{1,2,3,4,5,7,9\}; 300 examples per mm (2,100 total)
    • Fix m=2m=2, sweep n{2,3,4,5,7,9}n\in\{2,3,4,5,7,9\}; 300 examples per nn (1,800 total)

Problem complexity is a function of both state-space size (nn) and formula depth (mm). The induced Kripke structure has size O(n)O(n), but model-checking cost is approximately: ComplexityO(n×2m)\text{Complexity} \approx O\bigl( n \times 2^m \bigr) with exponential growth in mm dominating for deep formulas.

3. Evaluation Protocol

The benchmark tests zero-shot temporal reasoning using deterministic prompting; each challenge is provided as a “Context + Hypothesis” chat turn. Six LLM architectures are evaluated:

  • Large-parameter models:
    • gpt-3.5-turbo
    • llama3:70b-instruct
    • qwen:72b-chat
  • Small-parameter models:
    • gemma:7b-instruct
    • mistral:7b-instruct
    • qwen:7b-chat

Metrics reported per model and per difficulty:

  • Accuracy (ACC)
  • F1_1-score
  • Area under ROC curve (AUC)

No explicit reasoning-time metric is reported.

4. Empirical Results

Performance on the Core Benchmark (n=3n=3, m=3m=3)

Model ACC F1_1 AUC
gpt-3.5-turbo 0.56 0.55 0.56
llama3:70b-instruct 0.59 0.59 0.59
qwen:72b-chat 0.60 0.59 0.60
gemma:7b-instruct 0.55 0.53 0.55
qwen:7b-chat 0.54 0.54 0.54
mistral:7b-instruct 0.54 0.50 0.54

Large-parameter models perform ~0.58 ACC; small-parameter ~0.54. All models fall short of robust inference on general m=3m=3 tasks.

Performance by Formula Depth (mm, with n=2n=2 fixed)

  • Accuracy and AUC drop precipitously from m=1m=1 to m=2m=2
  • Beyond m=5m=5, most models approach random guessing (ACC\approx0.5, AUC\approx0.5)
  • Top-performing model (qwen:72b-chat) falls from ACC\approx0.85@ m=1m=1 to ACC\approx0.52@ m=7m=7

Performance by State-Space Size (nn, with m=2m=2 fixed)

  • As nn increases from 2 to 9, accuracy/AUC decline but more gradually: ACC drops from 0.75\approx 0.75 @ n=2n=2 to 0.55\approx 0.55 @ n=9n=9

Strengths and Limitations

  • LLMs can solve highly elementary LTL tasks (m=1m=1) at >>80% ACC
  • Robustness declines rapidly for nested formulas (m2m\geq2)
  • Large models outperform small ones for low complexity, but all collapse toward random performance for deep formulas or large state spaces
  • Basic “next”/“eventually” reasoning emerges, but globally-scoped or fixed-point inference is outside current zero-shot capabilities

A plausible implication is that current LLMs only instantiate shallow forms of temporal logic reasoning in the absence of explicit reasoning modules or augmented supervision.

5. LTL Syntax and Illustrative Formulas

Standard LTL syntax describes atomic, unary, and binary temporal constructs: ϕ::=p¬ϕϕ1ϕ2XϕFϕGϕϕ1ϕ2\phi ::= p \mid \lnot \phi \mid \phi_1 \land \phi_2 \mid \mathbf{X}\,\phi \mid \mathbf{F}\,\phi \mid \mathbf{G}\,\phi \mid \phi_1 \rightarrow \phi_2 Example formulas in the benchmark include:

  • “Whenever pp holds, qq will eventually hold”:

G(p    Fq)\mathbf{G}\bigl(p \;\rightarrow\; \mathbf{F}\,q\bigr)

  • “If rr happens next, then always ss”:

rXsGsr \wedge \mathbf{X}\,s \rightarrow \mathbf{G}\,s

The formula nesting (mm) is the key determinant of challenge complexity, with the aforementioned complexity measure highlighting exponential scaling.

6. Significance and Prospects

LTLBench provides a fully synthetic, controllable, and rigorously labeled benchmark for zero-shot temporal-logic reasoning in LLMs. Through parameter sweeps in (n,m)(n, m), LTLBench exposes the accuracy ceiling of current models and identifies a gap in their ability to handle nested, globally-scoped, or fixed-point temporal relationships. The systematic approach allows precise measurement of LLM limitations and offers a flexible platform for evaluating progress in temporal reasoning architectures (Tang et al., 2024).

This suggests future research on LLM temporal reasoning will benefit from targeted architectural modifications and training protocols that address deep operator nesting and global temporal dependencies, using LTLBench both as baseline and diagnostic instrument.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LTI-Bench.