LTLBench: Temporal Logic Benchmark

Updated 2 February 2026

LTLBench is a synthetic benchmark designed to evaluate LLM temporal reasoning by generating LTL-based challenges with varied complexity.
It employs a four-stage pipeline including directed graph generation, LTL formula synthesis, NuSMV model checking, and natural-language conversion to produce rigorous ground-truth labels.
The benchmark systematically analyzes the impact of formula depth and state-space size on model performance, revealing limitations in handling nested and global temporal relations.

LTLBench is a synthetic benchmark designed to precisely evaluate temporal reasoning capabilities in LLMs by generating challenges grounded in Linear Temporal Logic (LTL). It utilizes a controlled pipeline to systematically vary problem complexity through manipulation of event space and formula depth, leveraging automated model checking to generate rigorous ground-truth labels. The benchmark reveals critical insights into the strengths and current limitations of LLMs in handling temporal-logic inference, especially beyond basic operator usage (Tang et al., 2024).

1. Construction Pipeline

LTLBench employs a structured four-stage generation pipeline:

A. Random Directed Graph Generation

A random directed graph $G = (V, E)$ with $n > 1$ nodes (events) models the transition system. Each edge $event_i \rightarrow event_j$ denotes that event $j$ can succeed $i$ , supporting cycles and arbitrary connectivity. This graph defines both the NuSMV state-space and the vocabulary for LTL formulas.

B. LTL Formula Generation

Random LTL formulas of exact operator length $m>0$ are synthesized using adapted Zhu (2021) methodology.

Unary operators: $\mathbf{X}$ (next), $\mathbf{F}$ (eventually), $\mathbf{G}$ (globally), $\lnot$ (negation)
Binary operators: $\land$ (and), $\lor$ (or), $\rightarrow$ (implication) Recursive construction picks operators uniformly, with subformulas built up to meet the $m$ operator constraint: $\phi ::= p \mid \lnot \phi \mid \phi_1 \land \phi_2 \mid \mathbf{X}\,\phi \mid \mathbf{F}\,\phi \mid \mathbf{G}\,\phi \mid \phi_1 \rightarrow \phi_2$ Example ( $n=2$ , $m=3$ ):

$\underbrace{(p\rightarrow \mathbf{G}(\mathbf{F}\,q))}_{m=3}, \quad p\equiv \text{event}_1,\;q\equiv \text{event}_2$

C. NuSMV Model Checking

A NuSMV module is produced with:

One state variable over $n$ events
Random initial state $event_k$
Transition dynamics encoding all edges via “next(state) := case … esac;”
Attached LTLSPEC for the generated formula NuSMV labelings (“SAT”/“UNSAT”) serve as ground-truth (i.e., does the formula hold on every infinite path through the graph).

D. Natural-Language Challenge Conversion

For evaluation, each problem is presented as:

Context: “Initially, event $_k$ happened. After event $_i$ , event $_j$ can happen.” (lists all transitions)
Hypothesis: Decomposes the LTL spec into named clauses $C_1,…,C_r$ , then asks if $C_r$ is True or False. Deterministic zero-shot prompting (temperature=0) is used; responses must be “True” or “False.”

2. Dataset Composition and Complexity

LTLBench provides systematic control over temporal reasoning challenge difficulty:

Core Benchmark:
- ( $n=3$ , $m=3$ ): 2,000 examples, balanced (1,000 “True”, 1,000 “False”)
Difficulty Sweeps:
- Fix $n=2$ , sweep $m\in\{1,2,3,4,5,7,9\}$ ; 300 examples per $m$ (2,100 total)
- Fix $m=2$ , sweep $n\in\{2,3,4,5,7,9\}$ ; 300 examples per $n$ (1,800 total)

Problem complexity is a function of both state-space size ( $n$ ) and formula depth ( $m$ ). The induced Kripke structure has size $O(n)$ , but model-checking cost is approximately: $\text{Complexity} \approx O\bigl( n \times 2^m \bigr)$ with exponential growth in $m$ dominating for deep formulas.

3. Evaluation Protocol

The benchmark tests zero-shot temporal reasoning using deterministic prompting; each challenge is provided as a “Context + Hypothesis” chat turn. Six LLM architectures are evaluated:

Large-parameter models:
- gpt-3.5-turbo
- llama3:70b-instruct
- qwen:72b-chat
Small-parameter models:
- gemma:7b-instruct
- mistral:7b-instruct
- qwen:7b-chat

Metrics reported per model and per difficulty:

Accuracy (ACC)
F $_1$ -score
Area under ROC curve (AUC)

No explicit reasoning-time metric is reported.

4. Empirical Results

Performance on the Core Benchmark ( $n=3$ , $m=3$ )

Model	ACC	F $_1$	AUC
gpt-3.5-turbo	0.56	0.55	0.56
llama3:70b-instruct	0.59	0.59	0.59
qwen:72b-chat	0.60	0.59	0.60
gemma:7b-instruct	0.55	0.53	0.55
qwen:7b-chat	0.54	0.54	0.54
mistral:7b-instruct	0.54	0.50	0.54

Large-parameter models perform ~0.58 ACC; small-parameter ~0.54. All models fall short of robust inference on general $m=3$ tasks.

Performance by Formula Depth ( $m$ , with $n=2$ fixed)

Accuracy and AUC drop precipitously from $m=1$ to $m=2$
Beyond $m=5$ , most models approach random guessing (ACC $\approx$ 0.5, AUC $\approx$ 0.5)
Top-performing model (qwen:72b-chat) falls from ACC $\approx$ 0.85@ $m=1$ to ACC $\approx$ 0.52@ $m=7$

Performance by State-Space Size ( $n$ , with $m=2$ fixed)

As $n$ increases from 2 to 9, accuracy/AUC decline but more gradually: ACC drops from $\approx 0.75$ @ $n=2$ to $\approx 0.55$ @ $n=9$

Strengths and Limitations

LLMs can solve highly elementary LTL tasks ( $m=1$ ) at $>$ 80% ACC
Robustness declines rapidly for nested formulas ( $m\geq2$ )
Large models outperform small ones for low complexity, but all collapse toward random performance for deep formulas or large state spaces
Basic “next”/“eventually” reasoning emerges, but globally-scoped or fixed-point inference is outside current zero-shot capabilities

A plausible implication is that current LLMs only instantiate shallow forms of temporal logic reasoning in the absence of explicit reasoning modules or augmented supervision.

5. LTL Syntax and Illustrative Formulas

Standard LTL syntax describes atomic, unary, and binary temporal constructs: $\phi ::= p \mid \lnot \phi \mid \phi_1 \land \phi_2 \mid \mathbf{X}\,\phi \mid \mathbf{F}\,\phi \mid \mathbf{G}\,\phi \mid \phi_1 \rightarrow \phi_2$ Example formulas in the benchmark include:

“Whenever $p$ holds, $q$ will eventually hold”:

$\mathbf{G}\bigl(p \;\rightarrow\; \mathbf{F}\,q\bigr)$

“If $r$ happens next, then always $s$ ”:

$r \wedge \mathbf{X}\,s \rightarrow \mathbf{G}\,s$

The formula nesting ( $m$ ) is the key determinant of challenge complexity, with the aforementioned complexity measure highlighting exponential scaling.

6. Significance and Prospects

LTLBench provides a fully synthetic, controllable, and rigorously labeled benchmark for zero-shot temporal-logic reasoning in LLMs. Through parameter sweeps in $(n, m)$ , LTLBench exposes the accuracy ceiling of current models and identifies a gap in their ability to handle nested, globally-scoped, or fixed-point temporal relationships. The systematic approach allows precise measurement of LLM limitations and offers a flexible platform for evaluating progress in temporal reasoning architectures (Tang et al., 2024).

This suggests future research on LLM temporal reasoning will benefit from targeted architectural modifications and training protocols that address deep operator nesting and global temporal dependencies, using LTLBench both as baseline and diagnostic instrument.

Markdown Report Issue Upgrade to Chat

References (1)

LTLBench: Towards Benchmarks for Evaluating Temporal Logic Reasoning in Large Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LTI-Bench.

LTLBench: Temporal Logic Benchmark

1. Construction Pipeline

2. Dataset Composition and Complexity

3. Evaluation Protocol

4. Empirical Results

Performance on the Core Benchmark ( $n=3$ , $m=3$ )

Performance by Formula Depth ( $m$ , with $n=2$ fixed)

Performance by State-Space Size ( $n$ , with $m=2$ fixed)

Strengths and Limitations

5. LTL Syntax and Illustrative Formulas

6. Significance and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LTLBench: Temporal Logic Benchmark

1. Construction Pipeline

2. Dataset Composition and Complexity

3. Evaluation Protocol

4. Empirical Results

Performance on the Core Benchmark (n=3n=3n=3, m=3m=3m=3)

Performance by Formula Depth (mmm, with n=2n=2n=2 fixed)

Performance by State-Space Size (nnn, with m=2m=2m=2 fixed)

Strengths and Limitations

5. LTL Syntax and Illustrative Formulas

6. Significance and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Performance on the Core Benchmark ( $n=3$ , $m=3$ )

Performance by Formula Depth ( $m$ , with $n=2$ fixed)

Performance by State-Space Size ( $n$ , with $m=2$ fixed)