LTLBench: Temporal Logic Benchmark
- LTLBench is a synthetic benchmark designed to evaluate LLM temporal reasoning by generating LTL-based challenges with varied complexity.
- It employs a four-stage pipeline including directed graph generation, LTL formula synthesis, NuSMV model checking, and natural-language conversion to produce rigorous ground-truth labels.
- The benchmark systematically analyzes the impact of formula depth and state-space size on model performance, revealing limitations in handling nested and global temporal relations.
LTLBench is a synthetic benchmark designed to precisely evaluate temporal reasoning capabilities in LLMs by generating challenges grounded in Linear Temporal Logic (LTL). It utilizes a controlled pipeline to systematically vary problem complexity through manipulation of event space and formula depth, leveraging automated model checking to generate rigorous ground-truth labels. The benchmark reveals critical insights into the strengths and current limitations of LLMs in handling temporal-logic inference, especially beyond basic operator usage (Tang et al., 2024).
1. Construction Pipeline
LTLBench employs a structured four-stage generation pipeline:
A. Random Directed Graph Generation
A random directed graph with nodes (events) models the transition system. Each edge denotes that event can succeed , supporting cycles and arbitrary connectivity. This graph defines both the NuSMV state-space and the vocabulary for LTL formulas.
B. LTL Formula Generation
Random LTL formulas of exact operator length are synthesized using adapted Zhu (2021) methodology.
- Unary operators: (next), (eventually), (globally), (negation)
- Binary operators: (and), (or), (implication) Recursive construction picks operators uniformly, with subformulas built up to meet the operator constraint: Example (, ):
C. NuSMV Model Checking
A NuSMV module is produced with:
- One state variable over events
- Random initial state
- Transition dynamics encoding all edges via “next(state) := case … esac;”
- Attached LTLSPEC for the generated formula NuSMV labelings (“SAT”/“UNSAT”) serve as ground-truth (i.e., does the formula hold on every infinite path through the graph).
D. Natural-Language Challenge Conversion
For evaluation, each problem is presented as:
- Context: “Initially, event happened. After event, event can happen.” (lists all transitions)
- Hypothesis: Decomposes the LTL spec into named clauses , then asks if is True or False. Deterministic zero-shot prompting (temperature=0) is used; responses must be “True” or “False.”
2. Dataset Composition and Complexity
LTLBench provides systematic control over temporal reasoning challenge difficulty:
- Core Benchmark:
- (, ): 2,000 examples, balanced (1,000 “True”, 1,000 “False”)
- Difficulty Sweeps:
- Fix , sweep ; 300 examples per (2,100 total)
- Fix , sweep ; 300 examples per (1,800 total)
Problem complexity is a function of both state-space size () and formula depth (). The induced Kripke structure has size , but model-checking cost is approximately: with exponential growth in dominating for deep formulas.
3. Evaluation Protocol
The benchmark tests zero-shot temporal reasoning using deterministic prompting; each challenge is provided as a “Context + Hypothesis” chat turn. Six LLM architectures are evaluated:
- Large-parameter models:
- gpt-3.5-turbo
- llama3:70b-instruct
- qwen:72b-chat
- Small-parameter models:
- gemma:7b-instruct
- mistral:7b-instruct
- qwen:7b-chat
Metrics reported per model and per difficulty:
- Accuracy (ACC)
- F-score
- Area under ROC curve (AUC)
No explicit reasoning-time metric is reported.
4. Empirical Results
Performance on the Core Benchmark (, )
| Model | ACC | F | AUC |
|---|---|---|---|
| gpt-3.5-turbo | 0.56 | 0.55 | 0.56 |
| llama3:70b-instruct | 0.59 | 0.59 | 0.59 |
| qwen:72b-chat | 0.60 | 0.59 | 0.60 |
| gemma:7b-instruct | 0.55 | 0.53 | 0.55 |
| qwen:7b-chat | 0.54 | 0.54 | 0.54 |
| mistral:7b-instruct | 0.54 | 0.50 | 0.54 |
Large-parameter models perform ~0.58 ACC; small-parameter ~0.54. All models fall short of robust inference on general tasks.
Performance by Formula Depth (, with fixed)
- Accuracy and AUC drop precipitously from to
- Beyond , most models approach random guessing (ACC0.5, AUC0.5)
- Top-performing model (qwen:72b-chat) falls from ACC0.85@ to ACC0.52@
Performance by State-Space Size (, with fixed)
- As increases from 2 to 9, accuracy/AUC decline but more gradually: ACC drops from @ to @
Strengths and Limitations
- LLMs can solve highly elementary LTL tasks () at 80% ACC
- Robustness declines rapidly for nested formulas ()
- Large models outperform small ones for low complexity, but all collapse toward random performance for deep formulas or large state spaces
- Basic “next”/“eventually” reasoning emerges, but globally-scoped or fixed-point inference is outside current zero-shot capabilities
A plausible implication is that current LLMs only instantiate shallow forms of temporal logic reasoning in the absence of explicit reasoning modules or augmented supervision.
5. LTL Syntax and Illustrative Formulas
Standard LTL syntax describes atomic, unary, and binary temporal constructs: Example formulas in the benchmark include:
- “Whenever holds, will eventually hold”:
- “If happens next, then always ”:
The formula nesting () is the key determinant of challenge complexity, with the aforementioned complexity measure highlighting exponential scaling.
6. Significance and Prospects
LTLBench provides a fully synthetic, controllable, and rigorously labeled benchmark for zero-shot temporal-logic reasoning in LLMs. Through parameter sweeps in , LTLBench exposes the accuracy ceiling of current models and identifies a gap in their ability to handle nested, globally-scoped, or fixed-point temporal relationships. The systematic approach allows precise measurement of LLM limitations and offers a flexible platform for evaluating progress in temporal reasoning architectures (Tang et al., 2024).
This suggests future research on LLM temporal reasoning will benefit from targeted architectural modifications and training protocols that address deep operator nesting and global temporal dependencies, using LTLBench both as baseline and diagnostic instrument.