Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

Published 22 May 2025 in cs.CL and cs.AI | (2505.16998v1)

Abstract: LLMs have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small LLMs, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that Thinking models outperform Instruct models in structured logical tasks when employing formal language trajectories.
The paper reveals significant performance variations across deductive, inductive, abductive, and mixed-form reasoning tasks with formal formats.
The paper shows that augmenting LLMs with formal language datasets improves accuracy by up to 17% on CSP tasks.

Do LLMs Excel in Complex Logical Reasoning with Formal Language?

Introduction

The paper investigates the performance of LLMs on logical reasoning tasks, specifically utilizing formal languages to guide reasoning processes. The study is motivated by the necessity to offer comprehensive evaluations of LLMs, as most existing work is limited in assessing the logical reasoning capabilities across a spectrum of formal languages. It evaluates LLMs using a framework that considers three dimensions: the type of LLM, the taxonomy of reasoning tasks, and the trajectory formats for modeling the reasoning path (Figure 1).

Figure 1: Evaluation framework with three specific dimensions: spectrum of LLMs, taxonomy of logical reasoning tasks, and format of trajectories.

Methodology

The methodology involves constructing an evaluation framework and gathering datasets corresponding to different logical reasoning types: deductive, inductive, abductive, and mixed-form. The paper implements a comprehensive analysis across these dimensions to ascertain the capabilities of LLMs in leveraging formal languages. The framework is designed to evaluate reasoning trajectories modeled in various formats, namely "Python" (PoT), "Z3", "CSP", and default "Text".

Results

Evaluation of LLMs Across Dimensions

The findings reveal that Thinking models outperform Instruct models, primarily when formal languages are used. However, all models exhibit considerable limitations in inductive reasoning tasks, highlighting the challenges LLMs face in generalizing across formal languages.

Radar plot analysis (Figure 2) indicates a varying performance across different logical reasoning tasks and trajectory formats. Text-based formats generally outperform formal languages except for specific models like QwQ-32B, which maintains robust performance across all formats. The results underline that formal language performance drops significantly on complex tasks.

Figure 2: Radar plots illustrating the performance (\%) of multiple LLMs across different reasoning task types and trajectory formats.

Preferred Formats for Different Reasoning Tasks

The study highlights the preference of certain reasoning tasks for specific trajectory formats (Figure 3). Text format excels in language comprehension and open-ended tasks. Conversely, well-structured tasks demonstrate better performance with the PoT format, particularly in mathematical and symbolic reasoning tasks. Z3 is adapted well for formal logic tasks, evidencing the advantage of logic-based languages for strict logical rules. Lastly, CSP format exhibits strengths in structured logic tasks with numerous constraints.

Figure 3: Preferred reasoning task performance across different trajectory formats in GPT-4o results.

Generalization Analysis

The analysis across different reasoning tasks and trajectory formats indicates that while PoT trajectories demonstrate positive transfer capabilities, CSP exhibits poor transferability across other formats, emphasizing structural variances among formal languages (Figures 4 and 5). The results suggest that all reasoning types exhibit positive transfer, with deductive-CSP configurations being the most easily generalized.

Figure 4: Fine-grained by Trajectory Format.

Figure 5: Trajectory Format.

Enhancing LLMs with Formal Data

The study further investigates enhancing the LLMs’ capabilities using a rejected fine-tuning approach with formal language datasets. The findings indicate significant performance improvements, enhancing model accuracy by up to 17.0\% on CSP formats (Table below).

Model	Text	PoT	Z3	CSP	Avg
Qwen2.5-7B-Base w.Formal	+3.0	+4.0	+7.7	+17.0	+8.0

Table: Performance improvements on different formats using formal data augmentation.

Conclusion

The paper systematically evaluates LLMs’ performance in logical reasoning tasks using formal languages, highlighting key observations about trajectory preferences and generalization capabilities. It provides insights into enhancing LLMs using formal language datasets, advocating for balanced improvements and task-specific trajectory alignment to augment reasoning capabilities.

The comprehensive evaluation underscores that while LLMs exhibit strong capabilities in certain task types, their performance is inconsistent when formal languages are employed, and generalization remains a critical challenge. Future work is directed towards expanding dataset coverage, enhancing formal reasoning capabilities, and exploring diverse symbolic systems.

Markdown