VeruSyn: Scalable Proof Synthesis for Rust
- VeruSyn is a two-stage data synthesis pipeline that produces large-scale, proof-annotated Rust programs with machine-verifiable correctness proofs using the Verus framework.
- It combines self-synthesis, tutorial-driven expansion, and agent trajectory capture to overcome data scarcity and complex reasoning challenges in formal verification.
- Fine-tuned language models leveraging the VeruSyn dataset achieve high proof success rates at significantly lower costs compared to many commercial alternatives.
VeruSyn is a two-stage data synthesis pipeline designed to generate large-scale proof-annotated Rust code for formal verification, specifically targeting the Verus verification framework. The system addresses the dual challenge of data scarcity and the elevated reasoning requirements inherent in synthesizing formal correctness proofs alongside code, as opposed to code generation alone. By leveraging self-synthesis, tutorial-driven data expansion, and agent trajectory capture, VeruSyn produces a dataset comprising 6.9 million Rust programs, each annotated with a specification and a machine-verifiable correctness proof. This corpus enables the fine-tuning of general-purpose LLMs such as Qwen2.5-Coder-32B-Instruct, yielding proof success rates rivaling those of leading commercial models at orders of magnitude lower cost (Di et al., 4 Feb 2026).
1. Motivation and Existing Challenges
The increasing adoption of Rust for system-level software has amplified demand for formally verified code that ensures properties like memory safety, absence of data races, and functional correctness. While the Rust type system addresses many safety concerns, low-level code frequently necessitates explicit proof constructs, including preconditions, postconditions, and loop invariants, to achieve full formal verification. Manual construction of such proofs via Verus—a Rust-embedded verification tool using SMT backends like Z3—is both time-consuming and error-prone.
Attempts to automate Verus proof generation have faced two key obstacles:
- Proof-annotated Data Scarcity: Natural repositories contain fewer than 1,000 distinct verification tasks, providing insufficient coverage for training sophisticated LLMs.
- LLM Limitations: Small or open-source LLMs (e.g., o4-mini) underperform in proof synthesis, whereas only expensive commercial models (e.g., Claude Sonnet 4.5) demonstrate reliable results, creating a prohibitive cost-proof tradeoff.
VeruSyn was developed to synthesize a massive, diverse dataset that includes challenging, long-chain-of-thought (CoT) examples, closing the data and reasoning gap for cost-effective LLM-based proof synthesis.
2. Data Synthesis Methodology
VeruSyn introduces a two-stage pipeline for maximizing data quantity, feature diversity, and reasoning depth.
2.1 Self-Synthesis
Self-synthesis fine-tunes an LLM to autoregressively produce Rust+Verus code/proof pairs. A canonical prompt template solicits proof-annotated functions of at least 20 lines of code. Deduplication using SimHash is applied to every 500 generated outputs; synthesis halts in a round when duplicate rate exceeds 95%. Verification proceeds via the Verus tool:
- Verified programs are added as "DirectGen" data.
- Failed programs are debugged iteratively by the LLM, using Verus error messages; successfully debugged pairs are recorded as "Debug" data.
Pseudocode for this process formally specifies initialization on SAFE seed data, batching, deduplication, direct verification, and feedback-driven debugging.
2.2 Tutorial-Based Synthesis
Tutorial-based synthesis systematically ensures comprehensive coverage of the 20 distinct Verus features, such as forall, invariant, nonlinear arithmetic, and bit-vector reasoning. The process comprises:
- Manual curation of approximately 1,000 seed programs illustrating all Verus tutorial concepts.
- Prompting the LLM to generate 2,000 variants per seed, guided by the corresponding tutorial.
- Measuring origin and coverage within the verified subset; further seed authoring is solicited for underrepresented features.
- Generating up to 4,000 variants per newly supplemented seed.
After deduplication and debugging, this stage yields approximately 1.1 million verified programs, ensuring robust feature representation.
2.3 Agent Trajectory Synthesis
To model complex, long-multi-step proof development—unaddressed by isolated program–proof pairs—VeruSyn captures complete CoT “agent trajectories.” Using a high-performing automated agent stack (GitHub Copilot plus Claude Sonnet 4.5), it records sequences of reasoning–action pairs transforming a raw Rust program into a fully verified one:
where each action includes operations like reading files, invoking Verus, or editing proofs, and encodes the chain-of-thought. Trajectories are structurally split into direct-gen and debugging steps, providing granular supervision for downstream model fine-tuning.
3. Dataset Composition and Scale
Two full rounds of self-synthesis and tutorial-based generation, followed by dataset alignment to the latest Verus release, yield an aggregate of 6,896,180 verified Rust programs:
- Mean LOC: ≈ 39 (on par with SAFE/AlphaVerus datasets).
- Feature coverage: Each of the 20 tutorial features appears, the rarest in 8,611 examples (≥ 0.12%).
- CoT data: 4,557 direct-gen and 3,162 debugging trajectories.
- Partitioned corpus: 5.7M direct-gen and 1.2M debug-augmented samples.
A summary table of synthesis stages:
| Phase | Synthesized | Deduped | Verified |
|---|---|---|---|
| First round–Self | 6.4M | 4.2M | 1.0M |
| First round–Tutorial | 2.0M | 1.6M | 0.3M |
| Second round–Self | 29.0M | 6.3M | 5.3M |
| Second round–Tutorial | 5.8M | 2.6M | 1.1M |
| Final total | — | 14.7M | 7.7M |
Following final processing, 6.9M programs constitute the canonical dataset (Di et al., 4 Feb 2026).
4. Model Fine-Tuning and Evaluation
Fine-tuning proceeds in two SFT (Supervised Fine-Tuning) stages:
- Stage 1: Part 1 (self + tutorial) data, 2 epochs, batch size 128, learning rate .
- Stage 2: Part 2 (agent trajectory) data, 5 epochs, batch size 32, learning rate .
Evaluation Metrics:
- Accuracy@K (with or without debugging):
- Cost per task, aggregated by token volume and published per-token API rates.
Cost–proof tradeoffs illustrate Qwen2.5-VeruSyn's substantial advantages:
| Model | Accuracy@100 | Cost/Task |
|---|---|---|
| Qwen2.5-VeruSyn (5x debug) | 49% | $0.61 |
| Claude Sonnet 4.5 (no debug) | 46% | $8.04 |
| o4-mini (no debug) | 15% | $3.39 |
On VeruSAGE-Bench, the Qwen2.5-VeruSyn (w/ SFT and debugging) attains 49% Accuracy@100—a substantial gain over o4-mini (24%) and approaching Claude 4.5 (54%) with competitive cost. On algorithmic benchmarks (VerusBench), VeruSyn-Qwen reaches 83% Accuracy@100 (debug-enabled), marginally surpassing Claude 4.5's 76%.
5. Ablation Studies and Statistical Significance
Ablation experiments corroborate that both data scale and tutorial-driven coverage are critical for attaining high proof success rates:
- Full Part 1 data (6.9M) outperforms smaller subsets, random samples, or tutorial-only collections.
- The presence of CoT/trajectory data provides a measurable boost in complex or long-horizon tasks.
A paired McNemar's test confirms statistical significance ($p < 0.001$) for VeruSyn-Qwen's improvement over AlphaVerus-SFT on VeruSAGE predictions.
6. Contributions, Limitations, and Future Directions
Documented contributions include:
- A novel, scalable two-stage synthesis pipeline combining self-synthesis, tutorial expansion, and agent trajectory data.
- The construction of a 6.9M-scale dataset of Rust+Verus code-spec-proof triples with comprehensive feature coverage.
- Demonstration that fine-tuned Qwen2.5-Coder-32B achieves proof synthesis quality competitive with Claude Sonnet 4.5 at 1/50th the cost and threefold greater accuracy than preceding research models.
- Systematic ablations quantifying the effects of dataset size, representational diversity, and the incorporation of long chain-of-thoughts.
Notable limitations include restriction to mostly single-function proofs, limited multi-module or cross-crate coverage, and practical constraints on the volume of high-quality agent trajectory data due to the scope of system-level tasks solvable by existing agents. Future avenues suggested include automating trajectory synthesis for larger codebases, integrating programmatic tool use within smaller LLMs, and generalizing the approach to alternative verification frameworks (e.g., Kani, Prusti).
VeruSyn demonstrates the effectiveness of large-scale, procedurally synthesized datasets for advancing cost-effective formal verification of Rust system software. The approach shows potential for extending formal methods to broader contexts within program synthesis and machine-assisted reasoning (Di et al., 4 Feb 2026).