PACIFIC: Verified Code Instruction Following
- The paper introduces a principled framework that verifies each instruction in code using deterministic executability, ensuring precise instruction adherence.
- It details a modular architecture that samples instruction chains and employs step-by-step validation to assess compositional code reasoning.
- Empirical results show that reinforcement learning with verifiable rewards can outperform traditional benchmarks by up to 15–20 percentage points.
Precise Automatically Checked Instruction Following In Code (PACIFIC) is a principled framework and benchmarking paradigm for evaluating, training, and analyzing LLMs on their ability to precisely follow programmatic instructions, with deterministic, code-executable verification. Unlike end-to-end coding benchmarks that focus solely on functional correctness, PACIFIC enforces that each instruction or output constraint is unambiguously specified and can be automatically validated, enabling high-fidelity measurement of instruction-following and stepwise code reasoning, as well as robust training through verifiable reward signals.
1. Formal Problem Definition and Motivation
PACIFIC addresses the following problem: given an initial input (typically a number, string, or code context) and a sequence of discrete instructions (code operations, user-specified output constraints, or compositional coding requirements), an LLM is evaluated on whether it can produce the precise sequence of intermediate and final outputs that would result from executing each instruction in order—without external code execution (“dry running”) (Dreyfuss et al., 11 Dec 2025).
The core requirements of PACIFIC benchmarks are:
- Precision: Every instruction or output specification is fully determined in machine-readable form.
- Automatic Checkability: There exists a deterministic, executable validator (e.g., Python test, static analyzer, evaluation script) that marks outputs as correct or incorrect (Wang et al., 5 Mar 2025, Moon et al., 9 Oct 2025, Pyatkin et al., 3 Jul 2025).
The importance of PACIFIC-style evaluation arises from the limitations of prior code benchmarks:
- Overreliance on end-to-end correctness, without measuring adherence to individual instructions or intermediate states.
- Use of human or LLM-based judgment, leading to subjectivity, latency, or evaluation bias.
- Limited sensitivity to training data contamination and generalization over unseen constraints (Dreyfuss et al., 11 Dec 2025, Pyatkin et al., 3 Jul 2025).
PACIFIC fills this gap by providing deterministic, scalable, and contamination-resilient benchmarks that isolate core capabilities in instruction-following, multi-step reasoning, and compositional code understanding.
2. Benchmark Framework Architecture and Generation
The canonical PACIFIC pipeline (Dreyfuss et al., 11 Dec 2025) comprises three modular components:
| Component | Functionality | Typical Instantiation |
|---|---|---|
| Instruction Pool | Bank of atomic instructions (multiple languages) | Python, Java, C++ ops |
| Benchmark Generator | Samples instruction chains, computes references | Chains, config (length, seed) |
| Evaluation Engine | Parses outputs, checks correctness per step/tag | Tag-based/output canonicalizer |
The benchmark generator samples randomized input values and instruction selections under user-specified constraints:
- Number of Instructions
- Target Output Length
- Cyclomatic Complexity of context code
Difficulty is calibrated via the weighted sum
where are user-defined.
Each sample consists of the input, instruction sequence, and canonical expected outputs. Multiple programming languages and prompt representations may be included to control contamination risk, quantified as for the set of variants (Dreyfuss et al., 11 Dec 2025).
3. Automatic Checking and Evaluation
PACIFIC evaluation is strictly programmatic. Each response must emit answer spans within explicit tag templates (e.g., [ANSWER] [j] ... [\ANSWER] for step ). The evaluation engine applies tolerant normalization (trimming, whitespace collapse, numeric canonicalization) to avoid trivial formatting mismatches. An output for instruction is considered correct if
where and are the model and reference outputs, respectively (Dreyfuss et al., 11 Dec 2025).
Standard metrics include:
- Instruction-Level Accuracy:
- Prompt-Level Accuracy:
where is correctness for instruction in sample and .
Supplementary metrics: precision, recall, , pass@k (for code), and differentiation power between difficulty levels (Dreyfuss et al., 11 Dec 2025, Wang et al., 5 Mar 2025). Statistical significance is assessed using standard error estimates and two-proportion -tests on accuracy.
4. Representative Instantiations: MCBench, CodeIF-Bench, AutoIF, IFBench
Several recent systems instantiate the PACIFIC paradigm:
- MCBench (Moon et al., 9 Oct 2025): Benchmarks LLMs on step-by-step calculation of string-matching NLP metrics (e.g., BLEU), where each intermediate result and the final numeric answer is verified via reference code. Metrics include Final Accuracy, Format Following, and Following Depth.
- CodeIF-Bench (Wang et al., 5 Mar 2025): Multi-turn code generation tasks embedding nine classes of verifiable instructions (e.g., explicit exception handling, edge-case management, code style) each actionable by executable test harnesses. Pass@k and overall scores track precise multi-step compliance.
- AutoIF (Self-Play with Execution Feedback) (Dong et al., 2024): A data synthesis and training pipeline that programs LLMs to generate instructions, code verifiers, and test cases, with execution-based rejection sampling for filtering correct/failing responses. Used for both supervised fine-tuning and direct preference optimization (DPO).
- IFBench (Pyatkin et al., 3 Jul 2025): Focuses on out-of-domain generalization with 58 novel constraint templates, each paired with deterministic Python verifiers, and evaluates models trained with reinforcement learning from verifiable rewards (RLVR).
5. Training Paradigms and Generalization
PACIFIC enables both evaluation and training innovations:
- Supervised Fine-Tuning (SFT): Models are trained only on outputs that pass all executable constraints.
- Offline and Online Direct Preference Optimization (DPO): Models are updated using pairwise preference data mined from pass/fail verifier outcomes.
- Reinforcement Learning with Verifiable Rewards (RLVR): Policies are optimized for high expected reward from code-validated constraints, enabling generalization to unseen instructions (Pyatkin et al., 3 Jul 2025).
Formal reward:
with policy-gradient objective
In practice, group region policy optimization can enforce trust-region updates.
Empirical results show significant gains in both in-domain and out-of-domain constraint satisfaction, with models trained using RLVR and multi-constraint settings outperforming SFT-only baselines by 15–20 percentage points on held-out test benchmarks (Pyatkin et al., 3 Jul 2025, Dong et al., 2024).
6. Limitations, Scalability, and Extension Directions
PACIFIC frameworks currently focus on constraints that are directly and automatically checkable via code or executable tests. Semantic or highly open-ended requirements remain a challenge. Increasing cyclomatic complexity, supporting multi-round/multi-module contexts, and integrating multiple modalities are active extension areas.
Scalability is enabled by:
- Seed-based resampling and prompt/representation diversity, minimizing contamination risk.
- Modular pipeline components, facilitating new instruction types or programming languages (Dreyfuss et al., 11 Dec 2025).
Future work includes:
- Compositional constraints (e.g., joint property enforcement).
- Hybrid (symbolic + execution) verifiers.
- PACIFIC for cross-modal (e.g., image + code) instruction following.
- Layered RL curriculum, with verifiable code properties of increasing difficulty (Pyatkin et al., 3 Jul 2025).
7. Benchmarking Practices and Comparative Insights
PACIFIC benchmarks differ substantially from traditional code evaluation. On hard multi-step benchmarks (e.g., 15 instructions, target output length ≥10), prompt-level accuracy of all existing LLMs drops to near 0% with the best models below 28%, emphasizing the challenge of compositional dry-run reasoning (Dreyfuss et al., 11 Dec 2025, Moon et al., 9 Oct 2025). Distinct performance rankings across PACIFIC, code, and natural language leaderboards indicate that instruction-following and "code dry running" are orthogonal skills not captured by end-to-end correctness alone.
Best practices for PACIFIC-style evaluation include:
- Prioritizing precision and automatable validation for each instruction.
- Enabling multi-turn, cumulative evaluation protocols (reflecting real developer workflows).
- Stratifying by complexity and context scope.
- Maintaining modular, extensible test and evaluation infrastructure (Wang et al., 5 Mar 2025, Dreyfuss et al., 11 Dec 2025).
A plausible implication is that as LLMs are increasingly deployed for complex, feedback-driven programming applications, PACIFIC’s requirements for granular, automatically checkable instruction adherence will become a critical axis for both capability and safety assessment.