Papers
Topics
Authors
Recent
Search
2000 character limit reached

PACIFIC: Verified Code Instruction Following

Updated 13 December 2025
  • The paper introduces a principled framework that verifies each instruction in code using deterministic executability, ensuring precise instruction adherence.
  • It details a modular architecture that samples instruction chains and employs step-by-step validation to assess compositional code reasoning.
  • Empirical results show that reinforcement learning with verifiable rewards can outperform traditional benchmarks by up to 15–20 percentage points.

Precise Automatically Checked Instruction Following In Code (PACIFIC) is a principled framework and benchmarking paradigm for evaluating, training, and analyzing LLMs on their ability to precisely follow programmatic instructions, with deterministic, code-executable verification. Unlike end-to-end coding benchmarks that focus solely on functional correctness, PACIFIC enforces that each instruction or output constraint is unambiguously specified and can be automatically validated, enabling high-fidelity measurement of instruction-following and stepwise code reasoning, as well as robust training through verifiable reward signals.

1. Formal Problem Definition and Motivation

PACIFIC addresses the following problem: given an initial input (typically a number, string, or code context) and a sequence of discrete instructions (code operations, user-specified output constraints, or compositional coding requirements), an LLM is evaluated on whether it can produce the precise sequence of intermediate and final outputs that would result from executing each instruction in order—without external code execution (“dry running”) (Dreyfuss et al., 11 Dec 2025).

The core requirements of PACIFIC benchmarks are:

The importance of PACIFIC-style evaluation arises from the limitations of prior code benchmarks:

  • Overreliance on end-to-end correctness, without measuring adherence to individual instructions or intermediate states.
  • Use of human or LLM-based judgment, leading to subjectivity, latency, or evaluation bias.
  • Limited sensitivity to training data contamination and generalization over unseen constraints (Dreyfuss et al., 11 Dec 2025, Pyatkin et al., 3 Jul 2025).

PACIFIC fills this gap by providing deterministic, scalable, and contamination-resilient benchmarks that isolate core capabilities in instruction-following, multi-step reasoning, and compositional code understanding.

2. Benchmark Framework Architecture and Generation

The canonical PACIFIC pipeline (Dreyfuss et al., 11 Dec 2025) comprises three modular components:

Component Functionality Typical Instantiation
Instruction Pool Bank of atomic instructions (multiple languages) Python, Java, C++ ops
Benchmark Generator Samples instruction chains, computes references Chains, config (length, seed)
Evaluation Engine Parses outputs, checks correctness per step/tag Tag-based/output canonicalizer

The benchmark generator samples randomized input values and instruction selections under user-specified constraints:

  • Number of Instructions I\lvert I \rvert
  • Target Output Length O\lvert O \rvert
  • Cyclomatic Complexity CC(C)CC(C) of context code

Difficulty is calibrated via the weighted sum

D(B)=αI+βCC(C)+γOD(B) = \alpha\cdot |I| + \beta\cdot CC(C) + \gamma\cdot |O|

where α,β,γ\alpha,\beta,\gamma are user-defined.

Each sample consists of the input, instruction sequence, and canonical expected outputs. Multiple programming languages and prompt representations may be included to control contamination risk, quantified as R(B)=1VR(B)=\frac{1}{|\mathcal{V}|} for the set of variants V\mathcal{V} (Dreyfuss et al., 11 Dec 2025).

3. Automatic Checking and Evaluation

PACIFIC evaluation is strictly programmatic. Each response must emit answer spans within explicit tag templates (e.g., [ANSWER] [j] ... [\ANSWER] for step jj). The evaluation engine applies tolerant normalization (trimming, whitespace collapse, numeric canonicalization) to avoid trivial formatting mismatches. An output for instruction jj is considered correct if

norm(o^i,j)=norm(oi,j)\mathrm{norm}(\hat{o}_{i,j}) = \mathrm{norm}(o_{i,j})

where o^i,j\hat{o}_{i,j} and oi,jo_{i,j} are the model and reference outputs, respectively (Dreyfuss et al., 11 Dec 2025).

Standard metrics include:

  • Instruction-Level Accuracy:

i=1Nj=1nici,ji=1Nni\frac{\sum_{i=1}^N\sum_{j=1}^{n_i} c_{i,j}}{\sum_{i=1}^N n_i}

  • Prompt-Level Accuracy:

i=1NpiN\frac{\sum_{i=1}^N p_i}{N}

where ci,jc_{i,j} is correctness for instruction jj in sample ii and pi=j=1nici,jp_i = \prod_{j=1}^{n_i} c_{i,j}.

Supplementary metrics: precision, recall, F1F_1, pass@k (for code), and differentiation power between difficulty levels (Dreyfuss et al., 11 Dec 2025, Wang et al., 5 Mar 2025). Statistical significance is assessed using standard error estimates and two-proportion zz-tests on accuracy.

4. Representative Instantiations: MCBench, CodeIF-Bench, AutoIF, IFBench

Several recent systems instantiate the PACIFIC paradigm:

5. Training Paradigms and Generalization

PACIFIC enables both evaluation and training innovations:

  • Supervised Fine-Tuning (SFT): Models are trained only on outputs that pass all executable constraints.
  • Offline and Online Direct Preference Optimization (DPO): Models are updated using pairwise preference data mined from pass/fail verifier outcomes.
  • Reinforcement Learning with Verifiable Rewards (RLVR): Policies are optimized for high expected reward from code-validated constraints, enabling generalization to unseen instructions (Pyatkin et al., 3 Jul 2025).

Formal reward:

R(y;C)=i=1nwiv(ci,y)R(y;C)=\sum_{i=1}^n w_i v(c_i, y)

with policy-gradient objective

J(θ)=Eyπθ(x,C)[R(y;C)]J(\theta) = \mathbb{E}_{y\sim\pi_\theta(\cdot\mid x,C)}[R(y;C)]

In practice, group region policy optimization can enforce trust-region updates.

Empirical results show significant gains in both in-domain and out-of-domain constraint satisfaction, with models trained using RLVR and multi-constraint settings outperforming SFT-only baselines by 15–20 percentage points on held-out test benchmarks (Pyatkin et al., 3 Jul 2025, Dong et al., 2024).

6. Limitations, Scalability, and Extension Directions

PACIFIC frameworks currently focus on constraints that are directly and automatically checkable via code or executable tests. Semantic or highly open-ended requirements remain a challenge. Increasing cyclomatic complexity, supporting multi-round/multi-module contexts, and integrating multiple modalities are active extension areas.

Scalability is enabled by:

Future work includes:

  • Compositional constraints (e.g., joint property enforcement).
  • Hybrid (symbolic + execution) verifiers.
  • PACIFIC for cross-modal (e.g., image + code) instruction following.
  • Layered RL curriculum, with verifiable code properties of increasing difficulty (Pyatkin et al., 3 Jul 2025).

7. Benchmarking Practices and Comparative Insights

PACIFIC benchmarks differ substantially from traditional code evaluation. On hard multi-step benchmarks (e.g., 15 instructions, target output length ≥10), prompt-level accuracy of all existing LLMs drops to near 0% with the best models below 28%, emphasizing the challenge of compositional dry-run reasoning (Dreyfuss et al., 11 Dec 2025, Moon et al., 9 Oct 2025). Distinct performance rankings across PACIFIC, code, and natural language leaderboards indicate that instruction-following and "code dry running" are orthogonal skills not captured by end-to-end correctness alone.

Best practices for PACIFIC-style evaluation include:

  • Prioritizing precision and automatable validation for each instruction.
  • Enabling multi-turn, cumulative evaluation protocols (reflecting real developer workflows).
  • Stratifying by complexity and context scope.
  • Maintaining modular, extensible test and evaluation infrastructure (Wang et al., 5 Mar 2025, Dreyfuss et al., 11 Dec 2025).

A plausible implication is that as LLMs are increasingly deployed for complex, feedback-driven programming applications, PACIFIC’s requirements for granular, automatically checkable instruction adherence will become a critical axis for both capability and safety assessment.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Precise Automatically Checked Instruction Following In Code (PACIFIC).