PACIFIC: Verified Code Instruction Following

Updated 13 December 2025

The paper introduces a principled framework that verifies each instruction in code using deterministic executability, ensuring precise instruction adherence.
It details a modular architecture that samples instruction chains and employs step-by-step validation to assess compositional code reasoning.
Empirical results show that reinforcement learning with verifiable rewards can outperform traditional benchmarks by up to 15–20 percentage points.

Precise Automatically Checked Instruction Following In Code (PACIFIC) is a principled framework and benchmarking paradigm for evaluating, training, and analyzing LLMs on their ability to precisely follow programmatic instructions, with deterministic, code-executable verification. Unlike end-to-end coding benchmarks that focus solely on functional correctness, PACIFIC enforces that each instruction or output constraint is unambiguously specified and can be automatically validated, enabling high-fidelity measurement of instruction-following and stepwise code reasoning, as well as robust training through verifiable reward signals.

1. Formal Problem Definition and Motivation

PACIFIC addresses the following problem: given an initial input (typically a number, string, or code context) and a sequence of discrete instructions (code operations, user-specified output constraints, or compositional coding requirements), an LLM is evaluated on whether it can produce the precise sequence of intermediate and final outputs that would result from executing each instruction in order—without external code execution (“dry running”) (Dreyfuss et al., 11 Dec 2025).

The core requirements of PACIFIC benchmarks are:

Precision: Every instruction or output specification is fully determined in machine-readable form.
Automatic Checkability: There exists a deterministic, executable validator (e.g., Python test, static analyzer, evaluation script) that marks outputs as correct or incorrect (Wang et al., 5 Mar 2025, Moon et al., 9 Oct 2025, Pyatkin et al., 3 Jul 2025).

The importance of PACIFIC-style evaluation arises from the limitations of prior code benchmarks:

Overreliance on end-to-end correctness, without measuring adherence to individual instructions or intermediate states.
Use of human or LLM-based judgment, leading to subjectivity, latency, or evaluation bias.
Limited sensitivity to training data contamination and generalization over unseen constraints (Dreyfuss et al., 11 Dec 2025, Pyatkin et al., 3 Jul 2025).

PACIFIC fills this gap by providing deterministic, scalable, and contamination-resilient benchmarks that isolate core capabilities in instruction-following, multi-step reasoning, and compositional code understanding.

2. Benchmark Framework Architecture and Generation

The canonical PACIFIC pipeline (Dreyfuss et al., 11 Dec 2025) comprises three modular components:

Component	Functionality	Typical Instantiation
Instruction Pool	Bank of atomic instructions (multiple languages)	Python, Java, C++ ops
Benchmark Generator	Samples instruction chains, computes references	Chains, config (length, seed)
Evaluation Engine	Parses outputs, checks correctness per step/tag	Tag-based/output canonicalizer

The benchmark generator samples randomized input values and instruction selections under user-specified constraints:

Number of Instructions $\lvert I \rvert$
Target Output Length $\lvert O \rvert$
Cyclomatic Complexity $CC(C)$ of context code

Difficulty is calibrated via the weighted sum

$D(B) = \alpha\cdot |I| + \beta\cdot CC(C) + \gamma\cdot |O|$

where $\alpha,\beta,\gamma$ are user-defined.

Each sample consists of the input, instruction sequence, and canonical expected outputs. Multiple programming languages and prompt representations may be included to control contamination risk, quantified as $R(B)=\frac{1}{|\mathcal{V}|}$ for the set of variants $\mathcal{V}$ (Dreyfuss et al., 11 Dec 2025).

3. Automatic Checking and Evaluation

PACIFIC evaluation is strictly programmatic. Each response must emit answer spans within explicit tag templates (e.g., [ANSWER] [j] ... [\ANSWER] for step $j$ ). The evaluation engine applies tolerant normalization (trimming, whitespace collapse, numeric canonicalization) to avoid trivial formatting mismatches. An output for instruction $j$ is considered correct if

$\mathrm{norm}(\hat{o}_{i,j}) = \mathrm{norm}(o_{i,j})$

where $\hat{o}_{i,j}$ and $o_{i,j}$ are the model and reference outputs, respectively (Dreyfuss et al., 11 Dec 2025).

Standard metrics include:

Instruction-Level Accuracy:

$\frac{\sum_{i=1}^N\sum_{j=1}^{n_i} c_{i,j}}{\sum_{i=1}^N n_i}$

Prompt-Level Accuracy:

$\frac{\sum_{i=1}^N p_i}{N}$

where $c_{i,j}$ is correctness for instruction $j$ in sample $i$ and $p_i = \prod_{j=1}^{n_i} c_{i,j}$ .

Supplementary metrics: precision, recall, $F_1$ , pass@k (for code), and differentiation power between difficulty levels (Dreyfuss et al., 11 Dec 2025, Wang et al., 5 Mar 2025). Statistical significance is assessed using standard error estimates and two-proportion $z$ -tests on accuracy.

4. Representative Instantiations: MCBench, CodeIF-Bench, AutoIF, IFBench

Several recent systems instantiate the PACIFIC paradigm:

MCBench (Moon et al., 9 Oct 2025): Benchmarks LLMs on step-by-step calculation of string-matching NLP metrics (e.g., BLEU), where each intermediate result and the final numeric answer is verified via reference code. Metrics include Final Accuracy, Format Following, and Following Depth.
CodeIF-Bench (Wang et al., 5 Mar 2025): Multi-turn code generation tasks embedding nine classes of verifiable instructions (e.g., explicit exception handling, edge-case management, code style) each actionable by executable test harnesses. Pass@k and overall $S_{IF}$ scores track precise multi-step compliance.
AutoIF (Self-Play with Execution Feedback) (Dong et al., 2024): A data synthesis and training pipeline that programs LLMs to generate instructions, code verifiers, and test cases, with execution-based rejection sampling for filtering correct/failing responses. Used for both supervised fine-tuning and direct preference optimization (DPO).
IFBench (Pyatkin et al., 3 Jul 2025): Focuses on out-of-domain generalization with 58 novel constraint templates, each paired with deterministic Python verifiers, and evaluates models trained with reinforcement learning from verifiable rewards (RLVR).

5. Training Paradigms and Generalization

PACIFIC enables both evaluation and training innovations:

Supervised Fine-Tuning (SFT): Models are trained only on outputs that pass all executable constraints.
Offline and Online Direct Preference Optimization (DPO): Models are updated using pairwise preference data mined from pass/fail verifier outcomes.
Reinforcement Learning with Verifiable Rewards (RLVR): Policies are optimized for high expected reward from code-validated constraints, enabling generalization to unseen instructions (Pyatkin et al., 3 Jul 2025).

Formal reward:

$R(y;C)=\sum_{i=1}^n w_i v(c_i, y)$

with policy-gradient objective

$J(\theta) = \mathbb{E}_{y\sim\pi_\theta(\cdot\mid x,C)}[R(y;C)]$

In practice, group region policy optimization can enforce trust-region updates.

Empirical results show significant gains in both in-domain and out-of-domain constraint satisfaction, with models trained using RLVR and multi-constraint settings outperforming SFT-only baselines by 15–20 percentage points on held-out test benchmarks (Pyatkin et al., 3 Jul 2025, Dong et al., 2024).

6. Limitations, Scalability, and Extension Directions

PACIFIC frameworks currently focus on constraints that are directly and automatically checkable via code or executable tests. Semantic or highly open-ended requirements remain a challenge. Increasing cyclomatic complexity, supporting multi-round/multi-module contexts, and integrating multiple modalities are active extension areas.

Scalability is enabled by:

Seed-based resampling and prompt/representation diversity, minimizing contamination risk.
Modular pipeline components, facilitating new instruction types or programming languages (Dreyfuss et al., 11 Dec 2025).

Future work includes:

Compositional constraints (e.g., joint property enforcement).
Hybrid (symbolic + execution) verifiers.
PACIFIC for cross-modal (e.g., image + code) instruction following.
Layered RL curriculum, with verifiable code properties of increasing difficulty (Pyatkin et al., 3 Jul 2025).

7. Benchmarking Practices and Comparative Insights

PACIFIC benchmarks differ substantially from traditional code evaluation. On hard multi-step benchmarks (e.g., 15 instructions, target output length ≥10), prompt-level accuracy of all existing LLMs drops to near 0% with the best models below 28%, emphasizing the challenge of compositional dry-run reasoning (Dreyfuss et al., 11 Dec 2025, Moon et al., 9 Oct 2025). Distinct performance rankings across PACIFIC, code, and natural language leaderboards indicate that instruction-following and "code dry running" are orthogonal skills not captured by end-to-end correctness alone.

Best practices for PACIFIC-style evaluation include:

Prioritizing precision and automatable validation for each instruction.
Enabling multi-turn, cumulative evaluation protocols (reflecting real developer workflows).
Stratifying by complexity and context scope.
Maintaining modular, extensible test and evaluation infrastructure (Wang et al., 5 Mar 2025, Dreyfuss et al., 11 Dec 2025).

A plausible implication is that as LLMs are increasingly deployed for complex, feedback-driven programming applications, PACIFIC’s requirements for granular, automatically checkable instruction adherence will become a critical axis for both capability and safety assessment.

Markdown Report Issue Upgrade to Chat

References (5)

PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code (2025)

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation (2025)

Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models (2025)

Generalizing Verifiable Instruction Following (2025)

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Precise Automatically Checked Instruction Following In Code (PACIFIC).

PACIFIC: Verified Code Instruction Following

1. Formal Problem Definition and Motivation

2. Benchmark Framework Architecture and Generation

3. Automatic Checking and Evaluation

4. Representative Instantiations: MCBench, CodeIF-Bench, AutoIF, IFBench

5. Training Paradigms and Generalization

6. Limitations, Scalability, and Extension Directions

7. Benchmarking Practices and Comparative Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PACIFIC: Verified Code Instruction Following

1. Formal Problem Definition and Motivation

2. Benchmark Framework Architecture and Generation

3. Automatic Checking and Evaluation

4. Representative Instantiations: MCBench, CodeIF-Bench, AutoIF, IFBench

5. Training Paradigms and Generalization

6. Limitations, Scalability, and Extension Directions

7. Benchmarking Practices and Comparative Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research