Analyzing and Internalizing Complex Policy Documents for LLM Agents

Published 13 Oct 2025 in cs.AI | (2510.11588v1)

Abstract: LLM-based agentic systems rely on in-context policy documents encoding diverse business rules. As requirements grow, these documents expand rapidly, causing high computational overhead. This motivates developing internalization methods that embed policy documents into model priors while preserving performance. Prior prompt compression work targets generic prompts, but agentic policy documents span multiple complexity levels and require deeper reasoning, making internalization harder. We introduce CC-Gen, an agentic benchmark generator with Controllable Complexity across four levels, enabling systematic evaluation of agents' ability to handle complexity and offering a unified framework for assessing policy internalization. Our analysis shows that complex policy specifications governing workflows pose major reasoning challenges. Supporting internalization with gold user agent interaction trajectories containing chain-of-thought (CoT) annotations via supervised fine-tuning (SFT) is data-intensive and degrades sharply as policy complexity increases. To mitigate data and reasoning burdens, we propose Category-Aware Policy Continued Pretraining (CAP-CPT). Our automated pipeline parses policy documents to extract key specifications, grouping them into factual, behavioral, and conditional categories, and isolating complex conditions that drive workflow complexity. This guides targeted data synthesis and enables agents to internalize policy information through an autoregressive pretraining loss. Experiments show CAP-CPT improves SFT baselines in all settings, with up to 41% and 22% gains on Qwen-3-32B, achieving 97.3% prompt length reduction on CC-Gen and further enhancing tau-Bench with minimal SFT data.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CAP-CPT, a novel pipeline that internalizes complex policy documents into LLM priors, achieving performance improvements up to 41% on workflow complexity tasks.
It systematically benchmarks policy complexity across multiple axes, revealing that logical branching in workflows is the critical bottleneck for LLM agent success.
Empirical results demonstrate that targeted data generation and continued pretraining significantly reduce prompt length while enhancing recall, generalization, and computational efficiency.

Analyzing and Internalizing Complex Policy Documents for LLM Agents

Motivation and Problem Statement

LLM agents rely heavily on in-context policy documents, which encode extensive business rules, workflows, and behavioral constraints. As these documents grow in complexity and length—often exceeding tens of thousands of tokens in real-world scenarios—they pose significant computational and reasoning challenges. Conventional prompt compression techniques are ill-suited for policy documents; their generic approaches fail to address the multi-level complexity and intricate reasoning demands required by agentic policies. The central technical challenge is developing efficient internalization algorithms that embed policy documents into model priors, enabling agents to recall and enforce business rules without extensive prompt context, while maintaining or improving performance.

Benchmarking Policy Complexity and Agent Reasoning

The paper introduces CC-Gen, an agentic benchmark generator with controllable policy complexity. This framework systematically categorizes complexity along four axes:

Task-level complexity: Number and arguments of agent tasks.
Workflow-level complexity: Depth and branching of logical rules (e.g., nested conditional statements).
Environmental-level complexity: Scale and richness of accessible databases.
Query-level complexity: User query intricacy.

Experiments show that environmental complexity has negligible impact on agent performance, while task and workflow complexity significantly degrade success rates. Notably, workflow complexity induces sharp performance drops, even for SOTA LLMs (Qwen-3-32B, Claude-3.5-Sonnet), revealing that logical reasoning over policy rules is a major bottleneck. The Qwen-3 series displays greater robustness to complexity than other models, suggesting model architecture and pretraining play pivotal roles in agentic reasoning.

Policy Internalization Methodology: CAP-CPT

To address these challenges, the authors propose Category-Aware Policy Continued Pretraining (CAP-CPT), an automated pipeline for policy complexity analysis and targeted data synthesis:

Policy Specification Categorization: The pipeline organizes policy rules into factual, behavioral, and conditional (simple/complex) types via LLM-driven document analysis. Complex conditionals are isolated, given their outsized impact on agent workflows.
Targeted Data Generation:
- Factual specifications: Paraphrases and QA pairs for memorization and recall.
- Behavioral specifications: Scenario-based role model demonstrations.
- Conditional specifications: Scenario simulation data over diverse subproblems, operationalizing policy rules into executable workflows.
- All generated data are tagged with policy identifiers to support recall without explicit document context.
Continued Pretraining (CPT): Training is performed with an autoregressive loss over curated data, enhancing “durable recall” and broad generalization of policy knowledge rather than rote memorization. Combined with supervised fine-tuning (SFT) on gold Chain-of-Thought (CoT) trajectories, this approach achieves substantial gains, particularly in data-sparse and high-complexity scenarios.

Empirical Results and Numerical Highlights

Extensive experiments validate the effectiveness of CAP-CPT:

Performance Gains: CAP-CPT + SFT improves baseline across all data scenarios, yielding gains up to 41% (workflow complexity) and 22% (task complexity) on Qwen-3-32B. In Qwen-2.5-32B, the approach boosts performance by 44% in sparse data settings and narrows the performance gap between workflow complexity levels by up to 37%.
Prompt Compression: Achieves up to 97.3% reduction in input token length on synthetic benchmarks.
Generalization: CAP-CPT manifests robustness under diverse policy-centric evaluation tasks, including policy referral, substitution, override, and general instruction following, yet does not surpass prompting baseline in all settings—exposing further research directions.
Real-world Application: Applied to $\tau$ -Bench with limited data, CAP-CPT improves success rate from 26.96% (prompting) to 28.70%, compressing input by 34.8%.

Ablation studies confirm that both scenario-simulation data and CPT loss are critical for handling complexity and outperform SFT-alone methods.

Theoretical and Practical Implications

By systematically modeling and benchmarking policy complexity, this work identifies workflow depth and logical branching as primary sources of reasoning failure in LLM agents, providing actionable insights for both LLM design and agentic system engineering. CAP-CPT’s category-aware approach, relying on policy identifiers and explicit data grouping, enables scalable internalization applicable with minimal assumptions about policy structure or domain.

From a practical perspective, computational overhead and context length constraints in agentic deployments are mitigated. Theoretical implications include improved understanding of model reasoning limits, catastrophic forgetting in SFT, and the necessity for targeted continued pretraining to accommodate evolving business requirements.

Future Directions

The limitations section highlights several research avenues:

Extension to multimodal, multi-turn, and intent-rich agentic benchmarks.
Integration with reinforcement learning for alignment under RL feedback, as in concurrent work (Tri-MPI (Wang et al., 10 Oct 2025)).
Advances in override/referral granularity and robustness via counterfactual training.
Development of prior-preserving regularizers and continual-learning safeguards to avoid negative transfer and forgetting.

These directions are both theoretically relevant for ongoing AI alignment and practically critical for scalable, adaptable agentic infrastructure.

Conclusion

This paper establishes a rigorous framework for evaluating and internalizing complex policy documents in LLM agents. By characterizing complexity, benchmarking agentic reasoning, and employing category-aware continued pretraining, it delivers substantial improvements in performance, generalization, and computational efficiency. The findings underscore the inadequacy of generic prompt compression for policy internalization and motivate future research in scalable, robust policy embedding for AI agents, ultimately enabling more reliable and context-efficient deployment in real-world applications (2510.11588).