Evaluating Business-Policy Adherence of Customer Support LLM Agents

Develop standardized evaluation methodologies and benchmarks to assess whether large language model–based customer support agents act in accordance with business rules and real-world support workflows, ensuring that adherence to multi-step policies and task dependencies is rigorously measured.

Background

The paper motivates the need for policy-aware agents in customer support by highlighting the limitations of traditional IVR systems, which enforce rigid flows but provide poor user experience. LLM agents promise flexible, multi-turn interactions but must still comply with business rules encoded in Standard Operating Procedures (SOPs).

Existing benchmarks largely emphasize tool selection or goal completion and do not adequately measure whether agents follow required multi-step workflows with complex dependencies or remain robust to disturbances (e.g., missing inputs, tool failures). JourneyBench is introduced to address this evaluation gap using SOP graphs and a metric (UJCS) for policy adherence, underscoring that reliable assessment of adherence remains a central challenge.

References

While LLM agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge.

Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence  (2601.00596 - Balaji et al., 2 Jan 2026) in Abstract (page 1)