Credit horizon selection in LLM-based generative optimization

Determine the appropriate number of steps from a multi-step agentic system’s execution trace to include in the learning context (the credit horizon) when using large language model–based generative optimization, specifically assessing whether to optimize using immediate per-step feedback or only after observing all feedback for the full multi-step process, so as to guide effective updates to the system.

Background

The paper defines a "learning loop" for LLM-based generative optimization and highlights that, in multi-step tasks, engineers must choose how much of an execution trace to provide to the optimizer in the learning context.

This choice, termed the credit horizon, is critical because different horizons can change optimization behavior and outcomes. The authors illustrate the issue with Atari game-playing, where optimizing for immediate rewards sometimes suffices while other tasks require longer traces, underscoring that this is a genuine design decision without a known universal rule.

References

More generally, it is unclear how many steps of the process's execution trace should be included in the learning context for the optimizer. Should we optimize the agent for instantaneous feedback, or should we only optimize it until all feedback in the multi-step process has been observed?

— Understanding the Challenges in Iterative Generative Optimization with LLMs (2603.23994 - Nie et al., 25 Mar 2026) in Learning Context (Credit Horizon), Section 2

Credit horizon selection in LLM-based generative optimization

Background

References

Related Problems