How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Published 24 Apr 2026 in cs.CL, cs.CY, cs.HC, and cs.SE | (2604.22750v1)

Abstract: The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper reveals that agentic coding tasks incur token costs up to 1000× higher than single-turn tasks due to persistent context ingestion and redundant operations.
The paper demonstrates that token efficiency varies significantly among LLMs, with models like GPT-5 consuming far fewer tokens compared to others by over 1.5 million tokens in similar tasks.
The paper shows that cost prediction using self-estimation is challenging due to substantial underestimation and weak correlation with actual token usage, impacting cost transparency.

Systematic Analysis of Token Consumption in Agentic Coding Tasks

Introduction

The proliferation of agentic coding workflows utilizing frontier LLMs has dramatically escalated token consumption, presenting both economic and technical challenges for practitioners and system designers. "How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks" (2604.22750) addresses three central questions: (1) What dominates the token expenditure in agentic coding tasks? (2) How do different LLMs compare in token efficiency under identical workloads? (3) Can token usage be reliably predicted prior to execution, enabling cost transparency and budget-aware deployment?

Token Consumption Patterns in Agentic Coding Tasks

Empirical analysis across eight frontier LLMs on SWE-bench Verified demonstrates that agentic coding tasks incur orders of magnitude higher token costs than both single-turn code reasoning (e.g., CRUXEval) and code chat (e.g., ShareGPT-style code feedback dialogs). The primary cost driver is input token accumulation due to persistent context ingestion, tool-based exploration, and repeated context reuse—even under token caching regimes.

Figure 1: Agentic coding task token costs exceed code reasoning and code chat by $1000\times$ , dominated by input token growth.

Token usage exhibits high stochasticity both across tasks and across repeated runs of the same problem; individual runs can vary by up to $30\times$ in total tokens. The heaviest tail is observed in complex tasks, which also present the largest cross-run variance—making upfront cost prediction difficult.

Figure 2: Token cost variance is substantial across problems and repeated runs, with high-cost tasks showing the greatest unpredictability.

Cost-Accuracy Dynamics and Behavioral Patterns

Scaling token expenditure does not monotonically improve agentic task accuracy. Aggregate results show that accuracy saturates or even declines at the highest cost regimes—reflecting inefficiency, redundancy, or ineffective exploration. Intermediate cost runs often peak in accuracy, consistent with empirical findings on inverse test-time scaling laws for reasoning agents.

Figure 3: Task accuracy peaks at intermediate token cost and saturates or declines at higher costs across all evaluated models.

Detailed behavioral analysis reveals that high-cost runs are associated with frequent repeated views and modification actions on the same files, indicating unproductive redundancy and inefficient search patterns.

Figure 4: High token-cost runs correlate with increased repeated file access operations, driving up token consumption.

Human expert-rated task difficulty is only weakly correlated with agent token expenditure. The gap between perceived complexity and agent computational effort is statistically significant but modest ( $\tau_b = 0.32$ ), with substantial distributional overlap.

Figure 5: Human-assessed problem difficulty fails to reliably predict agent token consumption, highlighting a misalignment.

Token Efficiency of Frontier LLMs

Direct comparative evaluation reveals systematic differences in token efficiency between models. For identical tasks, models such as Kimi-K2 and Claude Sonnet-4.5 consume, on average, over 1.5 million more tokens than GPT-5. These efficiency differences persist even when restricting analysis to task subsets with universal success/failure, confirming that efficiency is primarily model-dependent rather than task-dependent.

Figure 6: Model token efficiency varies intrinsically and consistently, regardless of task success/failure.

Token-inefficient models exhibit higher rates of exploratory actions (file viewing/modification), with a larger proportion of redundant operations.

Figure 7: Fine-grained file action statistics indicate consistent inefficiency patterns across models.

Cost Dynamics Across Execution Phases

Phase-level and round-level analyses using Claude Sonnet-4.5 demonstrate that input tokens (especially cache-read tokens) dominate both total token volume and dollar cost throughout the agentic workflow. Despite steep discounts on cache reads, their cumulative volume can outweigh output token costs.

Figure 8: Input tokens—primarily cache reads—are the principal contributors to agentic task cost across all phases.

Cost spikes in trajectory execution are attributable to discrete actions such as new file views and test execution, which add new context.

Figure 9: Round-level analysis demonstrates stable cache-read costs with expense spikes from specific actions.

Pre-execution Token Usage Prediction: Feasibility and Limitations

Self-prediction experiments task agents to estimate their own token consumption before execution, leveraging their full tool-calling and analysis capabilities. However, observed prediction correlations with actual token usage are weak-to-moderate (max $\sim 0.39$ across tested models), with consistent systematic underestimation—especially for input tokens. Prediction overhead is generally lower than execution cost, but does not scale linearly with prediction accuracy.

Figure 10: Prediction performance is modest; prediction overhead varies across models and is not strictly associated with accuracy.

Calibration plots confirm persistent underestimation of token usage for both input and output, regardless of model or in-context demonstration.

Figure 11: Agents systematically underestimate both input and output tokens in self-prediction.

Output Token Analyses

Supplementary analyses demonstrate that results using output tokens mirror those found for input tokens: higher output token expenditure correlates with reduced accuracy and increased redundant file operations.

Figure 12: Excess output tokens do not improve accuracy and are linked to inefficient file interaction patterns.

Implications and Future Directions

The findings highlight several important implications:

Transparency Gap: The unpredictability and stochasticity of token consumption in agentic workflows make fixed-price or upfront cost estimation currently impractical. Coarse-grained self-prediction can provide budgeting signals but lacks precision for instance-level pricing.
Runtime Efficiency and Control: Model-specific inefficiencies necessitate deployment of budget-aware tool-use and runtime token constraint policies for cost control (Liu et al., 21 Nov 2025, Lin et al., 30 Oct 2025). More adaptive stopping and dynamic compute strategies are essential.
Benchmarking and Strategy Design: Accurate agent benchmarking must account for token efficiency and behavioral redundancy, not just success rate. Energy-per-token and Green AI metrics merit further integration (Schwartz et al., 2019, Alomrani et al., 2 Jul 2025).
Research Directions: Enhancements in agent memory management (Hu et al., 15 Dec 2025, Xiong et al., 21 May 2025), dynamic context evolution (Wan et al., 9 Oct 2025), and execution trajectory diagnosis (Barke et al., 2 Feb 2026) are promising venues. Formal modeling of cost variance and agent self-modeling may improve prediction reliability.

Conclusion

This systematic study elucidates the sources, dynamics, and predictability limits of token consumption in agentic coding settings. Token costs are dominated by input context, highly variable, and only weakly predictable both by models themselves and by expert perception. Model-level behavioral patterns drive efficiency differences independent of task difficulty. While current agents can provide coarse-grained cost signals, robust upfront prediction of execution tokens remains an open challenge, warranting continued research into adaptive and controllable agent deployments, runtime cost constraints, and transparent pricing schemes for real-world applications.

Markdown Report Issue