Lita: Minimal Agentic Framework
- Lita is a minimal agentic framework that evaluates LLM coding tasks by reducing complex workflow scaffolding.
- It utilizes a streamlined toolset and model-predicted decision-making to surface intrinsic reasoning and error recovery.
- Empirical results show that Lita achieves competitive performance with lower token consumption and reduced design complexity.
Lita denotes a minimal agentic framework for LLMs, operationalizing principles of “liteness” to expose the intrinsic agentic coding competence of contemporary models while minimizing manual workflow engineering and design complexity. Designed to unify and simplify the evaluation of LLMs on software engineering tasks, Lita omits elaborate prompt scaffolding, prescriptive multi-step workflows, and excessive tool integration. Its core aims are to enable fairer, more faithful evaluation of agentic abilities, reduce methodological confounds associated with workflow tuning, and provide a lightweight yet fully autonomous agent interface for code-based tasks (Dai et al., 30 Sep 2025).
1. Motivation and Design Philosophy
The Lita framework is motivated by the observation that existing agentic coding systems often depend heavily on custom pipelines, prompt templates, and manual environment handling. These workflow-based or over-engineered agentic systems introduce three primary challenges:
- Fairness: Prompt-tuning and pipeline engineering frequently advantage particular models or datasets, confounding head-to-head evaluation.
- Truthfulness: Rich scaffolding may obscure the LLM’s own planning, debugging, and error recovery competence, making reported benchmark success partially an artifact of the agent's intervention.
- Efficiency: Engineered agent pipelines are costly to build and maintain, increase token consumption, and risk data leakage via benchmark-specific details.
Lita proposes a paradigm of “liteness”—a minimal agentic principle—characterized by minimal system complexity, model-agnostic interaction, autonomy from explicit workflow encoding, and a focus on surfacing model-internal reasoning rather than external orchestration.
A core theoretical insight is the “Agent Complexity Law:” as LLM capability increases, the marginal performance difference between complex, workflow-heavy agents and minimal agents shrinks, converging to zero for sufficiently strong models:
where is the task success rate for agent complexity using model strength , for (Dai et al., 30 Sep 2025).
2. Architectural Structure and Workflow
Lita’s architecture is modular, comprising three conceptual subsystems:
- Tools: The minimal action set enabling file system and execution environment manipulation. Only essential “action” tools are supplied:
Editor(path, diff)Terminal(command)Search(query, path)Finish()- Additionally, “thinking” tools record explicit reflection:
Think(…)and planning steps:Plan(…).
- Reasoning: LLM-driven reasoning emerges through “Think” and “Plan” invocations, with the LLM deciding tool sequences based solely on current and past state, absent any engineered control logic.
- Memory: A linear concatenation of previous interactions is retained; optionally, a
Summarize()tool allows context condensation, but by default no hierarchical or replay memory is provided.
The Lita agent loop is fully next-action-prediction driven:
1 2 3 4 5 6 7 8 9 10 11 |
Initialize: memory m ← [ ], environment E, tool set T
For t in 1 .. T_max:
Compose prompt:
- System: "You are Lita, a lite coding agent…"
- Tool schema T
- History m
- Current state from E
a ← LLM_CALL(prompt) // tool invocation
If a.name == Finish: break
Execute a in E; observe result r
Append ("Agent invoked a, got result r") to m |
No conditional routing, workflow handlers, or hard-coded control flow are imposed; all agentic behavior is model-predicted (Dai et al., 30 Sep 2025).
System “intrinsic complexity” is quantified by:
where is number of supported tools, is number of system-level tokens, and is a scaling parameter (e.g., $1/1000$).
3. Evaluation Methodology and Experimental Design
Lita was systematically benchmarked using agentic conversions of canonical code evaluation datasets:
- HumanEval: Function-level code completion.
- Aider Polyglot: Intermediate-difficulty, multilingual code generation.
- SWE-Bench Verified: Real-world bug-fixing tasks (high difficulty).
Each benchmark instance is formulated for agentic interaction, including an initial state (file system snapshot), task description, expected output state, and validation steps for code execution or testing.
Evaluation encompasses both proprietary (GPT-4.1, GPT-5, Claude Opus 4) and open-source LLMs (Qwen3-Coder 480B, 30B). Agents for comparison include:
- Workflow-guided (Aider on Polyglot)
- Agentic (OpenHands, mini-SWE-agent)
- Lita variants: “Lita” (full tools), “Lita-diff” (diff-based editor), “Lita-mini” (Terminal only), “Lita-reason” (Terminal, Think, Plan).
Metrics reported include:
- Polyglot: pass@1 (early and after 50 turns), diff-format adherence, input/output token usage, cost (USD).
- SWE-Bench: percentage resolved within 100 iterations, token usage, and cost.
4. Empirical Findings
Performance Across Benchmarks
| Scaffold | LLM | Polyglot 50T Pass@1 (%) | SWE-Bench Verified Solved (%) | Input Tokens (M) | Cost (\$) |
|---|---|---|---|---|---|
| Lita | Claude Opus 4 | 96.4 | 62.6 | 20.7 | 376.2 |
| OpenHands | Claude Opus 4 | 95.4 | 67.8 | 34.2 | 587.9 |
| Lita | GPT-5 | 96.0 | – | 7.0 | 15.4 |
| OpenHands | GPT-5 | 96.8 | – | 15.6 | 27.8 |
Key empirical results:
- On strong models (e.g., Claude Opus 4, GPT-5), Lita achieves competitive—often state-of-the-art—performance with 30–50% reduced input tokens and lower overall design complexity compared to workflow-based or complex agentic baselines (Dai et al., 30 Sep 2025).
- On Polyglot, minimal agentic Lita matches or surpasses more intricate scaffolds, especially in later stages of iterative completion (“50 turns”), showing that LLMs recover autonomously from initial mistakes.
- For SWE-Bench, Lita’s solved rate is ≈10% below the most elaborate agent OpenHands, but this gap diminishes as model strength increases. For some top-tier models, Lita-mini (Terminal-only) approaches full-agent scores.
Agent Complexity Law
The observed reduction in performance gap between minimal and complex agents as LLM capability grows is formalized as:
capturing that, for sufficiently capable models, manual design complexity provides vanishing returns in agentic coding tasks (Dai et al., 30 Sep 2025).
5. Theoretical and Methodological Contributions
Lita introduces a precise framework for evaluating agentic capacity with minimal confounding factors:
- Decoupled evaluation: Tools and prompt schema are model- and benchmark-agnostic, supporting fairer cross-model comparison.
- Intrinsic agentic competence: The design surfaces the emergent planning, error recovery, and decision sequencing abilities of the LLM rather than those imposed by workflow engineering.
- Quantitative “liteness” metric: Adds a formal basis for analyzing agent complexity relative to model strength. Empirical evidence substantiates the conjecture that model improvement diminishes the advantage of additional agent design.
The workflow also establishes a template for constructing minimal agentic evaluations in other domains, by translating environmental preconditions, user goals, and system affordances into compact tool/action schemas and interpretable environment states (Dai et al., 30 Sep 2025).
6. Limitations and Scope of Applicability
Current Lita evaluations are restricted to agentic coding for single-repo workflows, unit tests, and resolution of atomic bug-fixing or implementation tasks. The framework—by design—omits:
- Retrieval-augmented generation
- Web search or external knowledge integration
- Multi-agent dialogue or inter-agent collaboration
The benchmarks themselves probe only a controlled subset of real-world agentic engineering. Performance on multi-repository refactoring, long-horizon project management, continuous integration tasks, or collaborative version control remains unvalidated. Lita presently does not target non-coding agentic settings or scenarios requiring human-in-the-loop evaluation.
A further noted limitation is the absence of explicit user experience considerations and the exclusion of advanced retrieval/interface extensions (e.g., web-based tool augmentation or hierarchical memory) (Dai et al., 30 Sep 2025).
7. Implications and Future Directions
Lita exemplifies a broader trend in AI systems research: as LLMs become increasingly robust, the cost and value of elaborate agentic scaffolding decline, motivating a shift toward unified, lightweight evaluation of autonomy and tool-use. The Lita paradigm indicates that:
- Minimal, well-specified toolsets and unadorned prompts suffice to expose the limits and competencies of current LLM agentic architectures.
- Substantial reductions in token consumption (30–70%) and engineering effort are achievable without appreciable loss in agentic performance, especially for the most advanced models.
- Future agent frameworks may focus more on the challenging design of environments, tool affordances, and task structure, rather than on complex orchestration mechanisms.
- The Agent Complexity Law suggests that model-agnostic, intrinsic evaluation will become increasingly accurate—and economical—for benchmarking universal agency as LLMs attain higher levels of cognitive competence (Dai et al., 30 Sep 2025).
A plausible implication is that as models approach the “perfect model” limit, agent design will become commoditized, with emphasis shifting from workflow optimization to environment and tool evaluation. This suggests opportunities for extending Lita-like frameworks to non-coding domains, as well as more comprehensive studies on generalization, robustness, and scaling in minimal agentic settings.