Performance of TALM on Large, Interconnected Codebases

Determine the performance and robustness of TALM (Tree-Structured Multi-Agent Framework with Long-Term Memory) when applied to system-level software engineering projects involving large, interconnected codebases spanning multiple classes, modules, or packages, which were not represented in the HumanEval, BigCodeBench, or ClassEval benchmarks used in the study.

Background

The paper evaluates TALM on HumanEval, BigCodeBench, and ClassEval, which primarily cover function-level and class-level tasks. Although ClassEval introduces greater structural complexity than HumanEval or BigCodeBench, it is still limited to class-level scope.

The authors note that realistic software engineering often requires system-level projects involving multiple classes, modules, or packages. Because such large-scale contexts are not available in current benchmarks, the study does not assess TALM’s performance in truly large, interconnected codebases, leaving this question explicitly open.

References

Such large-scale contexts were not available in existing benchmarks, leaving open the question of how TALM would perform on truly large and interconnected codebases.

TALM: Dynamic Tree-Structured Multi-Agent Framework with Long-Term Memory for Scalable Code Generation  (2510.23010 - Shen et al., 27 Oct 2025) in Section: Limitations