Papers
Topics
Authors
Recent
Search
2000 character limit reached

To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models

Published 16 Oct 2025 in cs.LG | (2510.14826v1)

Abstract: State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any ``truly long-form'' generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.

Summary

  • The paper demonstrates that interactive tool-use enables SSMs to generalize arbitrarily long sequences by overcoming fixed-size memory constraints.
  • It provides a rigorous theoretical framework and extensive empirical validation across coding, arithmetic, and reasoning tasks.
  • The study highlights that multi-turn interactions with external memory tools are essential to achieve Turing-complete performance in long-form generation.

Tool-Use Enables Length Generalization in State Space Models

Introduction

This paper rigorously investigates the limitations and capabilities of State Space Models (SSMs) for long-form sequence modeling, particularly in the context of interactive tool-use. SSMs, such as Mamba, have emerged as efficient alternatives to Transformers due to their fixed-size memory and linear computational scaling. However, the authors provide a formal negative result: SSMs are fundamentally unable to solve "truly long-form" generation tasks when operating in standalone or single-turn tool-use settings, regardless of chain-of-thought (CoT) reasoning. This limitation arises from the bounded memory inherent to SSMs, which restricts their expressive power as task complexity and output length grow.

The central contribution is a theoretical and empirical demonstration that interactive tool-use—where the model can issue multiple commands to external memory or computational tools—enables SSMs to generalize to arbitrary sequence lengths and problem complexities. The paper formalizes this claim, proves that interactive tool-use is both necessary and sufficient for SSMs to achieve length generalization on any tractable long-form generation task, and validates these findings across arithmetic, reasoning, and coding domains.

Theoretical Framework

The authors define long-form generation tasks as those where the effective number of possible outputs grows with problem complexity. Examples include multi-digit arithmetic, sorting, and code fixing. They introduce the Generalized State Space Model (GSSM) abstraction, encompassing SSMs, RNNs, and Transformers with local attention, and analyze their behavior under three regimes: CoT-only, single-turn tool-use, and interactive tool-use.

A key theorem establishes that GSSMs cannot solve long-form generation tasks in the CoT-only or single-turn tool-use settings. Specifically, for any coverage parameter α\alpha, there exists a problem complexity n0n_0 such that the error rate of any GSSM is at least 1α1-\alpha for nn0n \geq n_0. This result is proved by showing that the fixed-size state space cannot encode the exponentially growing output space required for long-form tasks.

Conversely, the authors prove that with interactive tool-use—specifically, access to an external memory tool with read/write and pointer operations—GSSMs can simulate a Turing machine and thus achieve length generalization. They construct training distributions that cover the necessary state-symbol transitions and show that a simple learning algorithm can generalize to arbitrary problem lengths, provided sufficient coverage in the training data.

Experimental Validation

The empirical section evaluates SSMs (Mamba, LSTM, GRU) and Transformer baselines (Pythia, Mistral) on synthetic and realistic tasks, using pointer-based and search-based memory tools.

Coding Task: Tool-Use Trajectories

The authors finetune Mamba and Pythia on trajectories collected from three types of tool-use agents for a code fixing task: single-turn, interactive, and agent distillation. Models are trained on codebases of up to 16 files and evaluated on larger codebases. Figure 1

Figure 1: Mamba and Pythia finetuned on tool-use agent trajectories for code fixing; Mamba generalizes to larger codebases when trained on interactive agent data, outperforming Pythia in out-of-distribution settings.

Results show that while Pythia outperforms Mamba on small codebases and single-turn tool-use, Mamba exhibits superior extrapolation to larger codebases when trained to imitate interactive agents. This supports the theoretical claim that interactive tool-use is essential for length generalization in SSMs.

Arithmetic Tasks: Multi-Digit Addition and Multiplication

Models are trained on synthetic trajectories for multi-digit addition and multiplication using pointer-based memory tools. For addition, Mamba and LSTM trained on 5-digit examples generalize perfectly to 1,000-digit addition, whereas Transformers fail to extrapolate. Figure 2

Figure 2

Figure 2: Left: Interactive tool-use agent trajectory for multi-digit addition with pointer-based memory. Right: Accuracy of SSMs and Transformers on addition tasks, showing SSMs generalize to 1,000 digits while Transformers do not.

Similar results are observed for multiplication, with SSMs maintaining high accuracy on numbers orders of magnitude longer than those seen during training.

Reasoning Tasks: Logical Graphs and Tower of Hanoi

The logical graph task uses a search-based memory tool, and SSMs generalize from graphs with 10 nodes (training) to 1,000 nodes (testing) with near-perfect accuracy. For Tower of Hanoi, which has exponentially growing output length, SSMs show more limited but still significant length generalization, with performance sensitive to initialization seed. Figure 3

Figure 3: Illustration of logical graph reasoning and code fixing tasks, highlighting the graph-based structure and synthetic problem generation.

Figure 4

Figure 4: Example logical reasoning graph and its encoding.

Figure 5

Figure 5: Performance variability across seeds for Mamba on Tower of Hanoi, indicating sensitivity in length generalization for tasks with exponential output growth.

Aggregate Results

Figure 6

Figure 6

Figure 6: Length generalization performance of SSMs and Transformers on multiplication, logical graph, and Tower of Hanoi tasks; SSMs consistently outperform Transformers in out-of-distribution extrapolation.

Figure 7

Figure 7: Multiplication generalization for Mamba across training configurations, showing stable accuracy as sequence length increases with sufficient training coverage.

Ablations and Task Mixtures

Ablation studies confirm that CoT-only, reversed answer, and single-turn tool-use settings do not yield length generalization. Only interactive tool-use with sufficient coverage in training enables SSMs to generalize. Task mixture experiments show that co-training with structurally related auxiliary tasks (e.g., addition for multiplication) can improve generalization under limited training budgets.

Realistic Coding Agent Evaluation

The code fixing task is further evaluated using trajectories from SWE-agent-LM-32B, demonstrating that Mamba maintains higher pass rates and longer sequence handling than Transformer baselines as codebase complexity increases. Figure 8

Figure 8: Pass rate and median sequence length for SWE-agent-LM-32B on the code fixing task, illustrating degradation with increasing codebase size.

Implications and Future Directions

The findings have significant implications for the design of efficient, scalable language modeling systems. SSMs, when equipped with interactive tool-use, can overcome their memory bottleneck and achieve both efficiency and accuracy in long-form generation tasks. This positions SSMs as strong candidates for agentic and tool-based applications, such as coding assistants, mathematical reasoning agents, and complex search systems.

The analysis also highlights the importance of evaluating model architectures in system-level, agentic contexts rather than as standalone predictors. As LLMs are increasingly deployed as agents interacting with external tools, the synergy between model architecture and tool-use becomes critical.

Future research should focus on developing pretrained SSMs with robust tool-use capabilities, exploring hybrid architectures that combine SSM efficiency with Transformer expressivity, and formalizing training paradigms that maximize coverage of state-symbol transitions for length generalization.

Conclusion

This paper establishes that SSMs are fundamentally limited in standalone and single-turn tool-use settings for long-form generation tasks due to bounded memory. However, interactive tool-use enables SSMs to generalize to arbitrary sequence lengths and complexities, matching or exceeding Transformer performance in agentic settings. Theoretical proofs and extensive experiments across arithmetic, reasoning, and coding domains validate these claims. The work encourages a shift toward system-level evaluation and development of tool-augmented SSM agents, with broad implications for efficient, scalable AI systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Plain-language summary of “To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models”

Overview

This paper studies two kinds of AI models that read and write long sequences of text: Transformers (like many big chatbots) and State Space Models (SSMs, like Mamba). Transformers are powerful but slow and costly when the text gets long. SSMs are much faster on long text, but they have a big weakness: their memory is fixed-size, so they struggle with truly long, multi-step problems.

The main idea of the paper is simple: if you let SSMs use external tools—like a notebook they can read and write, a search function, or a code runner—they can “think” step-by-step with help and handle much longer and harder problems than before. With the right kind of training, SSMs can keep solving bigger versions of the same problem far beyond what they saw during training. The authors call this length generalization.

What questions did the paper ask?

  • Can SSMs, with their fixed memory, solve tasks that get longer and more complicated as the input grows?
  • If not, can giving SSMs interactive access to tools (like an external memory or filesystem) remove this limit?
  • Does this work not just in theory, but also in practice, on tasks like big-number math, logic, and coding?

How did they study it?

The authors combine theory (mathematical proofs) and experiments.

Theory in everyday terms

  • Think of an SSM as a person with a small notepad (fixed memory). If you ask them to write a very long essay or solve a long puzzle where the answer gets longer as the task grows, they eventually run out of space. The paper proves this limitation formally: without outside help, SSMs cannot reliably solve “truly long-form” tasks just by thinking to themselves (even if they write a step-by-step solution, called Chain-of-Thought).
  • Now give the SSM a set of tools—like a big external notebook it can read and write, and allow it to use these tools interactively (not just once). The paper proves that with the right training examples, an SSM can learn to use these tools to solve any task that is computable and generalize to much longer versions—meaning it can handle bigger inputs or more steps than it saw during training.

Key settings they compare:

  • CoT-only: the model “thinks out loud” but has no tools.
  • Single-turn tool-use: the model can call a tool once.
  • Interactive tool-use: the model can use tools many times, switching between “thinking,” using tools, and writing the final answer.

Result from the proofs:

  • CoT-only and single-turn tool-use aren’t enough; SSMs still hit a wall on long tasks.
  • Interactive tool-use breaks the wall and enables length generalization.

Experiments in everyday terms

The authors train models to follow step-by-step demonstrations that include tool commands. They test on much larger problems than the ones seen during training.

Tools they provide to the models:

  • A “pointer-based memory”: like a cursor that can move left/right across the whole conversation and read tokens (helps with long arithmetic and step tracking).
  • A “search tool”: find all occurrences of a string in the context (helps with logic tasks).
  • Real coding tools: shell commands to read/edit files and run code (helps fix bugs across many files).

Tasks they tried:

  • Multi-digit addition and multiplication: Learn the standard paper-and-pencil methods. Models trained on small numbers (like 5–10 digits) are tested on much bigger ones (up to 1,000 digits).
  • Logical graphs: Read a small program that describes a network of logical operations and compute the final result.
  • Tower of Hanoi: Output the sequence of moves to solve the puzzle as the number of disks grows.
  • Coding: Fix a bug that requires editing multiple files in a codebase. They collected real tool-use traces from coding agents and trained models to imitate them.

Models they compared:

  • SSMs and RNNs: Mamba, LSTM, GRU
  • Transformers: Pythia, a Mistral-based local-attention Transformer

What did they find, and why is it important?

  • Without tools, SSMs fail on truly long, growing-output tasks. Even if they “show their work,” their fixed memory becomes a bottleneck. This confirms the theory.
  • Single-turn tool-use isn’t enough for SSMs to beat the limitation.
  • With interactive tool-use, SSMs show remarkable length generalization:
    • Arithmetic: Mamba and other recurrent models trained on small numbers reliably solved 1,000-digit additions and multiplications by following the learned algorithms with the memory tool.
    • Logical graphs: SSMs scaled to graphs far larger than the training ones.
    • Tower of Hanoi: Some improvement, but more modest, likely because the number of moves grows exponentially.
    • Coding: When trained on interactive tool-use traces (like iteratively running code, reading files, making small changes), Mamba handled larger codebases better than a comparable Transformer. For small codebases, the Transformer often did great; but as projects got larger and longer than the training data, Mamba’s accuracy held up better—if it was trained to imitate interactive agents.
  • Bottom line: Giving SSMs the ability to use tools interactively turns their fixed memory into a non-issue, letting them keep solving bigger and bigger problems while staying efficient.

What does this mean going forward?

  • SSMs could be excellent “agents” that use tools—like code editors, search engines, databases, and external memory—especially when tasks need many steps and produce long outputs.
  • This can make AI systems both fast (thanks to SSMs’ efficiency on long sequences) and capable (thanks to tools that extend memory and computation).
  • It changes how we should evaluate and train models: not just as standalone text generators, but as parts of a system that can act, read, write, and iterate with tools.
  • For real-world applications (coding assistants, math tutors, research helpers), training with interactive tool-use examples can unlock strong generalization to bigger, more complex problems than seen during training.

Takeaway

SSMs alone are fast but hit a wall on long, growing tasks. Add interactive tool-use—especially external memory—and that wall disappears. With the right training, SSMs can “learn the method” and keep applying it to much larger problems, making them a strong, efficient alternative to Transformers in tool-based, multi-step settings.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps and open questions that remain unresolved and can guide future research:

  • Theoretical necessity claims are only shown for CoT-only and single-turn settings; the minimal degree of interaction (e.g., bounded number of tool calls or bounded observation bandwidth) required for GSSMs to overcome the lower bound is not characterized.
  • Length generalization proof relies on a custom pointer-based external memory tool with idealized read/write semantics; robustness to realistic tool constraints (latency, failures, partial observability, rate limits, noisy outputs, and non-atomic operations) is not analyzed.
  • The learning algorithm used in the positive result is a non-standard “string matching” approach; there is no proof that standard training (e.g., SGD on common SSMs/RNNs) reliably induces the required tool-use behavior and length generalization.
  • Sample complexity and coverage requirements for the training trajectories (e.g., how much of the Turing machine’s transition table must be seen, as a function of task complexity) are not quantified.
  • Theoretical framework assumes finite state spaces for GSSMs; extension of the lower bound and positive results to continuous, high-precision hidden states typical of modern SSMs (and the role of numerical precision/quantization) is not provided.
  • The definition of long-form generation assumes finite input/output spaces and monotone support growth; applicability to tasks with infinite alphabets, continuous outputs, or non-monotone output distributions is unclear.
  • The lower bound provides an error floor of at least 1−α beyond some length, but the tightness of this bound and its dependence on architecture size, precision, and output entropy are not empirically validated.
  • Sufficiency result hinges on exact simulation of a Turing machine via tool calls; there is no characterization of tasks for which exponential-time or exponential-length outputs make interactive SSMs impractical (e.g., Tower of Hanoi), nor formal criteria linking output growth rate to achievable generalization.
  • The impact of different external memory abstractions (pointer tape vs. random-access store, key–value memory, vector databases, graph stores) on learnability, data efficiency, and length generalization is not explored.
  • The role of logging choices (e.g., whether to append observations to the context) on learnability and generalization is only fixed to one regime; alternative logging policies and their trade-offs are not studied.
  • No analysis of the computational cost and latency introduced by frequent interactive tool calls; end-to-end efficiency gains of SSMs with tools vs. Transformers with tools are not quantified.
  • Transformers with interactive tool-use and/or external memory retrieval were not systematically benchmarked in the same agentic settings, leaving unclear whether similar length generalization could be achieved with attention-based models.
  • Hybrid architectures (attention + SSM) are noted but not evaluated in agentic settings; their potential to balance efficiency with generalization in tool-use scenarios remains open.
  • The experimental addition/multiplication tasks use strictly synthetic, noiseless trajectories and exact-match evaluation; robustness to noisy, imperfect, or suboptimal demonstrations (common in real settings) is not examined.
  • The distillation setup for coding filters for successful and short trajectories, introducing selection bias; effects of training on mixed-quality or failed trajectories on generalization are not assessed.
  • The coding task focuses on a single synthetic bug class (propagating a variable through dependencies); generalization to diverse, realistic bug types, larger and heterogeneous repositories, and multi-file refactoring is untested.
  • The agentic coding environment uses simple bash tools; portability to richer tool ecosystems (IDEs, compilers, static analyzers, test harnesses, version control workflows) and the effect on generalization are unexplored.
  • Sensitivity to initialization and training randomness is noted for Tower of Hanoi but not systematically studied across tasks; reproducibility and stability across seeds and training curricula need quantification.
  • The pointer-based memory experiments show strong generalization under teacher-forcing; error accumulation at inference (e.g., compounding mistakes in tool calls and reads) and mitigation strategies are not evaluated.
  • Task mixture results (addition + multiplication) hint at synergies under limited budgets, but there is no principled model of task relatedness or guidance for constructing auxiliary task sets that maximize generalization.
  • Theoretical and experimental work focuses on exact correctness; approximate solutions, probabilistic tasks, and trade-offs between accuracy and tool-call budgets are not modeled.
  • Safety, security, and sandboxing considerations for tool-use (e.g., preventing harmful commands, ensuring integrity of external state) are not addressed.
  • Memory persistence across sessions (lifelong external memory) and multi-agent collaboration over shared tools/memory are not covered; implications for scalability and generalization are open.
  • There is no release of code, datasets, or tool implementations to facilitate replication and broader evaluation across tasks and models.
  • Practical guidance on constructing training trajectories that achieve the necessary coverage (e.g., curriculum design, trajectory diversity, automated trajectory generation) is not provided.
  • The trade-off between SSM hidden-state size and interactive tool complexity (e.g., how increasing model capacity reduces reliance on tools) remains unexplored.
  • The effect of positional encodings, state updates, and architectural variants within SSMs on tool-use learning and length generalization is only minimally ablated.
  • How these findings translate to multilingual, multimodal, or streaming settings (e.g., speech or vision + text) with tool-use is left unexplored.

Practical Applications

Immediate Applications

The following bullets summarize specific, deployable use cases that leverage interactive tool-use with State Space Models (SSMs) to achieve efficient long-form generation and length generalization today.

  • Software engineering (coding agents)
    • Application: SSM-based interactive code agents for large repositories that perform iterative run–test–edit loops to fix bugs, trace variables, and propagate changes across deep dependency graphs.
    • Sector: Software
    • Tools/products/workflows: IDE plugins, CI/CD bot runners, grep/search tools, bash/file I/O, ReAct-style multi-turn workflows, trajectory distillation from capable SWE agents.
    • Assumptions/dependencies: Requires curated interactive tool-use trajectories (not single-turn), robust tool APIs (file system, test runners), and task prompts that encourage iterative interaction. Performance degrades if limited to single-turn use.
  • Cost-optimized agent pipelines (LLMops)
    • Application: Replace Transformer-based agents with tool-augmented SSMs for tasks dominated by external tool interaction (code search, repository introspection, run–fix cycles), reducing per-token compute while maintaining accuracy beyond training lengths.
    • Sector: Software, Cloud/Platforms
    • Tools/products/workflows: Orchestrators with function-calling or action tagging, KV-cached tool outputs, external memory services.
    • Assumptions/dependencies: Benefits concentrate on tool-heavy pipelines; requires sufficient trajectory coverage of action–observation pairs; attention is still useful for purely in-context tasks.
  • Math tutoring and algorithmic learning apps
    • Application: Step-by-step arithmetic tutors (multi-digit addition/multiplication) that explicitly use external memory (“scratchpad” with pointers/search) to generalize from short training examples to very long problems.
    • Sector: Education
    • Tools/products/workflows: Pointer-based memory tools, structured “scratchpad” outputs, CoT (chain-of-thought) scaffolds.
    • Assumptions/dependencies: Needs high-quality demonstrations with interactive tool calls; generalization may be limited for tasks whose output length grows exponentially (e.g., Tower of Hanoi).
  • Code review and variable flow tracing
    • Application: Agents that trace variable propagation across modules (e.g., ensuring a specific variable is passed through all dependency paths), detect missing parameters, and propose fixes at scale.
    • Sector: Software, Security/Quality
    • Tools/products/workflows: Dependency graph extractors, code search tools, static analyzers, interactive edit-and-run loops.
    • Assumptions/dependencies: Requires accurate graph construction and instrumented environments; benefits from interactive multi-turn workflows.
  • Customer support and case management in long threads
    • Application: SSM agents that navigate extensive ticket histories, search tool outputs, and iteratively update summaries or action plans with external memory.
    • Sector: Customer Support, Enterprise Software
    • Tools/products/workflows: Search in CRM/ticket systems, external memory for session state, ReAct-style “thought–action–observation” loops.
    • Assumptions/dependencies: Integration with enterprise systems, access control, careful trajectory curation to avoid hallucinated actions.
  • Data pipeline reasoning for DAGs
    • Application: Agents that inspect large directed acyclic graphs (DAGs) of transformations, search for node definitions and dependencies, and plan consistent updates across the graph.
    • Sector: Data Engineering/ETL
    • Tools/products/workflows: Graph exploration tools, search APIs, external memory for intermediate derivations, iterative planning loops.
    • Assumptions/dependencies: Requires tooling to expose DAGs in searchable formats and sufficient demonstration coverage.
  • Academic benchmarks and replicable agentic evaluations
    • Application: Construct and share datasets that include interactive tool-use trajectories (pointer-based memory, search, bash), enabling reproducible evaluations of length generalization under different agent designs.
    • Sector: Academia/Research
    • Tools/products/workflows: Open-source ReAct-style environments, standardized tool interfaces, length generalization benchmarks (arithmetic, logic graphs, coding).
    • Assumptions/dependencies: Community adoption of tool-use benchmarks; clear protocol for masking tool observations during training.
  • Energy-efficient AI guidance (organizational policy)
    • Application: Internal policies encouraging tool-augmented SSMs for long-form, tool-heavy tasks to reduce compute and costs without sacrificing quality.
    • Sector: Policy (organizational/IT governance)
    • Tools/products/workflows: Procurement and architecture guidelines, cost/performance dashboards comparing SSM vs Transformer agents.
    • Assumptions/dependencies: Evidence gathering on workload-specific benefits; risk assessment for tasks requiring global, in-context attention.

Long-Term Applications

These use cases are promising but require additional research, tooling, and scaling—especially in safety, integration, and data availability.

  • Longitudinal healthcare reasoning
    • Application: Agents that navigate years of EHRs with external memory, perform long-horizon reasoning (medication changes, lab trends), and produce actionable care plans.
    • Sector: Healthcare
    • Tools/products/workflows: Secure external memory stores, audited tool calls, structured clinical knowledge tools, ReAct-style clinical prompts.
    • Assumptions/dependencies: HIPAA-compliant infrastructure, rigorous validation, bias/safety guardrails, and high-quality clinical trajectories.
  • Financial analysis and backtesting over large horizons
    • Application: Tool-augmented SSMs for long-horizon backtesting, portfolio attribution, and regulatory reporting over massive time-series.
    • Sector: Finance
    • Tools/products/workflows: Market data interfaces, external memory for intermediate results, audit-ready action logs.
    • Assumptions/dependencies: Data quality, latency constraints, strict compliance and auditability.
  • Robotics and long-horizon planning
    • Application: Agents with external memory for planning sequences of actions in complex environments, recalling maps, and iteratively updating plans from sensor/tool observations.
    • Sector: Robotics/Autonomy
    • Tools/products/workflows: Memory tools connected to world models, planning toolkits, action–observation loops.
    • Assumptions/dependencies: Real-time constraints, safety certification, robust integration with physical systems.
  • Scientific workflows and lab automation
    • Application: Agents that manage long experimental protocols, search lab logs, and maintain persistent external memory across runs to generalize procedures.
    • Sector: Scientific Research/Lab Automation
    • Tools/products/workflows: LIMS integration, structured experiment logs, external scratchpads for protocols.
    • Assumptions/dependencies: Standardization of lab tool interfaces, high-fidelity logging, contamination/safety controls.
  • Formal verification and proof assistance
    • Application: Efficient proof assistants that use external memory to navigate large theorem libraries and construct long proofs in interactive steps.
    • Sector: Software Verification/Math
    • Tools/products/workflows: Theorem search tools, tactic libraries, external proof state memory.
    • Assumptions/dependencies: High-quality proof trajectories, domain-specific safety checks, reliable tool interoperability.
  • Enterprise-scale code maintenance for monorepos
    • Application: Autonomous agents that plan and execute multi-file refactors, API migrations, and dependency updates across massive codebases.
    • Sector: Software
    • Tools/products/workflows: Repository-wide search/edit tools, CI/CD orchestration, persistent external memory to track refactor state.
    • Assumptions/dependencies: Access controls, rollback strategies, multi-agent coordination, robust evaluation frameworks.
  • Multi-agent systems with shared external memory
    • Application: Teams of SSM agents coordinating via shared, auditable memory substrates to decompose and solve large tasks.
    • Sector: Software/Platforms
    • Tools/products/workflows: Shared memory boards, permissions and consensus, telemetry for tool calls.
    • Assumptions/dependencies: Access control policies, conflict resolution protocols, monitoring and governance.
  • Curriculum-scale educational tutoring
    • Application: Tutors that generalize to complex, multi-step problem sequences across grades by leveraging external memory and interactive reasoning.
    • Sector: Education
    • Tools/products/workflows: Structured trajectory datasets, progressive curricula, interactive scratchpads.
    • Assumptions/dependencies: Content alignment, fairness and bias audits, student privacy protections.
  • Regulatory standards for agentic tool-use and auditability
    • Application: Policies defining logging, traceability, and accountability for agents that write to and read from external memory.
    • Sector: Public Policy/Regulation
    • Tools/products/workflows: Audited tool APIs, standardized action–observation logs, compliance testing suites.
    • Assumptions/dependencies: Cross-industry consensus, privacy/security frameworks, enforcement mechanisms.
  • Cloud platform services for external memory and tool orchestration
    • Application: Managed services offering persistent, scalable “scratchpad” memory and tool orchestration optimized for SSM agents.
    • Sector: Cloud/Platforms
    • Tools/products/workflows: Memory-as-a-service, action routers, tool adapters with security layers.
    • Assumptions/dependencies: Strong isolation, cost controls, service reliability and portability.

Cross-cutting assumptions and dependencies

Across applications, feasibility hinges on:

  • Interactive tool-use: Single-turn tool calls (one-shot) are not sufficient; agents need multi-turn, ReAct-style loops with thoughts, actions, and observations interleaved.
  • External memory design: Pointer/search-based tools must be reliable, secure, and exposed via stable APIs. Observations should be masked during training when appropriate.
  • Training data quality: Length generalization requires trajectories with adequate coverage of state–symbol transitions (analogous to Turing machine operations). Poor coverage limits generalization.
  • Safety and auditability: Tool calls should be logged with clear provenance; access controls and guardrails must prevent unsafe writes/reads.
  • Task fit: SSMs excel when external tools dominate the workload and outputs scale with problem complexity; Transformers may remain preferable for purely in-context tasks without tool interaction.
  • Output scaling: Generalization is strongest for tasks with linear output growth (e.g., arithmetic); tasks with exponentially growing outputs (e.g., Tower of Hanoi) require more careful design and may show limited extrapolation.

Glossary

Below is an alphabetical list of advanced domain-specific terms from the paper, each with a short definition and a verbatim usage example.

  • Agentic settings: Scenarios where models act as agents that interact with tools and environments to solve tasks. "interactive tool-based and agentic settings."
  • Chain of Thought (CoT): A prompting style where models produce step-by-step reasoning before the final answer. "As test-time scaling paradigms that involve the generation of long Chain of Thought (CoT) have become the leading solution for improving reasoning capabilities"
  • Directed-acyclic computation graph: A directed graph with no cycles used to represent dependencies in computations. "In this task, we construct a directed-acyclic computation graph with nn nodes."
  • Distillation: Training a model to imitate the behavior or trajectories of a stronger agent/model. "3) Distillation: SWE-agent LLM \citep{yang2025swe} instructed to solve the bug in the code."
  • External memory: Storage outside the model that can be read/written via tools to extend the model’s effective memory. "allowing SSMs access to external memory through interactive tool-use makes them much more powerful."
  • Flash Attention: An optimized attention algorithm that improves efficiency and memory usage for long sequences. "efficient implementations such as Flash Attention \citep{dao2022flashattention} and KV caching enable close to linear complexity."
  • Function calling: A model capability to invoke structured functions/tools programmatically. "does not naively support function calling"
  • Generalized State Space Model (GSSM): A fixed-memory sequence model defined by an update rule and an output rule over a finite state. "A generalized state space model (GSSM) is a (potentially probabilistic) function h:ΣΔ(Σ)h : \Sigma^* \to \Delta(\Sigma^*)"
  • In-context learning: Learning behavior from examples provided in the input context without parameter updates. "tasks that involve memorization of long sequences and in-context learning \citep{jelassi2024repeat, park2024can, akyurek2024context}."
  • Interactive Tool-Use: Multi-turn tool invocation interleaved with thoughts and outputs. "3) Interactive Tool-Use: The model is allowed to use as many commands as it needs, and freely interleave thoughts, commands and outputs."
  • KV caching: Storing key-value pairs from past attention steps to speed up autoregressive generation. "efficient implementations such as Flash Attention \citep{dao2022flashattention} and KV caching enable close to linear complexity."
  • Length generalization: The ability to generalize from short/simple training instances to problems of arbitrary length/complexity. "achieve length generalization on any tractable long-form generation task."
  • Linear Transformers: Transformer variants designed so computation scales linearly with sequence length. "Examples of such architectures include variants of Linear Transformers \citep{katharopoulos2020transformers}"
  • Local (sliding window) attention: Attention restricted to a fixed window of nearby tokens to reduce compute/memory. "Transformers where all the attention layers have local (sliding window) attention with fixed window size, are also GSSMs."
  • Long-form generation: Tasks where the output length or space grows with problem complexity. "We formally define long-form generation tasks to be problems where the effective number of outputs grows with the complexity of the problem."
  • Mamba: A state space model architecture for efficient sequence modeling with linear-time inference. "We finetune Mamba and Pythia (Transformer) on trajectories collected from different tool-use agents for solving a coding problem."
  • Memory-tool oracle: A tool interface that exposes read/write operations to an external memory for the agent. "There exists memory-tool oracle andasimpleGSSMlearningalgorithmand a simple GSSM learning algorithm"
  • Neural GPU: A neural architecture designed to learn algorithmic tasks via parallel, convolutional-like operations. "with some improvements such as the Neural GPU \citep{kaiser2015neural}"
  • Neural Turing Machine (NTM): A neural model augmented with differentiable external memory that can emulate algorithm execution. "\cite{graves2014neural} introduces the Neural Turing Machine (NTM), a neural network that can simulate Turing machines and thus execute computable algorithms."
  • Next-token prediction objective: The standard language modeling objective of predicting the next token. "We train the model using the standard next-token prediction objective with teacher-forcing"
  • Pointer-based memory: An external memory mechanism where the agent moves pointers to read/write tokens at specific positions. "we augment the model with a pointer-based memory tool that gives the model access to past tokens in the input/output context."
  • Positional encodings: Representations injected into models to encode token positions for sequence processing. "including various choices of positional encodings and output format"
  • ReAct: A framework for interleaving reasoning (thoughts) and actions (tool calls) with observations. "We introduce a new theoretical framework for studying ReAct \citep{yao2023react} agents"
  • RetNet: A sequence model architecture with retentive mechanisms designed for efficient long-context processing. "H3 \citep{fu2022hungry}, RetNet \citep{sun2023retentive}, Mamba-1 and Mamba-2 \citep{gu2023mamba, dao2024transformers}"
  • RoPE scaling: A technique to extend rotary positional embeddings to longer contexts at inference. "We experimented with applying RoPE scaling when using the Transformer beyond the training context length, both in finetuning and evaluation, and observed mixed results."
  • Scratchpad: A technique where intermediate computations are written explicitly to aid model reasoning. "scratchpads \citep{nye2021show,lee2023teaching,zhou2023algorithms, abbe2024far}"
  • Single-Turn Tool-Use: A setting where the agent may issue only one tool command before producing the final answer. "2) Single-Turn Tool-Use: The model is allowed to issue a single command, followed by an observation, and then generate the output."
  • State Space Models (SSMs): Sequence models with fixed-size internal state updated over time, enabling linear compute scaling. "State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling."
  • Teacher-forcing: Feeding the ground-truth previous tokens during training to stabilize learning. "We train the model using the standard next-token prediction objective with teacher-forcing"
  • Test-time scaling: Improving capability by increasing computation (e.g., longer reasoning) during inference. "As test-time scaling paradigms that involve the generation of long Chain of Thought (CoT) have become the leading solution for improving reasoning capabilities"
  • Tool-use trajectories: Logged sequences of thoughts, actions, and observations that demonstrate tool interactions. "In this paper, we study the length generalization of SSMs when trained on data with tool-use trajectories."
  • Tower of Hanoi: A benchmark puzzle used to assess algorithmic reasoning and planning in models. "Tower of Hanoi task (a task which proved to be difficult for reasoning models, see \cite{shojaee2025illusion})."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 404 likes about this paper.