- The paper introduces a modular, multi-agent framework that utilizes an IR-driven process to improve FSM Verilog code generation.
- It demonstrates significant improvements with up to 11.94% higher functional correctness and 17.62% lower syntax errors compared to baselines.
- The integration of automated SystemC-based testing and agent feedback loops underlines its scalability and robustness in handling complex FSM designs.
AutoFSM: Multi-Agent Framework for FSM Code Generation with IR and SystemC-Based Testing
Introduction and Motivation
The generation of Register-Transfer Level (RTL) code for finite state machines (FSMs) is central to digital system design, yet remains an error-prone and labor-intensive process, particularly as FSM complexity increases. While LLMs exhibit promising natural language to code translation capabilities, their application to Verilog code generation for FSMs is hindered by high syntax error rates, inefficient debugging cycles, and dependence on manual or predefined testbenches. These limitations originate largely from LLMs’ limited Verilog training data coverage, leading to syntactic and semantic model deficiencies, and from the lack of integrated, automated verification feedback within existing frameworks.
AutoFSM directly addresses these challenges by introducing a structured, collaborative agent-based framework that leverages an intermediate representation (IR) and fully automated SystemC-based testing. The design intent is to improve both the syntactic correctness and functional reliability of autogenerated FSM Verilog, while providing a scalable, interpretable code-debug loop.
System Architecture and Core Innovations
AutoFSM employs a division of labor between specialized agents: FSMExtractor (NL description to IR conversion), Verifier (semantics consistency checking), Coder (IR to Verilog translation), Tester (testbench and SystemC modeling), Fixer (error localization and patching), and Judger (fault classification and feedback). This modular workflow is orchestrated atop the MetaGPT platform, optimizing agent context management and collaborative task planning.
The principal architectural innovation is the strict separation of semantic (FSM structure/behavior) and syntactic (Verilog HDL) stages via a context-rich, JSON-based IR. The IR is designed to be precisely specifiable by LLMs, mitigating ambiguities associated with schema-constrained input formats like YAML. The system implements a conversion toolchain (json2yaml, with significant custom extensions to state tracking and register support in fsm2sv) to automate Verilog synthesis from this IR, with agent-in-the-loop verification at each step.
Functional validation is systematized by generating SystemC models from the design description. Automated test stimuli and reference output traces enable differential checking between DUT and reference model, with failures localized through comprehensive error tracebacks. The framework’s integration with Verilator provides cycle-accurate simulations, with all failures fed directly into the multi-agent correction loop.
Benchmark, Experimental Methodology, and Results
To facilitate systematic and hierarchical evaluation, the authors introduce the SKT-FSM benchmark, comprising 67 highly curated FSM samples spanning a normalization-based complexity gradient (easy/medium/hard). This benchmark advances VerilogEval’s coverage limitations, offering a testbed reflective of practical FSM control design requirements.
Evaluation is performed using DeepSeek-V3 and GPT-4o as LLM substrates, with baseline comparisons to the open-source multi-agent framework MAGE and additional large models (Gemini 2.5, Qwen2.5-Max). The primary metrics are pass@1 (holistic functional correctness via test pass rates) and syntax error rate. Experimental control is strictly maintained (deterministic LLM sampling, bounded agent actions).
Quantitative results demonstrate that AutoFSM yields up to 11.94% higher pass rates and 17.62% lower syntax error rates than MAGE, controlling for LLM choice. On DeepSeek-V3, pass@1 reaches 58.21% (vs. 46.27% for MAGE) with a syntax error rate of just 0.75%. On GPT-4o, the improvements persist (pass@1: 44.78% vs. 38.81%; syntax error: 5.22% vs. 22.84%). Ablation confirms both the IR-based toolchain and SystemC-based feedback loop are critical; disabling either leads to a substantial drop in pass@1 and a significant escalation in syntax errors.
Pass@1 performance differentials are especially marked for increased FSM complexity, with AutoFSM outperforming baselines by up to 30.77% on hard SKT-FSM tasks, providing evidence for better generalization in realistic scenarios.
Implications and Future Directions
Practically, AutoFSM’s modular, validation-driven architecture advances reliability, maintainability, and scalability for hardware code generation pipelines. The adoption of semantically explicit IRs and systematized agent feedback enables robust code generation even in the presence of data sparsity, and supports efficient fault isolation and correction cycles. The SystemC-based automated testbench generation addresses a notable bottleneck in prior frameworks by removing dependence on handcrafted test programs, making AutoFSM adaptable for deployment in dynamic or large-scale design environments.
Theoretically, these results substantiate the efficacy of agent specialization and intermediate formal abstractions for complex code generation tasks. The significant reduction in syntax errors and targeted correction loop suggest that LLM-driven hardware design can approach practical utility when coupled with explicit semantic mediation and closed-loop verification.
Looking forward, extending the IR and agent protocol to support richer FSM constructs (e.g., hierarchical/nested state machines, timed/transient transitions), broadening integration with professional EDA flows, and exploring reinforcement/active learning across agent simulations are promising directions. Additionally, the hierarchical SKT-FSM benchmark lays a foundation for reproducible, community-wide progress tracking in FSM code generation research.
Conclusion
AutoFSM delivers a robust, agent-based approach for FSM Verilog code generation, distinguished by its IR-driven pipeline and automated SystemC-based verification. It decisively reduces LLM-induced syntax errors and raises the ceiling on functional correctness, outperforming contemporary multi-agent baselines and general-purpose LLMs across all evaluated metrics on a challenging FSM benchmark. The modular agent and feedback architecture of AutoFSM sets a scalable precedent for future advancements in AI-driven hardware design automation.