LLMulator: Generalizable Cost Modeling for Dataflow Accelerators with Input-Adaptive Control Flow

Published 25 Aug 2025 in cs.AR | (2508.17826v1)

Abstract: Accurate and fast performance prediction for dataflow-based accelerators is vital for efficient hardware design and design space exploration, yet existing methods struggle to generalize across architectures, applications, and input-dependent control flows. We present LLMulator, a progressive numeric modeling framework leveraging the program semantic knowledge of pre-trained LLMs for robust, hardware- and application-aware prediction. Our numeric model treats performance values as categorical token sequences, enabling range-agnostic estimates and confidence-aware predictions for unseen applications. To handle input-dependent control flows, we introduce a reinforcement learning-based dynamic calibration method, reducing cycle prediction error by 9.7% over static models and converging to 11.2% error after a few iterations. For cross-hardware generalization, we develop a progressive data augmentation strategy that generates diverse datasets covering multi-level dataflow structures, memory parameters, and loop mapping primitives, significantly boosting prediction accuracy across architectures and configurations.

Abstract PDF Upgrade to Chat

Summary

The paper introduces LLMulator, a cost modeling framework that integrates LLMs to predict dataflow accelerator performance with input-adaptive control flow.
It employs progressive numeric tokenization, digit-wise categorical modeling, and reinforcement learning for robust prediction and minimal error propagation.
Experimental results show up to a 41% reduction in cycle prediction error and excellent generalization across unseen hardware configurations.

Summary of "LLMulator: Generalizable Cost Modeling for Dataflow Accelerators with Input-Adaptive Control Flow" (2508.17826)

Motivation and Problem Statement

The paper introduces LLMulator, a comprehensive framework for performance prediction of dataflow accelerators, uniquely leveraging LLMs for generalizable and interpretable cost modeling. It addresses three central limitations in prior modeling approaches: lack of generalization across new applications and hardware configurations, inability to capture input-dependent control flows, and the challenge of robust numerical estimation in large performance ranges.

Traditional rule-based simulators (e.g., Timeloop), GNN-based predictors, and LM-based regression models exhibit diminished accuracy and adaptability when facing unseen applications, hardware features, or dynamic control flows. LLMulator's design directly targets these deficiencies through progressive numeric modeling, reinforcement learning-based dynamic calibration, and hierarchical data augmentation.

Framework Architecture

LLMulator's architecture comprises three synergistic modules:

Dataset Synthesizer: Generates diverse dataflow programs across a spectrum of hardware/software configurations. A multi-stage generation process incorporates AST-based construction, dataflow-specialized program mutation, and LLM-driven augmentation to maximize coverage of real-world accelerator scenarios.
Numeric Modeling-Based Static Prediction: Introduces progressive program encoding and digit-wise, classification-based output modeling. Numerical values within dataflow programs are tokenized with precision-preserving schemes, and outputs are predicted with hierarchical categorical modeling (digit-wise classification with beam search), yielding position-wise confidence estimations and reducing distortion at extreme values.
Dynamic Prediction-Based Calibration: Implements reinforcement learning (DPO) using on-line profiler feedback for input-adaptive cycle prediction. A replay-cost buffer instantiates preference pairs from real vs. predicted performance, directly updating model parameters and achieving robust input-dependent generalization. Further acceleration is realized via operator-level selective attention masking, minimizing redundant computation for static operators.

Technical Innovations

Tokenization and Output Decoding: Progressive numerical tokenization aligns token counts with digit lengths, reducing representation ambiguity and error propagation. Output decoding uses categorical cross-entropy at the digit level, optimizing for both spatial and temporal efficiency and circumventing the limitations of regression models constrained by normalization and saturation effects.
Confidence-aware Predictions: Beam search and logit analysis yield interpretability and actionable confidence metrics. Prediction error correlates with final-layer logits, enabling quantifiable uncertainty modeling.
Data Augmentation and Reasoning: Hierarchical data synthesis incorporates intermediate reasoning fragments (> -tagged extracted features) alongside program code, allowing the model to integrate rich context and hardware mapping parameters (memory delays, loop unrolls) into its learning space.
- Dynamic Calibration and Acceleration: Real-time DPO updates, operator-level masking, and partial attention mechanisms optimize inference latency (up to 30.6% reduction in extreme context-length scenarios) and model adaptability for workloads with substantial input-dependent control flow.
Experimental Results

Extensive evaluation on Polybench and modern application benchmarks (image processing, NLP) demonstrates the following:
- Numerical Performance: LLMulator achieves a Mean Absolute Percentage Error (MAPE) of 12.2% across all metrics (static power, area, cycles), outperforming GNNHLS (28.9%) and TLP (20.0%) by substantial margins. Dynamic calibration reduces cycle prediction error by up to 41% over static models.
- Edge Case and Hardware Generalization: Digit-wise categorical output modeling achieves superior accuracy on unseen or extreme value scenarios compared to regression-based baselines. Progressive data augmentation enables generalization to new hardware configurations (memory delay, loop mapping primitives) with minimal increase in prediction error.
- Inference Latency: While LLM-based inference incurs higher raw computation time (~1.01s), optimizations via partial attention and operator masking yield significant runtime improvements.
- Transferability: LLMulator retains robust generalization when predicting performance for canonical accelerator architectures (TPU v1, Eyeriss, ShiDianNao) and real-world workloads compiled via MLIRSynth, achieving MAPE in the 6.9-10.7% range even without retraining.
Practical and Theoretical Implications

LLMulator opens new paradigms in accelerator cost modeling by integrating LLMs' semantic capabilities with domain-specific numerical modeling and RL-based calibration. The explicit digit-level confidence modeling and hierarchical data augmentation provide interpretability and actionable uncertainty estimation, suited for iterative hardware-software co-design, rapid design space exploration, and robust prediction across evolving architectures and workloads.

Theoretically, this approach demonstrates that pre-trained LLMs, when coupled with progressive numeric modeling and reinforcement feedback, can generalize beyond classical regression models and static analytical formulations. Its methodology could be extended to performance modeling in related domains (e.g., compiler optimization, automated design synthesis) and inspires future research in context-adaptive hardware systems.

Future Outlook

Further research directions include augmenting program normalization and abstraction techniques to strengthen accuracy on deeply nested and highly abstracted dataflows, scaling base LLM parameters for enhanced reliability, and expanding reasoning chains for richer interpretability. Integration with hardware-in-the-loop learning could further raise input-adaptive calibration robustness.

Conclusion

LLMulator constitutes a comprehensive, modular framework advancing the generalization, interpretability, and input-adaptiveness of dataflow accelerator modeling. Through synergistic application of LLMs, digit-wise categorical modeling, and RL-based dynamic calibration, it achieves substantial accuracy improvements and sets new methodological standards for hardware cost prediction in increasingly diverse and dynamic accelerator landscapes.