Generalization of rule-optimized small language models to unseen domains without iterative error analysis

Determine how specialized small language models, such as Qwen3 4B integrated within the Code Generation Agent and equipped with a fine-tuned set of domain-specific prompt rules, can generalize to new, unseen application domains without employing the iterative error-analysis and rule-induction cycle used in this study.

Background

The paper introduces an error-driven optimization framework for arithmetic reasoning on tabular financial data using a Code Generation Agent with on-premises small LLMs. By clustering erroneous predictions and formulating domain-specific prompt rules, the authors achieve significant accuracy gains with Qwen3 4B, improving Exact Match from 59.96% to 70.82% and surpassing GPT-3.5 Turbo.

While the approach is effective within the studied financial domain and dataset (TAT-QA), the method relies on iterative error clustering and rule induction. The authors identify a critical remaining challenge: enabling these specialized, rule-optimized small LLMs to generalize to previously unseen domains without repeating the full error-analysis cycle. This generalization is important for scalable, privacy-preserving deployment across diverse regulated sectors.

References

A critical open question remains, however, as to how these specialized small models, equipped with a fine-tuned rule-set, can generalize to new, unseen domains without the iterative error-analysis cycle presented here.

Error-Driven Prompt Optimization for Arithmetic Reasoning  (2512.13323 - Pándy et al., 15 Dec 2025) in Section 7 (Conclusion)