Comparative evaluation of agent architectures on CTI-REALM

Determine the comparative performance of multiple agent frameworks, including plan-and-execute and tree-of-thought architectures, on the CTI-REALM benchmark tasks to assess how agent framework choice impacts detection engineering outcomes.

Background

The study evaluates models using a single ReAct agent architecture to isolate model capability differences under controlled conditions. While this controls for agent variability, it leaves unaddressed how alternative agent frameworks might perform on the same tasks within CTI-REALM.

Understanding the relative effectiveness of different agentic paradigms (e.g., plan-and-execute, tree-of-thought) is important for practitioners aiming to deploy agents for end-to-end detection engineering workflows, and remains explicitly identified as future work by the authors.

References

We use a single agent architecture to isolate model capability differences under controlled conditions; comparing multiple agent frameworks (e.g., plan-and-execute, tree-of-thought) is left to future work.

CTI-REALM: Benchmark to Evaluate Agent Performance on Security Detection Rule Generation Capabilities  (2603.13517 - Chakraborty et al., 13 Mar 2026) in Experimental Setup, Agent Architecture