Papers
Topics
Authors
Recent
Search
2000 character limit reached

Feature Engineering for Agents: An Adaptive Cognitive Architecture for Interpretable ML Monitoring

Published 11 Jun 2025 in cs.LG and cs.AI | (2506.09742v1)

Abstract: Monitoring Machine Learning (ML) models in production environments is crucial, yet traditional approaches often yield verbose, low-interpretability outputs that hinder effective decision-making. We propose a cognitive architecture for ML monitoring that applies feature engineering principles to agents based on LLMs, significantly enhancing the interpretability of monitoring outputs. Central to our approach is a Decision Procedure module that simulates feature engineering through three key steps: Refactor, Break Down, and Compile. The Refactor step improves data representation to better capture feature semantics, allowing the LLM to focus on salient aspects of the monitoring data while reducing noise and irrelevant information. Break Down decomposes complex information for detailed analysis, and Compile integrates sub-insights into clear, interpretable outputs. This process leads to a more deterministic planning approach, reducing dependence on LLM-generated planning, which can sometimes be inconsistent and overly general. The combination of feature engineering-driven planning and selective LLM utilization results in a robust decision support system, capable of providing highly interpretable and actionable insights. Experiments using multiple LLMs demonstrate the efficacy of our approach, achieving significantly higher accuracy compared to various baselines across several domains.

Summary

  • The paper demonstrates that a structured, feature engineering-inspired decision procedure significantly improves the interpretability and performance of ML monitoring systems.
  • It integrates procedural, episodic, semantic, and working memory modules to manage and synthesize complex data into actionable insights.
  • Experimental results show that the CAMA architecture achieves superior accuracy, with up to 92.3% in drift scenarios compared to baseline methods.

Feature Engineering for Agents: An Adaptive Cognitive Architecture for Interpretable ML Monitoring

Introduction

The paper "Feature Engineering for Agents: An Adaptive Cognitive Architecture for Interpretable ML Monitoring" proposes a novel cognitive architecture designed to enhance the interpretability of outputs produced by ML monitoring systems in production environments. Monitoring ML models is essential as models can degrade over time, but traditional monitoring techniques often produce outputs with low interpretability, making efficient decision-making difficult. By using principles of feature engineering combined with LLMs, the authors aim to automate and improve the interpretability of ML monitoring tools.

Central to the proposal is the Decision Procedure module, which simulates feature engineering through a structured process involving three critical steps: Refactor, Break Down, and Compile. These steps collectively improve data representation, provide detailed analysis of complex information, and synthesize sub-insights into interpretable outputs that enable actionable decision-making. This architecture, termed Cognitive Architecture for Monitoring Agent (CAMA), integrates various memory components to manage monitoring data efficiently, thus creating a robust decision support system. Experimental results across various datasets demonstrate CAMA's superior performance in delivering interpretable insights compared to baseline methods. Figure 1

Figure 1: Cognitive architecture for ML monitoring. The system integrates Procedural (PM), Episodic (EM), Semantic (SM), and Working (WM) Memory. The central Decision Procedure (DP) implements a feature engineering-inspired approach: Refactor, Break Down, and Compile.

Approach

The cognitive architecture employs a multi-step decision procedure leveraging structured memory modules inspired by cognitive science. The architecture handles monitoring outputs, such as drift scores and SHAP values, providing structured, actionable recommendations. The memory modules are:

  • Procedural Memory (M_P): Contains the agent code and LLM prompts.
  • Episodic Memory (M_E): Stores historical instances of test data and insights.
  • Semantic Memory (M_S): Holds generalized knowledge from training data and models.
  • Working Memory (M_W): Maintains current context and ongoing reasoning processes.

The decision procedure includes:

  1. Refactor: Improves representation by restructuring data without LLM calls, enhancing feature semantics while removing noise. Figure 2

    Figure 2: Refactor step gathers and structures information from memories without LLM calls.

  2. Break Down: Analyzes features in parallel, leveraging LLM calls for detailed insights on individual feature interactions. Figure 3

    Figure 3: Break Down step analyzes features in parallel using LLM calls.

  3. Compile: Synthesizes insights into comprehensive reports through LLM calls, providing clear summaries and recommendations. Figure 4

    Figure 4: Compile step generates the final comprehensive report.

This approach allows the system to deliver highly interpretable and actionable monitoring outputs, significantly reducing reliance on LLM-generated planning, which tends to be inconsistent and overly general.

Experimental Setup

The architecture was tested using four LLMs of varying sizes, and three synthetic datasets (Loan Default Prediction, Eligibility Simulation, Chronic Condition Prediction) designed to simulate real-world drift scenarios. Evaluation involved accuracy, unknown response ratio, token usage, and processing time, following methodologies similar to MMLU (2506.09742). Figure 5

Figure 5: Accuracy comparison for the Healthcare dataset across different models and methods.

Figure 6

Figure 6: Accuracy comparison for the Eligibility dataset across different models and methods.

Figure 7

Figure 7: Accuracy comparison for the Financial dataset across different models and methods.

Results and Discussion

The results reveal that CAMA consistently outperforms all baseline methods across different models and datasets. The architecture achieves high accuracy rates, notably with the llama3-70b model, showing a 92.3% accuracy, underscoring its efficiency and robustness in interpreting and acting on complex data scenarios. Its ability to drastically reduce unknown responses suggests that its structured approach allows for comprehensive understanding and insight generation.

CAMA's adaptive use of LLMs and memory components ensures that interpretable and actionable monitoring reports are generated efficiently, with significant improvements over traditional methods. The comprehensive evaluation across datasets shows substantial gains in performance metrics, demonstrating the versatility and applicability of the proposed architecture in real-world scenarios.

Conclusion

The paper introduces a compelling cognitive architecture for ML monitoring that applies feature engineering principles to enhance the interpretability and actionability of monitoring outputs through structured memory and decision-making processes. CAMA's ability to integrate different LLMs and adapt to varying data complexities positions it as a powerful tool for improving ML model monitoring. Future work should explore its adaptability to other domains and optimization techniques to further reduce computational costs while ensuring robust monitoring across diverse production environments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Simple Explanation of the Paper

Overview

This paper introduces a smart helper system, called CAMA, that keeps an eye on ML models after they’ve been deployed in the real world. Its goal is to turn confusing monitoring numbers into clear, easy-to-understand reports that help people decide what to do next, like retrain a model or fix a data pipeline.

Think of it like a coach that watches a team (the ML model), studies game stats (monitoring tools), and writes a clear game summary with practical advice.

Key Objectives

The paper asks three main questions:

  • Can we use modern AI (LLMs, or LLMs) to explain ML monitoring results in plain, useful language?
  • Can we make those explanations more reliable by organizing the information better before the AI analyzes it?
  • Does this approach work well across different kinds and sizes of AI models and in different problem areas?

Methods and Approach

The authors build an “adaptive cognitive architecture” — basically, a structured way for an AI agent to think — and apply feature engineering ideas to the agent’s process. Feature engineering is the practice of cleaning and organizing data so models can understand it better. Here, they use it to help the AI agent understand monitoring outputs better.

They focus on two parts:

The memory modules

The system uses four kinds of “memory,” similar to how people remember things:

  • Procedural Memory: The “how-to” instructions and code the agent uses.
  • Episodic Memory: Notes from past monitoring sessions, like test data and previous reports.
  • Semantic Memory: General knowledge about the training data, the ML model, and tools (like what features mean).
  • Working Memory: The “scratchpad” for whatever the agent is currently analyzing.

This setup helps the agent stay organized, avoid noise, and remember what matters.

The 3-step decision procedure

To turn raw monitoring outputs into a clear report, the agent follows three steps:

  • Refactor: Reorganize the information so important parts stand out and irrelevant details are minimized. Think of it like tidying a messy desk before you start homework.
  • Break Down: Analyze each feature or signal separately and in parallel. It’s like looking at every part of a bike to see which piece is squeaking.
  • Compile: Combine all the mini-insights into one clear report with summaries and actionable recommendations.

This flow reduces the need for the AI to “make up” planning steps and makes its analysis more consistent and understandable.

What They Tested

The team ran experiments using different LLMs (from small to large) and three datasets that represent real-world situations where data can change over time (called “distribution drift”):

  • Loan Default Prediction (easy)
  • Eligibility Simulation (medium)
  • Chronic Condition Prediction (hard)

They used standard monitoring tools (like drift scores and SHAP values, where SHAP scores tell you how much each input feature affected a model’s prediction) and compared their method against popular prompting strategies (like Chain of Thought, Reflection, ReAct, and more).

They measured:

  • Accuracy: How often the agent’s report matched the ground truth.
  • Unknown ratio: How often the agent said “I don’t know.”
  • Tokens: How long the reports were.
  • Time: How long it took to generate the reports.

Main Findings and Why They Matter

The results show that CAMA consistently produced more accurate, more complete, and more useful reports than other methods, across different model sizes and datasets.

Highlights:

  • With a large model (llama3-70b), CAMA reached about 92% accuracy, while the next best method was around 59%.
  • With a medium model (llama3-8b), CAMA hit about 91% accuracy, far above the others.
  • Even with a small model (llama-3.2-1b), CAMA did much better than the alternatives.
  • CAMA had very low “I don’t know” rates, meaning it gave confident, informative answers.

They also did an “ablation study” (removing parts to see what breaks) and found all three steps (Refactor, Break Down, Compile) are necessary. Taking out any step caused big drops in accuracy, especially removing Refactor or Break Down.

Why this matters:

  • Teams can trust the monitoring summaries more.
  • Reports are easier to understand, so decisions like “retrain now” or “collect new data” become clearer.
  • The system works well even with smaller, cheaper models, which is useful in limited-resource settings.

Implications and Potential Impact

This approach can make day-to-day ML monitoring faster, clearer, and more reliable. It helps:

  • Reduce the workload for engineers and data scientists who currently need to interpret complex metrics.
  • Catch problems earlier when data in the real world changes over time (drift), which can prevent bad predictions.
  • Support multi-agent setups where one agent detects a problem and another suggests fixes.

Limitations and future work:

  • The tests used specific datasets and tools, so more testing across different domains would be helpful.
  • It can take longer and use more tokens than simpler methods; future improvements could make it faster and more efficient.

Overall, CAMA offers a practical, LLM-agnostic way to turn technical monitoring outputs into understandable advice, helping organizations keep their ML models healthy and trustworthy.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.