Context-Aware Automated Feature Engineering

Updated 24 December 2025

CAAFE is a framework that integrates task descriptions, metadata, and side information to automate feature generation and selection.
It leverages LLM-based prompts, multi-perspective sequence modeling, and genetic algorithms to create and validate context-aware features.
Empirical studies report significant improvements in ROC AUC and precision-recall metrics across domains like fraud detection, biosignal analysis, and recommendations.

Context-Aware Automated Feature Engineering (CAAFE) encompasses algorithmic frameworks that leverage data context—potentially from task description, metadata, or side information—to drive automated generation, selection, or construction of features for downstream machine learning. CAAFE methods stand in contrast to traditional automated feature engineering (AutoFE) by encoding and exploiting semantic or structural context, typically via generative models, LLMs, or evolutionary search, providing both performance improvements and interpretability gains in diverse domains such as tabular learning, biosignal analysis, sequential fraud detection, and recommender systems (Hollmann et al., 2023, Lin et al., 2023, Lucas et al., 2019, Liu et al., 9 Dec 2025, Livne et al., 2020).

1. Motivations and Foundational Concepts

Automated Feature Engineering aims to mitigate the substantial human effort devoted to data cleaning, feature creation, and domain integration—estimated at ≈77% of the typical data science cycle, as opposed to modeling and hyperparameter tuning (≈23%) (Hollmann et al., 2023). Classical AutoML pipelines do not address this phase. CAAFE frameworks respond by embedding task-specific context—including dataset semantics, data types, domain knowledge, and user constraints—into automated, reproducible, and interpretable feature generation. Research emphasizes context as a means to:

Enable incorporation of expert/domain knowledge not inferable directly from the data (Hollmann et al., 2023, Liu et al., 9 Dec 2025),
Regulate feature dimensionality and preserve transparency in high-dimensional settings (Livne et al., 2020, Lucas et al., 2019),
Ensure semantic relevance and interpretability in generated features (Lin et al., 2023, Hollmann et al., 2023).

2. Algorithmic Methodologies

CAAFE comprises several algorithmic instantiations. There exist:

LLM-driven, prompt-based iterated generation
- Frameworks such as CAAFE (Hollmann et al., 2023), DeepFeature (Liu et al., 9 Dec 2025), and SMARTFEAT (Lin et al., 2023) employ LLMs prompted with dataset descriptions, example rows, and explicit instructions to propose new features as executable code together with explanatory metadata.
- Iterative pipelines: Feature proposals are generated, executable code is synthesized and validated, downstream model is retrained, and features are retained only if a pre-specified metric (e.g., ROC AUC, accuracy) shows improvement. Validation includes sandboxed code execution and empirical model performance tests (Hollmann et al., 2023, Lin et al., 2023, Liu et al., 9 Dec 2025).
Multi-perspective HMMs for sequence modeling
- The CAAFE-HMM approach (Lucas et al., 2019) uses three binary perspectives (Class: fraud/genuine, Context: cardholder/terminal, Temporal: amount/delta) to construct eight context-specific event histories, training one HMM per configuration. The eight windowed sequence-likelihood scores form a context-aware feature vector.
Evolutionary algorithms for explicit context selection
- In context-aware recommenders, context-subsets are evolved via genetic algorithms (GA), using fitness functions that balance accuracy, subset compactness, and number of active sensor groups. Elite subsets are used directly or as a basis for stacking ensembles (Livne et al., 2020).

3. Architectural and Operational Details

LLM-Based CAAFE (Tabular and Biosignal Domains)

Pipeline:
- Inputs: training/validation splits, user-provided context (description, schema, sample rows).
- Prompt construction: includes schema, task intent, column statistics, sample data, and a chain-of-thought rationale template.
- LLM returns: Python code block (feature transformation), explanatory comment.
- Code execution: Executed in a restricted environment (pandas or NumPy), errors and outputs validated.
- Acceptance: Feature retained if empirical metric improves (ΔROC_AUC > 0, ΔACC > 0) over prior round (Hollmann et al., 2023).
- Iteration: Continue until K iterations or convergence.
Iterative Feedback (DeepFeature):
- Multi-source generation: Direct LLM (S₁), context-guided LLM (S₂: sensor/static context + retrieved literature), operator-based combinations (S₃).
- Feature assessment: RFE, mutual information, confusion matrix feedback.
- Multi-layer filtering: AST parsing, code structure/content validation, canary-batch execution before full application (Liu et al., 9 Dec 2025).
Efficient Operator Selection (SMARTFEAT):
- Context collector: Assembles contextual embeddings for operator selection.
- Operator selector: FM builds a proposal set or samples candidates by conditional confidence, avoiding combinatorial explosion ( $O(n + B|\text{op types}|)$ vs $O(n^2)$ ).
- Function generator: Converts FM descriptions into executable code, verifies correctness.
- Validation: Sanity checks and empirical model performance (Lin et al., 2023).

Framework	Context Types Used	Feature Generation Mechanism
CAAFE (Hollmann et al., 2023)	Dataset schema, task, sample rows	LLM + iterative retrain+accept+explain
DeepFeature (Liu et al., 9 Dec 2025)	Sensor metadata, static+retrieved expert	LLM, context-aware prompt, feedback loop
CAAFE-HMM (Lucas et al., 2019)	Sequential event context, label, role	Multi-perspective HMM sequence likelihood
CARS-GA (Livne et al., 2020)	Sensor groups, user privacy, etc.	GA subset selection, deep stacking
SMARTFEAT (Lin et al., 2023)	Schema, target, model meta	FM-driven operator/function selection

4. Evaluation Protocols and Empirical Results

Tabular Benchmarks (CAAFE, SMARTFEAT):
- CAAFE using GPT-4 improved average ROC AUC from 0.798 to 0.822 across 14 datasets, outperforming no-feature engineering and matching the jump from logistic regression to random forest (Hollmann et al., 2023).
- SMARTFEAT achieved an 11.5% AUC gain (0.78→0.87) on the Insurance Claims dataset, with 80% of top-10 features being FM-generated (Lin et al., 2023).
Sequential Fraud Detection (CAAFE-HMM):
- Precision–Recall AUC improved by +40.5% (raw→raw+HMM) in e-commerce and +85.4% in face-to-face mode using eight HMM-based features over industrial credit card transactions (Lucas et al., 2019).
- Default missing-value encoding (“default0”: missing=0) performed best for incomplete history.
Wearable Biosignal Tasks (DeepFeature):
- Mean AUROC improved 4.21–9.67% over best classical/LLM baselines across eight biosignal classification tasks. Context-guided and operator-based augmentations gave +3.53% and +6.75% respective AUROC benefits, and feedback loop added +5.58% (Liu et al., 9 Dec 2025).
Context-Aware Recommendations (CARS-GA):
- GA-based CAAFE pipelines outperformed latent compression and explicit feature-selection on CARS (AUC 0.8108–0.8407 vs. 0.8056 best baseline) and Hearo datasets, while using fewer selected features and dimensions (Livne et al., 2020).

5. Interpretability, Robustness, and User Aspects

Interpretability:
- All LLM-driven pipelines record feature explanations, rationales, and code (e.g., CAAFE's one-line description and “Usefulness” comment templates (Hollmann et al., 2023), DeepFeature’s feature spec JSON (Liu et al., 9 Dec 2025)).
- Sensor- and context-subset methods (CARS-GA) yield explicit mappings from sensor data to recommendations, enabling concrete user-facing explanations (Livne et al., 2020).
Robustness and Limitations:
- LLM “hallucinations” can result in invalid or nonsensical transformations; robust multi-stage code validation and performance gating provide mitigation but not elimination (Hollmann et al., 2023, Liu et al., 9 Dec 2025, Lin et al., 2023).
- Prompt/context length limitations (LLM window, e.g., 8k tokens), overfitting on small splits, and reliance on user-supplied context are persisting challenges.
- In recommender settings, context subset optimization can be tuned to balance privacy/battery constraints with accuracy.
Efficiency:
- Operator-guided and sampling-based procedures (e.g., SMARTFEAT) avoid the $O(2^n)$ combinatorial growth of naive feature generation (Lin et al., 2023).

6. Generalization and Extensions

Domain Transfer:
- The CAAFE-HMM recipe (multi-context models, per-context windowed sequence analysis, and likelihood features) generalizes to sequential prediction in any domain where events are temporally and contextually structured; any generative model (not restricted to HMMs) can be used (Lucas et al., 2019).
- LLM/AutoML-based CAAFE is applicable to wide-table tabular, sequential, and biosignal/time-series domains, with adaptation for modality- and context-specific feature engineering demonstrated in DeepFeature (Liu et al., 9 Dec 2025).
Future Directions:
- Potential directions include LLM prompt and model fine-tuning for code correctness, integration with bandit or Bayesian search to optimize feature proposals, advanced statistical acceptance criteria (e.g., cross-validated permutation tests), and explicit support for multimodal or multi-table data (Hollmann et al., 2023, Liu et al., 9 Dec 2025, Lin et al., 2023).
- Human-in-the-loop validation and interactive pipelines are suggested to further increase reliability and semantic alignment (Hollmann et al., 2023).

7. Comparative Summary and Practical Recommendations

CAAFE encompasses a repertoire of strategies ranging from generative context-specific LLM prompts, multi-perspective sequence modeling, operator-guided function synthesis, to evolutionary context subset selection. Empirical evidence across diverse research demonstrates consistent improvements in predictive accuracy and interpretability in domain-agnostic, biosignal, sequential event, and recommendation systems (Hollmann et al., 2023, Lucas et al., 2019, Liu et al., 9 Dec 2025, Livne et al., 2020, Lin et al., 2023). Robust feature acceptance mechanisms and thorough code validation are essential to prevent LLM-driven failures. Explicit modeling of context—task descriptions, domain knowledge, sensor metadata, temporal structures—enables AutoML ecosystems to automate not only modeling but also the key data-engineering bottleneck, supporting the emerging paradigm of “semantic AutoML.”