Shapley Value of Whole Columns
- Shapley value of whole columns is a principled method that attributes feature utility by averaging marginal contributions over all subsets, based on cooperative game theory.
- The approach leverages both sampling-based and nonparametric regression methods to efficiently approximate contributions in high-dimensional datasets.
- Practical variants like SHAP and SAGE illustrate how different payoff functions yield distinct feature attributions, highlighting trade-offs in model interpretation and feature selection.
The Shapley value of whole columns is a principled, axiomatic approach to attributing aggregate utility or importance to each feature column in a dataset, based on averaging the marginal contributions that features make to a model’s performance across all possible subsets. Originating in cooperative game theory and widely adopted in machine learning for Explainable AI (XAI), the Shapley value treats each feature as an agent in a game where the payoff is the model’s evaluation function—such as , expected log-loss reduction, or information criteria—computed on arbitrary subsets of features. Computation and interpretation of whole-column Shapley values, together with their limitations for feature selection, comprise a central topic in the current literature (Fryer et al., 2021, Miftachov et al., 2022, Li et al., 2024).
1. Formal Definition of the Shapley Value for Feature Columns
Consider a set indexing the feature columns of a dataset. Let be a payoff function assigning to each subset the value , representing the performance achieved by a model utilizing exactly features (with by convention). The Shapley value for feature is defined as: Here, ranges over all subsets not containing , the combinatorial weight reflects the fraction of all feature orderings in which exactly the features in precede , and the bracketed term is the marginal contribution of to . Intuitively, quantifies the average extra gain from feature across all possible contexts in which it may be added (Fryer et al., 2021).
2. Axiomatic Properties and Their Role
The classical Shapley value allocation is uniquely determined by four axioms interpreted in the feature-column setting:
- Efficiency: . The total full-model utility is exactly allocated among the features.
- Symmetry: If two features always contribute the same increment to every coalition, their Shapley values are equal.
- Dummy (Null Player): If a feature never increases performance in any subset, its Shapley value is zero.
- Additivity: The Shapley value distributes over payoff functions: for , .
These axioms underlie a form of "game-theoretic fairness": each feature’s score depends on marginal utility averaged over every possible subset (Fryer et al., 2021). However, this collective rationality does not align perfectly with typical feature selection objectives (see Section 4).
3. Estimation Strategies for Shapley Values
Direct computation of all marginal contributions is infeasible for large , prompting both analytical and statistical approximations:
3.1. Sampling-Based Approximation (OFA–A Framework)
The OFA–A (One-For-All) framework provides a unified stochastic estimator to efficiently approximate Shapley values for each feature:
- Let be the number of features.
- For subset sizes , define the sampling probabilities , normalized so that they sum to one.
- Draw samples of random subsets of size , recording, for each feature , whether it is present (updating ) or absent (updating ) in .
- The overall Shapley estimate is computed by
for (Li et al., 2024).
3.2. Nonparametric Regression-Based Estimation
For the regression problem , the population-level Shapley curve for feature is: with . The global (integrated) Shapley value is
Two estimation approaches are:
- Component-based: Separate local-linear regressions for all subset regressions , with plug-in estimates for differences.
- Integration-based: Full -variate regression fitted, with estimates for marginal means constructed via Monte Carlo or kernel methods (Miftachov et al., 2022).
Statistical theory guarantees minimax-optimal convergence rates for mean-integrated squared error under appropriate smoothness (Miftachov et al., 2022).
A wild-bootstrap procedure enables valid confidence bands for the estimated by residual reweighting and local refitting.
4. Failures of Shapley Value for Feature Selection
Despite their theoretical fairness, Shapley values may counteract the goal of identifying compact, parsimonious feature subsets. Three illustrative counterexamples confirm this:
- Taxicab payoff: Irrelevant features can receive strictly positive Shapley values if they improve performance only in suboptimal models, not in the global optimum.
- Secret holder problem: Essential features with crucial conditional contributions may not receive maximal Shapley credit if their marginal contributions are hidden except in particular coalitions.
- Monotonic suboptimality: Under non-monotonic payoffs (e.g., AIC/BIC), the efficiency axiom forces credit allocation to features that optimal model selection would exclude.
Simulations show that mean-absolute SHAP (using predicted values as payoffs) routinely selects spurious features in Markov boundary and interaction models, while SAGE (using expected-loss difference) more robustly highlights features central to the optimal predictive submodel—but is not universally perfect (Fryer et al., 2021).
5. Variants and Interpretations: SHAP, SAGE, and Shapley Curves
Two major practical applications of column-wise Shapley value are:
- SHAP (SHapley Additive exPlanations): Computes local Shapley values for individual predictions, using conditional expectation of model output under partially missing features; global SHAP importance averages these absolute values over data points. SHAP’s payoff function is based on prediction, potentially leading to averaging over submodels not optimal for feature selection (Fryer et al., 2021).
- SAGE (Shapley Additive Global importance): Employs a payoff function based on reduction in expected model loss (e.g., cross-entropy); resulting SAGE values correlate more closely with true submodel relevance (Fryer et al., 2021).
- Shapley curves: Provide a continuous function decomposing a feature’s local contribution across the feature and sample space, with statistical estimation techniques for both pointwise and global quantities and associated uncertainty bands (Miftachov et al., 2022).
The choice of evaluation function is critical: SHAP and SAGE yield different attribution behaviors for the same dataset and model, especially under structural confounding or non-monotonic scoring (Fryer et al., 2021, Miftachov et al., 2022).
6. Practical Lessons, Recommendations, and Statistical Guarantees
The global Shapley value for a column reflects marginal importance averaged over all submodels, not just those in or near the optimum. Several recommendations have emerged:
- Match to the inferential or predictive goal: use SAGE’s loss-based for overall model performance.
- Recognize that Shapley-based rankings can be misaligned with both Markov boundary membership and minimal-optimal feature subsets.
- For non-monotonic payoffs, the Shapley efficiency constraint allocates credit to globally irrelevant features.
- Estimating Shapley value sampling variance or constructing confidence intervals for attributions is essential; high variance indicates instability of rankings (Fryer et al., 2021, Miftachov et al., 2022).
- For large-, employ scalable techniques: efficient MC sampling (OFA-A), kernel approximations, and tree-structured recursions (Li et al., 2024).
- Leverage domain knowledge for pre-selection or fixed-inclusion to mitigate known pathologies such as taxicab payoff effects (Fryer et al., 2021).
The wild bootstrap for nonparametric Shapley curves yields consistent confidence bands, while one-shot MC-sampling approaches (OFA-A) enable fast, simultaneous -approximate estimation of all feature column Shapley values with rigorously quantifiable error (Miftachov et al., 2022, Li et al., 2024).
7. Summary Table: Algorithms and Implementations
| Method | Underlying | Computational Complexity |
|---|---|---|
| Exact enumeration | Any | |
| Kernel SHAP | SHAP (prediction averaging) | MC sampling, |
| TreeSHAP | SHAP for tree ensembles | Polynomial in , |
| Monte Carlo SAGE | Expected-loss reduction | MC sampling, |
| OFA–A one-for-all | Any Beta-probabilistic value | |
| Shapley curves | (population) | Nonparametric, |
All sampling-based methods require specification of the payoff function and estimation of performance scores or losses on arbitrary feature subsets. Optimal implementation depends on the feature dimensionality, nature of , and relevance of valid uncertainty quantification.
In conclusion, the Shapley value provides a rigorously grounded but not universally optimal solution for feature column attribution and selection, with estimation, interpretive, and methodological nuances that must be matched carefully to the analytical objective and data structure (Fryer et al., 2021, Miftachov et al., 2022, Li et al., 2024).