Shapley Value of Whole Columns

Updated 16 January 2026

Shapley value of whole columns is a principled method that attributes feature utility by averaging marginal contributions over all subsets, based on cooperative game theory.
The approach leverages both sampling-based and nonparametric regression methods to efficiently approximate contributions in high-dimensional datasets.
Practical variants like SHAP and SAGE illustrate how different payoff functions yield distinct feature attributions, highlighting trade-offs in model interpretation and feature selection.

The Shapley value of whole columns is a principled, axiomatic approach to attributing aggregate utility or importance to each feature column in a dataset, based on averaging the marginal contributions that features make to a model’s performance across all possible subsets. Originating in cooperative game theory and widely adopted in machine learning for Explainable AI (XAI), the Shapley value treats each feature as an agent in a game where the payoff is the model’s evaluation function—such as $R^2$ , expected log-loss reduction, or information criteria—computed on arbitrary subsets of features. Computation and interpretation of whole-column Shapley values, together with their limitations for feature selection, comprise a central topic in the current literature (Fryer et al., 2021, Miftachov et al., 2022, Li et al., 2024).

1. Formal Definition of the Shapley Value for Feature Columns

Consider a set $N = \{1, 2, ..., d\}$ indexing the $d$ feature columns of a dataset. Let $v : 2^N \to \mathbb{R}$ be a payoff function assigning to each subset $S \subseteq N$ the value $v(S)$ , representing the performance achieved by a model utilizing exactly features $S$ (with $v(\emptyset) = 0$ by convention). The Shapley value $\varphi_i$ for feature $i \in N$ is defined as: $\varphi_i = \sum_{S \subseteq N\setminus\{i\}} \frac{|S|! (d - |S| - 1)!}{d!} [v(S \cup \{i\}) - v(S)]$ Here, $S$ ranges over all subsets not containing $i$ , the combinatorial weight reflects the fraction of all feature orderings in which exactly the features in $S$ precede $i$ , and the bracketed term is the marginal contribution of $i$ to $S$ . Intuitively, $\varphi_i$ quantifies the average extra gain from feature $i$ across all possible contexts in which it may be added (Fryer et al., 2021).

2. Axiomatic Properties and Their Role

The classical Shapley value allocation is uniquely determined by four axioms interpreted in the feature-column setting:

Efficiency: $\sum_{i \in N} \varphi_i = v(N)$ . The total full-model utility is exactly allocated among the $d$ features.
Symmetry: If two features always contribute the same increment to every coalition, their Shapley values are equal.
Dummy (Null Player): If a feature never increases performance in any subset, its Shapley value is zero.
Additivity: The Shapley value distributes over payoff functions: for $v, w$ , $\varphi_i(v + w) = \varphi_i(v) + \varphi_i(w)$ .

These axioms underlie a form of "game-theoretic fairness": each feature’s score depends on marginal utility averaged over every possible subset (Fryer et al., 2021). However, this collective rationality does not align perfectly with typical feature selection objectives (see Section 4).

3. Estimation Strategies for Shapley Values

Direct computation of all $2^d$ marginal contributions is infeasible for large $d$ , prompting both analytical and statistical approximations:

3.1. Sampling-Based Approximation (OFA–A Framework)

The OFA–A (One-For-All) framework provides a unified stochastic estimator to efficiently approximate Shapley values for each feature:

Let $n = d$ be the number of features.
For subset sizes $s = 2, ..., n-2$ , define the sampling probabilities $q_{s-1} \propto 1/\sqrt{s(n-s)}$ , normalized so that they sum to one.
Draw $T = O\big(n \epsilon^{-2} \log(n/\delta)\big)$ samples of random subsets of size $s$ , recording, for each feature $i$ , whether it is present (updating $\widehat{\varphi}^+_{i,s}$ ) or absent (updating $\widehat{\varphi}^-_{i,s-1}$ ) in $S$ .
The overall Shapley estimate is computed by

$\widehat{\varphi}_i = \sum_{s=1}^n m_s \big( \widehat{\varphi}^+_{i,s} - \widehat{\varphi}^-_{i,s-1} \big)$

for $m_s = \binom{n-1}{s-1}/n$ (Li et al., 2024).

3.2. Nonparametric Regression-Based Estimation

For the regression problem $Y = m(X) + \varepsilon$ , the population-level Shapley curve for feature $j$ is: $\phi_j(x) = \sum_{S \subseteq N \setminus \{j\}} w_{j, S} \left[ m_{S \cup \{j\}}(x_S, x_j) - m_S(x_S) \right ]$ with $w_{j,S} = 1/d \binom{d-1}{|S|}^{-1}$ . The global (integrated) Shapley value is

$\Phi_j = E_X[\phi_j(X)]$

Two estimation approaches are:

Component-based: Separate local-linear regressions for all subset regressions $m_s(x_s)$ , with plug-in estimates for differences.
Integration-based: Full $d$ -variate regression fitted, with estimates for marginal means constructed via Monte Carlo or kernel methods (Miftachov et al., 2022).

Statistical theory guarantees minimax-optimal convergence rates $O(n^{-4/(4+d)})$ for mean-integrated squared error under appropriate smoothness (Miftachov et al., 2022).

A wild-bootstrap procedure enables valid confidence bands for the estimated $\phi_j(x)$ by residual reweighting and local refitting.

4. Failures of Shapley Value for Feature Selection

Despite their theoretical fairness, Shapley values may counteract the goal of identifying compact, parsimonious feature subsets. Three illustrative counterexamples confirm this:

Taxicab payoff: Irrelevant features can receive strictly positive Shapley values if they improve performance only in suboptimal models, not in the global optimum.
Secret holder problem: Essential features with crucial conditional contributions may not receive maximal Shapley credit if their marginal contributions are hidden except in particular coalitions.
Monotonic suboptimality: Under non-monotonic payoffs (e.g., AIC/BIC), the efficiency axiom forces credit allocation to features that optimal model selection would exclude.

Simulations show that mean-absolute SHAP (using predicted values as payoffs) routinely selects spurious features in Markov boundary and interaction models, while SAGE (using expected-loss difference) more robustly highlights features central to the optimal predictive submodel—but is not universally perfect (Fryer et al., 2021).

5. Variants and Interpretations: SHAP, SAGE, and Shapley Curves

Two major practical applications of column-wise Shapley value are:

SHAP (SHapley Additive exPlanations): Computes local Shapley values for individual predictions, using conditional expectation of model output under partially missing features; global SHAP importance averages these absolute values over data points. SHAP’s payoff function is based on prediction, potentially leading to averaging over submodels not optimal for feature selection (Fryer et al., 2021).
SAGE (Shapley Additive Global importance): Employs a payoff function based on reduction in expected model loss (e.g., cross-entropy); resulting SAGE values correlate more closely with true submodel relevance (Fryer et al., 2021).
Shapley curves: Provide a continuous function $\phi_j(x)$ decomposing a feature’s local contribution across the feature and sample space, with statistical estimation techniques for both pointwise and global quantities and associated uncertainty bands (Miftachov et al., 2022).

The choice of evaluation function $v(S)$ is critical: SHAP and SAGE yield different attribution behaviors for the same dataset and model, especially under structural confounding or non-monotonic scoring (Fryer et al., 2021, Miftachov et al., 2022).

6. Practical Lessons, Recommendations, and Statistical Guarantees

The global Shapley value for a column reflects marginal importance averaged over all submodels, not just those in or near the optimum. Several recommendations have emerged:

Match $v(S)$ to the inferential or predictive goal: use SAGE’s loss-based $v$ for overall model performance.
Recognize that Shapley-based rankings can be misaligned with both Markov boundary membership and minimal-optimal feature subsets.
For non-monotonic payoffs, the Shapley efficiency constraint allocates credit to globally irrelevant features.
Estimating Shapley value sampling variance or constructing confidence intervals for attributions is essential; high variance indicates instability of rankings (Fryer et al., 2021, Miftachov et al., 2022).
For large- $d$ , employ scalable techniques: efficient MC sampling (OFA-A), kernel approximations, and tree-structured recursions (Li et al., 2024).
Leverage domain knowledge for pre-selection or fixed-inclusion to mitigate known pathologies such as taxicab payoff effects (Fryer et al., 2021).

The wild bootstrap for nonparametric Shapley curves yields consistent confidence bands, while one-shot MC-sampling approaches (OFA-A) enable fast, simultaneous $(\epsilon, \delta)$ -approximate estimation of all feature column Shapley values with rigorously quantifiable error (Miftachov et al., 2022, Li et al., 2024).

7. Summary Table: Algorithms and Implementations

Method	Underlying $v(S)$	Computational Complexity
Exact enumeration	Any	$O(2^d)$
Kernel SHAP	SHAP (prediction averaging)	MC sampling, $O(n \log n)$
TreeSHAP	SHAP for tree ensembles	Polynomial in $d$ , $n$
Monte Carlo SAGE	Expected-loss reduction	MC sampling, $O(n \log n)$
OFA–A one-for-all	Any Beta-probabilistic value	$O(n \log n)$
Shapley curves	$E[Y \mid X = x]$ (population)	Nonparametric, $O(n^{1+\gamma})$

All sampling-based methods require specification of the payoff function and estimation of performance scores or losses on arbitrary feature subsets. Optimal implementation depends on the feature dimensionality, nature of $v(S)$ , and relevance of valid uncertainty quantification.

In conclusion, the Shapley value provides a rigorously grounded but not universally optimal solution for feature column attribution and selection, with estimation, interpretive, and methodological nuances that must be matched carefully to the analytical objective and data structure (Fryer et al., 2021, Miftachov et al., 2022, Li et al., 2024).

Markdown Report Issue Upgrade to Chat

References (3)

Shapley values for feature selection: The good, the bad, and the axioms (2021)

Shapley Curves: A Smoothing Perspective (2022)

One Sample Fits All: Approximating All Probabilistic Values Simultaneously and Efficiently (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shapley Value of Whole Columns.