Minimum Redundancy Maximum Relevance (mRMR)

Updated 22 January 2026

mRMR is an information-theoretic feature selection approach that identifies feature subsets maximizing relevance to the target and minimizing inter-feature duplication.
It employs mutual information and alternative dependency measures like HSIC and distance correlation to effectively handle nonlinear, high-dimensional datasets using scalable algorithms.
Flexible variants, including weighted and sparse mRMR, address challenges such as estimator sensitivity and automatic subset size selection for diverse applications.

Minimum Redundancy Maximum Relevance (mRMR) is a feature selection paradigm founded on information-theoretic principles, designed to construct subsets of variables that jointly maximize statistical dependency with a target while minimizing mutual information among themselves. Its widespread use in computational biology, neuroinformatics, finance, large-scale machine learning, and network inference is grounded in its balance of predictive power and parsimony, as well as its adaptability to non-linear and high-dimensional data scenarios.

1. Mathematical Foundations of mRMR

The canonical mRMR criterion operates over a candidate feature set $F=\{f_1,\ldots,f_n\}$ and a target variable $c$ . Its objective, in difference form, is

$\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$

where $I(X;Y)$ denotes mutual information, quantifying statistical dependence between variables (Yousefi et al., 16 Jan 2026, Wollstadt et al., 2021, Yu et al., 2021). The quotient form is also used:

$\max_{S}\;\frac{\tfrac{1}{|S|}\sum_{f\in S}I(f;c)}{\tfrac{1}{|S|^2}\sum_{f,g\in S}I(f;g)}$

These objectives arise naturally under the information bottleneck principle, seeking features that explain the target without inter-feature duplicity.

The mRMR difference criterion is optimized via a greedy forward selection:

$\text{score}(f\,|\,S) = I(f; c) - \frac{1}{|S|} \sum_{g\in S} I(f;g)$

Selecting $k^*$ maximizing $\text{score}(f\,|\,S)$ at each step builds the subset $S$ iteratively (Barker et al., 2024, Reggiani et al., 2017).

2. Information-Theoretic Justification, Redundancy, and Synergy

Mutual information, employed in mRMR, does not distinguish unique versus redundant versus synergistic contributions among variables. The Partial Information Decomposition (PID) framework decomposes $I(c;S)$ into unique, redundant, and synergistic atoms (Wollstadt et al., 2021):

Relevance (unique + synergy): $c$ 0.
Redundancy: $c$ 1.

The PID analysis establishes that maximizing conditional mutual information (CMI), $c$ 2, achieves true minimum redundancy and maximum relevance, including synergistic effects that standard mRMR may miss. Thus, CMI-based forward–backward selection methods provide PID-optimal feature sets, especially for interactive or nonlinear settings.

3. Algorithmic Implementations: Greedy, MILP, and Scalable MapReduce

Greedy Forward Selection

The standard approach initializes $c$ 3 and iteratively adds the candidate $c$ 4 that maximizes $c$ 5, repeating until $c$ 6 (Yousefi et al., 16 Jan 2026, Barker et al., 2024, Reggiani et al., 2017).

Enhanced MILP Formulations

A mixed-integer linear programming (MILP) approach reformulates mRMR as a fractional program over selection variables $c$ 7. Fractional and bilinear forms are convexified via perspective and McCormick relaxations. Perspective-based MILPs achieve provable global optimality and outpace big-M and disjunctive benchmarks in both solution quality and runtime for $c$ 8 up to several hundred (He et al., 22 Aug 2025).

Scalability Aspects

For massive datasets, distributed MapReduce frameworks partition data either row-wise (“tall and narrow”) or feature-wise (“wide and short”), using combiners and broadcast variables to minimize I/O and network bottlenecks (Vivek et al., 2022, Reggiani et al., 2017). Vertical partitioning (VMR_mRMR) is preferred for “wide” regimes, while horizontal (HMR_mRMR) is superior for “tall” data, and ultra-scalable implementations only require minor adaptation to support arbitrary MI-score functions.

Formulation	Optimization	Scalability
Greedy Forward	$c$ 9	Single machine
MILP (PersRLT)	Global opt., tight relaxations	Hundreds of features; optimal certificates
MapReduce	Data-parallel WM, vertical/horizontal splitting	Millions of features/samples; distributed

4. Nonlinear and Nonparametric Extensions

Nonlinear Redundancy and Relevance

Classic mRMR relies on mutual information, often estimated by discretization. To address nonlinearity, several alternative dependency measures have been integrated:

HSIC (Hilbert–Schmidt Independence Criterion): Captures any nonlinear relationships; leveraged in convex N³LARS, which is scalable and globally optimal (Yamada et al., 2014).
Distance Correlation ( $\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 0): Zero iff independence, tuning-free, parameter-free, and suitable for functional data (Berrendero et al., 2015).
Wasserstein Distance: Nonparametric redundancy via optimal transport in continuous data without discretization (Nie et al., 2023).

These measures replace $\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 1 in the mRMR objective, providing robustness to sampling noise and complex dependence structures.

5. Variants, Penalization, and Modern Enhancements

Weighted and Penalized mRMR

A weighted mRMR inserts $\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 2 to balance relevance vs. redundancy explicitly:

$\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 3

Empirically, intermediate $\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 4 (e.g., 0.2–0.3) optimize the dimension–accuracy tradeoff in practical applications such as transient stability of power systems (Li et al., 2019).

Sparse mRMR (SmRMR)

A penalized continuous relaxation replaces combinatorial selection with nonnegative coefficients $\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 5, yielding a convex (or nonconvex for SCAD/MCP) optimization with knockoff-based FDR control, guaranteeing that inactive features will receive exactly zero coefficients under mild conditions (Naylor et al., 26 Aug 2025).

Boosting Unique Relevance

mRMR can be augmented to explicitly account for unique relevance, using either nearest-neighbor MI estimation ( $\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 6) or classifier-based conditional loss. The MRwMR-BUR variant demonstrates substantial reductions in selected features while increasing test accuracy relative to plain mRMR (Liu et al., 2022).

Hybrid Wrappers and Metaheuristics

mRMR is often used to pre-filter features, followed by a metaheuristic wrapper (e.g., BHOA), leading to both computational gains and improved accuracy in high-dimensional biology settings (Mehrabi et al., 2023). Hybrid procedures enable large-scale searches over filtered sets ( $\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 7–best by mRMR), with fitness scores combining classification performance and subset size.

6. Empirical Evaluation and Application Scenarios

Extensive benchmarking demonstrates that mRMR and its generalizations yield robust, interpretable feature subsets across biomedical, financial, functional data, and network-structure inference problems:

In EEG-based depression detection, applying mRMR to deep representations yields a 75% reduction in features with an improvement of 1–8 percentage points in accuracy relative to non-mRMR baselines (Yousefi et al., 16 Jan 2026).
In functional data analysis, distance-correlation based mRMR outperforms mutual-information approaches, with consistently higher accuracy and sparser subsets especially for small $\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 8 (Berrendero et al., 2015).
For Bayesian network structure learning, MRMR-based FCBF methods efficiently recover Markov blankets and PC sets with lower computational complexity and accuracy competitive with modern global optimizers (Yu et al., 2021).
In financial prediction, MRMR-SVM-RFE hybrids improve all accuracy metrics by 3–9 percentage points and better balance redundancy and classifier-alignment (Ding et al., 2024).
For large-scale high-dimensional problems, scalable MapReduce or Spark implementations retain near-linear speedup and sublinear sensitivity to feature subset size (Reggiani et al., 2017, Vivek et al., 2022).

7. Limitations, Guidelines, and Emerging Directions

Estimator Choice: MI estimates are sensitive to discretization or kernel parameters; nonparametric and kernel-based dependencies (HSIC, distance correlation) often ameliorate this, especially with continuous or functional data (Yamada et al., 2014, Berrendero et al., 2015).
Subset Cardinality: Standard mRMR requires pre-specification of subset size. Recent advances, such as genetic algorithms (MVMR-FS) and continuous relaxations, incorporate automatic cardinality selection (Nie et al., 2023, Naylor et al., 26 Aug 2025).
Scalability: Combinatorial search is NP-hard; distributed methods, blockwise/approximated dependency computation, and convex optimization (e.g., N³LARS, SmRMR) are preferred in $\max_{S\subset F,\,|S|=k}\;\underbrace{\frac{1}{|S|}\sum_{f\in S} I(f;c)}_{\text{average relevance}} \,-\, \underbrace{\frac{1}{|S|^2}\sum_{f,g\in S} I(f;g)}_{\text{average redundancy}}$ 9 or large $I(X;Y)$ 0 contexts.
Optimality/Coverage: PID/CMI justifies cases where conventional mRMR fails to distinguish redundancy from synergy. For maximal data efficiency and rigor, CMI-based selection or hybrid MRMR/unique-relevance criteria should be favored where interaction effects exist (Wollstadt et al., 2021, Liu et al., 2022).

In sum, mRMR is a versatile, theoretically justified framework for selecting concise, non-redundant, strongly relevant feature subsets in high-dimensional, nonlinear, and large-scale statistical learning problems, with numerous specialized variants adapted to modern computational and statistical requirements.