Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimal Decision Trees with Continuous Features

Updated 28 January 2026
  • Optimal decision trees with continuous features are models that optimize split thresholds on continuous variables, avoiding greedy heuristics for improved global performance.
  • Dynamic programming, branch-and-bound, and continuous optimization techniques effectively prune vast search spaces to achieve global or near-global optimality.
  • These methods enhance interpretability and sparsity in applications like classification and regression, utilizing both axis-aligned and oblique splits.

Optimal decision trees with continuous features refer to tree-based predictive models in which the selection of split points (thresholds) is optimized directly over continuous-valued variables, rather than limited to a finite set of categorical or discretely binned options. These models strive for global or near-global optimality in training loss (classification or regression error) and model complexity (e.g., tree size, sparsity), in contrast to greedy decision tree algorithms that make sequential, locally optimal splits. The challenge is fundamentally combinatorial: for even moderate tree depths, the number of possible tree topologies and threshold assignments over continuous domains becomes enormous, making the design of scalable and theoretically sound algorithms nontrivial. The field has seen a surge of methodological advancements involving dynamic programming, tailored branch-and-bound, continuous optimization, and hybrid approaches, all with the goal of efficiently producing interpretable, sparse, and accurate trees for data with continuous features.

1. Mathematical Formulations and Models

The majority of optimal decision tree algorithms with continuous features adopt either axis-aligned (univariate) split models, in which each internal node splits on a single feature at a threshold, or oblique (multivariate) models, where splits occur on general hyperplanes:

  • Axis-aligned model: Each internal node chooses a feature ff and a threshold τ\tau, partitioning data by xfτx_f \le \tau vs xf>τx_f > \tau.
  • Oblique model: Each internal node learns a vector aRpa \in \mathbb{R}^p and a threshold μ\mu, partitioning via axμa^\top x \le \mu (Blanquero et al., 2021, Blanquero et al., 2020).

The objective typically minimizes a loss (e.g., $0$-$1$ classification error, squared loss for regression) plus a penalty on tree complexity (leaf count, number of nonzero split coefficients, or features used). For univariate regression problems, this is

minT:depth(T)d i=1n(yi,T(xi))+λS(T)\min_{T: \operatorname{depth}(T) \le d}\ \sum_{i=1}^n \ell(y_i, T(x_i)) + \lambda\cdot S(T)

where \ell is the loss and S(T)S(T) the complexity penalty (Mazumder et al., 2022, Heredia et al., 27 Oct 2025).

Optimal tree learning for continuous features is NP-hard; exact and approximation algorithms rely on smart exploitation of problem structure, relaxation, and bounding strategies.

2. Algorithmic Approaches and Optimization Techniques

2.1 Dynamic Programming and Branch-and-Bound

Recent foundational progress has leveraged dynamic programming (DP) subproblem decomposition with branch-and-bound (BnB) to prune the search space. In "Optimal Classification Trees for Continuous Feature Data Using Dynamic Programming with Branch-and-Bound" (Brita et al., 14 Jan 2025), the DP variable encodes the current data partition and depth. Pruning relies on similarity lower bounds (SLB) across dataset partitions, neighborhood pruning using optimality gaps, and sub-interval exclusion. These rules allow skipping over large regions of the threshold space where improvement is provably impossible. When d=2d = 2, efficient algorithms scan all thresholds for root/child splits in O(pnlogm)O(pn\log m) per-feature time. In practice, this yields >99%>99\% pruning in depth-two subproblems and $1$-$2$ orders of magnitude acceleration over prior DP+BnB methods (Brita et al., 14 Jan 2025, Mazumder et al., 2022).

2.2 Specialized Branch-and-Bound for Regression

Reduced-space BnB, as in RS-ORT (Heredia et al., 27 Oct 2025), branches exclusively over tree-structural variables (splits and thresholds), not over individual data samples. Bound tightening exploits closed-form solutions for leaf predictions, empirical threshold discretization (restricting candidate thresholds to observed feature values), and optimal depth-1 parsing. Parallel evaluation across nn samples further decouples computation, making the search-tree size independent of sample count. RS-ORT achieves global optimality for trees of modest depth and up to millions of continuous-valued samples (Heredia et al., 27 Oct 2025).

2.3 Quantile- and Sub-block-based Pruning

Quant-BnB (Mazumder et al., 2022) exploits quantile-based dissection of the threshold search space, recursively partitioning intervals of possible thresholds into a small number of subblocks. Analytical upper and lower bounds over these blocks enable aggressive pruning. This approach supports regression and classification, with significant speed gains for shallow (depth\leq3) trees.

The Branches algorithm (Chaouki et al., 2024) reformulates optimal tree construction as an AND/OR graph search problem, solved with AO* and dynamic programming, supporting direct handling of continuous splits by augmenting the action space at each branch-state with all possible threshold splits. Purification bounds and best-first expansion guarantee global optimality and outperform classical DP+BnB, especially for larger tree depths.

2.5 Anytime, Limited Discrepancy, and Hybrid Methods

CA-ConTree (Kiossou et al., 21 Jan 2026) addresses poor anytime performance of depth-first DP+BnB by integrating limited discrepancy search over heuristic feature and threshold orderings. This distributes computational effort across early and late subtrees, yielding high-quality trees quickly at any cutoff while eventually guaranteeing optimality. Hybrid SPARSE LOOKAHEAD methods (SPLIT (Babbar et al., 21 Feb 2025)) perform global DP+BnB with greedy completion past a small lookahead depth; this delivers near-optimal trees in polynomial time for most nodes but remains amenable to post-processing for full optimality.

2.6 Continuous Optimization and Differentiable Trees

Continuous relaxations (e.g., ORCT (Blanquero et al., 2021), S-ORCT (Blanquero et al., 2020), argmin-differentiable trees (Zantedeschi et al., 2020)) formulate oblique or randomized trees as continuous nonlinear programs, optimizing over split coefficients and probabilistic routing. Polyhedral norms enforce local and global sparsity, giving explicit trade-offs between accuracy and interpretability (Blanquero et al., 2020). Such approaches bypass combinatorial search at the price of local optimality, but typically provide competitive accuracy and fine-grained sparsity control, and can be used as layers in end-to-end differentiable architectures (Zantedeschi et al., 2020).

2.7 Mixed-Integer Optimization (MIO)

MIP-based formulations permit either univariate or hyperplane splits (OCT-H/ORT-H (Bertsimas et al., 2022), FlowOCT (Aghaei et al., 2021)), with variables for split coefficients, assignment of samples to leaves, and binary decisions about split activation. Big-M or big-M-free (locally ideal) encodings are used, optionally augmented with Benders' decomposition and cutting-plane constraints. For continuous features, one often pre-generates a grid of feasible thresholds (e.g., midpoints between unique sorted samples), but these models encounter memory bottlenecks at scale (Bertsimas et al., 2022, Aghaei et al., 2021, Heredia et al., 27 Oct 2025).

2.8 Moving-Horizon Metaheuristics

To handle large-scale, deep-tree regimes, metaheuristic approaches such as MH-DEOCT (Ren et al., 2023) interleave tree construction and optimization, using differential evolution within a moving-horizon strategy. GPU acceleration and intelligent discrete tree decoding (eliminating duplicated candidate splits) enable the scaling of optimal-like trees to 10 million samples and depth eight, with empirical accuracy indistinguishable from global IP (Ren et al., 2023).

3. Handling Continuous Features: Threshold Search, Oblique Splits, and Sparsity

3.1 Threshold Selection and Efficient Pruning

For axis-aligned splits, the core challenge is the explosion in the number of candidate thresholds—a unique candidate exists for every midpoint between sorted values per feature (up to n1n-1 per feature). Modern algorithms restrict and prioritize splits:

  • Prune sub-intervals where the optimality gap indicates no improvement is possible (Brita et al., 14 Jan 2025).
  • For some MIP/Dynamic Programming approaches, only thresholds that yield a change in class distribution are considered (CART-style).
  • Quantile binning and threshold guessing (e.g., via boosted stump ensembles) reduce threshold sets with negligible accuracy impact (Babbar et al., 21 Feb 2025).

3.2 Oblique and Randomized Splits

Oblique methods optimize split hyperplanes, often under 1\ell_1 or group-norm sparsity constraints to ensure interpretability (Blanquero et al., 2021, Blanquero et al., 2020, Bertsimas et al., 2022). Probabilistic routing (randomization at split nodes) renders the process differentiable and improves robustness relative to hard thresholding.

3.3 Sparsity: Local and Global

Sparsity is enforced either at the split (local, few features per node) or tree level (global, few features across all splits) via:

  • 1\ell_1-norm penalties on split vectors for local sparsity,
  • \ell_\infty-group norms for global sparsity,
  • explicit cap on the number of splits or features used (Blanquero et al., 2020, Aghaei et al., 2021).

Empirically, moderate sparsity can be achieved with minimal loss in classification accuracy.

4. Computational Complexity and Scalability

The search space for optimal trees with continuous features is combinatorial. Key scalability advances include:

  • Aggressive pruning via upper and lower bounds (similarity bounds, interval and neighborhood pruning).
  • Nodewise decomposition and parallel execution decoupling computation across data samples (Heredia et al., 27 Oct 2025).
  • Subdivision of the threshold search space via quantiles or blocks, reducing branching factor (Mazumder et al., 2022).
  • Efficient caching and memoization of subproblems, exploiting iso-morphism among data partitions (Brita et al., 14 Jan 2025, Chaouki et al., 2024).
  • GPU-based acceleration for large trees and massive datasets, especially in moving-horizon or hybrid settings (Ren et al., 2023).

State-of-the-art methods (as of 2026) can efficiently compute globally optimal trees of depth 3–5 on datasets with up to 10610^6 samples and 50+ features, with several orders-of-magnitude improvement over prior methods in both speed and memory (Heredia et al., 27 Oct 2025, Brita et al., 14 Jan 2025, Mazumder et al., 2022, Ren et al., 2023, Chaouki et al., 2024).

5. Empirical Performance and Benchmarks

Empirical studies consistently show that optimal trees on continuous features achieve significant accuracy gains over classical greedy heuristics:

  • ConTree (Brita et al., 14 Jan 2025) achieves +4.7 pp test accuracy over CART at depth 3, beating quantile-binarized optimal trees by 1 pp.
  • Quant-BnB (Mazumder et al., 2022) reduces test error vs. CART by 5–30% on two-thirds of datasets at depth 2–3.
  • RS-ORT (Heredia et al., 27 Oct 2025) produces lower or equivalent test RMSEs compared to dominant MIP baselines, at vastly better scaling.
  • S-ORCT (Blanquero et al., 2020) matches or exceeds CART in test accuracy with far greater global or local sparsity.
  • MH-DEOCT (Ren et al., 2023) achieves train/test accuracy within <0.4%<0.4\% of global optimality on 68 UCI datasets, even at n>107n > 10^7 and depth 8.

These improvements persist across both classification and regression benchmarks, with optimal trees typically requiring far smaller tree depth and/or number of features to match greedy or ensemble methods (Ren et al., 2023, Babbar et al., 21 Feb 2025).

6. Interpretability and Extensions

Optimal decision trees with continuous features provide interpretable predictive models, revealing global decision rules and variable interactions. Sparsity, depth constraints, and explicit regularization guide interpretability. Several frameworks allow the incorporation of explicit side constraints (e.g., minimum samples per leaf, fairness, and group sparsity) (Aghaei et al., 2021). Extensions include cost-sensitive objectives, minimum leaf-size constraints, coverage constraints for each class, direct AUC or F-score optimization, and calculation of the Rashomon set of near-optimal trees (Babbar et al., 21 Feb 2025).

Oblique and randomized tree models offer continuous, differentiable surfaces, improved robustness to noise, and facilitate integration as layers in end-to-end deep networks (Blanquero et al., 2021, Zantedeschi et al., 2020). MIP-based formulations support direct embedding of optimal trees as explicit, piecewise-linear surrogate constraints in larger optimization and decision-making systems (Bertsimas et al., 2022).

7. Future Directions and Open Problems

Open challenges and promising research directions include:

  • Scalable algorithms for globally optimal trees of depth d>4d > 4, especially with many continuous features.
  • Adaptive threshold selection methods, oblique split search, and convex-relaxation-based pruning for even tighter bounds.
  • Efficient exploration and enumeration of Rashomon sets (the ε\varepsilon-near-optimal set of trees) for quantifying model uncertainty (Babbar et al., 21 Feb 2025).
  • Parallel and distributed frameworks for nodewise computation in massive-scale settings.
  • Approximation algorithms with explicit performance guarantees as a function of tree depth, size, and data dimensionality.

Empirical results suggest that hybrid and metaheuristic approaches (e.g., moving-horizon DE, continuous relaxation) will continue bridging the gap between provable optimality and practical scalability, especially when interpretability and feature importance are equally important as prediction accuracy (Ren et al., 2023, Babbar et al., 21 Feb 2025).


Key references for the developments above include:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optimal Decision Trees with Continuous Features.