Minimal Polynomial-Kernel Feature Basis

Updated 4 February 2026

Minimal polynomial-kernel feature basis is a compact set of feature vectors that exactly spans the native Hilbert space of a fixed-degree polynomial kernel.
Deterministic and randomized constructions, via fundamental systems and Maclaurin expansions, ensure full kernel expressiveness with minimal computational and storage overhead.
The methodologies offer strong theoretical guarantees and have demonstrated empirical efficiency in scalable kernel classification and kernelized attention applications.

A minimal polynomial-kernel feature basis is a compact and theoretically optimal set of feature vectors that exactly spans the native Hilbert space associated with a polynomial kernel of fixed degree. This concept underpins recent advances in both scalable kernel classification and kernelized attention, ensuring that the full expressive power of the polynomial kernel is available with drastically reduced storage and computational requirements by careful basis selection or randomization scheme. Key developments include deterministic constructions via fundamental systems and random Maclaurin features certified by harmonic analysis results such as Schoenberg's theorem.

1. Polynomial Kernels and Their Native Feature Spaces

Given $d$ -dimensional inputs $x, x' \in \mathbb{R}^d$ and a fixed integer $s \geq 1$ , the $s$ -th degree polynomial kernel is defined by

$K_s(x, x') = (1 + x \cdot x')^s,$

where $x \cdot x'$ denotes the standard Euclidean dot product. The native reproducing-kernel Hilbert space (RKHS) for $K_s$ is the set of real polynomials of total degree at most $s$ :

$\mathcal{H}_s = \{p(x) : p\text{ is a polynomial of degree }\leq s\}.$

The dimension of $\mathcal{H}_s$ is

$n = \dim \mathcal{H}_s = \binom{s + d}{s},$

which is finite for any $s, d$ (Zeng et al., 2019).

2. Deterministic Construction: Fundamental Systems and Feature Expansions

A minimal feature basis for polynomial-kernel classification comprises $n$ carefully chosen “center points” $\{\eta_j\}_{j=1}^n \subset \mathbb{R}^d$ , such that the kernel translates

$\{K_s(\eta_j, \cdot)\}_{j=1}^n = \{(1 + \eta_j \cdot x)^s\}_{j=1}^n$

span the native space $\mathcal{H}_s$ . This set is called a $K_s$ –fundamental system if the corresponding Gram matrix

$G_{jk} = (1 + \eta_j \cdot \eta_k)^s$

is nonsingular, equivalently $\dim\mathcal{H}_{\eta, n} = n$ where

$\mathcal{H}_{\eta, n} := \Big\{\sum_{j=1}^n c_j (1+\eta_j \cdot x)^s : c \in \mathbb{R}^n\Big\}.$

A classical result asserts that $n$ points drawn i.i.d. from any absolutely-continuous distribution almost surely form a fundamental system—so in practice, selecting $n$ distinct samples suffices (Zeng et al., 2019). Any $f \in \mathcal{H}_s$ then admits a unique expansion

$f(x) = \sum_{j=1}^n u_j (1+\eta_j \cdot x)^s,$

with $u \in \mathbb{R}^n$ . The induced feature map is

$\Psi(x) = ((1 + \eta_1\cdot x)^s, \dots, (1 + \eta_n\cdot x)^s)^T \in \mathbb{R}^n,$

which is minimal by construction.

3. Randomized Construction: Minimal Random-Maclaurin Feature Bases via Schoenberg's Theorem

For broader kernelized modeling—including efficient attention mechanisms—the minimal polynomial basis may be constructed via randomized Maclaurin feature maps in accordance with Schoenberg’s theorem. Any dot-product kernel with nonnegative Maclaurin coefficients, such as $K(x, y) = (x^T y + c)^d$ , can be expanded as

$K(x, y) = \sum_{n=0}^{d} a_n (x^T y)^n, \quad a_n = \binom{d}{n} c^{d-n}.$

A minimal random-feature approximation is achieved by drawing $m$ features as follows (Guo et al., 18 May 2025):

For each feature index $i$ , sample degree $N \in \{0,\dots,d\}$ , with probability $P(N=n) = a_n / A$ , $A = \sum_{k=0}^d a_k$ .
For $N > 0$ , draw $N$ independent Rademacher vectors $\omega_j \in \{\pm 1\}^d$ .
Form the raw feature $\psi_i(x) = \sqrt{A/a_N} \prod_{j=1}^N \langle \omega_j, x \rangle$ .
The map $\phi(x) = (1/\sqrt{m}) (\psi_1(x), \dots, \psi_m(x))^T$ satisfies $\mathbb{E}[\phi(x)^T \phi(y)] = K(x, y)$ .

A two-stage regularization (“pre–post SBN”) wraps this mapping, where batch normalization and projection onto the unit sphere enforce compatibility with Schoenberg's theorem (which assumes $\|x\| \leq 1$ ), and subsequent learned scaling parameters $(\gamma, \beta)$ restore output magnitudes for downstream computation (Guo et al., 18 May 2025).

4. Optimization and Computational Properties

In classification, the margin-free empirical risk

$\frac{1}{m} \sum_{i=1}^m (1 - y_i f(x_i))_+$

is minimized over $f \in \mathcal{H}_{\eta, n}$ , reducible to

$\min_{u \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m \left(1 - y_i \sum_{j=1}^n A_{ij} u_j \right)_+,$

$A_{ij} = (1 + x_i \cdot \eta_j)^s$ (Zeng et al., 2019). The ADMM algorithm solves this with provable global convergence for any $\alpha, \beta > 0$ , applying closed-form updates exploiting the Gram structure:

$u^{k+1} = (\beta A^T A + \alpha I_n)^{-1} (\alpha u^k + \beta A^T v^k - A^T w^k),$

with hinge-proximal updates for $v$ and scaled dual multipliers $w$ .

The total training cost after precomputing $(\beta A^T A + \alpha I_n)^{-1}$ is

$\mathcal{O}(mn^2 + n^3 + T m n) \approx \mathcal{O}(m n (T + n)),$

with storage $\mathcal{O}(mn + n^2)$ , substantially less than the $\mathcal{O}(m^2)$ required for full kernel matrices. Typical iterations $T \ll m$ , often $T \approx 5$ suffices (Zeng et al., 2019).

5. Statistical and Approximation Guarantees

Proposition 1 in (Zeng et al., 2019) guarantees that a random center set of size $n = \binom{s + d}{s}$ spans the entire polynomial space. The ADMM scheme converges to global optima with $o(1/k)$ rate, and the statistical learning rate under Tsybakov conditions and geometric noise exponent $\alpha$ achieves

$\mathcal{R}(\operatorname{sgn}(f_{D, n})) - \mathcal{R}(f_c) \leq C m^{-\alpha \theta^*} \log \tfrac{1}{\delta},$

with $\theta^* = \frac{q+1}{\alpha(q+2 + pq/2) + d(q+1)}$ , and optimal $(q+1)/(q+2)$ rate for large $\alpha$ . Notably, explicit RKHS-norm regularization is unnecessary—the feature space capacity is fully controlled by $n = \binom{s+d}{s}$ .

For randomized Maclaurin features, the Hoeffding-type error bound is

$P\left(|\phi(x)^T \phi(y) - K(x, y)| > \epsilon\right) \leq 2 \exp\left(-\frac{m \epsilon^2}{2 R^{4d}}\right),$

with $R$ bounded by the pre-scaling transformation. To ensure approximation error $\leq \epsilon$ with confidence $1-\delta$ ,

$m \geq \frac{2 R^{4d}}{\epsilon^2} \ln\frac{2}{\delta}$

suffices (Guo et al., 18 May 2025).

6. Practical Guidelines and Empirical Performance

For the deterministic approach, the two critical parameters are the polynomial degree $s \in \{1, \ldots, 10\}$ (tuned via cross-validation) and the fundamental system size $n = \binom{s + d}{s}$ (fixed by kernel degree and input dimension) (Zeng et al., 2019). In random-feature methods, the recipe involves Maclaurin expansion, degree sampling, vector randomization, unit-norm pre-scaling, and post-scaling with $(\gamma, \beta)$ , with the number of features $m$ selected per target error tolerance and input norm bounds (Guo et al., 18 May 2025).

Empirical evaluations confirm that these minimal bases can yield test accuracies within tenths of a percent of full kernel SVMs on benchmarks, while running orders of magnitude faster and consuming drastically less memory. In large-scale scenarios (e.g., SUSY, HIGGS, MNIST), minimal polynomial basis schemes attain strong metrics (e.g., AUC ≈ 0.876 in under ten minutes with $n \approx 1500$ ), outperforming Nyström or random-Fourier approaches for the same feature dimension (Zeng et al., 2019, Guo et al., 18 May 2025).

7. Comparison with Other Feature Mappings and Approximations

A minimal polynomial-kernel basis as constructed above guarantees exact coverage of the native polynomial space, unlike random-Fourier or Nyström approximations, which trade $n \ll m$ with looser control over expressiveness and regularization. In particular, random-Fourier expansions are tied to shift-invariant kernels under Bochner’s theorem, and typically require bandwidth tuning and regularization. By contrast, minimal polynomial feature mappings—deterministic via fundamental systems or randomized via Schoenberg-certified Maclaurin expansions—ensure the representational requirements are fully met, with provable approximation error bounds and straightforward parameterization by kernel degree and input dimension. This suggests strong applicability for scalable learning and kernelized modeling in modern large-scale and sequential data regimes (Zeng et al., 2019, Guo et al., 18 May 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Fast Polynomial Kernel Classification for Massive Data (2019)

SchoenbAt: Rethinking Attention with Polynomial basis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimal Polynomial-Kernel Feature Basis.