Papers
Topics
Authors
Recent
Search
2000 character limit reached

Minimal Polynomial-Kernel Feature Basis

Updated 4 February 2026
  • Minimal polynomial-kernel feature basis is a compact set of feature vectors that exactly spans the native Hilbert space of a fixed-degree polynomial kernel.
  • Deterministic and randomized constructions, via fundamental systems and Maclaurin expansions, ensure full kernel expressiveness with minimal computational and storage overhead.
  • The methodologies offer strong theoretical guarantees and have demonstrated empirical efficiency in scalable kernel classification and kernelized attention applications.

A minimal polynomial-kernel feature basis is a compact and theoretically optimal set of feature vectors that exactly spans the native Hilbert space associated with a polynomial kernel of fixed degree. This concept underpins recent advances in both scalable kernel classification and kernelized attention, ensuring that the full expressive power of the polynomial kernel is available with drastically reduced storage and computational requirements by careful basis selection or randomization scheme. Key developments include deterministic constructions via fundamental systems and random Maclaurin features certified by harmonic analysis results such as Schoenberg's theorem.

1. Polynomial Kernels and Their Native Feature Spaces

Given dd-dimensional inputs x,xRdx, x' \in \mathbb{R}^d and a fixed integer s1s \geq 1, the ss-th degree polynomial kernel is defined by

Ks(x,x)=(1+xx)s,K_s(x, x') = (1 + x \cdot x')^s,

where xxx \cdot x' denotes the standard Euclidean dot product. The native reproducing-kernel Hilbert space (RKHS) for KsK_s is the set of real polynomials of total degree at most ss:

Hs={p(x):p is a polynomial of degree s}.\mathcal{H}_s = \{p(x) : p\text{ is a polynomial of degree }\leq s\}.

The dimension of Hs\mathcal{H}_s is

n=dimHs=(s+ds),n = \dim \mathcal{H}_s = \binom{s + d}{s},

which is finite for any s,ds, d (Zeng et al., 2019).

2. Deterministic Construction: Fundamental Systems and Feature Expansions

A minimal feature basis for polynomial-kernel classification comprises nn carefully chosen “center points” {ηj}j=1nRd\{\eta_j\}_{j=1}^n \subset \mathbb{R}^d, such that the kernel translates

{Ks(ηj,)}j=1n={(1+ηjx)s}j=1n\{K_s(\eta_j, \cdot)\}_{j=1}^n = \{(1 + \eta_j \cdot x)^s\}_{j=1}^n

span the native space Hs\mathcal{H}_s. This set is called a KsK_s–fundamental system if the corresponding Gram matrix

Gjk=(1+ηjηk)sG_{jk} = (1 + \eta_j \cdot \eta_k)^s

is nonsingular, equivalently dimHη,n=n\dim\mathcal{H}_{\eta, n} = n where

Hη,n:={j=1ncj(1+ηjx)s:cRn}.\mathcal{H}_{\eta, n} := \Big\{\sum_{j=1}^n c_j (1+\eta_j \cdot x)^s : c \in \mathbb{R}^n\Big\}.

A classical result asserts that nn points drawn i.i.d. from any absolutely-continuous distribution almost surely form a fundamental system—so in practice, selecting nn distinct samples suffices (Zeng et al., 2019). Any fHsf \in \mathcal{H}_s then admits a unique expansion

f(x)=j=1nuj(1+ηjx)s,f(x) = \sum_{j=1}^n u_j (1+\eta_j \cdot x)^s,

with uRnu \in \mathbb{R}^n. The induced feature map is

Ψ(x)=((1+η1x)s,,(1+ηnx)s)TRn,\Psi(x) = ((1 + \eta_1\cdot x)^s, \dots, (1 + \eta_n\cdot x)^s)^T \in \mathbb{R}^n,

which is minimal by construction.

3. Randomized Construction: Minimal Random-Maclaurin Feature Bases via Schoenberg's Theorem

For broader kernelized modeling—including efficient attention mechanisms—the minimal polynomial basis may be constructed via randomized Maclaurin feature maps in accordance with Schoenberg’s theorem. Any dot-product kernel with nonnegative Maclaurin coefficients, such as K(x,y)=(xTy+c)dK(x, y) = (x^T y + c)^d, can be expanded as

K(x,y)=n=0dan(xTy)n,an=(dn)cdn.K(x, y) = \sum_{n=0}^{d} a_n (x^T y)^n, \quad a_n = \binom{d}{n} c^{d-n}.

A minimal random-feature approximation is achieved by drawing mm features as follows (Guo et al., 18 May 2025):

  • For each feature index ii, sample degree N{0,,d}N \in \{0,\dots,d\}, with probability P(N=n)=an/AP(N=n) = a_n / A, A=k=0dakA = \sum_{k=0}^d a_k.
  • For N>0N > 0, draw NN independent Rademacher vectors ωj{±1}d\omega_j \in \{\pm 1\}^d.
  • Form the raw feature ψi(x)=A/aNj=1Nωj,x\psi_i(x) = \sqrt{A/a_N} \prod_{j=1}^N \langle \omega_j, x \rangle.
  • The map ϕ(x)=(1/m)(ψ1(x),,ψm(x))T\phi(x) = (1/\sqrt{m}) (\psi_1(x), \dots, \psi_m(x))^T satisfies E[ϕ(x)Tϕ(y)]=K(x,y)\mathbb{E}[\phi(x)^T \phi(y)] = K(x, y).

A two-stage regularization (“pre–post SBN”) wraps this mapping, where batch normalization and projection onto the unit sphere enforce compatibility with Schoenberg's theorem (which assumes x1\|x\| \leq 1), and subsequent learned scaling parameters (γ,β)(\gamma, \beta) restore output magnitudes for downstream computation (Guo et al., 18 May 2025).

4. Optimization and Computational Properties

In classification, the margin-free empirical risk

1mi=1m(1yif(xi))+\frac{1}{m} \sum_{i=1}^m (1 - y_i f(x_i))_+

is minimized over fHη,nf \in \mathcal{H}_{\eta, n}, reducible to

minuRn1mi=1m(1yij=1nAijuj)+,\min_{u \in \mathbb{R}^n} \frac{1}{m} \sum_{i=1}^m \left(1 - y_i \sum_{j=1}^n A_{ij} u_j \right)_+,

Aij=(1+xiηj)sA_{ij} = (1 + x_i \cdot \eta_j)^s (Zeng et al., 2019). The ADMM algorithm solves this with provable global convergence for any α,β>0\alpha, \beta > 0, applying closed-form updates exploiting the Gram structure:

uk+1=(βATA+αIn)1(αuk+βATvkATwk),u^{k+1} = (\beta A^T A + \alpha I_n)^{-1} (\alpha u^k + \beta A^T v^k - A^T w^k),

with hinge-proximal updates for vv and scaled dual multipliers ww.

The total training cost after precomputing (βATA+αIn)1(\beta A^T A + \alpha I_n)^{-1} is

O(mn2+n3+Tmn)O(mn(T+n)),\mathcal{O}(mn^2 + n^3 + T m n) \approx \mathcal{O}(m n (T + n)),

with storage O(mn+n2)\mathcal{O}(mn + n^2), substantially less than the O(m2)\mathcal{O}(m^2) required for full kernel matrices. Typical iterations TmT \ll m, often T5T \approx 5 suffices (Zeng et al., 2019).

5. Statistical and Approximation Guarantees

Proposition 1 in (Zeng et al., 2019) guarantees that a random center set of size n=(s+ds)n = \binom{s + d}{s} spans the entire polynomial space. The ADMM scheme converges to global optima with o(1/k)o(1/k) rate, and the statistical learning rate under Tsybakov conditions and geometric noise exponent α\alpha achieves

R(sgn(fD,n))R(fc)Cmαθlog1δ,\mathcal{R}(\operatorname{sgn}(f_{D, n})) - \mathcal{R}(f_c) \leq C m^{-\alpha \theta^*} \log \tfrac{1}{\delta},

with θ=q+1α(q+2+pq/2)+d(q+1)\theta^* = \frac{q+1}{\alpha(q+2 + pq/2) + d(q+1)}, and optimal (q+1)/(q+2)(q+1)/(q+2) rate for large α\alpha. Notably, explicit RKHS-norm regularization is unnecessary—the feature space capacity is fully controlled by n=(s+ds)n = \binom{s+d}{s}.

For randomized Maclaurin features, the Hoeffding-type error bound is

P(ϕ(x)Tϕ(y)K(x,y)>ϵ)2exp(mϵ22R4d),P\left(|\phi(x)^T \phi(y) - K(x, y)| > \epsilon\right) \leq 2 \exp\left(-\frac{m \epsilon^2}{2 R^{4d}}\right),

with RR bounded by the pre-scaling transformation. To ensure approximation error ϵ\leq \epsilon with confidence 1δ1-\delta,

m2R4dϵ2ln2δm \geq \frac{2 R^{4d}}{\epsilon^2} \ln\frac{2}{\delta}

suffices (Guo et al., 18 May 2025).

6. Practical Guidelines and Empirical Performance

For the deterministic approach, the two critical parameters are the polynomial degree s{1,,10}s \in \{1, \ldots, 10\} (tuned via cross-validation) and the fundamental system size n=(s+ds)n = \binom{s + d}{s} (fixed by kernel degree and input dimension) (Zeng et al., 2019). In random-feature methods, the recipe involves Maclaurin expansion, degree sampling, vector randomization, unit-norm pre-scaling, and post-scaling with (γ,β)(\gamma, \beta), with the number of features mm selected per target error tolerance and input norm bounds (Guo et al., 18 May 2025).

Empirical evaluations confirm that these minimal bases can yield test accuracies within tenths of a percent of full kernel SVMs on benchmarks, while running orders of magnitude faster and consuming drastically less memory. In large-scale scenarios (e.g., SUSY, HIGGS, MNIST), minimal polynomial basis schemes attain strong metrics (e.g., AUC ≈ 0.876 in under ten minutes with n1500n \approx 1500), outperforming Nyström or random-Fourier approaches for the same feature dimension (Zeng et al., 2019, Guo et al., 18 May 2025).

7. Comparison with Other Feature Mappings and Approximations

A minimal polynomial-kernel basis as constructed above guarantees exact coverage of the native polynomial space, unlike random-Fourier or Nyström approximations, which trade nmn \ll m with looser control over expressiveness and regularization. In particular, random-Fourier expansions are tied to shift-invariant kernels under Bochner’s theorem, and typically require bandwidth tuning and regularization. By contrast, minimal polynomial feature mappings—deterministic via fundamental systems or randomized via Schoenberg-certified Maclaurin expansions—ensure the representational requirements are fully met, with provable approximation error bounds and straightforward parameterization by kernel degree and input dimension. This suggests strong applicability for scalable learning and kernelized modeling in modern large-scale and sequential data regimes (Zeng et al., 2019, Guo et al., 18 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimal Polynomial-Kernel Feature Basis.