Minimal Polynomial-Kernel Feature Basis
- Minimal polynomial-kernel feature basis is a compact set of feature vectors that exactly spans the native Hilbert space of a fixed-degree polynomial kernel.
- Deterministic and randomized constructions, via fundamental systems and Maclaurin expansions, ensure full kernel expressiveness with minimal computational and storage overhead.
- The methodologies offer strong theoretical guarantees and have demonstrated empirical efficiency in scalable kernel classification and kernelized attention applications.
A minimal polynomial-kernel feature basis is a compact and theoretically optimal set of feature vectors that exactly spans the native Hilbert space associated with a polynomial kernel of fixed degree. This concept underpins recent advances in both scalable kernel classification and kernelized attention, ensuring that the full expressive power of the polynomial kernel is available with drastically reduced storage and computational requirements by careful basis selection or randomization scheme. Key developments include deterministic constructions via fundamental systems and random Maclaurin features certified by harmonic analysis results such as Schoenberg's theorem.
1. Polynomial Kernels and Their Native Feature Spaces
Given -dimensional inputs and a fixed integer , the -th degree polynomial kernel is defined by
where denotes the standard Euclidean dot product. The native reproducing-kernel Hilbert space (RKHS) for is the set of real polynomials of total degree at most :
The dimension of is
which is finite for any (Zeng et al., 2019).
2. Deterministic Construction: Fundamental Systems and Feature Expansions
A minimal feature basis for polynomial-kernel classification comprises carefully chosen “center points” , such that the kernel translates
span the native space . This set is called a –fundamental system if the corresponding Gram matrix
is nonsingular, equivalently where
A classical result asserts that points drawn i.i.d. from any absolutely-continuous distribution almost surely form a fundamental system—so in practice, selecting distinct samples suffices (Zeng et al., 2019). Any then admits a unique expansion
with . The induced feature map is
which is minimal by construction.
3. Randomized Construction: Minimal Random-Maclaurin Feature Bases via Schoenberg's Theorem
For broader kernelized modeling—including efficient attention mechanisms—the minimal polynomial basis may be constructed via randomized Maclaurin feature maps in accordance with Schoenberg’s theorem. Any dot-product kernel with nonnegative Maclaurin coefficients, such as , can be expanded as
A minimal random-feature approximation is achieved by drawing features as follows (Guo et al., 18 May 2025):
- For each feature index , sample degree , with probability , .
- For , draw independent Rademacher vectors .
- Form the raw feature .
- The map satisfies .
A two-stage regularization (“pre–post SBN”) wraps this mapping, where batch normalization and projection onto the unit sphere enforce compatibility with Schoenberg's theorem (which assumes ), and subsequent learned scaling parameters restore output magnitudes for downstream computation (Guo et al., 18 May 2025).
4. Optimization and Computational Properties
In classification, the margin-free empirical risk
is minimized over , reducible to
(Zeng et al., 2019). The ADMM algorithm solves this with provable global convergence for any , applying closed-form updates exploiting the Gram structure:
with hinge-proximal updates for and scaled dual multipliers .
The total training cost after precomputing is
with storage , substantially less than the required for full kernel matrices. Typical iterations , often suffices (Zeng et al., 2019).
5. Statistical and Approximation Guarantees
Proposition 1 in (Zeng et al., 2019) guarantees that a random center set of size spans the entire polynomial space. The ADMM scheme converges to global optima with rate, and the statistical learning rate under Tsybakov conditions and geometric noise exponent achieves
with , and optimal rate for large . Notably, explicit RKHS-norm regularization is unnecessary—the feature space capacity is fully controlled by .
For randomized Maclaurin features, the Hoeffding-type error bound is
with bounded by the pre-scaling transformation. To ensure approximation error with confidence ,
suffices (Guo et al., 18 May 2025).
6. Practical Guidelines and Empirical Performance
For the deterministic approach, the two critical parameters are the polynomial degree (tuned via cross-validation) and the fundamental system size (fixed by kernel degree and input dimension) (Zeng et al., 2019). In random-feature methods, the recipe involves Maclaurin expansion, degree sampling, vector randomization, unit-norm pre-scaling, and post-scaling with , with the number of features selected per target error tolerance and input norm bounds (Guo et al., 18 May 2025).
Empirical evaluations confirm that these minimal bases can yield test accuracies within tenths of a percent of full kernel SVMs on benchmarks, while running orders of magnitude faster and consuming drastically less memory. In large-scale scenarios (e.g., SUSY, HIGGS, MNIST), minimal polynomial basis schemes attain strong metrics (e.g., AUC ≈ 0.876 in under ten minutes with ), outperforming Nyström or random-Fourier approaches for the same feature dimension (Zeng et al., 2019, Guo et al., 18 May 2025).
7. Comparison with Other Feature Mappings and Approximations
A minimal polynomial-kernel basis as constructed above guarantees exact coverage of the native polynomial space, unlike random-Fourier or Nyström approximations, which trade with looser control over expressiveness and regularization. In particular, random-Fourier expansions are tied to shift-invariant kernels under Bochner’s theorem, and typically require bandwidth tuning and regularization. By contrast, minimal polynomial feature mappings—deterministic via fundamental systems or randomized via Schoenberg-certified Maclaurin expansions—ensure the representational requirements are fully met, with provable approximation error bounds and straightforward parameterization by kernel degree and input dimension. This suggests strong applicability for scalable learning and kernelized modeling in modern large-scale and sequential data regimes (Zeng et al., 2019, Guo et al., 18 May 2025).