Papers
Topics
Authors
Recent
Search
2000 character limit reached

One-Class SVM: Smooth CVaR Optimization

Updated 4 December 2025
  • The paper introduces a novel OC-SVM framework that integrates smooth CVaR surrogates with signature-based embeddings to achieve tractable risk calibration.
  • The methodology uses shuffle-product identities to derive closed-form polynomial surrogates, leading to explicit error bounds and enhanced hypothesis testing.
  • Empirical evaluations in anomalous diffusion and RNA modification detection demonstrate improved type I/II error control and increased detection power over traditional methods.

One-class Support Vector Machine (OC-SVM) algorithms optimising smooth Conditional Value-at-Risk (CVaR) objectives constitute a significant advance in novelty detection within path spaces, connecting sequential data analysis, statistical learning, and probability in function spaces. This class of algorithms exploits signature-based feature embeddings and the shuffle-product structure, enabling closed-form polynomial surrogates for risk-sensitive test statistics and new theoretical guarantees for error control and statistical power in hypothesis testing settings (Gasteratos et al., 2 Dec 2025).

1. Signature-based Features and Smooth CVaR Surrogates

Let XμX \sim \mu denote a path whose signature SN(X)S_N(X) is taken up to truncation level NN. To approximate the positive part [u]+[u]^+ on [K,K][-K,K], a polynomial Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i is introduced. The smooth CVaR surrogate is then defined by

fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]

Employing the shuffle-product identity, $(\langle w, S \rangle)^i = \langle w^{\shuffle i}, S \rangle$, Theorem 3.1 shows the surrogate may be rewritten as

$E_\mu\left[ Q_n(\langle w, S(X) \rangle - \rho) \right] = \langle Q_n^{\shuffle}(w - \rho 1), E_\mu[S(X)] \rangle$

where $Q_n^{\shuffle}(\ell) = \sum_{i=0}^n a_i \ell^{\shuffle i} \in (T^{nN}(\mathbb{R}^d))^*$. The result is an explicit polynomial SN(X)S_N(X)0 whose coefficients SN(X)S_N(X)1 depend only on SN(X)S_N(X)2 and shuffle-powers of SN(X)S_N(X)3. This surrogate admits closed-form computation, substantially improving tractability for high-dimensional path data.

2. OC-SVM Formulations and Optimisation Problems

The OC-SVM framework is captured as minimising a regularised CVaR of negative scoring functionals. Given a feature map SN(X)S_N(X)4, typically the truncated signature SN(X)S_N(X)5 or its infinite-level version, the population-level objective reads

SN(X)S_N(X)6

Replacing SN(X)S_N(X)7 by the smooth surrogate yields the smooth-CVaR OC-SVM problem,

SN(X)S_N(X)8

For empirical OC-SVM with finite samples SN(X)S_N(X)9, the unconstrained primal is

NN0

A constrained quadratic program variant introduces slack variables NN1: NN2

3. Dual Formulation and Signature Kernels

The dual form of the constrained OC-SVM is

NN3

for kernel matrix NN4. With NN5, the signature kernel is

NN6

Given solution NN7, the primal vector is NN8 and the test-score for a new path NN9 is [u]+[u]^+0. In the smooth-CVaR population version, [u]+[u]^+1 enters via the closed-form surrogate, replacing empirical averages.

4. Theoretical Error Bounds: Type I and Power

Denoting [u]+[u]^+2, the following bounds are established:

  • Type I Error (Theorem 3.4): If [u]+[u]^+3 obeys an [u]+[u]^+4 transportation-cost inequality, including Gaussian and RDE laws, then there exist constants [u]+[u]^+5 such that

[u]+[u]^+6

where [u]+[u]^+7 and [u]+[u]^+8 depends on the deviation [u]+[u]^+9. Solving [K,K][-K,K]0 provides a quantile bound [K,K][-K,K]1 and super-uniform p-values

[K,K][-K,K]2

  • Type II Error (Power, Theorem 3.3): For alternatives [K,K][-K,K]3 with finite first moment,

[K,K][-K,K]4

with [K,K][-K,K]5 for [K,K][-K,K]6, and [K,K][-K,K]7 the relative entropy. Thus, finite relative entropy ensures nontrivial lower bounds on power.

5. Algorithmic Procedure and Practical Considerations

Population-level smooth-CVaR OC-SVM is implemented via the following high-level steps:

  1. Empirical expected signature: [K,K][-K,K]8
  2. Surrogate objective: [K,K][-K,K]9
  3. Joint optimisation: Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i0
    • Solved by alternation or explicit polynomial root-finding when Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i1.
  4. Test statistic: Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i2
  5. Hypothesis rejection: Reject Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i3 if Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i4.

For sample-based OC-SVM, standard primal/dual QP with signature kernels is used (e.g., LIBSVM, ThunderSVM). At test time, Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i5 is compared to the learned bias Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i6.

6. Empirical Evaluation: Diffusion and Molecular Biology

Anomalous Diffusion

  • Setup: Binary discrimination between standard Brownian motion (Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i7) and “spiked-BM” (Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i8) defined by Qn(u)=i=0naiuiQ_n(u) = \sum_{i=0}^n a_i u^i9, fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]0.
  • Statistic: Signature-based distance fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]1; also, linear form in fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]2.
  • Results:
    • AUROC vs fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]3: Monotonic increase, no sharp phase at fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]4.
    • Type I and II control: Empirical p-values (fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]5 calibration) give marginal FDR fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]6 but high conditional variability. Weibull tail-bound (Theorem 3.4) with fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]7 samples yields super-uniform p-values, tighter FDR and FPR control.
    • Comparison: Signature-based distance outperforms TAMSD and is competitive with kernelised OC-SVM.

RNA Modification Detection

  • Data: Synthetic 100-nt oligos, three modifications (inosine, m5C, fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]8) at fixed positions; Nanopore direct RNA reads (Leger et al. 2021).
  • Preprocessing: Dorado basecalling, Uncalled4 event alignment, per-base segmentation.
  • Methods:
    • OC-SVM on signature features (fαn(ρ)=ρ+11αEμ[Qn(w,SN(X)ρ)],ρ[K,K]f^n_\alpha(\rho) = \rho + \frac{1}{1-\alpha} E_\mu\left[ Q_n(\langle w, S_N(X) \rangle - \rho) \right], \quad \rho \in [-K, K]9, time-augment and invisibility-reset), $(\langle w, S \rangle)^i = \langle w^{\shuffle i}, S \rangle$0 unmodified reads per site, p-values from $(\langle w, S \rangle)^i = \langle w^{\shuffle i}, S \rangle$1 held-out reads.
    • OC-SVM on standard 2D features (mean current and dwell time).
  • Results: At BH–FDR level $(\langle w, S \rangle)^i = \langle w^{\shuffle i}, S \rangle$2, signature OC-SVM yields substantially higher recall (power) for all modification types, with type I error controlled at nominal level.

7. Connections, Scope, and Implications

These developments bridge hypothesis testing, path signatures (Lyons et al.), transportation-cost inequalities (Gasteratos and Jacquier 2023), and robust machine learning. The use of smooth CVaR surrogates via shuffle-product identities establishes new analytic techniques for risk calibration and empirical p-value calculation. Non-asymptotic bounds on error rates generalise beyond Gaussian settings to laws of rough differential equation solutions, supporting broader applications in anomalous diffusion analysis and molecular biology. A plausible implication is further cross-fertilisation with time-series anomaly detection and functional data analysis, leveraging closed-form population objectives and signature kernel methods.

The principal contribution is the integration of population-level risk surrogates, shuffle-product algebra, and theoretical guarantees for novelty detection (Gasteratos et al., 2 Dec 2025). This framework enables more refined control of type I and type II errors and supports robust calibration for high-dimensional non-Euclidean data spaces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to One-Class SVM Algorithms Optimising Smooth CVaR Objectives.