Papers
Topics
Authors
Recent
Search
2000 character limit reached

Koo-Fu CLIP: FK-LDA for CLIP Embeddings

Updated 8 February 2026
  • Koo-Fu CLIP is a supervised adaptation protocol that reshapes high-dimensional CLIP embeddings using FK-LDA to enhance class discriminability.
  • It employs a closed-form two-stage pipeline with regularization and whitening to reduce dimensionality and suppress within-class variance.
  • Empirical results on ImageNet benchmarks demonstrate improved classification accuracy and efficient retrieval in large-scale supervised tasks.

Koo-Fu CLIP is a supervised adaptation protocol for vision-language representations, specifically targeting embeddings produced by models such as CLIP. It leverages Fukunaga–Koontz Linear Discriminant Analysis (FK-LDA) to reshape the geometry of high-dimensional embedding spaces, suppressing within-class variation and enhancing between-class discriminability via a closed-form linear projection. The approach achieves substantial dimensionality reduction and efficient supervised adaptation for large-scale supervised classification and retrieval tasks, while remaining computationally lightweight and robust on datasets such as ImageNet-1K, ImageNet-14K, and ImageNet-21K (Suchanek et al., 1 Feb 2026).

1. Motivation and Problem Context

CLIP and similar vision-LLMs yield general-purpose, high-dimensional embeddings (often 768-D or higher) that are effective for zero-shot transfer but not tailored for supervised classification. The raw embedding space often contains excessive within-class variance and only limited separation between classes, impeding straightforward nearest-prototype classification and efficient use in downstream, label-rich scenarios. Adapting these representations for discriminative tasks therefore requires a process that can both compress the feature dimensions and amplify inter-class structure without expensive retraining (Suchanek et al., 1 Feb 2026).

Koo-Fu CLIP directly addresses this challenge: given labeled embedding data xiRDx_i \in \mathbb{R}^D with labels yi{1,...,K}y_i \in \{1,...,K\}, the objective is to learn a linear map into a lower-dimensional space RL\mathbb{R}^L, with LDL\leq D, that minimizes within-class variance and maximizes between-class separation.

2. Fukunaga–Koontz Linear Discriminant Analysis

Koo-Fu CLIP operationalizes FK-LDA for vision-language adaptation. FK-LDA extends classical Linear Discriminant Analysis (LDA) by introducing a whitening step that regularizes and numerically stabilizes the mapping process. The essential goal is to find a projection WW maximizing the Fisher criterion:

J(W)=tr(WSbW)tr(WSwW)J(W) = \frac{\operatorname{tr}(W^\top S_b W)}{\operatorname{tr}(W^\top S_w W)}

where SwS_w is the within-class scatter matrix and SbS_b is the between-class scatter matrix. Unlike classical LDA, which solves the generalized eigenproblem Sbv=λSwvS_b v = \lambda S_w v (with rank limitations and instability if SwS_w is ill-conditioned), FK-LDA first whitens SwS_w and then diagonalizes SbS_b in this whitened space (Suchanek et al., 1 Feb 2026).

Scatter Matrix Construction

  • Global mean: μ=1Ni=1Nxi\mu = \frac{1}{N} \sum_{i=1}^N x_i
  • Class mean: μk=1Nki:yi=kxi\mu_k = \frac{1}{N_k} \sum_{i:y_i=k} x_i
  • Within-class scatter: Sw=k=1Ki:yi=k(xiμk)(xiμk)S_w = \sum_{k=1}^K \sum_{i:y_i=k} (x_i - \mu_k)(x_i - \mu_k)^\top
  • Between-class scatter: Sb=k=1KNk(μkμ)(μkμ)S_b = \sum_{k=1}^K N_k(\mu_k - \mu)(\mu_k - \mu)^\top

3. Closed-Form Solution and Algorithmic Pipeline

The FK-LDA adaptation comprises a two-stage diagonalization pipeline:

Closed-Form Steps

  1. Regularization: Add a ridge term to ensure SwS_w is invertible:

Sw=Sw+λIS_w' = S_w + \lambda I

  1. Whitening: Eigendecompose SwS_w':

Sw=VΛV,Z=VΛ1/2VS_w' = V\Lambda V^\top, \quad Z = V\Lambda^{-1/2}V^\top

The whitening transform ZZ maps SwS_w' to the identity.

  1. Whitened Means and Between-Class Scatter: For each class kk,

Δk=μkμ,δk=ZΔk\Delta_k = \mu_k - \mu, \quad \delta_k = Z\Delta_k

Sb=k=1KNkδkδk=ZSbZS_b' = \sum_{k=1}^K N_k \delta_k \delta_k^\top = Z S_b Z

  1. Diagonalization of Between-Class Scatter: Eigendecompose SbS_b':

Sb=UΓUS_b' = U\Gamma U^\top

ULU_L are the top LL eigenvectors.

  1. Final Projection: The transformation mapping xRDx \in \mathbb{R}^D to yRLy \in \mathbb{R}^L is:

y=Tx,withT=ULZy = T x, \quad \text{with} \quad T = U_L^\top Z

The columns of W=ZULW = ZU_L (or rows of TT) define discriminant directions maximizing inter-class spread and eliminating intra-class variation.

Summary of Steps

Step Operation Output
1 Compute means and scatters Sb,SwS_b, S_w
2 Regularize SwS_w SwS_w'
3 Eigendecompose SwS_w' V,ΛV, \Lambda
4 Compute Z=VΛ1/2VZ = V\Lambda^{-1/2}V^\top Whitening matrix
5 Whiten means, build SbS_b' SbS_b'
6 Eigendecompose SbS_b' U,ΓU, \Gamma
7 Select top LL columns ULU_L Final projector TT

4. Geometric and Statistical Properties

The whitening transformation ZZ spheres the within-class covariances, making all classes isotropic (identity covariance) in the transformed space. This suppresses within-class variation uniformly across directions. The subsequent diagonalization of SbS_b' finds orthogonal axes that maximize the projected between-class separations. Thus, the final embedding aligns class means along maximally discriminative directions while compressing to a user-specified dimension LL.

A key distinction from classical LDA is the absence of a hard rank constraint: classical LDA yields at most K1K-1 meaningful directions due to the rank of SbS_b, whereas FK-LDA in the whitened space is full-rank up to DD, permitting arbitrary dimensionality reduction to LDL\leq D.

5. Computational Complexity and Practical Implementation

The adaptation requires:

  • Accumulating scatter matrices: O(ND2)O(N D^2)
  • Two eigendecompositions (size D×DD\times D): O(D3)O(D^3) each
  • Inference for a new sample: O(DL)O(DL) (single matrix-vector multiplication)

Choice of regularization parameter λ\lambda is not critical—for CLIP-scale data, any value in the range 10310^{-3}10110^{-1} times the mean diagonal of SwS_w suffices to ensure numerical stability and consistent accuracy.

The method supports dimensionality reduction by varying LL, with the ability to compress the embedding vectors by $10$- to $12$-fold with little or no loss in accuracy on large-scale benchmarks.

6. Empirical Performance and Application Scope

On ImageNet-1K, nearest-prototype classification in the Koo-Fu CLIP-optimized space achieves a top-1 accuracy increase from 75.1%75.1\% (raw CLIP) to 79.1%79.1\%. Performance gains are robust as the number of classes scales to ImageNet-14K and ImageNet-21K. The approach provides an efficient closed-form post-hoc transformation, requiring only linear operations and avoiding retraining. This enables scalable large-scale classification and retrieval, especially in scenarios where prototype-based inference or severe embedding compression is needed (Suchanek et al., 1 Feb 2026).

FK-LDA is most effective in scenarios characterized by high-dimensional embeddings and relatively few samples per class, yielding improved class separation for both accuracy and computational efficiency.

7. Theoretical Considerations and Limitations

FK-LDA guarantees a closed-form solution that is robust to the empirical ill-conditioning commonly encountered in high-dimensional embedding spaces generated by models like CLIP. The method analytically optimizes the Fisher criterion, producing a solution that is not limited by the number of classes and is agnostic to the specifics of the base embedding model.

A plausible implication is that FK-LDA may be deployed as a generic adaptation layer atop any pre-trained general-purpose representation with minimal tuning, offering a numerically stable and theoretically grounded alternative to traditional discriminant analysis schemes. Classical LDA’s instability and rank limitation are circumvented by the whitening step, suggesting that Koo-Fu CLIP may be especially suited for transfer learning and fast adaptation pipelines where re-training the backbone is infeasible.

For applications requiring more nuanced or nonlinear discriminative adaptation, a limitation is the linearity of the FK-LDA transformation; nonlinearity is not modeled, so the method’s efficacy is bounded by the linear separability of class structure in the base embedding. However, for a large regime of supervised adaptation problems—particularly those characterized by high-dimensional, over-complete feature spaces—FK-LDA, as instantiated in Koo-Fu CLIP, provides an efficient and well-founded solution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Koo-Fu CLIP.