Koo-Fu CLIP: FK-LDA for CLIP Embeddings

Updated 8 February 2026

Koo-Fu CLIP is a supervised adaptation protocol that reshapes high-dimensional CLIP embeddings using FK-LDA to enhance class discriminability.
It employs a closed-form two-stage pipeline with regularization and whitening to reduce dimensionality and suppress within-class variance.
Empirical results on ImageNet benchmarks demonstrate improved classification accuracy and efficient retrieval in large-scale supervised tasks.

Koo-Fu CLIP is a supervised adaptation protocol for vision-language representations, specifically targeting embeddings produced by models such as CLIP. It leverages Fukunaga–Koontz Linear Discriminant Analysis (FK-LDA) to reshape the geometry of high-dimensional embedding spaces, suppressing within-class variation and enhancing between-class discriminability via a closed-form linear projection. The approach achieves substantial dimensionality reduction and efficient supervised adaptation for large-scale supervised classification and retrieval tasks, while remaining computationally lightweight and robust on datasets such as ImageNet-1K, ImageNet-14K, and ImageNet-21K (Suchanek et al., 1 Feb 2026).

1. Motivation and Problem Context

CLIP and similar vision-LLMs yield general-purpose, high-dimensional embeddings (often 768-D or higher) that are effective for zero-shot transfer but not tailored for supervised classification. The raw embedding space often contains excessive within-class variance and only limited separation between classes, impeding straightforward nearest-prototype classification and efficient use in downstream, label-rich scenarios. Adapting these representations for discriminative tasks therefore requires a process that can both compress the feature dimensions and amplify inter-class structure without expensive retraining (Suchanek et al., 1 Feb 2026).

Koo-Fu CLIP directly addresses this challenge: given labeled embedding data $x_i \in \mathbb{R}^D$ with labels $y_i \in \{1,...,K\}$ , the objective is to learn a linear map into a lower-dimensional space $\mathbb{R}^L$ , with $L\leq D$ , that minimizes within-class variance and maximizes between-class separation.

2. Fukunaga–Koontz Linear Discriminant Analysis

Koo-Fu CLIP operationalizes FK-LDA for vision-language adaptation. FK-LDA extends classical Linear Discriminant Analysis (LDA) by introducing a whitening step that regularizes and numerically stabilizes the mapping process. The essential goal is to find a projection $W$ maximizing the Fisher criterion:

$J(W) = \frac{\operatorname{tr}(W^\top S_b W)}{\operatorname{tr}(W^\top S_w W)}$

where $S_w$ is the within-class scatter matrix and $S_b$ is the between-class scatter matrix. Unlike classical LDA, which solves the generalized eigenproblem $S_b v = \lambda S_w v$ (with rank limitations and instability if $S_w$ is ill-conditioned), FK-LDA first whitens $S_w$ and then diagonalizes $S_b$ in this whitened space (Suchanek et al., 1 Feb 2026).

Scatter Matrix Construction

Global mean: $\mu = \frac{1}{N} \sum_{i=1}^N x_i$
Class mean: $\mu_k = \frac{1}{N_k} \sum_{i:y_i=k} x_i$
Within-class scatter: $S_w = \sum_{k=1}^K \sum_{i:y_i=k} (x_i - \mu_k)(x_i - \mu_k)^\top$
Between-class scatter: $S_b = \sum_{k=1}^K N_k(\mu_k - \mu)(\mu_k - \mu)^\top$

3. Closed-Form Solution and Algorithmic Pipeline

The FK-LDA adaptation comprises a two-stage diagonalization pipeline:

Closed-Form Steps

Regularization: Add a ridge term to ensure $S_w$ is invertible:

$S_w' = S_w + \lambda I$

Whitening: Eigendecompose $S_w'$ :

$S_w' = V\Lambda V^\top, \quad Z = V\Lambda^{-1/2}V^\top$

The whitening transform $Z$ maps $S_w'$ to the identity.

Whitened Means and Between-Class Scatter: For each class $k$ ,

$\Delta_k = \mu_k - \mu, \quad \delta_k = Z\Delta_k$

$S_b' = \sum_{k=1}^K N_k \delta_k \delta_k^\top = Z S_b Z$

Diagonalization of Between-Class Scatter: Eigendecompose $S_b'$ :

$S_b' = U\Gamma U^\top$

$U_L$ are the top $L$ eigenvectors.

Final Projection: The transformation mapping $x \in \mathbb{R}^D$ to $y \in \mathbb{R}^L$ is:

$y = T x, \quad \text{with} \quad T = U_L^\top Z$

The columns of $W = ZU_L$ (or rows of $T$ ) define discriminant directions maximizing inter-class spread and eliminating intra-class variation.

Summary of Steps

Step	Operation	Output
1	Compute means and scatters	$S_b, S_w$
2	Regularize $S_w$	$S_w'$
3	Eigendecompose $S_w'$	$V, \Lambda$
4	Compute $Z = V\Lambda^{-1/2}V^\top$	Whitening matrix
5	Whiten means, build $S_b'$	$S_b'$
6	Eigendecompose $S_b'$	$U, \Gamma$
7	Select top $L$ columns $U_L$	Final projector $T$

4. Geometric and Statistical Properties

The whitening transformation $Z$ spheres the within-class covariances, making all classes isotropic (identity covariance) in the transformed space. This suppresses within-class variation uniformly across directions. The subsequent diagonalization of $S_b'$ finds orthogonal axes that maximize the projected between-class separations. Thus, the final embedding aligns class means along maximally discriminative directions while compressing to a user-specified dimension $L$ .

A key distinction from classical LDA is the absence of a hard rank constraint: classical LDA yields at most $K-1$ meaningful directions due to the rank of $S_b$ , whereas FK-LDA in the whitened space is full-rank up to $D$ , permitting arbitrary dimensionality reduction to $L\leq D$ .

5. Computational Complexity and Practical Implementation

The adaptation requires:

Accumulating scatter matrices: $O(N D^2)$
Two eigendecompositions (size $D\times D$ ): $O(D^3)$ each
Inference for a new sample: $O(DL)$ (single matrix-vector multiplication)

Choice of regularization parameter $\lambda$ is not critical—for CLIP-scale data, any value in the range $10^{-3}$ – $10^{-1}$ times the mean diagonal of $S_w$ suffices to ensure numerical stability and consistent accuracy.

The method supports dimensionality reduction by varying $L$ , with the ability to compress the embedding vectors by $10$- to $12$-fold with little or no loss in accuracy on large-scale benchmarks.

6. Empirical Performance and Application Scope

On ImageNet-1K, nearest-prototype classification in the Koo-Fu CLIP-optimized space achieves a top-1 accuracy increase from $75.1\%$ (raw CLIP) to $79.1\%$ . Performance gains are robust as the number of classes scales to ImageNet-14K and ImageNet-21K. The approach provides an efficient closed-form post-hoc transformation, requiring only linear operations and avoiding retraining. This enables scalable large-scale classification and retrieval, especially in scenarios where prototype-based inference or severe embedding compression is needed (Suchanek et al., 1 Feb 2026).

FK-LDA is most effective in scenarios characterized by high-dimensional embeddings and relatively few samples per class, yielding improved class separation for both accuracy and computational efficiency.

7. Theoretical Considerations and Limitations

FK-LDA guarantees a closed-form solution that is robust to the empirical ill-conditioning commonly encountered in high-dimensional embedding spaces generated by models like CLIP. The method analytically optimizes the Fisher criterion, producing a solution that is not limited by the number of classes and is agnostic to the specifics of the base embedding model.

A plausible implication is that FK-LDA may be deployed as a generic adaptation layer atop any pre-trained general-purpose representation with minimal tuning, offering a numerically stable and theoretically grounded alternative to traditional discriminant analysis schemes. Classical LDA’s instability and rank limitation are circumvented by the whitening step, suggesting that Koo-Fu CLIP may be especially suited for transfer learning and fast adaptation pipelines where re-training the backbone is infeasible.

For applications requiring more nuanced or nonlinear discriminative adaptation, a limitation is the linearity of the FK-LDA transformation; nonlinearity is not modeled, so the method’s efficacy is bounded by the linear separability of class structure in the base embedding. However, for a large regime of supervised adaptation problems—particularly those characterized by high-dimensional, over-complete feature spaces—FK-LDA, as instantiated in Koo-Fu CLIP, provides an efficient and well-founded solution.

Markdown Report Issue Upgrade to Chat

References (1)

Koo-Fu CLIP: Closed-Form Adaptation of Vision-Language Models via Fukunaga-Koontz Linear Discriminant Analysis (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Koo-Fu CLIP.

Koo-Fu CLIP: FK-LDA for CLIP Embeddings

1. Motivation and Problem Context

2. Fukunaga–Koontz Linear Discriminant Analysis

Scatter Matrix Construction

3. Closed-Form Solution and Algorithmic Pipeline

Closed-Form Steps

Summary of Steps

4. Geometric and Statistical Properties

5. Computational Complexity and Practical Implementation

6. Empirical Performance and Application Scope

7. Theoretical Considerations and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Koo-Fu CLIP: FK-LDA for CLIP Embeddings

1. Motivation and Problem Context

2. Fukunaga–Koontz Linear Discriminant Analysis

Scatter Matrix Construction

3. Closed-Form Solution and Algorithmic Pipeline

Closed-Form Steps

Summary of Steps

4. Geometric and Statistical Properties

5. Computational Complexity and Practical Implementation

6. Empirical Performance and Application Scope

7. Theoretical Considerations and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research