Koo-Fu CLIP: FK-LDA for CLIP Embeddings
- Koo-Fu CLIP is a supervised adaptation protocol that reshapes high-dimensional CLIP embeddings using FK-LDA to enhance class discriminability.
- It employs a closed-form two-stage pipeline with regularization and whitening to reduce dimensionality and suppress within-class variance.
- Empirical results on ImageNet benchmarks demonstrate improved classification accuracy and efficient retrieval in large-scale supervised tasks.
Koo-Fu CLIP is a supervised adaptation protocol for vision-language representations, specifically targeting embeddings produced by models such as CLIP. It leverages Fukunaga–Koontz Linear Discriminant Analysis (FK-LDA) to reshape the geometry of high-dimensional embedding spaces, suppressing within-class variation and enhancing between-class discriminability via a closed-form linear projection. The approach achieves substantial dimensionality reduction and efficient supervised adaptation for large-scale supervised classification and retrieval tasks, while remaining computationally lightweight and robust on datasets such as ImageNet-1K, ImageNet-14K, and ImageNet-21K (Suchanek et al., 1 Feb 2026).
1. Motivation and Problem Context
CLIP and similar vision-LLMs yield general-purpose, high-dimensional embeddings (often 768-D or higher) that are effective for zero-shot transfer but not tailored for supervised classification. The raw embedding space often contains excessive within-class variance and only limited separation between classes, impeding straightforward nearest-prototype classification and efficient use in downstream, label-rich scenarios. Adapting these representations for discriminative tasks therefore requires a process that can both compress the feature dimensions and amplify inter-class structure without expensive retraining (Suchanek et al., 1 Feb 2026).
Koo-Fu CLIP directly addresses this challenge: given labeled embedding data with labels , the objective is to learn a linear map into a lower-dimensional space , with , that minimizes within-class variance and maximizes between-class separation.
2. Fukunaga–Koontz Linear Discriminant Analysis
Koo-Fu CLIP operationalizes FK-LDA for vision-language adaptation. FK-LDA extends classical Linear Discriminant Analysis (LDA) by introducing a whitening step that regularizes and numerically stabilizes the mapping process. The essential goal is to find a projection maximizing the Fisher criterion:
where is the within-class scatter matrix and is the between-class scatter matrix. Unlike classical LDA, which solves the generalized eigenproblem (with rank limitations and instability if is ill-conditioned), FK-LDA first whitens and then diagonalizes in this whitened space (Suchanek et al., 1 Feb 2026).
Scatter Matrix Construction
- Global mean:
- Class mean:
- Within-class scatter:
- Between-class scatter:
3. Closed-Form Solution and Algorithmic Pipeline
The FK-LDA adaptation comprises a two-stage diagonalization pipeline:
Closed-Form Steps
- Regularization: Add a ridge term to ensure is invertible:
- Whitening: Eigendecompose :
The whitening transform maps to the identity.
- Whitened Means and Between-Class Scatter: For each class ,
- Diagonalization of Between-Class Scatter: Eigendecompose :
are the top eigenvectors.
- Final Projection: The transformation mapping to is:
The columns of (or rows of ) define discriminant directions maximizing inter-class spread and eliminating intra-class variation.
Summary of Steps
| Step | Operation | Output |
|---|---|---|
| 1 | Compute means and scatters | |
| 2 | Regularize | |
| 3 | Eigendecompose | |
| 4 | Compute | Whitening matrix |
| 5 | Whiten means, build | |
| 6 | Eigendecompose | |
| 7 | Select top columns | Final projector |
4. Geometric and Statistical Properties
The whitening transformation spheres the within-class covariances, making all classes isotropic (identity covariance) in the transformed space. This suppresses within-class variation uniformly across directions. The subsequent diagonalization of finds orthogonal axes that maximize the projected between-class separations. Thus, the final embedding aligns class means along maximally discriminative directions while compressing to a user-specified dimension .
A key distinction from classical LDA is the absence of a hard rank constraint: classical LDA yields at most meaningful directions due to the rank of , whereas FK-LDA in the whitened space is full-rank up to , permitting arbitrary dimensionality reduction to .
5. Computational Complexity and Practical Implementation
The adaptation requires:
- Accumulating scatter matrices:
- Two eigendecompositions (size ): each
- Inference for a new sample: (single matrix-vector multiplication)
Choice of regularization parameter is not critical—for CLIP-scale data, any value in the range – times the mean diagonal of suffices to ensure numerical stability and consistent accuracy.
The method supports dimensionality reduction by varying , with the ability to compress the embedding vectors by $10$- to $12$-fold with little or no loss in accuracy on large-scale benchmarks.
6. Empirical Performance and Application Scope
On ImageNet-1K, nearest-prototype classification in the Koo-Fu CLIP-optimized space achieves a top-1 accuracy increase from (raw CLIP) to . Performance gains are robust as the number of classes scales to ImageNet-14K and ImageNet-21K. The approach provides an efficient closed-form post-hoc transformation, requiring only linear operations and avoiding retraining. This enables scalable large-scale classification and retrieval, especially in scenarios where prototype-based inference or severe embedding compression is needed (Suchanek et al., 1 Feb 2026).
FK-LDA is most effective in scenarios characterized by high-dimensional embeddings and relatively few samples per class, yielding improved class separation for both accuracy and computational efficiency.
7. Theoretical Considerations and Limitations
FK-LDA guarantees a closed-form solution that is robust to the empirical ill-conditioning commonly encountered in high-dimensional embedding spaces generated by models like CLIP. The method analytically optimizes the Fisher criterion, producing a solution that is not limited by the number of classes and is agnostic to the specifics of the base embedding model.
A plausible implication is that FK-LDA may be deployed as a generic adaptation layer atop any pre-trained general-purpose representation with minimal tuning, offering a numerically stable and theoretically grounded alternative to traditional discriminant analysis schemes. Classical LDA’s instability and rank limitation are circumvented by the whitening step, suggesting that Koo-Fu CLIP may be especially suited for transfer learning and fast adaptation pipelines where re-training the backbone is infeasible.
For applications requiring more nuanced or nonlinear discriminative adaptation, a limitation is the linearity of the FK-LDA transformation; nonlinearity is not modeled, so the method’s efficacy is bounded by the linear separability of class structure in the base embedding. However, for a large regime of supervised adaptation problems—particularly those characterized by high-dimensional, over-complete feature spaces—FK-LDA, as instantiated in Koo-Fu CLIP, provides an efficient and well-founded solution.