Zero-Shot Learning Techniques

Updated 5 February 2026

Zero-shot learning is a machine learning paradigm that predicts unseen classes using auxiliary semantic information like attribute vectors and knowledge graphs.
Methods include direct embedding, generative feature synthesis, and graph-based alignment to bridge the gap between visual data and semantic representations.
These techniques address practical challenges such as domain shift and scalability, enabling applications in computer vision, NLP, multi-label, and object detection tasks.

Zero-shot learning (ZSL) is a paradigm in statistical machine learning and artificial intelligence where models are explicitly constructed to recognize or predict classes or concepts not present in the training data. The primary motivation behind ZSL is to overcome the combinatorial explosion of possible visual, textual, or multi-modal categories, by leveraging structured side information—such as attributes, semantic word vectors, or knowledge graphs—to transfer knowledge to previously unseen classes. ZSL has been adopted in diverse domains, including computer vision, natural language processing, multi-label and multi-modal classification, and object detection, becoming foundational for open-world and low-resource learning scenarios.

1. Foundational Principles and Problem Formulation

Zero-shot learning formalizes the recognition of an instance $x$ whose true class label $y$ belongs to an unseen set $\mathcal{U}$ , disjoint from the seen (training) class set $\mathcal{S}$ ( $\mathcal{U} \cap \mathcal{S} = \varnothing$ ). The key to ZSL is access to auxiliary information relating seen and unseen classes—most commonly:

Attribute vectors (binary or real-valued): both seen and unseen classes are associated with human-curated or learned vectors describing abstract properties (e.g. "has stripes," "can fly").
Semantic word or phrase embeddings: classes are referenced by terms mapped into a continuous vector space (e.g., word2vec, GloVe).
Knowledge graphs: nodes represent classes/concepts; edges encode ontological or commonsense relations (e.g., ConceptNet, WordNet).

Central ZSL tasks include standard zero-shot recognition (classify among unseen classes), generalized zero-shot recognition (classify among seen + unseen), zero-shot retrieval (rank database samples by unseen class queries), and zero-shot detection (localize and label unseen-class regions in images or documents).

The prevailing theoretical challenge is knowledge transfer, i.e., how to map, align, or connect the space of visible data (e.g., images, documents), the auxiliary semantic space, and the class-conditional models for both seen and unseen categories.

2. Taxonomy of Zero-Shot Learning Techniques

ZSL methods can be structured along four primary methodological axes:

2.1 Direct Embedding and Compatibility Models

Linear/Bilinear Embeddings: A majority of early ZSL approaches learn a function $f: \mathbb{R}^d \to \mathbb{R}^m$ mapping input features to semantic embeddings, or a compatibility score $s(x, y)$ between input features and class embeddings. Examples include Direct Attribute Prediction (DAP), Embarrassingly Simple Zero-Shot Learning (ESZSL), Attribute Label Embedding (ALE), and Structured Joint Embedding (SJE), which utilize cross-domain dot-product or bilinear operators and derive closed-form or SGD-trained projection matrices (Saad et al., 2022).

Semantic Auto-Encoders and Latent Alignment: Later work emphasizes reconstruction and bidirectional mappings. Semantic Auto-Encoder (SAE) imposes both encoder (feature to semantic) and decoder (semantic to feature) objectives (Saad et al., 2022), often to alleviate domain shift. Latent Space Encoding (LSE) generalizes this approach by using a tied encoder-decoder across feature and semantic modalities, with a jointly optimized latent subspace (Yu et al., 2017).

2.2 Generative and Feature-Synthesis Approaches

Generative methods address projection-domain shift and class imbalance by simulating or synthesizing feature distributions for unseen classes. Modalities include:

Conditional VAEs and GANs: These frameworks use semantic vectors as conditioning inputs to probabilistically generate visual features for unseen classes. The Simultaneously Generating and Learning (SGAL) VAE alternates between generating missing unseen samples and optimizing model parameters, allowing the classifier to "experience" unseen classes during training (Yu et al., 2019). Knowledge Sharing GANs further incorporate enhanced semantic representations via neighbor sharing (Ting et al., 2021).
Pseudo Feature Generation: Some methods construct unseen-class features by combinatorially matching attribute-level CNN features learned on seen classes, enabling data-augmentation at the feature level (GPFR) (Lu et al., 2017). Here, a joint attribute feature extractor (JAFE) builds a "cognitive repository" of confident feature-attribute vectors, which are stochastically sampled to synthesize unseen-class pseudo-examples.
Transductive Distribution Synthesis: In transductive ZSL, with access to the unlabelled set of unseen features, class-conditional distributions for unseen categories are estimated by transferring manifold structure from the auxiliary semantic space (e.g., via sparse-coding to reconstruct unseen semantic embeddings in terms of seen-class bases, and transferring those coefficients to the statistics of seen feature distributions) (Zhao et al., 2016). The parameters of unseen-class GMMs are then adapted via EM.

2.3 Structure and Graph-Based Alignment

Graph-based methods exploit structured relationships between classes:

Knowledge Graph Embedding: Embedding class nodes from ontologies or commonsense graphs (ConceptNet, WordNet) using learned graph neural networks (e.g., Transformer-GCN) produces class representations that aggregate multi-hop relational information, outperforming earlier WordNet and GCN-based techniques in both language and vision domains (Nayak et al., 2020).
Contextual and Relational Potentials: Models such as context-aware ZSL incorporate inter-object or inter-entity relationships via graph-conditional random fields (CRFs), augmenting local (unary) compatibility potentials with pairwise relational priors inferred from scene graphs or relationship KGs (Luo et al., 2019). This approach enables the exploitation of spatial and categorical context in ZSL.
Coupled Dictionary and Structure Alignment: Coupled dictionary learning aligns visual and semantic class prototypes under a shared code basis, explicitly correcting for misalignment in structure between the spaces and improving discriminability for class prototypes. This method anchors unseen-class prototypes using their semantic descriptions and jointly reconstructs both seen and unseen class representations (Jiang et al., 2018).

Unified Embedding across Modalities: Multi-Battery Factor Analysis (MBFA) generalizes CCA to simultaneously embed visual features and multiple types of side information (e.g. attributes, word vectors, co-occurrence) into a joint space, maximizing joint covariance and enabling direct similarity comparison for ZSL inference (Ji et al., 2016).

Multi-label Zero-shot Learning: Multi-label ZSL extends the standard setting by allowing instances to be assigned subsets of labels, which may themselves be unseen. This requires compositionally synthesizing "prototypical" semantic representations for all power sets of labels and transductive label-propagation mechanisms for leveraging label dependencies (Fu et al., 2015).

3. Performance, Experimental Protocols, and Meta-Analysis

3.1 Standard Benchmarks and Protocols

Typical ZSL evaluation uses visually diverse datasets with fixed seen/unseen splits and attributes or word embeddings:

AwA (Animals with Attributes): 50 classes, 85-dim attributes, canonical 40/10 seen/unseen split, used across most ZSL literature.
CUB (Caltech-UCSD Birds-200-2011): 200 classes, 312 attributes, 150/50 split.
aPY (aPascal/aYahoo), SUN, ImageNet: various classes/attributes/word vectors; often used to illustrate scalability and fine-grained discrimination.

Performance is measured via per-class top-1 accuracy (TZSL), generalized accuracy (per-class accuracy balancing seen and unseen), and mean Average Precision (mAP) for retrieval tasks.

3.2 Comparative Outcomes

Generative methods (SGAL, knowledge-sharing GANs): State-of-the-art or competitive generalized ZSL accuracy on benchmarks (e.g., SGAL hits H=65.6% on AwA2; KS-GAN achieves SOTA AUSUC on CUB/NAB SCE splits) (Yu et al., 2019, Ting et al., 2021).
Graph-embedding (ZSL-KG): Outperforms WordNet-GCNs by 1–16 points (harmonic mean) on aPY, AWA2, OntoNotes, and intent datasets (Nayak et al., 2020).
MBFA-ZSL: Achieves 4–14% higher unseen-class accuracy versus alternatives on AwA, CUB, SUN, benefitting from multi-modal side information (Ji et al., 2016).

Meta-classifier studies demonstrate that simple consensus voting across multiple linear ZSL models (e.g., ESZSL, ALE, SJE, DeViSE, SAE) consistently yields a 1–3% absolute gain in top-1 accuracy on standard ZSL splits (Saad et al., 2022).

4. Specialized and Advanced Scenarios

4.1 Zero-Shot Object Detection

Object detection ZSL (ZSD) extends recognition to localizing unseen instances in images. Approaches include:

Embedding-alignment: ZSD-YOLO predicts feature vectors at bounding-box proposals, aligning them with class text embeddings (e.g., CLIP) via cross-entropy and L1 image-level loss. During inference, class assignment is performed via nearest-neighbor search in the embedding space (Badawi et al., 2024).
Generative Feature Synthesis: GANs conditioned on semantic descriptions are used to synthesize RoI features for unseen classes, closing the distributional gap and improving mAP/Recall@100 (Badawi et al., 2024).

4.2 Adversarial Augmentation and Robustness

HAS (Harnessing Adversarial Samples) replaces conventional augmentation with semantic-preserving adversarial perturbations constructed to (a) emulate negative classes, (b) enforce feature reliability constraints, and (c) diversify attribute localization, avoiding semantic drift and improving ZSL and GZSL accuracy on fine-grained, high-class-count datasets (Chen et al., 2023).

4.3 Topic Models and Hierarchical Coding

Zero-shot recognition via topic models replaces mid-level manual attribute annotation with pLSA-based topic representations derived from visual vocabularies, coupled with hierarchical class ("HiC") codebooks for robust transfer. The approach achieves comparable performance to attribute-based methods without human-labeled features (Hoo et al., 2014).

5. Limitations, Open Challenges, and Future Directions

5.1 Limitations

Domain Shift: The projection domain shift problem—where mappings learned on seen classes misalign for unseen classes—remains fundamental, mitigated but not solved by generative or structure-alignment approaches (Tang et al., 2019).
Semantic Embedding Quality: Model performance is sensitive to the choice and granularity of side information. Poorly aligned or noisy semantic representations (attributes, word embeddings, or knowledge graphs) can degrade ZSL accuracy or induce "hubness" in embedding spaces (Saad et al., 2022, Nayak et al., 2020).
Scalability: Computational cost (e.g., for pLSA/LDA codebooks, eigen-solvers in MBFA, or EM in transductive ZSL) and memory burden of dense graphs/embeddings can impede deployment for large-scale ontologies (Hoo et al., 2014, Ji et al., 2016).

5.2 Future Research Directions

Nonlinear and Deep Extensions: Kernelized/differentiable extensions of MBFA or dictionary learning (deep dictionary learning, GNNs, advanced transformers) for richer relational structure (Ji et al., 2016, Nayak et al., 2020).
Transductive and Self-training Methods: Leveraging unlabeled unseen-class feature manifolds with domain adaptation, label propagation, or iterative self-training visibly improves accuracy and mitigates domain shift (Zhao et al., 2016, Fu et al., 2015).
Multi-modal and Hierarchical ZSL: Unifying higher-arity side information, label hierarchies, and ontologies (e.g., taxonomic and graph structures) (Lake, 2022, Nayak et al., 2020). Addressing taxonomy evolution and transfer in real-world settings such as job classification in human resources (Lake, 2022).
Uncertainty and Robustness: Incorporating robust optimization, dropout-driven uncertainty modeling, and semantically consistent augmentation to stabilize training and inference in generalized and open-set contexts (Yu et al., 2019, Chen et al., 2023).
Prompt and LLM Integration: Improving calibration and systematization of prompt-based or language foundation model-driven ZSL, especially for open-vocabulary detection and classification (Badawi et al., 2024).

ZSL remains an active area, with state-of-the-art techniques regularly surpassing previous approaches across vision and language domains by capitalizing on structured side information, advanced embedding, and generative methodologies. Future advances are expected from deeper integration of transductive adaptation, richer multimodal context, and scalability to increasingly fine-grained, complex, and dynamic class spaces.