Multilabel Movie Genre Classification

Updated 14 January 2026

Multilabel movie genre classification is a task that assigns one or more genres to films using multimodal data such as text, visuals, audio, and metadata.
Advanced techniques like transformer models, deep neural networks, and multimodal fusion strategies achieve high performance, with metrics reaching up to 90% Macro-F1 on benchmark datasets.
Applications span recommendation engines, digital archives, and content-based retrieval, while challenges include addressing class imbalance and incomplete metadata.

Multilabel movie genre classification refers to the problem of automatically assigning one or more genres to a movie instance, recognizing that most films exhibit multiple overlapping genre characteristics. This task is a principal challenge at the intersection of information retrieval, machine learning, and multimedia analysis, underpinning recommender systems, archival organization, and audience expectation modeling. In contrast to single-label genre prediction, multilabel classification allows each movie to be tagged with any subset of a predefined genre set, reflecting the complex and combinatorial nature of contemporary film categorization.

1. Problem Formulation and Datasets

Formally, multilabel genre classification seeks a function $f : X \rightarrow 2^{\mathcal{G}}$ , where $X$ is the feature space (metadata, video, audio, text, or multimodal representations), and $\mathcal{G} = \{g_1, \ldots, g_K\}$ is the set of $K$ possible genres. Each movie instance is annotated with a binary vector $y \in \{0,1\}^K$ . Primary benchmark datasets include:

MM-IMDb and MM-IMDb 2.0: Up to 33k titles, 23 genres, providing multi-modal data (posters, plot summaries, metadata, (Li et al., 2023)).
MovieNet: 28k trailers, curated genres, strong class imbalance (Zhang et al., 2022, Sulun et al., 2024).
Trailers12k: 12k manually labeled trailers for 10 genres (Montalvo-Lezama et al., 2022).
MMX-Trailer-20: ~9k trailers with dense clips and precomputed embeddings for video/audio (Fish et al., 2020).
IMDb/RottenTomatoes/Gracenote fusion: ~1 million movie instances with genre labels consolidated from multiple sources (Agrawal et al., 2023).
IMDB plot and review text sets: 250k+ plot summaries (Hoang, 2018), 7k-50k+ reviews (Nyberg, 2018, Kar et al., 2019).
Large poster collections: >13k posters with multilabel annotations (Nareti et al., 2023, Nareti et al., 2024).

Label cardinality (average number of labels per movie) varies, typically between 1.8 and 3.7, with significant long-tailed imbalance.

2. Feature Modalities and Representation

Genres can be inferred from a broad range of modalities, either unimodal or in multimodal fusion:

Textual narrative: Plot summaries, reviews, taglines, and metadata (cast, crew, release year). Document representations range from tf–idf (Nyberg, 2018), Bag-of-Words (Hoang, 2018), to contextual embeddings from BERT, GPT, or Doc2Vec (Agrawal et al., 2023, Kar et al., 2019).
Visual signals: Movie posters (Nareti et al., 2023, Nareti et al., 2024) using CNN/ResNet or transformer architectures; video frame sequences or keyframes from trailers, processed via pretrained CNNs or vision transformers (Sulun et al., 2024, Zhang et al., 2022, Montalvo-Lezama et al., 2022).
Audio and speech: Raw soundtrack, MFCCs, spectrograms, and learned audio/event/music embeddings (Mangolin et al., 2020, Zhang et al., 2022, Sulun et al., 2024).
Multimodal fusion: Late, early, or joint fusion of poster, synopses, trailer frames, subtitles, and audio (Mangolin et al., 2020, Fish et al., 2020, Li et al., 2023, Zhang et al., 2022, Sulun et al., 2024).
Knowledge graph features: Encoded cast, director, and genre relations as a domain knowledge graph (Li et al., 2023).

Some systems employ explicit feature engineering (e.g., handcrafted LBP descriptors on posters and spectrograms (Mangolin et al., 2020)) while others focus exclusively on end-to-end deep learning approaches.

3. Model Architectures and Fusion Strategies

A wide array of classifier architectures have been evaluated for multilabel movie genre classification, including:

Binary relevance and classifier chains: One-vs-rest SVM or MLP per genre (Mangolin et al., 2020, Nyberg, 2018).
Recurrent neural networks: Sequence models (GRU, LSTM) applied to plot summaries and subtitles (Hoang, 2018, Mangolin et al., 2020), typically with multi-label sigmoid or softmax output and example-specific thresholding.
Transformer-based models: ViT, CLIP, Swin, and customized transformers for both image/poster (Nareti et al., 2023, Nareti et al., 2024) and video (Sulun et al., 2024, Montalvo-Lezama et al., 2022) domains. Transformer layers aggregate features across sampled frames or modalities.
Multimodal fusion: Adaptive scalar fusion (Zhang et al., 2022), collaborative gating (Fish et al., 2020), and domain knowledge graph integration via attention (Li et al., 2023).
Ensemble systems: Weighted late fusion or product/average/max rules over top-performing unimodal classifiers (Mangolin et al., 2020, Nareti et al., 2023).
Contrastive and fine-grained training: Contrastive loss (NT-Xent) to refine semantic inter-movie embeddings (Fish et al., 2020); genre-centric anchored contrastive learning guided by KG-anchored centroids (Li et al., 2023).
LLM-oriented architectures: BERT/GPT embeddings powering "genre spectrum" deep MLPs (Agrawal et al., 2023); meta-label (micro-genre) generation via LLM prompt-augmented multi-heads.

Example: The "Movie-CLIP" model fuses sparse shot-sampled CLIP-visual features, PANNs-audio, and keyword-filtered CLIP-textual features via a learnable scalar-weighted sum, achieving a macro-mAP of 65.4% on MovieNet (Zhang et al., 2022).

4. Loss Functions, Inference, and Thresholding

The standard multi-label setting employs the following candidate losses and inference policies:

Binary cross-entropy: Computed per label, often with class-balancing or weighted loss terms to address label imbalance (Mangolin et al., 2020, Zhang et al., 2022, Sulun et al., 2024, Li et al., 2023).
Asymmetric loss (ASL): Overweights rare positive classes and underweights negatives, often with margin clipping (Nareti et al., 2023, Nareti et al., 2024).
Contrastive objectives: NT-Xent for inter-sample embedding structuring (Fish et al., 2020), KG-centric contrastive anchoring (Li et al., 2023).
Ranking or learned-threshold losses: Exponential rank loss, example-adaptive thresholding via regression (Hoang, 2018) to calibrate the number of predicted labels to movie ambiguity.
Label inference: Fixed thresholding (e.g. 0.5) on sigmoid outputs (Sulun et al., 2024, Nareti et al., 2023); probability-based per-instance decision using learned cutpoints (Hoang, 2018); variable-length genre prediction with probabilistic co-occurrence modules (Nareti et al., 2023).

Systems supporting probabilistic outputs enable retrieval and fine-grained semantic similarity, as in "Genre Spectrum" (Agrawal et al., 2023) and NT-Xent fine-tuned clustering (Fish et al., 2020).

5. Empirical Evaluation and Results

Evaluation is conducted using a range of multilabel metrics:

Paper/Method	Macro-F1 / mAP	Micro-F1 / mAP	Hamming Loss	Best Modality Fusion	Key Dataset
Genre Spectrum (Agrawal et al., 2023)	≈0.90 / —	0.78 / —	—	BERT/GPT-4 multi-label MLPs	IMDb, Rotten Tomatoes, Gracenote
IDKG (Li et al., 2023)	0.832	0.849	—	KG + poster + plot fusion, contrastive	MM-IMDb, MM-IMDb 2.0
Movie-CLIP (Zhang et al., 2022)	65.4% mAP	75.2% mAP	—	Visual+audio+ASR-CLIP fusion	MovieNet
MMX-Trailer-20 (Fish et al., 2020)	— / 0.597	— / 0.583	—	Collab. gating over expert nets	MMX-Trailer-20
Poster (ERDT) (Nareti et al., 2023)	56.4% F1	—	0.1655	ResDenseTransf. ensemble	IMDb Posters
Poster (MCAM+SMSAM) (Nareti et al., 2024)	68.2% F1	—	—	CLIP bi-modal, cross-attn	IMDb Posters
Trailers12k (Montalvo-Lezama et al., 2022)	0.756 μAP (75.6%)	—	—	Swin-3D Transformer	Trailers12k
Multi-modal (late fusion) (Mangolin et al., 2020)	F1=0.628	—	—	LSTM on synopsis + CNN on video	TMDb / OpenSubtitles / Posters

Macro-F1 and mean average precision (mAP) remain standard, but many works report per-class/genre F1, balanced accuracy, hit ratio, and Jaccard index. The highest reported macro-F1/mAP values on large, modern, multimodal sets approach ≈0.83–0.90 (Agrawal et al., 2023, Li et al., 2023). Ensemble fusion and KG-guided contrastive learning yield the largest improvements, especially for long-tail genres.

6. Domains of Application, Extensions, and Limitations

Applications:

Automated genre annotation for digital archives, recommendation engines, and streaming platforms (Agrawal et al., 2023).
Content-based retrieval and clustering using genre spectrum or NT-Xent-like embedding spaces (Fish et al., 2020).
Fine-grained similarity search (e.g., "nearby" movies in multilabel semantic space).

Extensions:

Inclusion of micro-genres using LLM-generated labels (Agrawal et al., 2023).
Knowledge graph integration for stronger metadata reasoning (Li et al., 2023).
Fine-grained semantic clustering to dissociate subtle style/tone blends within/between coarse-class genres (Fish et al., 2020).
Multimodal label augmentation through cross-modal co-occurrence inference (Nareti et al., 2023).

Limitations:

Performance degrades on low-signal classes and with imbalanced label distributions (Nareti et al., 2023, Nareti et al., 2024).
Models relying solely on visual, audio, or shallow text representations underperform those aggregating deep contextual embeddings or multi-source fusion (Hoang, 2018, Mangolin et al., 2020, Agrawal et al., 2023).
Absence of joint modeling for hierarchical or ontology-aware genre structures (Li et al., 2023).
Incomplete metadata (missing cast/crew nodes or poor textual descriptions) remain problematic.

7. Open Challenges and Future Directions

Current research trajectories include:

Richer genre taxonomies: hierarchical/micro-genre labels, cross-source integration (Agrawal et al., 2023, Li et al., 2023).
Robustness to extreme label imbalance and domain transfer (e.g., from posters/trailers to full-length content) (Montalvo-Lezama et al., 2022, Sulun et al., 2024).
Fusion of increasingly diverse modalities (e.g., subtitle dialogue, OCR-extracted poster text, musical cues, KG-augmented metadata) (Sulun et al., 2024, Mangolin et al., 2020, Nareti et al., 2024).
Zero-/few-shot genre discovery via LLM-augmented representations and metadata mining (Agrawal et al., 2023).
Better leveraging of relational priors in metadata and cold-start scenarios (knowledge graph-based embedding, (Li et al., 2023)).

A plausible implication is that continued advances in LLMs, self-supervised cross-modal learning, and adaptive thresholding/fusion strategies will further elevate multilabel movie genre classification performance and expand its applicability across diverse content ecosystems.