Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models

Published 6 Apr 2026 in cs.AI | (2604.04482v1)

Abstract: Learners' use of video controls in educational videos provides implicit signals of cognitive processing and instructional design quality, yet the lack of scalable and explainable predictive models limits instructors' ability to anticipate such behavior before deployment. We propose a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and rewinding behavior as proxies for cognitive load from video content alone. Our approach leverages multimodal LLMs (MLLMs) to compute embeddings of short video segments and trains a neural classifier to identify temporally fine-grained interaction peaks. Drawing from multimedia learning theory on instructional design for optimal cognitive load, we code features of the video segments using GPT-5 and employ them as a basis for interpreting model predictions via concept activation vectors. We evaluate our pipeline on 77 million video control events from 66 online courses. Our findings demonstrate that classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts. Overall, our results show the feasibility of cost-efficient, interpretable pre-screening of educational video design and open new opportunities to empirically examine multimedia learning theory at scale.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a scalable and explainable pipeline that leverages MLLM embeddings to predict learner-video interactions from raw video segments.
It fuses multimodal features from transcripts, visuals, and slides to align predictions with CTML principles, achieving AUCs up to 0.76 in behavior prediction.
The approach employs TCAV analysis to interpret CTML features, offering actionable insights that bridge empirical behavioral traces and instructional design.

Scalable and Explainable Prediction of Learner-Video Interactions with Multimodal LLMs

Introduction

This paper explores the problem of predicting moment-to-moment learner interactions (watching, pausing, rewinding, skipping) with educational video content, using only the raw video segment as input. Rather than relying on retrospective video analytics or aggregated behavioral data, the proposed pipeline leverages multimodal LLMs (MLLMs) for content-based pre-screening and explainable modeling. Importantly, the approach is grounded in the Cognitive Theory of Multimedia Learning (CTML), with explicit feature coding and interpretation of model outputs in relation to well-established instructional design principles.

The work addresses three core research questions: (1) To what degree can MLLM embeddings of short video segments predict population-level interaction peaks; (2) Can these predictions be interpreted through CTML feature codings; (3) How sensitive are embedding-based models to multimedia learning concepts, as quantified by Testing with Concept Activation Vectors (TCAV).

Methodological Framework

A scalable, interpretable pipeline is introduced (Figure 1). The approach first aggregates clickstream interactions into percentile-ranked behavioral signals and preprocesses video content into synchronized multimodal views: automatic transcripts, visual slide text, and sampled video frames.

Figure 1: Pipeline for predicting learners’ interactions with online learning videos and explaining predictions from multimedia learning theory.

Video segments (20 seconds around each time-point) are embedded with state-of-the-art vision-language and text transformers (e.g., Qwen3-Embedding-8B, SigLIP, Qwen2.5-VL). Embeddings from these modalities are concatenated and fed into a lightweight neural classifier, which predicts whether a given moment falls among the top K% of interaction peaks (watching, pausing, rewinding, or skipping) per video. The model includes a video-level reference embedding to normalize importance across heterogeneous content.

Figure 2: Three modalities of video segments around $t$ are encoded by pre-trained transformers. A neural classifier predicts if $\text{Signal}_v(t)$ is among the top K\% at timepoint $t$ .

To expose theoretical interpretability, video moments are annotated with a rubric of 15 CTML features, including visual complexity, instructor presence, formulas, semantic and visual breakpoints, redundancy, and more. These features are coded by GPT-5, and their presence/absence forms the basis for subsequent TCAV analysis: for each concept, a linear probe is trained on the model’s hidden layers, and the TCAV score quantifies the fraction of correct predictions that are locally sensitive to the given feature.

Experimental Results

Predictive Performance of MLLM Embeddings

Evaluation spans 77 million video interactions across 1,641 videos from 66 STEM-focused courses. Predictive performance is measured by area under the ROC curve (AUC) and operational Lift@K metrics.

The multimodal classifier achieves AUCs up to 0.76 for rewinding behavior and 0.74 for pausing, clearly outperforming random baselines for all signals except skipping (which is less temporally localized to content). Models generalize effectively to unseen academic fields, with only minor performance degradation outside the training domain (Figure 3).

Figure 3: AUC (± std, 5 seeds) for prediction of top 5% learner-video interaction moments on fields unseen during training as a measure of generalization of the classifier.

Segment transcript and visual embeddings both convey informative signal, but maximal performance is obtained by fusing all modalities and including the context-averaged video reference embedding. The choice of layer and encoding depth in vision-LLMs affects performance (layer 32 offers best linear separability for CTML concepts), but smaller 7B parameter encoders perform comparably well to larger models, supporting the pipeline’s scalability.

Informativeness of CTML-Coded Features

CTML features auto-coded by GPT-5 display high inter-rater agreement with humans for visually objectified concepts (e.g., instructor presence, visual breakpoints) and moderate agreement for subjective transcript-driven features (e.g., signaling, interactivity).

Figure 4: Learner-video interactions at 6,000 video moments (50% from top 5% ranks) by CTML features coded by GPT-5; inter-rater agreement is shown in the right panel.

Most CTML features are significantly associated with behavioral signal. E.g., moments containing formulas, semantic or visual breakpoints show elevated pausing and rewinding; photo- or instructor-present scenes correlate with lower cognitive-load proxy signals. Visual complexity and redundancy display non-linear influences, matching theoretical expectations from CTML.

Predictive power of CTML feature sets alone saturates quickly as sample size grows, and is outperformed by high-dimensional MLLM embeddings at scale. However, with moderate ( $<$ 1,000) annotated datapoints, CTML features confer stronger predictive performance—highlighting their value as interpretable, low-cost priors.

TCAV-Based Explainability

TCAV analysis on the classifier’s embedding and hidden spaces uncovers robust, statistically significant model sensitivity to nearly all CTML features—demonstrating that MLLM embeddings not only support high classification accuracy, but encode meaningful instructional design concepts.

Figure 5: TCAV values of CTML concepts and activations in the classifier H. Significant (

) values above 0.5 indicate positive sensitivity to the concept.*

The model’s positive sensitivity is strongest for visual breakpoints, annotations, redundancy, and structured visualizations, consistent with CTML’s predictions on cognitive load. Some modality-dependent and concept-position-specific effects are also revealed (e.g., formulas’ positive association with pausing when spoken, negative when on slides). Binary CTML concepts reach AUCs above 0.82 in linear probe decoding tests. Even transcript-only embeddings contain latent information about visual features, indicating MLLM cross-modal integration.

Practical and Theoretical Implications

This research demonstrates the feasibility of pre-screening instructional videos for cognitive load proxies prior to deployment—overcoming the cold-start challenge with scalable content-based AI explainability. The pipeline enables empirical hypothesis testing in multimedia learning research at unprecedented granularity and scale and opens new frontiers for rapid, interpretable instructional design feedback.

On the methodological side, the integration of MLLMs with interpretable concept-based probing sets a template for explainable educational AI, strengthening alignment with cognitive theory rather than opaque association mining.

Going forward, several directions emerge: (1) extension to non-STEM domains to uncover discipline-specific generalization limits; (2) integration of course- or learner-level covariates; (3) more fine-grained, temporally adaptive context selection; (4) improved annotation of abstract CTML concepts with additional human-AI collaboration.

Conclusion

This paper introduces a content-grounded, scalable, and interpretable framework for predicting and explaining learner-video interaction patterns. Leveraging MLLM-based segment embeddings, automated theory-driven coding, and concept activation probing, the approach yields high predictive accuracy, strong generalizability, and rich interpretability with minimal cost. The results substantiate the view that multimodal LLMs, when coupled with explicit theoretical constructs and explainable AI methods, can bridge the long-standing gap between empirical behavioral traces and actionable instructional design.

Markdown Report Issue