Adaptive Multi-modal Guidance (AMG)

Updated 16 January 2026

Adaptive Multi-modal Guidance is a framework that dynamically integrates heterogeneous data streams for real-time decision-making and feedback.
It employs modular architectures with attention, gating, and transformer mechanisms to optimize multi-modal fusion and adapt to changing inputs.
AMG enhances applications such as live task guidance, path planning, and sensor fusion, demonstrated by robust empirical metrics in diverse domains.

Adaptive Multi-modal Guidance (AMG) is a class of algorithmic frameworks designed to optimize decision-making, prediction, or interaction in systems processing heterogeneous data streams from multiple modalities. AMG distinguishes itself from fixed or late-fusion approaches by adaptively weighting, integrating, and temporally sequencing information from visual, linguistic, auditory, spatial, sensor, or semantic sources in response to online feedback, user-specific uncertainty, and dynamic environmental cues. The AMG principle applies in contexts ranging from live instructional guidance, conditional prompt generation, and robust sensor fusion to path planning and cross-modal reconstruction. Current AMG architectures are implemented as streaming, event-driven modules, gating networks, attention mutual-guidance blocks, or iterative fusion transformers, explicitly formulated to maximize fidelity and robustness in supervisory, generative, recognition, and control tasks.

1. Foundational Problem Formulations in AMG

AMG is formalized as the problem of achieving real-time, step-wise guidance and feedback in multi-modal domains. In “Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?” (Bhattacharyya et al., 27 Nov 2025), the problem is defined as asynchronous, streaming vision–language sequence prediction. Here, a guidance agent receives a continuous stream of video frames $V = \{v_1, v_2, ...\}$ , emits instructions $I_t$ , detects completion events and error states, and outputs precise, timestamped feedback in real time:

At time $t$ , the model observes $v_1...v_t$ and must decide among three actions: $\texttt{NextInstruction}(\tau; P)$ (for plan step $\tau$ of plan $P$ ), $\texttt{Feedback}(\tau; OK)$ , or $\texttt{Feedback}(\tau; Mistake)$ .
Completed steps are validated via temporal alignment window $|t_{\text{pred}} - t_{\text{gt}}| \leq \Delta/2$ , typically $\Delta=30$ seconds.
The system objective is to maximize instruction completion accuracy (IC–Acc), mistake $F_1$ score, and feedback fluency (BERTScore, ROUGE-L), all subject to strict latency requirements (first-token latency $\lesssim 1.1$ s).

Similarly, AMG in path planning (Ha et al., 5 Jan 2026) fuses global language-planned waypoints and vision-language confidence into a decaying guidance term for incremental heuristic search, incorporating uncertainty and geometric constraints into the cost function over the search graph.

2. Model Architectures and Adaptive Mechanisms

State-of-the-art AMG models employ modular architectures that explicitly encode and adapt to multimodal signals.

LiveMamba (Task Guidance) (Bhattacharyya et al., 27 Nov 2025):
- Vision encoder InternViT-300M-448 generates $M=1025$ tokens/frame; Q-Former adapter compresses this to $K=32$ visual tokens.
- Mamba-130M LLM, operating recurrently with special tokens $<$ vision $>$ and $<$ response $>$ , enables frame-aligned, event-based output.
- Streaming pipeline interleaves video with recipe plan, prior instructions, and feedback, permitting asynchrony and error detection.
- Iterative re-planning invokes external LLMs (Qwen3-32B) for step recovery when divergences are detected.
MuGCP Attention Mutual-Guidance (AMG) Module (Yang et al., 11 Jul 2025):
- Receives semantic conditional prompts $P_{\mathcal{TD}^{(i)}}$ and image features $F_x$ , generating visual conditional prompts $P_{\mathcal{VC}^{(i)}}$ via self-attention, cross-attention, and gated fusion followed by a feedforward network, with downstream Multi-Prompt Fusion to contextualize both SCP and VCP.
- Enables adaptive prompt generation facilitating instance-specific learning and strong generalization to unseen classes.
RCMCL Adaptive Modality Gating (AMG) (Akgul et al., 6 Nov 2025):
- Modality-specific feature $Z_M$ gates are computed as $G_M = \sigma(W_G Z_M + b_G)$ , dynamically weighting RGB-D, skeleton, point cloud inputs.
- Gates are normalized and used for weighted fusion: $Z_{\text{fuse}} = w_R Z_R + w_S Z_S + w_P Z_P$ .
GAFusion Modality-Guided Transformer Blocks (Li et al., 2024):
- LiDAR occupancy masks and sparse depth maps provide structured geometric priors for image streams.
- Multi-scale dual-path transformers augment the receptive field and contextual awareness of camera features.
- LiDAR-guided adaptive fusion transformer (LGAFT) implements per-pixel gating, balancing camera and LiDAR features for optimal BEV object detection.

3. Benchmarks, Data Streams, and Annotation Protocols

Dedicated AMG benchmarks feature densely labeled, temporally resolved, multi-modal datasets.

Qualcomm Interactive Cooking (Bhattacharyya et al., 27 Nov 2025):
- Extends CaptainCook4D (384 egocentric videos).
- Annotated with action segments, mistake categories (seven types), precise instruction and feedback timestamps, and topological recipe graphs.
- Advanced set supports dynamic re-planning, requiring external LLM interventions every $\sim$ 2.7 instructions.
CoT-Movie-Dubbing Dataset (DeepDubber-V1) (Zheng et al., 31 Mar 2025):
- 7.2 hours of multi-modal video clips with subtitles, lip bounding-boxes, fine-grained meta-data (scene type, gender, age, emotion), and chain-of-thought traces.
nuScenes (GAFusion) (Li et al., 2024):
- BEV-based LiDAR and camera data annotated for 3D object detection, with mAP and NDS as primary metrics, and auxiliary depth and occupancy supervision.

AMG models ingest high-dimensional sensory data, synchronized by preprocessing blocks, and extract structured features from each modality for joint or sequential fusion.

4. Training Regimes, Loss Functions, and Optimization Techniques

AMG systems utilize supervised, self-supervised, and reinforcement learning, leveraging bespoke augmentation regimes and task-specific loss functions.

LiveMamba (Bhattacharyya et al., 27 Nov 2025):
- Q-Former pre-trained on object grounding, captioning, and fine-grained action understanding (LVIS, EPIC-KITCHENS VISOR, SSv2).
- Fine-tuning employs temporal jitter, instruction completion augmentations, and counterfactual mistake injections.
- AdamW optimizer, learning rate $1 \times 10^{-5}$ , pre-train for $200K$ iterations, fine-tune for $120K$ iterations on $8 \times$ H100 GPUs.
DeepDubber-V1 (Zheng et al., 31 Mar 2025):
- Mixed supervised (CoT response labeling), RL preference optimization (DPO, BCO), sequence generation (SFT), and multi-condition classifier-free guidance.
- Losses include accuracy, WER, speaker and emotion similarity, MCD, LSE, enforced duration alignment for lip sync, and binary cross-entropy on format and outcome tags.
RCMCL (Akgul et al., 6 Nov 2025):
- Totals loss: $\mathcal{L}_{\text{total}} = \lambda_{CM} \mathcal{L}_{CM} + \lambda_{IM} \mathcal{L}_{IM} + \lambda_{\text{deg}} \mathcal{L}_{\text{deg}}$ ; $\lambda_{CM}, \lambda_{IM}, \lambda_{\text{deg}}$ are hyper-parameters for cross-modal, intra-modal, and degradation simulation terms.
- AMG gates are back-propagated through all objectives, but fusion is applied only at inference.
MuGCP (Yang et al., 11 Jul 2025):
- Standard cross-entropy on paired visual and textual features, plus consistency loss on augmented inputs: $\mathcal{L}_{cc}=2-\cos(T',\Phi_{ta}(T^*))- \cos(F',\Phi_{va}(F^*))$ , with $\lambda=8$ .

5. Evaluation Metrics and Empirical Outcomes

AMG systems are evaluated using rigorous quantitative metrics and comparative baselines.

LiveMamba (Bhattacharyya et al., 27 Nov 2025):
- IC-Acc Main Set (streaming): 31.5%.
- Mistake $F_1$ : 0.13 on streaming, 0.19 on turn-based.
- Zero-shot baselines achieve IC-Acc $<$ 24% and Mistake $F_1<$ 0.02.
- Feedback fluency: BERTScore 0.651, ROUGE-L 0.561.
- Advanced planning with re-planner: IC-Acc 12.6%, Mistake $F_1$ 0.19.
Brain-Streams (Joo et al., 2024):
- Subject-averaged metrics (NSD test set):
- All guidance: CLIP 95.2%, Incep 94.0%, PixCorr 0.342, SSIM 0.365.
- Ablations confirm adaptive multi-modal guidance yields higher semantic fidelity than single/two stream models.
RCMCL (Akgul et al., 6 Nov 2025):
- Dual-modality dropout: only 11.5% degradation in accuracy vs. higher rates for non-adaptive fusion.
- Adding AMG boosts clean accuracy (NTU-60 CS) from 90.4% to 91.0%.
GAFusion (Li et al., 2024):
- nuScenes test: 73.6% mAP, 74.9% NDS.
- Ablation: LOG only +1.0% mAP, +0.6% NDS; SDG+LOG +1.4% mAP, +0.8% NDS.
- State-of-the-art BEV fusion, outperforming previous methods.
MuGCP (Yang et al., 11 Jul 2025):
- Full AMG (self + cross-attn + gating): HM 82.03, outperforming ablations (linear 79.85, self-only 80.72, cross-only 80.93).
Automotive AMG (Gomaa, 2022):
- Pointing-gaze referencing: RMSE reduction 20–40%, response times 1.10 s.
- Personalized fusion reduces angular errors from $\mu=6.2^{\circ}$ (baseline) to $\mu=3.6^{\circ}$ (transfer-learning).

6. Adaptiveness, Feedback, and Interaction Strategies

AMG algorithms deploy adaptive weighting and feedback mechanisms to maintain reliability in dynamic, multi-modal environments.

Temporal and Counterfactual Augmentation (Bhattacharyya et al., 27 Nov 2025):
- Temporal jitter and synthetic mistakes enhance model robustness to timing drift and diverse errors.
“When-to-Say” Tokenization (Bhattacharyya et al., 27 Nov 2025):
- Enables asynchronous, frame-wise output for streaming guidance.
Iterative Re-planning (Bhattacharyya et al., 27 Nov 2025):
- External planner invoked on user divergence, decoupling core feedback from plan graph updates.
Prompt Gating and Mutual Guidance (Yang et al., 11 Jul 2025):
- Shared attention layers enable joint semantic-visual prompt coordination, improving adaptation to new instances.
Modality Gating and Fusion Rules (Akgul et al., 6 Nov 2025, Gomaa, 2022):
- AMG learns reliability functions and applies rule-based gating, e.g., fallback on high-confidence modalities or domain-specific overrides in the face of sensor failures or ambiguous input.

7. Limitations and Future Research Directions

Limitations of current AMG frameworks include:

Domain specificity, e.g., LiveMamba confined to cooking (Bhattacharyya et al., 27 Nov 2025).
Difficulty in handling subtle or complex multi-step errors (Bhattacharyya et al., 27 Nov 2025).
Dataset scale and annotation dependency, e.g., DeepDubber-V1 relies on curated CoT traces (Zheng et al., 31 Mar 2025).
Real-time, closed-loop evaluation remains a challenge for live feedback systems (Bhattacharyya et al., 27 Nov 2025).
Human-centered adaptation requires scalable, unsupervised user modeling (Gomaa, 2022).

Envisaged advancements comprise:

Extending AMG beyond narrow procedural domains (fitness, assembly).
Integrating further modalities (audio, haptics).
End-to-end training over structured plan graphs and reasoning over dynamic tasks.
Accelerating re-planner latency for iterative plan adaptation.
Learning to fuse and weight modalities in response to both environmental uncertainty and user corrections.
Scaling AMG benchmarks and datasets to support robust comparative research.

AMG represents a convergence point for multi-modal learning architectures, enabling fine-grained, adaptive interaction and robust fusion across temporal, semantic, spatial, and linguistic modalities. By combining dedicated benchmarking, mutual-attention modules, uncertainty-aware gating, and user-centered adaptation, AMG frameworks establish scalable baselines and compelling directions for interactive perception, procedural guidance, and robust sensor fusion.