Multi-Image Finetuning Strategies
- Multi image finetuning is a technique that jointly processes multiple images to adapt model parameters based on diverse, multimodal data.
- It employs strategies like parallel backbones, adaptive fusion, and cross-attention to overcome the limitations of single-strategy fine-tuning.
- Applications span mixture distributions, generative composition, and multi-task learning, yielding improvements in accuracy, consistency, and AUROC.
Multi image finetuning refers to a spectrum of training and adaptation strategies in which multiple images, or sets of images, are leveraged jointly either as input to a model, as multi-reference conditioning, or as a mechanism for adapting model weights in a fine-grained, data- or task-dependent manner. This approach generalizes standard single-image finetuning to more complex settings, addressing scenarios such as mixture distributions, multi-task learning, vision-LLM alignment, and generative composition with multi-view or multi-reference data. Recent studies demonstrate that explicitly modeling multi-image contexts—via parallel backbones, adapter fusion, architectural cross-attention, or preference optimization—can yield significant gains in performance, consistency, and sample efficiency compared to conventional fine-tuning protocols.
1. Motivation and Limitations of Standard Fine-tuning
Conventional single-strategy fine-tuning procedures assume data homogeneity and optimize global hyperparameters (e.g., learning rates, layer freezing) for all samples. In many vision scenarios, however, the target domain exhibits inherent multimodality—heterogeneous mixtures of classes or tasks, or compositional scenes that require modeling distinct visual features, context, or semantics per instance. Empirically, a single set of fine-tuning hyperparameters fails to simultaneously (i) retain general features for some modes and (ii) adapt aggressively to others, leading to underfitting or overfitting depending on the subpopulation. For instance, in image classification across highly diverse modes (e.g., aircraft versus textures), single-strategy fine-tuning underperforms due to suboptimal feature adaptation on at least one component (Shen et al., 2022).
Multi image finetuning strategies aim to overcome this rigidity through more nuanced, data- and task-adaptive approaches, including explicitly modeling multi-image contexts, leveraging multiple reference observations, or merging the specialization of multiple fine-tuned modules or adapters.
2. Multi-variant Fusion and Adaptive Assignment
A prominent approach to multi image finetuning is to instantiate multiple parallel sub-models, each representing a distinct fine-tuning regime (“conservative” versus “aggressive” learning rates), and fuse their outputs using an adaptive sample-wise weighting (Shen et al., 2022). The Adaptable Multi-tuning Fusion (AMF) framework defines:
- parallel Inception-v4 prediction sub-networks (differing only in learning rate schedules, none with frozen layers).
- A separate policy network (ResNet-34 backbone), which computes unnormalized logits from the sample , transformed into a soft assignment vector .
- Each sub-network produces a latent feature ; these are weighted and concatenated to form , which is then classified.
All model components are optimized jointly via cross-entropy loss. This enables the policy network to learn which fine-tuning variant is optimal for each sample, producing near-deterministic assignment (assignment accuracy converges to 99–100%), particularly beneficial for mixture distributions such as Aircraft-DTD. On such heterogeneous data, AMF delivers a test top-1 accuracy gain of +1.69% (aircraft) and +2.79% (DTD) over standard fine-tuning, demonstrating the necessity of per-sample adaptation (Shen et al., 2022).
3. Multi-reference, Multi-view, and Multi-instance Fine-tuning
Multi image finetuning is also central in tasks where the model must integrate or reason over multiple views or reference renderings of the same object or scene.
a. Multi-reference Conditioning in Generative Composition
The MureObjectStitch approach extends compositional diffusion models by finetuning on real images of a single foreground object (Chen et al., 2024). For each training triplet, per-object fine-tuning constructs (background , composed reference set , ground-truth ), and the denoising network is trained to reconstruct using cross-attention over all reference features. The cross-attention mechanism is simply modified to accept the concatenated feature maps of the references, with no additional regularization; thus, the network can dynamically select fine-grained details from each view.
Empirically, multi-reference fine-tuning allows precise preservation of shape, pose, and fine texture in composites, outperforming single-reference baselines and maintaining geometric adaptability to diverse backgrounds (Chen et al., 2024).
b. Multi-view Consistency in Generative Diffusion
For text-to-multi-view diffusion, Carve3D introduces a two-stage SFT–RLFT pipeline: initial supervised fine-tuning (SFT) trains the model to produce tiled multi-view images; subsequent RL finetuning (RLFT) further optimizes model parameters to maximize cross-view 3D consistency as quantified by a NeRF-based Multi-view Reconstruction Consistency (MRC) metric (Xie et al., 2023). Carve3D uses on-policy REINFORCE with a KL penalty to prevent drift from the SFT prior, resulting in models with lower geometric/photometric inconsistency and superior downstream 3D reconstructions.
4. Parameter-efficient Multi-task Fusion
Multi image finetuning in the context of multi-task learning is addressed in the “Multi LoRA Meets Vision” protocol, where multiple sets of low-rank adapters (LoRA) are trained independently on different tasks and subsequently merged for unified inference (Kesim et al., 2024). For tasks, the corresponding LoRA adapter weights are linearly combined—commonly via concatenation or average—to form a single offset , which is added onto the frozen backbone for efficient multitask prediction.
Experimental results indicate that merging adapters from dissimilar domains (e.g., face recognition, galaxy classification, satellite imagery) incurs minor degradation (<2 percentage points in F1 or accuracy), whereas merging on similar domains can lead to destructive interference, especially with large-output-dimension adapters (e.g., facial landmark detection). As an empirical rule, merging is most effective when task domains have ≤10% visual overlap, and more than three merged adapters yields diminishing returns (Kesim et al., 2024).
5. Multi-image Reasoning in Vision-LLMs
For large vision-LLMs (LVLMs), multi image finetuning supports complex input/output regimes where questions or instructions apply to collections of images. The MIA-DPO framework synthesizes multi-image instructions by augmenting single-image data with unrelated images, presented either as sequences, grid collages, or pic-in-pic arrangements (Liu et al., 2024). The model is then aligned via a Direct Preference Optimization (DPO) loss, using attention scores to automatically select “rejected” answers that disregard the correct target image—eliminating the need for manual annotation.
Adoption of MIA-DPO yields performance gains of 3.0 percentage points on LLaVA-v1.5 and 4.3 on InternLM-XC2.5 across five multi-image benchmarks, with negligible impact on single-image tasks. Attention-based sample selection is critical: it focuses the model’s cross-modal mappings on task-relevant image tokens, mitigating sequence-confusion and element-interference hallucinations (Liu et al., 2024).
6. Multi-instance learning and Masked Context Modelling
In multiple instance learning (MIL) for histopathology, multi image finetuning can be interpreted as fine-tuning the feature extractor (e.g., ResNet-18) to predict the representations of masked patches within a context window, given the unmasked neighbouring patch representations. This is achieved via Masked Context Modelling (MCM) and knowledge distillation from a strong teacher network (EfficientNetV2-L); the student is trained to reconstruct the teacher’s feature vectors for masked instances, using only a single epoch of fine-tuning (Pisula et al., 2024). This approach injects strong context-awareness into the student without requiring pixel-level reconstruction or extra supervision and consistently improves downstream MIL classification, often outperforming the teacher in AUROC.
7. Empirical Benchmarks and Observed Gains
The following table summarizes selected empirical results from recent studies illustrating the gains of multi image finetuning protocols over relevant baselines.
| Protocol/Paper | Main Setting | Metric / Gain | Notes |
|---|---|---|---|
| AMF (Shen et al., 2022) | Mixture image classification | +1.69% aircraft / +2.79% DTD (Top-1 acc) | Policy net adaptively fuses per-sample sub-models |
| Carve3D (Xie et al., 2023) | Multi-view diffusion finetune | 0.0606 MRC (best; lower better) | RLFT + NeRF metric, >68% user pref. 3D consistency |
| MureObjectStitch (Chen et al., 2024) | Multi-ref composition | Details preserved visually | Multi-reference cross-attn, no FID/LPIPS reported |
| Multi-LoRA (Kesim et al., 2024) | Multitask ViT | ≤2 pp F1 drop for merging dissimilar | More than 3 adapters, or similar domains, degrades |
| MIA-DPO (Liu et al., 2024) | Multi-image LVLM alignment | +3.0% (LLaVA), +4.3% (InternLM-XC2.5) | Fully automatic DPO data construction |
| MCM+KD MIL (Pisula et al., 2024) | MIL feature extractor | +0.1–0.12 AUROC gain (per task) | Student beats teacher with 1-epoch context modelling |
In summary, multi image finetuning frameworks leverage advanced assignment, fusion, multi-view conditioning, and context-sensitive representation learning to achieve robust adaptation in diverse, multimodal, or multi-task vision settings, often with significant gains in efficiency and downstream performance.