Dynamic Resolution Input Strategy (DRIS)

Updated 5 January 2026

DRIS is a dynamic input strategy that adaptively selects image resolutions based on content, task requirements, and computational constraints to maximize semantic fidelity.
It leverages learned predictors and patch-level routing to optimize image processing, resulting in significant reductions in FLOPs while maintaining or improving accuracy.
DRIS is applied across domains such as MLLMs, object detection, OCR, and remote sensing, balancing resource use with detail retention through multi-stage resolution selection.

Dynamic Resolution Input Strategy (DRIS) describes a class of mechanisms that adaptively select and process image input resolutions based on content, task requirements, or computational constraints. Rather than statically resizing all images or regions to fixed dimensions, DRIS algorithms learn to allocate detail adaptively, aiming to maximize semantic fidelity within hardware or computational budgets. The approach is now canonical across domains including multimodal LLMs (MLLMs), object detection, OCR, super-resolution, autonomous perception, and remote sensing. Contemporary designs leverage learned predictors, region or patch-level routing, and multi-stage dynamic compression, yielding substantial reductions in computational cost while retaining—or even improving—accuracy for vision-language and pure vision tasks.

1. Fundamental Principles and Mechanisms

DRIS encompasses architectures wherein image input resolution is not fixed but determined via an adaptive or learned function. Central paradigms include:

Content-Adaptive Routing: Systems partition images into regions, patches, or crops, and select a processing resolution based on estimated semantic complexity, saliency, or information density. For example, ViCO employs a patch-level router assigning high or low token count connectors per patch (Cui et al., 14 Oct 2025); DynRsl-VLM crops high-resolution regions around detected entities and processes global context at low resolution (Zhou et al., 14 Mar 2025); remote sensing applications use per-pixel or per-region saliency maps to decide refinement levels (Zhang et al., 29 Dec 2025).
Learned Prediction Modules: A lightweight predictor trained jointly with the primary network outputs a discrete or continuous scaling factor per image (DRNet (Zhu et al., 2021), Elastic-DETR (Seo et al., 2024), DyRA (Seo et al., 2023)). Predictors typically use small CNN or transformer architectures and output either a resolution index or scale factor via Gumbel-Softmax or sigmoid normalization.
Multi-Connector Token Compression: In MLLMs, vision tokens are downsampled using connector MLPs with variable compression ratios, minimizing the number sent to the LLM. ViCO’s connectors yield 64 or 256 tokens based on routing decisions; DynRefer nests variable crops around region-of-interest boxes and fuses features over multiple scales (Zhao et al., 2024).
Optimization for Resource and Semantic Trade-offs: Most DRIS designs jointly minimize semantic loss (e.g., cross-entropy or KL divergence to reference outputs) and computational cost (e.g., expected FLOPs), using explicit trade-off regularizers or multi-stage losses.

2. Formal Models and Training Algorithms

DRIS methodology is underpinned by rigorous formalizations and two-stage or joint training approaches.

Compression Ratio Selection (ViCO)

Let $I$ be an image partitioned into $P$ patches, each routed to high or low resolution via the learned router $p_i$ . The consistency loss minimizing KL divergence between reference (full-res) outputs and policy outputs under mixed-resolution input is:

$\mathcal{L}_{\mathrm{cons}} = \mathbb{E}_{\xi \sim \mathrm{Uniform}[0,1]} \Bigg[ \frac{1}{N}\sum_{t=1}^N \mathrm{KL}(\pi_{\theta_{\mathrm{ref}}}(y_t|\cdot,I)\|\pi_\theta(y_t|\cdot,I_\xi)) \Bigg]$

Sensitivity ratios $r_i$ per patch determine router supervision, and binary cross-entropy loss trains the Visual Resolution Router (Cui et al., 14 Oct 2025).

Adaptive Image-Scale Prediction

In DRNet and Elastic-DETR, a sample-specific predictor $R(\cdot)$ outputs a resolution index or scale $\phi$ :

$r = \sum_{j=1}^m h_j r_j \qquad h \in \mathrm{one-hot}\{0,1\}^m$

The complete objective is:

$L = L_{ce} + \eta L_{reg}$

where $L_{reg}$ regularizes expected computational cost against a FLOPs target (Zhu et al., 2021, Seo et al., 2024).

In DyRA, loss aggregation over object-size bins uses ParetoScaleLoss:

$P$ 0

with BalanceLoss adapting scale boundaries to align with detector scale performance (Seo et al., 2023).

Dynamic Knowledge Distillation (Text Spotting)

In DLD, a Gumbel-Softmax selector supervises the resolution choice by minimizing KL divergence to a teacher network, with sequential knowledge distillation aligning low-res student recognition with high-res teacher predictions (Chen et al., 2022).

3. Architectural Realizations

DRIS implementations span patch-based, nested-view, and global-local partitioning strategies, enabled by modular predictors and dynamic routing.

Architecture	Resolution Control	Principle
ViCO (Cui et al., 14 Oct 2025)	Patch-wise connectors	Semantic-token routing
DynRefer (Zhao et al., 2024)	Nested crops (N views)	Detail/context fusion
AdaptVision (Wang et al., 2024)	Grid partitioning	Density-guided fusion
Elastic-DETR (Seo et al., 2024)	Image scale factor	Content-specific scale
DyRA (Seo et al., 2023)	Image scale factor	Pareto/balance loss
ESSR (Hsu et al., 26 Mar 2025)	Edge-based patching	MAC/PSNR trade-off
Remote Sensing VLM (Zhang et al., 29 Dec 2025)	Per-region saliency	ROI refinement

ViCO, DynRefer, and AdaptVision favor sub-image patching, connector selection, and token compression. DRIS for object detectors leverages image-wide scaling, with predictors co-trained by scale-specific losses.

Region selection methods in DRIS allocate high resolution only to top-k high-saliency regions, while retaining low-resolution background context—see remote sensing VLM (Zhang et al., 29 Dec 2025).

Low-level hardware accelerators (ESSR) use patch edge scores and adaptive thresholding for subnet selection, yielding 50% MAC reduction at negligible PSNR loss (Hsu et al., 26 Mar 2025).

4. Experimental Benchmarks and Quantitative Impact

DRIS empirical evaluations demonstrate:

Vision Token Compression: In ViCO, up to 50% reduction in vision tokens with ≥99.6% accuracy retention across OCR and reasoning tasks (Cui et al., 14 Oct 2025).
FLOPs Reduction and Accuracy Gains: DRNet achieves 44% FLOPs reduction in ResNet-50 with negligible accuracy drop; Elastic-DETR produces 26% decrease in computation or 3.5% AP gain over MS-trained DN-DETR (Seo et al., 2024, Zhu et al., 2021).
Multimodal Gains: DynRefer reports +8.6 CIDEr and +7.3 mAP improvement in region captioning and attribute detection versus fixed-resolution baselines (Zhao et al., 2024).
Super-Resolution: ESSR accelerator achieves 50% MAC reduction with only 0.1dB PSNR decrease at 8K@30FPS throughput (Hsu et al., 26 Mar 2025).
Resolution Robustness Benchmarks: Res-Bench (Li et al., 19 Oct 2025) introduces metrics (Spearman’s ρ, ACE, RCE) for resolution stability, showing that patch-based DRIS yields lower volatility at low-res, while native dynamic models score higher at high-res but with less robustness.
Autonomous Driving Perception: DynRsl-VLM demonstrates improved distance MAE (-0.6m), higher risk MAP (+2.5), and enhanced reasoning BLEU in end-to-end VQA (Zhou et al., 14 Mar 2025).
Remote Sensing: Coarse-to-fine DRIS achieves ×4 speedup with minimal BLEU-4 loss, and +7.5% accuracy improvement over LoRA baseline (Zhang et al., 29 Dec 2025).

5. Task-Specific Variants and Domain Applications

DRIS frameworks are now pervasive in:

Multimodal Vision-LLMs (MLLMs): Adaptive token routing based on semantic complexity is standard for LLM-based perception, OCR, spatial reasoning, and captioning (ViCO, DynRefer, AdaptVision, Res-Bench) (Cui et al., 14 Oct 2025, Zhao et al., 2024, Wang et al., 2024, Li et al., 19 Oct 2025).
Object Detection: Continuous and discrete scaling predictors, ParetoScaleLoss, and content-driven thresholds optimize detectors including DETR, RetinaNet, Faster-RCNN, FCOS, and DINO (Seo et al., 2024, Seo et al., 2023).
Document and Text Spotting: Dynamic low-resolution distillation merges scale selection with sequential knowledge distillation to match high-res performance at reduced costs (Chen et al., 2022).
Super-Resolution Hardware: Edge-selective patch routing underpins energy-efficient and hardware-constrained SR for high-resolution imaging (Hsu et al., 26 Mar 2025).
Autonomous Driving: Region-level resolution refinement preserves safety-critical details in VLM-based scene interpretation (Zhou et al., 14 Mar 2025).
Remote Sensing: Coarse-to-fine DRIS balances ROI detail with computational efficiency, crucial for cross-modal fusion and semantic interpretation pipelines (Zhang et al., 29 Dec 2025).

6. Limitations, Design Recommendations, and Benchmarked Robustness

Challenges and guidelines for DRIS include:

Stability vs. Peak Accuracy: Patch-based or hybrid strategies confer superior robustness across resolutions, but native dynamic processing achieves higher peak accuracy at high-res (Res-Bench findings) (Li et al., 19 Oct 2025). Stability regularization and mixed-resolution fine-tuning mitigate volatility.
Predictor Complexity: Overhead is modest (e.g., 1.5 GFLOPs in Elastic-DETR, 0.17–0.29 GFLOPs in DRNet), but must be balanced against overall gains, especially in resource-constrained or latency-sensitive deployments (Seo et al., 2024, Zhu et al., 2021).
Resolution Selection Granularity: Current frameworks (e.g. DyRA) operate at image-level scale only; fine-grained local control offers further room for optimization (Seo et al., 2023).
Hyperparameter Tuning: Thresholds, number of refined ROIs, and cutoff points should be grid-searched for optimal balance; aggressive thresholding risks omitting fine targets, undersized k yields coarse outputs (Zhang et al., 29 Dec 2025).
Hardware Adaptivity: For high-resolution accelerators, dynamic range adaptation, group-of-layer mapping, and SRAM-efficient SFBs are essential for utilization and throughput (Hsu et al., 26 Mar 2025).
Domain-Specific Pitfalls: On uniformly complex images, coarse-to-fine DRIS may fall back to high-resolution global processing, limiting speedup (Zhang et al., 29 Dec 2025). Perceptual hash-based view selection is a fast, robust approach for region-level choice in multimodal tasks (Zhao et al., 2024).
Metrics for Robustness: Beyond accuracy, metrics such as Spearman’s ρ, ACE, and RCE are now standard for benchmarking DRIS-enabled models (Li et al., 19 Oct 2025).

7. Outlook and Evolving Trends

Recent works advocate for:

Integration of super-resolution modules to enhance low-res inputs prior to DRIS routing, jointly fine-tuned for task response (Li et al., 19 Oct 2025).
Combinatorial routing for mixed-token models, enabling fine-grained, patch-level allocation based on saliency or task prior (Cui et al., 14 Oct 2025, Zhao et al., 2024).
Multi-stage coarse-to-fine schemes, especially for remote sensing, balancing accuracy and compute by hard thresholding of saliency maps and top-k local ROI refinement (Zhang et al., 29 Dec 2025).
Exploration of continuous-resolution predictors leveraging transformer-based representations and robust loss aggregation, as in DyRA and Elastic-DETR, for standard object detection frameworks (Seo et al., 2024, Seo et al., 2023).

A plausible implication is that future DRIS systems will employ semantic- and task-aware predictors yielding globally optimal allocation of resolution and computation across heterogeneous models, domains, and hardware environments.