Fine-grained Single-instance Perception

Updated 27 January 2026

FSP is a vision-language paradigm that precisely perceives, localizes, and describes isolated visual instances or regions with high fidelity.
It leverages region-specific attention, autoregressive captioning, and self-distillation techniques to capture minute details in images, documents, and videos.
Key datasets and architectures such as GranViT and SD-RPN demonstrate FSP’s impact on enhancing fine-grained object recognition, OCR, and segmentation tasks.

Fine-grained Single-instance Perception (FSP) denotes the precise, attribute-level perception, localization, and description of isolated visual instances or regions in complex data. In contemporary vision and vision-language modeling, FSP is characterized by the capacity to (a) selectively attend to an arbitrarily supplied region of interest (ROI, such as a single bounding box), (b) produce detailed, natural-language or symbolic descriptions of that isolated region, and (c) conversely, localize or segment a visual region corresponding to a supplied textual, attribute, or reasoning query. Unlike broad scene-level understanding, FSP mandates high-fidelity discrimination at the level of single objects, parts, or even small text tokens, with an explicit focus on capturing instance-level and sub-instance details (e.g., “the blue brake caliper,” “the word ‘Subtotal’ in bold in the lower right”). Modern FSP methods have progressed from isolated natural images and instance segmentation to visually rich document OCR, multimodal LLMs, and even fine-grained events in video.

1. Formalization and Task Definition

A precise definition of FSP is furnished in the context of GranViT: the capability of a vision-LLM to (a) attend to an arbitrarily supplied region (single bounding box), (b) generate a natural-language description of that ROI, and (c) invert this process to localize—in normalized image coordinates—the region referred to by a short phrase (Zheng et al., 23 Oct 2025). The “single-instance” qualifier emphasizes that each training example focuses exclusively on precisely one object or region at a time, while “fine-grained” stipulates the necessity to capture detailed, discriminatory attributes at the instance or sub-instance scale.

In fine-grained visual classification (FGVC), FSP encompasses the process of identifying and classifying an individual object into one of many visually similar subcategories by modeling both minute local cues (texture, edge, part) and global structure (shape, part layout), without reliance on explicit part annotations (Xu et al., 9 Aug 2025).

Modalities in which FSP has been formulated extend beyond images to video, where the objective is isolating and describing isolated, possibly transient, fine-grained events (e.g., a blink, a tap) within a longer sequence (Zhao et al., 24 Nov 2025). FSP also encompasses OCR, document layout analysis, and structured reasoning over complex scenes.

2. Key Datasets and Annotated Corpora

Large-scale, high-quality region-level annotation is foundational to FSP. Gran-29M, constructed for GranViT, comprises 29.51 million images (natural and OCR), incorporating 183.55 million region-level annotations. Each annotation includes a normalized bounding box and an associated caption (natural images) or extracted string (for OCR/text in images) (Zheng et al., 23 Oct 2025). Dataset splits are carefully filtered by pixel resolution, aspect ratio, bounding box area, and minimum instance count per image, ensuring the presence of sufficiently fine-grained regions for both natural and synthetic images.

In the document domain, strong region-annotated sources include public OCR/text-in-image benchmarks. Other FSP-oriented datasets in FGVC, such as CUB-200-2011, NABirds, FGVC-Aircraft, and Stanford Cars, provide class-level granularity, although regionwise annotation is less emphasized (Xu et al., 9 Aug 2025). In video, VideoPerceiver-80K is curated for FSP: 80,000 clips (∼1s duration) from motion, expression, and atomic event datasets, with dense fine-grained temporal annotations (Zhao et al., 24 Nov 2025).

For referring segmentation, datasets like RefCOCO, RefCOCO+, and RefCOCOg provide localized phrase-to-region annotation, supporting the development and benchmarking of instance-aware perception (Liu et al., 2022).

3. Model Architectures and Algorithmic Frameworks

(a) Region-level Autoregressive Perception

GranViT exemplifies a region-based autoregressive FSP pipeline: a Vision Transformer (ViT) backbone is coupled to a projector and LLM decoder (e.g., Qwen2.5-VL). Patch tokens are generated from fixed-resolution images or tiles. For bounding-box-to-caption tasks, RoIAlign pools region features; for caption-to-box, the LLM emits a four-token coordinate sequence. Self-distillation is imposed at the region feature level to enforce explicit localization (Zheng et al., 23 Oct 2025).

(b) Self-distilled Region Proposals

SD-RPN leverages teacher-student self-distillation within MLLMs to extract pseudo-ROI labels by denoising attention maps, training a lightweight region proposal network (RPN) to enable efficient, annotation-free single-pass ROI localization. The RPN is integrated atop frozen MLLM layers and trained via masked BCE on binarized, denoised pseudo-labels. This pipeline decouples ROI localization from slow autoregressive decoding, delivering fast, scalable fine-grained perception (Shi et al., 21 Sep 2025).

(c) Cascaded Spatial Decomposition

SCOPE develops single-instance fine-grained perception by adaptively fusing shallow detail (edges, texture) with deep semantic features, via cascaded Subtle Detail Extractor (SDE) and Salient Semantic Refiner (SSR) modules. This allows position-specific, stage-wise enhancement and integration of local and global cues, improving discriminability for highly similar object classes (Xu et al., 9 Aug 2025).

(d) Unified Language-Driven Perception

UFO unifies detection, segmentation, and reasoning under an open-ended language interface: both bounding boxes and pixelwise masks are generated as autoregressive token sequences, with segmentation mediated by embedding-retrieval from joint visual-token banks using mask token embeddings. This approach supports end-to-end multi-task training and simplifies architectural design (Tang et al., 3 Mar 2025).

(e) Feature Retrieval for Grounded Localization

VLM-FO1 transforms object-centric FSP by replacing brittle coordinate decoding with feature-token retrieval: region proposals are encoded into hybrid tokens combining semantic and spatial detail, projected to the LLM embedding space, and referenced by position-specific tokens in the prompt. This plug-in paradigm enables robust referencing and grounding in multi-object scenes and generalizes to instance segmentation, keypoint detection, and counting (Liu et al., 30 Sep 2025).

(f) Efficient Instance Segmentation

EffSeg realizes FSP as efficient high-resolution mask generation (e.g., 112×112) using structure-preserving sparsity: only "active" features at selected locations are refined with local 2D operations, indexed by a dense spatial map. This achieves RefineMask-level fine-grained segmentation accuracy at a fraction of the compute cost (Picron et al., 2023).

(g) FSP Beyond Vision

FSP has also been demonstrated for human body parsing using 1D WiFi antennas; a deep U-Net maps low-dimensional channel-state information to body segmentation masks and keypoint heatmaps, illustrating the cross-sensor generality of FSP (Wang et al., 2019).

4. Learning Paradigms and Losses

FSP models employ task-specific and cross-modal objectives:

For region captioning, categorical cross-entropy over autoregressive token outputs:

$L_{Bbox2Caption} = - \sum_{l=1}^{L} \log P(o_l = t_l | o_{<l}, x')$

(Zheng et al., 23 Oct 2025)

For caption-to-box regression, cross-entropy over discretized coordinates:

$L_{Caption2Bbox} = - \sum_{k=1}^{4} \log P(o_k' = c_k | o_{<k}', \mathrm{image})$

(Zheng et al., 23 Oct 2025)

Self-distillation loss aligns region features via MSE between teacher and student encoders:

$L_{distill} = \mathrm{MSE}(x'_{crop}, \mathrm{RoIAlign}(x'))$

(Zheng et al., 23 Oct 2025)

In SD-RPN, a masked BCE is imposed:

$L_{BCE} = - \sum_{j:\mathrm{mask}[j]=1} [ \bar{M}_{roi}[j] \log \sigma(\hat{S}_{roi}[j]) + (1 - \bar{M}_{roi}[j]) \log(1 - \sigma(\hat{S}_{roi}[j])) ]$

(Shi et al., 21 Sep 2025)

Multi-task losses combine detection, segmentation, and instance identification losses as weighted sums, e.g. in referring segmentation (Liu et al., 2022) and vehicle FSP (Lu et al., 2020).
Some pipelines utilize contrastive objectives at intermediate layers to force distinctiveness between relevant and “key-information-missing” sequences in video FSP (Zhao et al., 24 Nov 2025).
Reinforcement learning (e.g., GRPO) strategies train region selection policies where rewards are VLM task accuracy or likelihood improvement under focused crops (Carvalho et al., 25 Nov 2025).

5. Benchmarks, Empirical Results, and Ablations

FSP advances are documented across canonical benchmarks:

Model	FG Recognition (%)	OCR (%)	RefCOCO*	Other Gains
SAILViT	77.95	53.33
GranViT	80.78	55.97	+2.83	SOTA on multiple VQA tasks (Zheng et al., 23 Oct 2025)
SD-RPN+LLaVA			+12.4 (DocVQA), +12.6 (TextVQA)	0.62× throughput; annotation-free (Shi et al., 21 Sep 2025)
SCOPE (Swin-B)	92.7 (CUB)			New SOTA (CUB, FGVC-Aircraft) (Xu et al., 9 Aug 2025)
EffSeg				71% FLOPs reduction vs. RefineMask (Picron et al., 2023)
VideoPerceiver	+0.15 (MotionBench), +20 pp (VRU-Accident)			SOTA on rare, fine-grained action events (Zhao et al., 24 Nov 2025)

Ablations confirm that region-level training, self-distillation, adaptive filtering, and refinement modules consistently yield additive or multiplicative performance boosts over baselines (Zheng et al., 23 Oct 2025, Xu et al., 9 Aug 2025, Shi et al., 21 Sep 2025).

Performance is robust to the choice of VLM backbone (e.g., Qwen, InternViT), and certain FSP improvements (e.g., from GranViT pretraining, SD-RPN integration) transfer across LLM sizes and architectures (Zheng et al., 23 Oct 2025, Shi et al., 21 Sep 2025).

6. Extensions, Limitations, and Future Directions

Key limitations for current FSP methodology include annotation cost (for high-resolution region labels), computational overhead (for adaptive spatial filtering or region-wise inference), and sensitivity to out-of-distribution domains (e.g., documents vs. natural images, non-English text) (Carvalho et al., 25 Nov 2025, Xu et al., 9 Aug 2025). FSP methods show diminishing returns at extremely high resolutions or for crops that do not correspond to meaningful task units.

Promising future directions include lightweight filter factorization, integration with sparse or hybrid spatial-frequency attention, extension to multi-object and scene-centric scenarios, and richer multimodal fusion (e.g., cross-sensor, 3D, and temporal events) (Xu et al., 9 Aug 2025, Zhao et al., 24 Nov 2025). Plug-and-play, feature-retrieval–based referencing of arbitrary region/part proposals as in VLM-FO1 is expected to enable generalization to part-level attributes, video, and structured reasoning (Liu et al., 30 Sep 2025).

The FSP paradigm has demonstrably expanded the design space for vision-LLMs, enabling not only state-of-the-art fine-grained recognition but also mutual alignment of visual and language spaces at the single-instance level, with applications across VQA, OCR, dense captioning, object part analysis, and video event understanding (Zheng et al., 23 Oct 2025, Zhao et al., 24 Nov 2025).