Two-Stage Detection–Captioning Pipelines
- Two-stage detection–captioning pipelines are systems that separate visual detection from language generation, enhancing interpretability and caption accuracy.
- They leverage dedicated detectors to extract detailed object cues that are transformed into structured inputs for neural caption generators.
- Empirical findings show these pipelines boost caption relevance and reduce hallucinations compared to holistic end-to-end approaches.
A two-stage detection–captioning pipeline is a vision–language system architecture in which explicit object detections or visual concepts from dedicated detectors serve as primary input to a downstream caption generation module. This paradigm reestablishes a modular interface between vision perception and language modeling, supporting both interpretability and enhanced descriptive detail relative to direct end-to-end approaches. The two-stage framework has taken prominent forms across classical detection-conditioned captioners leveraging interpretable object cues, modern region-based detectors coupled with neural LLMs, and recent integrated object reasoning–caption refinement strategies targeting high-resolution imagery (Wang et al., 2018, &&&1&&&, Lee et al., 31 Oct 2025).
1. Architectural Overview
The canonical two-stage detection–captioning pipeline is structured as follows:
- Stage 1: Object Detection or Visual Concept Extraction
- A detection module (object detector, region proposal network, or MIL-based concept detector) processes an input image of size , producing a set of detections:
where . - Post-processing may include score thresholding, non-maximum suppression (NMS), or proposal filtering depending on detection workflow.
Stage 2: Caption Generation
- Detection outputs are mapped to fixed-length, interpretable vector representations encoding per-category counts, size, spatial position, and/or detector confidence.
- These structured features are projected, often via learned affine and non-linear layers, to form the visual conditioning vector , which seeds a caption generation module (e.g., LSTM, maximum-entropy model, or VLM+LLM).
- Variants may include region-specific captioning (cropping and describing new objects) and LLM-powered text fusion, as in multi-stage high-resolution pipelines.
This interface exposes both intermediate visual semantics and modularity for explicit cue integration (Wang et al., 2018, Fang et al., 2014, Lee et al., 31 Oct 2025).
2. Object Detection and Visual Concept Encodings
Detection–captioning systems instantiate the first stage using a range of detectors:
- COCO-trained object detectors (e.g., YOLOv2): Provide high-precision bounding boxes and class labels over 80+ categories (Wang et al., 2018).
- Multiple Instance Learning (MIL) visual concept detectors: Learn classifiers for frequent caption words using weakly supervised region proposals and bag-level Noisy-OR cross-entropy:
where is the probability that region of image contains concept (Fang et al., 2014).
- Open-vocabulary detectors and ensembles: For high-resolution and compositional scenes, detectors like GroundingDINO, YOLO-World, and OWLv2 are used to verify a candidate pool of objects predicted by LLMs (Lee et al., 31 Oct 2025).
Feature engineering from detection results often yields:
- Frequency (object counts): Per-class bag-of-objects vector .
- Size statistics: Max normalized area , where is the instance area relative to the image.
- Spatial encodings: Either as scalar centrality (distance to center) or as full 5-tuple per instance, for up to instances per class.
These cues provide dense semantic coverage, disambiguate instance details, and significantly improve caption relevance when compared to holistic CNN features (Wang et al., 2018, Fang et al., 2014).
3. Caption Generation Algorithms
The second stage translates detection-derived visual representations into fluent descriptions:
- LSTM-based Neural Captioners: Structured detection features are concatenated, projected via , and used as either the initial hidden state or as the initial LSTM input at . Standard LSTM decoder settings include a 2-layer architecture with hidden size 256 and word embedding size 128 (Wang et al., 2018).
- Maximum-Entropy LLMs: Condition on the pool of visual attributes discovered by MIL detectors. At each step, features account for which attributes have been mentioned and control for coverage, with log-linear scoring and NCE-based training (Fang et al., 2014).
- Region-Specific Captioning and LLM Synthesis: For high-res settings, candidate co-occurring objects are predicted, detected, cropped, and passed to a VLM for region-level captioning. Final descriptions are synthesized by an LLM, which ensures integration of all verified content and removes hallucinated mentions (objects referenced in the initial caption but not detected) (Lee et al., 31 Oct 2025).
Inference and Decoding:
- Generation may use greedy decoding (beam size 1) to isolate representational effects, or beam search (k-best) with re-ranking for coverage and fluency (Wang et al., 2018, Fang et al., 2014).
4. Performance Analysis and Empirical Insights
Evaluation across datasets (MS COCO, curated high-resolution sets) employs standard captioning measures (BLEU, CIDEr, METEOR) and hallucination-specific benchmarks (POPE):
| Feature/Model | CIDEr (COCO) | BLEU-4 (COCO) | Hallucination F1 (POPE) | Source |
|---|---|---|---|---|
| ResNet-152 POOL5 + LSTM | 0.749 | — | — | (Wang et al., 2018) |
| Bag-of-objects GT counts | 0.807 | — | — | (Wang et al., 2018) |
| Frequency + Size + Position | 0.849 | — | — | (Wang et al., 2018) |
| MIL + ME LM + DMSM rerank | — | 29.1% | — | (Fang et al., 2014) |
| High-res VLM baseline | — | — | 0.1484 | (Lee et al., 31 Oct 2025) |
| High-res pipeline (+ours) | — | — | 0.2153 | (Lee et al., 31 Oct 2025) |
Key findings include:
- Explicit detection cues (frequency, size, spatial position) are complementary, with joint modeling outperforming either CNN or binarized-only baselines (Wang et al., 2018).
- Covering true object counts, including typically undermentioned or low-frequency classes, is critical—removing "person," "train," etc. causes major metric drop (Wang et al., 2018).
- Additional region-specific captioning and LLM-driven fusion increases caption detail and correctness, while reducing hallucinations by explicit removal of undetected referents (Lee et al., 31 Oct 2025).
- MIL-based detectors for frequent words enable the system to learn concepts across part-of-speech boundaries (not only nouns), conditioning LLMs for robust attribute coverage (Fang et al., 2014).
5. Interpretability, Modularity, and Hallucination Control
A core advantage of the two-stage pipeline is interpretability—object detections can be inspected to understand and debug generated content ("why did the model say 'three benches'?") (Wang et al., 2018). Further, each stage can be independently improved or analyzed: detectors can integrate attributes or relations, LLMs can explicitly enforce attribute coverage or penalize hallucinations, and region cropping ensures that small or occluded objects receive detailed description (Lee et al., 31 Oct 2025).
To minimize hallucinations:
- Detected objects verified by high-threshold fusion are the only valid referents in the final caption; LLM synthesis excludes dropped objects (Lee et al., 31 Oct 2025).
- Co-occurring objects predicted by LLMs are filtered by detection stage, enforcing factual grounding.
- Traditional pipelines relying solely on VLM outputs without detection checkpoints show higher hallucination metrics (e.g., F1 score difference of +45% in favor of the detection–captioning pipeline in POPE evaluation) (Lee et al., 31 Oct 2025).
6. Limitations and Evolving Directions
Notwithstanding consistent empirical improvements, several practical and methodological constraints exist:
- Pipeline latency arises from sequential VLM, LLM, detector, and synthesis calls (Lee et al., 31 Oct 2025).
- Recall can be limited by detector blindspots, especially for small or rare objects absent from pre-defined class vocabularies.
- Pipelines relying on external detectors and non end-to-end learning cannot exploit joint parameter updates for optimality (Lee et al., 31 Oct 2025).
- Scalability issues persist in high-resolution and open-vocabulary regimes due to computational bottlenecks in detection and cropping.
Future research is exploring tighter integration:
- End-to-end architectures that unify detection with captioning, potentially leveraging open-vocabulary models (Lee et al., 31 Oct 2025).
- Temporal consistency for video, adaptive fine-tuning for specialized domains, or direct attribute/relationship detection integration (Lee et al., 31 Oct 2025, Wang et al., 2018).
7. Historical Context and Research Impact
Historically, image captioning systems incorporated explicit region or object proposals as a precursor to text generation (e.g., MIL detectors and ME LMs in (Fang et al., 2014)). The field later shifted towards end-to-end CNN–RNN architectures, which subsumed explicit detection within learned mid-level features. Subsequent work revalidated the utility of detection-based intermediate representations, demonstrating that their interpretable cues—counts, sizes, positions—yield performance competitive with or superior to deep, end-to-end embeddings, while greatly enhancing interpretability (Wang et al., 2018).
Recently, the need for reliable high-fidelity captioning in high-resolution and open-domain scenes has further driven adoption of multi-stage detection–captioning frameworks, combining modern VLMs, LLMs, and detector ensembles for detailed, factual, and hallucination-minimized descriptions (Lee et al., 31 Oct 2025).
The two-stage detection–captioning paradigm thus remains both foundational in vision–language research and at the forefront of robust, explainable caption generation for complex, real-world imagery.