Papers
Topics
Authors
Recent
Search
2000 character limit reached

YOLO-UniOW: Unified Open-World Detector

Updated 28 January 2026
  • YOLO-UniOW is a unified object detection framework that simultaneously recognizes known objects and flags unknown ones using vision-language integration.
  • It employs adaptive decision learning with low-rank CLIP calibration and dual detection heads to maintain high accuracy and real-time speed.
  • The framework supports dynamic vocabulary expansion without incremental re-training, demonstrating state-of-the-art performance on multiple benchmarks.

YOLO-UniOW is an efficient and versatile object detection framework that unifies open-vocabulary and open-world object detection within a single architecture. It addresses the limitations of traditional closed-set and open-vocabulary methods by simultaneously recognizing known categories, detecting out-of-distribution unknowns, and supporting dynamic vocabulary expansion without incremental re-training. This model leverages innovations in adaptive decision learning and wildcard learning strategies to achieve high accuracy and real-time speed across diverse detection scenarios (Liu et al., 2024).

1. Universal Open-World Detection: Problem Formulation

Traditional object detectors, such as Faster R-CNN or YOLOv5, are constrained by a fixed training vocabulary CtrainC_{\text{train}} and treat any out-of-vocabulary (OOV) object as background. Open-vocabulary detection (OVD) allows recognition of novel classes by leveraging vision–LLMs (VLMs), such as CLIP, to align text and image features via cross-modal fusion. However, OVD approaches typically predefine class names at inference and rely on computationally intensive fusion operations. Open-world object detection (OWOD) extends the paradigm by requiring the detection of unknown (unlabeled) categories and enabling incremental learning of new classes without catastrophic forgetting.

YOLO-UniOW operationalizes the Universal Open-World Detection (Uni-OWD) paradigm, defining the category space as C=CkCunkC = C_k \cup C_{\text{unk}}, where CkC_k are known categories (each assigned a text name TcVT_c \in V) and CunkC_{\text{unk}} are unknown categories. The detector DD maps an image II and vocabulary VV to:

  • Known category localization: D(I,V){(b,ck)bBck,ckCk}D(I,V) \rightarrow \{(b, c_k) \mid b \in B_{c_k}, c_k \in C_k\}
  • Unknown object flagging: D(I,Tw){(b,unknown)bBunk}D(I, T_w) \rightarrow \{(b, \text{unknown}) \mid b \in B_{\text{unk}}\}
  • Dynamic vocabulary expansion upon discovery, without extensive incremental re-training

2. Model Architecture and CLIP Integration

YOLO-UniOW is based on the YOLOv10 detection backbone with a PAN-style neck and features two parallel detection heads per anchor:

  • One-to-One (o2o) head: predicts a single box and class per anchor
  • One-to-Many (o2m) head: predicts multiple boxes per anchor and applies NMS

Each predicted region is associated with a DD-dimensional region embedding fiRDf_i \in \mathbb{R}^D. The architecture integrates a frozen CLIP text encoder that produces embeddings tjRDt_j \in \mathbb{R}^D for each class name TjVT_j \in V, including special wildcard tokens. Instead of early or mid-level fusion of image and text, the classification for each region is computed as the cosine similarity:

sij=cosine(fi,tj)=fitjfitjs_{ij} = \text{cosine}(f_i, t_j) = \frac{f_i \cdot t_j}{\|f_i\| \|t_j\|}

These similarity scores serve as logits over the (potentially dynamic) class vocabulary, including known classes and wildcards.

Dataflow Overview

Stage Output Description
YOLOv10 backbone Feature maps Multi-scale image features
PAN neck Aggregated feature maps Path aggregation for spatial context
Detection heads Boxes, embeddings o2o/o2m heads output boxes, objectness, and region embeddings fif_i
CLIP text encoder Text embeddings tjt_j Cached for all classes, including wildcards
Cosine classifier Scores sijs_{ij} Score for each region-class pairing

3. Adaptive Decision Learning

To circumvent the inference overhead associated with cross-modal fusion in OVD, YOLO-UniOW introduces Adaptive Decision Learning (ADL). ADL “calibrates” the CLIP text encoder using Low-Rank Adaptation (LoRA):

Wx=W0x+ΔWx,ΔW=AB,rank(A)=rdW'x = W_0 x + \Delta W x, \quad \Delta W = A B, \quad \text{rank}(A) = r \ll d

Here, W0W_0 is the frozen CLIP projection layer, and only the low-rank updates {A,B}\{A,B\} are trained. This approach allows the CLIP text encoder to adapt its class prototypes toward those most suitable for detection, while retaining zero overhead at inference (as all tjt_j are precomputed).

Supervision is applied via a temperature-scaled softmax contrastive loss:

LADL=ilogexp(sim(fi,tyi)/τ)jVexp(sim(fi,tj)/τ)L_{\text{ADL}} = -\sum_i \log \frac{\exp(\text{sim}(f_i, t_{y_i})/\tau)}{\sum_{j \in V} \exp(\text{sim}(f_i, t_j)/\tau)}

where τ\tau is the softmax temperature.

A dual-head matching strategy incorporates both cosine similarity (semantic alignment) and box IoU (spatial alignment):

m(α,β)=sαuβm(\alpha, \beta) = s^\alpha \cdot u^\beta

with s=sim(fi,tyi)s = \text{sim}(f_i, t_{y_i}) and u=IoU(bi,bgt)u = \text{IoU}(b_i, b_{\text{gt}}), using the same exponents for both detection heads.

4. Wildcard Learning for Unknown Discovery

YOLO-UniOW embeds two wildcard tokens in the CLIP vocabulary:

  • TobjT_{\text{obj}}: "object"—trains the model to recognize generic objects, not tied to specific categories.
  • TunkT_{\text{unk}}: "unknown"—learns to flag instances not aligned with any known class.

Wildcard learning uses a two-stage procedure:

  1. Object pre-tuning: All training instances are labeled "object," and only the LoRA modules of the text encoder are trained.
  2. Unknown fine-tuning: With all embeddings frozen except for TunkT_{\text{unk}}, pseudo-unknown boxes are generated (using TobjT_{\text{obj}} and known class predictions). Boxes with IoU<0.5\text{IoU} < 0.5 to any ground truth and sobj>0.01s_{\text{obj}} > 0.01 are pseudo-labeled as unknown. TunkT_{\text{unk}} is then fine-tuned with a binary cross-entropy loss using sobjs_{\text{obj}} as a soft target.

After prediction, a de-duplication filter removes unknown boxes with IoU >0.99>0.99 with any high-scoring known detection.

5. Training and Inference Protocols

  • Pre-training: Conducted on Objects365 and GoldG (image–text pairs) using AdamW (lr 5×1045\times10^{-4}, weight decay 0.025), batch size 128 across 8 GPUs, LoRA rank 16, and standard YOLO augmentations.
  • Open-world fine-tuning: Three epochs for TobjT_{\text{obj}} at lr 1×1041\times10^{-4}; three epochs for TunkT_{\text{unk}} and known class text embeddings at lr 1×1031\times10^{-3} with weight decay zero. All other weights frozen, batch size 16 per GPU.
  • Inference: All tjt_j embeddings are cached; detection is thresholded at score >0.05>0.05 (confident if >0.2>0.2). Unknown boxes are deduplicated as described. For the one-to-one detection head, no NMS is necessary; inference speed reaches up to 119.3 FPS.

6. Experimental Results and Ablation Studies

On the LVIS minival dataset (1,203 classes), YOLO-UniOW-L achieves 34.6 AP (including 30.0 AP on rare categories), outperforming YOLO-Worldv2-L by +1.6 AP and with +7.7 APr_r gain, at an inference speed of 69.6 FPS. State-of-the-art performance on M-OWODB, S-OWODB, and nuScenes datasets is demonstrated, including an 82.6 U-Recall in Task 1 of M-OWODB (+16.7 over OVOW*) and substantial improvements in other tasks.

Ablations indicate:

  • Omitting VL-PAN fusion does not reduce performance; the Adaptive Decision Learning strategy yields +4 AP improvements.
  • LoRA calibration of the text encoder produces the largest AP gains, especially for rare categories.
  • Zero-shot "object" detection outperforms oracle baselines regarding unknown discovery; full wildcard learning further boosts U-Recall to approximately 80%.

7. Limitations and Future Directions

YOLO-UniOW's unknown recall in highly cluttered environments (nu-OWODB) is in the 40–45% range, indicating sensitivity to small or camouflaged objects. Pseudo-labeling with TobjT_{\text{obj}} may introduce noise from over-generalization. Future enhancements include better pseudo-label thresholding using spatial reasoning, joint end-to-end fine-tuning of the image encoder, continual online discovery and vocabulary management, and generalization to instance and panoptic segmentation (Liu et al., 2024).


YOLO-UniOW establishes a unified, computationally efficient open-world detection methodology that integrates adaptive text-visual alignment and robust unknown detection, setting new accuracy and versatility benchmarks for the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YOLO-UniOW.