Papers
Topics
Authors
Recent
Search
2000 character limit reached

3AM: 3egment Anything with Geometric Consistency in Videos

Published 13 Jan 2026 in cs.CV | (2601.08831v3)

Abstract: Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/

Summary

  • The paper introduces a novel feature merger that bridges 2D appearance cues with 3D geometric features to maintain object identity under varying viewpoints.
  • Using a field-of-view aware sampling strategy during training, the method learns effective 3D correspondence with only RGB inputs, ensuring spatial consistency.
  • Experimental validation on ScanNet++ and Replica shows significant IoU improvements, with metrics like 90.6% IoU, demonstrating 3AM's superior performance over state-of-the-art VOS methods.

3AM: Geometry-Consistent Segmentation in Videos

Introduction

The paper "3AM: 3egment Anything with Geometric Consistency in Videos" presents a novel approach to video object segmentation (VOS) that addresses the limitations of traditional methods which struggle under significant viewpoint changes. The authors propose integrating 3D-aware features with 2D appearance-based methods to achieve spatially consistent tracking. This work targets the fundamental challenge of maintaining object identity in dynamic environments, which is crucial for applications like autonomous driving, robotics, and augmented reality.

Methodology

3AM's methodology bridges the gap between 2D tracking efficiency and 3D geometric consistency. It introduces a Feature Merger that combines 3D features from the MUSt3R model with the appearance cues from the SAM2 model via cross-attention and convolution. This merger fosters robust object recognition under drastic viewpoint variations, using only RGB input at inference. Figure 1

Figure 1: Overview of the 3AM pipeline showing the integration of geometric and appearance features.

A critical innovation in 3AM is its field-of-view aware sampling strategy during training. This approach ensures that frames included in the training see consistent object regions, thus learning effective 3D correspondence without needing explicit 3D supervision. The integration of memory attention and mask decoding further refines the spatial alignment of objects across different frames.

Experimental Validation

The efficacy of 3AM is demonstrated through extensive experiments on challenging datasets, such as ScanNet++ and Replica, which feature wide-baseline motion and diverse camera viewpoints. The results show substantial performance gains, with 3AM achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++ Selected Subset, marking improvements of +15.9 and +30.4 points over existing state-of-the-art VOS methods. Figure 2

Figure 2: Visual comparison of VOS methods, highlighting the superior object tracking capability of 3AM.

The paper includes comprehensive comparisons against existing methods such as SAM2, SAM2Long, and DAM4SAM, where 3AM consistently outperforms these models by significant margins across various metrics, attesting to its robustness and precision in maintaining object identity despite large viewpoint shifts.

Theoretical and Practical Implications

3AM's approach offers practical implications for deploying VOS systems in real-world scenarios that demand agility and accuracy in dynamic environments. The model's reliance solely on RGB input for inference broadens its applicability, reducing the computational overhead typically associated with 3D-based systems.

Theoretically, 3AM paves the way for more unified frameworks that simultaneously leverage geometric and appearance information, challenging the existing paradigms in VOS. Future developments could explore the integration of additional cues such as semantics and motion dynamics to further enhance the robustness of segmentation models.

Conclusion

In conclusion, 3AM represents a significant advancement in video object segmentation by effectively incorporating geometric consistency into the tracking process without incurring the complexities of traditional 3D methods. This achievement could potentially redefine segmentation tasks across various AI applications, making 3AM a valuable contribution to the field.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces 3AM, a smarter way to track and cut out objects in videos, even when the camera moves a lot or the object looks very different from one angle to another. It builds on SAM2 (a popular tool that lets you segment “anything” with clicks, boxes, or masks) by adding “3D awareness,” so the model doesn’t just remember how things look—it also learns where they are in space.

What questions is the paper trying to answer?

Here are the simple goals the researchers had:

  • How can we keep tracking the same object across a video when the camera view changes a lot?
  • Can we make 2D tools like SAM2 more reliable by teaching them about 3D geometry, but without needing special 3D inputs (like camera positions or depth) at test time?
  • Is it possible to get consistent results across very different viewpoints quickly enough for real-world use?

How did they do it?

Think of tracking an object like keeping your eye on the same toy while you walk around a room. From each new angle, the toy looks different. A tool that only uses how the toy looks in 2D can get confused. But if the tool also knows the toy’s place in 3D space, it can tell it’s the same toy, even from a new angle.

Here’s the approach in everyday terms:

  • SAM2 is good at recognizing how things look and remembering them across frames (like a short-term memory for a video).
  • MUSt3R is a model that learns 3D structure from multiple images—it’s good at understanding how different views relate in space.
  • 3AM blends both worlds. It takes SAM2’s “appearance features” (how the object looks) and MUSt3R’s “geometry-aware features” (where the object sits in space) and merges them using a small, efficient module called the Feature Merger.
  • During training, they use a smart frame-picking strategy called “field-of-view aware sampling.” In simple terms: when teaching the model, they pick sets of frames that actually look at overlapping physical parts of the object, not totally different sides. This helps the model learn consistent 3D correspondence rather than getting mixed signals.
  • At test time, 3AM only needs regular RGB video frames and a user prompt (a point, box, or mask). No camera poses, no depth maps, no heavy 3D preprocessing.

What did they find?

On challenging datasets with big camera motion and lots of viewpoint changes (like ScanNet++ and Replica), 3AM did much better than existing video object segmentation methods:

  • On the hard “Selected Subset” of ScanNet++ (designed to test reappearance and wide viewpoint changes), 3AM reached about 90.6% IoU and 71.7% Positive IoU, beating strong SAM2-based baselines by +15.9 and +30.4 points.
  • On the Replica dataset, 3AM also achieved the best scores, especially when the object was visible and correctly found.
  • 3AM even showed strong 3D results without using 3D ground truth at test time: by keeping 2D masks consistent across views, you can project them into 3D and get reliable 3D instances. This suggests you don’t always need heavy 3D merging if your 2D tracking is geometry-aware.

Why this is important:

  • It means we can get stable, consistent object masks across a video even when the camera moves a lot or the scene is cluttered.
  • It keeps the best part of SAM2—promptable, interactive segmentation—while making it robust to difficult camera motion.

What’s the impact?

If you can track objects consistently across big viewpoint changes without needing special 3D inputs, lots of applications get better:

  • Video editing: cleaner cut-outs even in tricky shots.
  • Robotics and AR: more reliable understanding of objects in real spaces, as robots or phones move around.
  • Autonomous driving and embodied AI: better scene understanding across changing angles.
  • Photo collections: consistent object tracking across casual multi-view images, not just videos.

Big picture: 3AM shows that adding lightweight 3D awareness to 2D segmentation makes tracking much more robust, while staying interactive and efficient. It’s a step toward tools that understand both what things look like and where they are in space—without needing complicated 3D data at test time.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.