VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency
Abstract: Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about teaching a computer to figure out where it is and build a map using only a single camera video, even when the camera’s settings are unknown. This task is called “monocular SLAM” (Simultaneous Localization and Mapping). The authors propose a new system, called VGGT-Motion, that stays accurate over very long trips (think kilometers of driving) and runs fast, without needing special camera calibration.
Key Objectives
Here are the main questions the paper tries to answer:
- How can we stop the computer’s map from slowly “drifting” and becoming less accurate during long videos?
- How can we avoid wasting time on frames that don’t add useful information (like when the car isn’t moving)?
- How can we make different chunks of the video line up correctly, even when they were processed separately?
- Can we do all this without knowing the camera’s exact settings (calibration), and still be fast and reliable?
How They Did It
To make this easier to understand, imagine you’re filming a road trip with your phone. The system needs to use the video to figure out your path and build a 3D map of the surroundings. The authors solve three big problems with three ideas:
Idea 1: Motion-aware submaps (smart chunking of the video)
- Everyday analogy: Instead of cutting a movie into equal-length scenes, cut it based on what’s happening—keep turns together as one scene, skip boring still scenes, and slice straight segments into reasonable pieces.
- What they do: They look at “optical flow,” which is the movement of pixels between frames, to decide if the camera is:
- Static (not moving),
- Turning (rotating a lot),
- Linear (moving straight).
- Then they:
- Prune static frames: If the car isn’t moving (like at a stoplight), they keep only the beginning and end frames of that stop. This avoids “hallucinated” motion caused by tiny camera noise.
- Pick keyframes by parallax: Parallax is the apparent shift of objects when the camera moves. They only keep frames that add enough viewpoint change to be informative.
- Encapsulate turns: They keep entire turning segments in one chunk so the system doesn’t lose track of scale and orientation during rotations.
- Use small overlaps: They add a few overlapping frames between chunks for easier alignment later, and reuse past frames if they detect loops (returning to the same place).
Idea 2: Anchor-driven direct Sim(3) registration (lining chunks up using shared “pins”)
- Everyday analogy: Imagine two drawings of the same place from slightly different times. If both drawings share the exact same pin points (anchors), you can line them up precisely without searching.
- What is Sim(3)? It’s a 3D transform that includes rotation, translation (shifting), and scaling. Scaling matters because with a single camera, exact size can be ambiguous.
- What they do:
- Choose “context-balanced anchors” in the overlap area—frames near the center of the overlap that are not biased toward near or far objects.
- Use VGGT (a strong 3D vision model) to get dense 3D points for each pixel in these anchor frames.
- Align chunks directly, pixel by pixel, using those shared anchors. This skips expensive feature matching and works in linear time, which is much faster.
- Filter out unreliable pixels (like sky or low-confidence areas) and use a robust loss to handle noise. If the alignment is good enough, they add it as a constraint in the global system.
Idea 3: Lightweight pose graph optimization (tightening the whole map)
- Everyday analogy: Picture each chunk as a node in a network, with rubber bands (constraints) connecting them. You adjust the nodes until the bands are comfy—not too stretched, not too slack.
- What they do:
- Treat each chunk (“submap”) as a node.
- Add edges (constraints) using the Sim(3) transforms found above.
- Run a small, efficient optimization to make the whole trajectory consistent.
- Compose final camera poses by combining optimized chunk poses with local estimates, giving a clean, globally consistent path.
Main Findings
The authors tested VGGT-Motion on several outdoor driving datasets (KITTI, Waymo, 4Seasons, Complex Urban, A2D2). They did not retrain the model for these datasets (“zero-shot”), which shows strong generalization.
Key results:
- Much lower trajectory error on long videos: 85–95% reduction compared to a leading method (VGGT-Long).
- Big speedups: 18–36× faster on very long sequences, thanks to pruning useless frames and using direct pixel alignment.
- Strong robustness: Works well with lighting changes, low-texture scenes (like plain walls), and high-speed motion.
- Stable on standard benchmarks (KITTI, Waymo) and outstanding on large, challenging datasets with many thousands of frames.
Why it’s important:
- Long-range consistency matters for navigation, mapping cities, and analyzing long video streams.
- Being calibration-free means it can work “in the wild” on everyday videos without knowing exact camera settings.
- Linear-time alignment and submap-level optimization mean it’s practical and scalable.
Implications and Potential Impact
- Real-world readiness: This approach makes single-camera mapping more reliable for long drives, making it useful for autonomous driving, delivery robots, and AR navigation.
- Works with everyday video: You can process uncalibrated, in-the-wild footage to get consistent maps and trajectories.
- Efficient and scalable: The system avoids heavy computation, making it suitable for long recordings and possibly for on-device processing.
- Foundation for future work: Motion-aware chunking and anchor-driven alignment could be reused in other 3D tasks that need speed, robustness, and long-term stability.
In short, VGGT-Motion shows how to be smart about which frames to keep, how to line up chunks using shared anchors, and how to optimize the whole path lightly—all to keep a single-camera SLAM system accurate and fast over very long distances without needing camera calibration.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what the paper leaves missing, uncertain, or unexplored, with actionable pointers for future work:
- Optical flow dependency: The method’s motion-state estimation hinges on dense optical flow, but the choice of flow model, its robustness to illumination changes, motion blur, and dynamic objects, and its computational cost are not specified or ablated. Evaluate alternative flow estimators (e.g., RAFT, PWC-Net) and quantify sensitivity to flow errors on long sequences.
- Motion-state thresholding: Fixed thresholds for static/turning detection and parallax (, , , ) are used across datasets without sensitivity analysis or auto-tuning. Investigate adaptive or learned thresholding and report how performance degrades/improves under threshold variations.
- Parallax proxy definition: The paper references a “parallax proxy” used for keyframe selection but does not clearly define how it is computed (metric, units, normalization, robustness to depth errors). Provide a formal definition and study its reliability in scenes with low texture or depth discontinuities.
- Turning encapsulation memory bounds: Turning segments are encapsulated into single submaps with no stated maximum length, which may cause memory spikes on prolonged curves. Specify hard bounds for turn submaps and analyze trade-offs between stability and memory/latency.
- Overlap anchor window size: The “compact temporal window” around the midpoint used as the overlap anchor is not quantitatively defined. Introduce a tunable parameter for window size and analyze how anchor window length affects alignment accuracy and drift.
- Dynamic object handling: Reliability filtering excludes sky pixels but not other dynamic elements (vehicles, pedestrians, reflections). Add semantic filtering for dynamic classes and evaluate its impact on Sim(3) estimation and drift in highly dynamic urban scenes.
- Loop closure retrieval method: Loop anchors are obtained via “SALAD features,” but SALAD (as cited) is a 3D shape diffusion method, not a place-recognition model. Clarify the retrieval mechanism (feature type, indexing, distance metric), and report precision/recall curves for loop detection to quantify false positives/negatives.
- Anchor corruption/occlusion: The anchor-driven pixel-indexed correspondence assumes the anchor frame is reliable. Evaluate failure cases where anchors are occluded or corrupted and propose fallback alignment (e.g., patch-based matching, multi-anchor consensus).
- Confidence map calibration: VGGT confidence maps are used as weights, but their calibration to actual geometric error is not validated. Calibrate confidence-to-noise models (e.g., reliability curves) and explore heteroscedastic weighting in Sim(3) optimization.
- Robust estimation details: Sim(3) estimation uses Huber loss and an inlier-ratio gate, but inlier definition and computation are unspecified (thresholds, residual norms). Provide the exact inlier criteria and compare against RANSAC- or M-estimator variants.
- Uncertainty-aware pose graph: Edge weights are set to the inlier ratio, which is a crude proxy for constraint quality. Derive constraint covariance from residuals to form principled information matrices and evaluate improvements in global optimization.
- Absolute scale recovery: The system remains monocular and evaluates ATE after global Sim(3) alignment, leaving metric scale unresolved. Explore integrating ground-plane priors, object-size priors, speedometer cues, GNSS/IMU, or learned scale priors to recover metric scale without external calibration.
- Rolling-shutter and lens distortion: The pipeline assumes a global-shutter, pinhole model, while consumer cameras often have rolling shutter and non-linear distortion. Quantify the impact of these effects and incorporate correction models or learned compensation.
- Intrinsics generalization: Although “calibration-free,” the approach does not test extreme FOVs (fisheye, action cameras) or strong intrinsics variability. Benchmark on diverse camera types and report failure modes when intrinsics are far from the pretraining distribution.
- Dataset diversity: Evaluation is restricted to outdoor driving datasets; indoor, handheld, aerial, and night/rain/snow conditions are untested. Extend benchmarks to these domains and analyze domain-specific failure modes.
- Map fusion and dense reconstruction quality: The paper focuses on trajectory ATE/drift and does not evaluate dense map consistency across submaps (e.g., point cloud alignment quality, surface fidelity). Introduce metrics for dense reconstruction (e.g., point cloud overlap error, completeness) and assess cross-submap fusion.
- Minimal overlap robustness: Alignment relies on shared anchor frames with a small fixed overlap (). Quantify the minimal overlap needed for reliable alignment, especially under high speeds (A2D2), and develop strategies for low-overlap regimes (e.g., synthetic re-rendering, learned correspondence across non-identical frames).
- Online/real-time operation: Experiments are offline with desktop GPU; no latency budgets, streaming constraints, or on-the-fly loop closures are reported. Measure end-to-end latency, memory, and throughput under real-time constraints and evaluate on embedded platforms (e.g., Jetson).
- Module-wise runtime/memory profiling: Claims of O(N) alignment and linear complexity are not supported by module-wise breakdowns. Provide detailed profiling (inference, flow, partitioning, alignment, optimization), peak memory usage per submap/sequence, and scaling with image resolution.
- Comparison coverage: The paper does not compare against recent streaming foundation reconstructions (e.g., Stream3r, TTT3R) on long sequences. Include these baselines to contextualize gains and identify complementary strengths/weaknesses.
- Test-time adaptation: No test-time training or adaptation is explored for handling distribution shifts (e.g., seasonal changes). Evaluate light-weight TTA schemes and their effect on drift and robustness.
- Place-recognition under perceptual aliasing: Urban canyons often exhibit aliasing; false loop closures can be catastrophic. Quantify aliasing rates and introduce conservative gating (e.g., multi-modal verification, geometric prechecks) to prevent erroneous constraints.
- Gaussian smoothing parameters: Temporal smoothing of motion scores is used without specifying kernel sizes or variance. Provide these parameters and analyze the effect of smoothing on state misclassification (e.g., missed short turns, misdetected brief stops).
- Reproducibility details: Key implementation components (optical flow model/config, parallax computation, sky segmentation model, anchor window size) lack precise specification. Release these details and code to ensure replicability of the reported gains.
Practical Applications
Immediate Applications
Below is a concise set of deployable use cases that leverage the paper’s motion-aware partitioning, anchor-driven Sim(3) registration, and lightweight submap-level optimization to deliver calibration-free, long-range monocular SLAM.
- Fleet dashcam mapping and analytics [Automotive, Logistics, Smart City]
- Use daily dashcam videos to generate globally consistent trajectories and dense point maps for route auditing, pothole detection, lane marking inventory, and construction monitoring.
- Tools/Workflows: “Calibration-free SLAM” cloud API; batch processing pipeline; GIS export (GeoJSON/CityGML) with loop-closure anchors; dashboard for drift and coverage QA.
- Assumptions/Dependencies: Reliable optical flow and sky segmentation; access to VGGT weights; sufficient server GPU; optional GPS for absolute scales; privacy compliance for public-road video.
- Drop-in SLAM for low-cost mobile robots [Robotics, Warehousing]
- Replace calibrated SLAM with VGGT-Motion to maintain consistency over long warehouse runs (corridors, repeated loops) using a single camera.
- Tools/Workflows: ROS2 node wrapping motion-aware submap construction, anchor-based alignment, pose-graph backend; model monitoring for “zero-motion drift” suppression.
- Assumptions/Dependencies: Stable camera mount; lighting robustness; indoor domain generalization (optical flow quality in low-texture surfaces).
- AR occlusion and persistence from smartphone video [Software, AR/VR]
- Offload monocular video to server to build large-scale, consistent maps that support occlusions and persistent anchors across campus or retail floors.
- Tools/Workflows: Mobile SDK to upload sequences; server-side anchor-driven Sim(3) registration; “cloud anchors” indexed by overlap anchors; AR engine integration (Unity/Unreal).
- Assumptions/Dependencies: Server inference (mobile GPU may be insufficient); consistent parallax (avoid excessive motion blur); optional GNSS for world scale.
- Construction/site progress capture [AEC/BIM]
- Walkthrough video yields drift-free site maps; compare against BIM to flag deviations and quantify progress along long corridors or multi-floor routes.
- Tools/Workflows: BIM plug-in (Navisworks/Revit) to import trajectory and pointmaps; weekly batch alignment with site loops; ticketing integration.
- Assumptions/Dependencies: Adequate texture/structure; method outputs are in Sim(3); absolute scale needs reference objects or GNSS/IMU.
- Rapid post-incident reconstruction from dashcams/CCTV [Insurance, Public Safety]
- Recover globally consistent trajectories and geometry from monocular footage to support claims or investigations without camera calibration.
- Tools/Workflows: Evidence ingestion pipeline; anchor-based loop alignment for multi-camera overlap; report generator with trajectory error bounds.
- Assumptions/Dependencies: Legal chain-of-custody; robust handling of occlusions and motion blur; world-scale inference requires external references.
- UAV field mapping with monocular payloads [Agriculture, Environmental Monitoring]
- Create drift-minimized maps over long flight lines; turning encapsulation helps during waypoint maneuvers; sky masking reduces spurious constraints.
- Tools/Workflows: Flight-log + video post-processing; anchor-driven Sim(3) to close loops; agronomic analytics (row straightness, canopy gaps).
- Assumptions/Dependencies: High sky fraction; altitude changes affect scale; wind-induced motion blur; may need GNSS for absolute scaling.
- Crowd-sourced curb/asset inventory [Smart City]
- From resident or contractor videos, generate consistent maps to track signage, curb ramps, hydrants; combine with detectors for asset geotagging.
- Tools/Workflows: Upload portal; asset detection on pointmaps; loop anchors to merge multiple runs; change detection over time.
- Assumptions/Dependencies: Privacy policies; detector accuracy; optional GNSS to resolve absolute positioning.
- Dataset curation and pre-alignment for research [Academia]
- Convert long, uncalibrated sequences into globally consistent trajectories for benchmarking and training downstream perception models.
- Tools/Workflows: “Motion-aware sampler” for parallax-based keyframe selection; submap export; pose graph statistics; minimal compute due to linear complexity.
- Assumptions/Dependencies: Access to foundation model checkpoints; sequence domains similar to pretraining; scale ambiguity handled via Sim(3).
- Video capture optimization (storage and bandwidth) [Software, Edge]
- Integrate parallax-based keyframe selection to avoid redundant frames in capture apps, reducing storage and post-processing load.
- Tools/Workflows: On-device parallax proxy; motion-state gating (static pruning); variable frame-rate recording.
- Assumptions/Dependencies: On-device optical flow approximation; acceptable latency; user consent for adaptive recording.
- Indoor navigation mapping for campuses/hospitals [Education, Healthcare Operations]
- Use handheld video to produce floor-level maps and consistent trajectories, supporting kiosk navigation and maintenance planning.
- Tools/Workflows: Batch map builder; anchor reuse across building loops; signage overlay and facilities planning layers.
- Assumptions/Dependencies: Repetitive textures and long corridors; scale references or IMU; privacy and security policies for indoor video.
Long-Term Applications
The following use cases are promising but require further research, domain adaptation, scaling, or hardware integration beyond the current paper’s scope.
- City-scale crowdsourced mapping from consumer videos [Smart City, Mobility]
- Persistent, season-robust maps built from millions of trips; anchor-driven loop closures merge contributions; updates for navigation and urban analytics.
- Tools/Workflows: Map backend with anchor indexing; drift auditing at city scale; anonymization and governance layers.
- Assumptions/Dependencies: Privacy-preserving infrastructure; scalable storage and compute; robust domain generalization across weather/lighting.
- Monocular-only autonomy for low-speed robots and micro-vehicles [Robotics, Automotive]
- Rely predominantly on calibration-free monocular SLAM, using anchors and motion-aware submaps for long routes.
- Tools/Workflows: Embedded inference on accelerators; safety monitors for drift; multi-run loop closure.
- Assumptions/Dependencies: Safety certification; redundancy with IMU/LiDAR for edge cases; extreme texture-poor or reflective environments.
- AR “cloud anchors” across seasons and lighting [AR/VR, Mapping]
- Build an AR cloud that remains consistent across months using loop anchors and robust Sim(3) constraints.
- Tools/Workflows: Anchor registry; cross-season alignment services; developer APIs for persistent content placement.
- Assumptions/Dependencies: Domain adaptation for heavy appearance changes; scalable anchor discovery; mobile upload throughput.
- Emergency/disaster response mapping from bodycams and drones [Public Safety]
- Rapid, calibration-free maps assembled from ad-hoc, uncalibrated monocular streams; supports search-and-rescue path planning.
- Tools/Workflows: Edge/cloud hybrid inference; anchor fusion from multiple operators; resilience to smoke/dust.
- Assumptions/Dependencies: Robust optical flow under debris/low light; communications constraints; operational policy approvals.
- Underwater and endoscopic monocular SLAM [Healthcare, Marine Robotics]
- Adapt motion-aware partitioning and anchor-driven registration to high-scatter, low-texture domains (turbid water, biological tissue).
- Tools/Workflows: Domain-specific confidence masks (e.g., specular or fluid regions); tailored optical flow; validation datasets.
- Assumptions/Dependencies: Severe visual degradations; domain pretraining; specialized segmenters replacing “sky” masks.
- Powerline, pipeline, and wind-farm inspection at scale [Energy]
- Long-range, loop-heavy routes mapped consistently with monocular cameras; detect structural changes over time.
- Tools/Workflows: Inspection planner integrating loop anchors; change detection across pointmaps; maintenance ticketing.
- Assumptions/Dependencies: Vibration/motion blur mitigation; high-altitude sky dominance (mask reliability); absolute scale alignment via GNSS.
- Standards and policy for calibration-free video-derived maps [Policy, Standards]
- Define V&V protocols, quality metrics (ATE, drift), privacy/consent frameworks, and procurement guidelines for monocular SLAM solutions.
- Tools/Workflows: Benchmark suites; compliance reports; audit trails for anchor use and loop closure decisions.
- Assumptions/Dependencies: Multi-stakeholder alignment; evolving regulation on public-space imaging and data retention.
- Multi-sensor fusion with anchor-aware constraints [Robotics, Automotive]
- Integrate IMU/GNSS/event cameras with anchor-driven Sim(3) to improve robustness and absolute scale, keeping linear complexity.
- Tools/Workflows: Factor-graph fusion (Sim(3)+SE(3)) with anchor weights; online loop closures; adaptive motion-state blending.
- Assumptions/Dependencies: Sensor synchronization; robust cross-modal calibration; real-time optimization on embedded hardware.
- On-device real-time SLAM for consumer apps [Mobile, Software]
- Port VGGT-Motion to mobile NPUs/accelerators for live AR navigation and mapping without cloud offload.
- Tools/Workflows: Model distillation/quantization; lightweight optical flow; incremental pose-graph maintenance.
- Assumptions/Dependencies: Mobile compute budgets; battery constraints; efficient memory management for long sequences.
- Insurance risk modeling and claims automation [Finance/Insurance]
- Use consistent trajectories and geometry to quantify risky maneuvers, validate claims, and price premiums based on actual driving behavior.
- Tools/Workflows: Risk scoring pipelines; anomaly detection on turns/stops (leveraging motion-state classification); policy dashboards.
- Assumptions/Dependencies: Regulatory approvals for behavioral data use; bias mitigation in dynamic scenes; secure data handling.
Cross-cutting assumptions and dependencies (applies broadly)
- Foundation model availability and licensing (VGGT or successors), plus robust optical flow and semantic masks.
- Domain generalization for extreme conditions (night, heavy rain/snow, tunnels, specular surfaces).
- Absolute scale often requires external references (GNSS, known object dimensions, or multi-sensor fusion).
- Compute pathways: server-side GPUs today; edge/mobile feasibility requires model slimming, hardware acceleration, and careful memory control.
- Privacy, consent, and data governance for public and private spaces; secure handling of video and derived maps.
Glossary
- Absolute Trajectory Error (ATE): A standard metric that measures the deviation between an estimated trajectory and the ground truth after alignment. "We report the Absolute Trajectory Error (ATE) after global Sim(3) alignment"
- Anchor-driven direct Sim(3) registration: An alignment strategy that estimates similarity transforms between submaps using shared anchors to enable efficient, dense alignment without feature matching. "we develop an anchor-driven direct Sim(3) registration module."
- Bundle adjustment: A nonlinear optimization that jointly refines 3D structure and camera parameters to minimize reprojection error. "achieve high reconstruction accuracy via global bundle adjustment (Triggs et al., 1999)"
- Calibration-free: Operating without prior knowledge of camera intrinsic parameters, estimating them directly from data. "a calibration-free SLAM sys- tem for efficient and robust global consistency"
- Catastrophic forgetting: The tendency of a model to lose previously learned information when trained sequentially, relevant for streaming architectures. "remaining susceptible to catastrophic forget- ting (Kirkpatrick et al., 2017)"
- Confidence map: A per-pixel estimate of prediction reliability used to filter out uncertain regions during alignment. "pixel-wise confidence maps Ck"
- Contextual asymmetry: A mismatch in predictions across submaps due to differing temporal contexts, causing alignment bias. "Contextual Asymmetry and Systematic Alignment Bias."
- Covisibility-based selection: A keyframe selection strategy based on overlapping features between frames, potentially yielding narrow baselines. "Even covisibility-based selection, which triggers keyframe insertion based on feature overlap, reduces redun- dancy but prioritizes visual continuity over spatial displace- ment."
- Dense correspondences: Pixel-wise 3D point matches used to impose strong geometric constraints for registration. "dense, search-free geometric correspondences"
- Epipolar search: The process of constraining point correspondences along epipolar lines; bypassed by some foundation models. "Unified 3D foundation models by- pass epipolar search by directly regressing dense geometry and camera parameters."
- Gauge drift: Drift arising from unobservable degrees of freedom (gauge), such as global scale in monocular SLAM, between submaps. "rectify inter- submap gauge drift, especially scale ambiguity."
- Huber loss: A robust loss function that is less sensitive to outliers, used in regression and alignment. "PHuber(.) denotes the Huber loss function (Huber, 1992)."
- Inlier ratio: The fraction of correspondences consistent with an estimated model, used for geometric verification. "geometric verification using the inlier ratio nij of the converged solution."
- Information matrix: A matrix encoding the confidence (inverse covariance) of constraints in optimization. "Rij = WijI is the information matrix."
- Intrinsics: Camera internal parameters (e.g., focal length, principal point) estimated alongside poses and geometry. "VGGT (Wang et al., 2025a) further integrates pose, intrinsics, and feature tracks within a Transformer architecture."
- Levenberg-Marquardt: An iterative optimization algorithm combining gradient descent and Gauss-Newton, used for nonlinear least squares. "solve Eq. 12 via Levenberg-Marquardt on Lie groups."
- Lie algebra: The tangent space associated with a Lie group, used to represent small pose updates in optimization. "the residual rij is defined in the Lie algebra via the logarithm map:"
- Logarithm map: A mapping from a Lie group to its Lie algebra that linearizes group elements for optimization. "the residual rij is defined in the Lie algebra via the logarithm map:"
- Loop closure: The process of detecting and enforcing consistency when the camera revisits a previously seen area to reduce drift. "and efficient loop closure without costly feature matching."
- Monocular scale ambiguity: The inherent inability of monocular vision to determine absolute scale without additional cues. "to account for monocular scale ambiguity."
- Motion-agnostic partitioning: Segmenting sequences without considering motion dynamics, which can harm geometric coherence. "Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift,"
- Optical flow: The per-pixel apparent motion between consecutive images, used here for motion state estimation and partitioning. "uses optical flow to guide adap- tive partitioning"
- Out-of-Memory (OOM): A failure mode where models exceed available memory during computation. "Out-of-Memory (OOM)"
- Parallax-based keyframe selection: Selecting keyframes when sufficient viewpoint change (parallax) occurs to ensure informative geometry. "Parallax-based keyframe selection and redundancy pruning suppress zero-motion drift"
- Pixel-Indexed Dense Correspondence: A correspondence scheme that aligns 3D points across submaps by sharing the same pixel indices in an anchor frame. "Pixel-Indexed Dense Correspondence."
- Pose graph optimization: An optimization over nodes (poses) and edges (relative constraints) to achieve global consistency. "a lightweight pose graph optimization enforces global con- sistency with linear complexity"
- Quadratic self-attention: The O(N2) computational and memory complexity of standard self-attention with sequence length N. "quadratic O(N2) cost of self- attention"
- SALAD features: Features derived from a diffusion-based 3D shape model used here for loop detection/retrieval. "via SALAD features (Koo et al., 2023)"
- Scale drift: Accumulating scale errors over time in monocular trajectories due to ambiguous scale estimation. "scale drift remains severe on long sequences."
- Self-attention: Mechanism in Transformers allowing global contextual interactions, here influencing geometric consistency and cost. "disrupting the global self-attention mechanism, which is essential for maintaining geometric consistency."
- Similarity transform: A 3D transformation comprising scale, rotation, and translation used to align submaps. "We model their geometric relation by a similarity transform Sij E Sim(3):"
- Sim(3): The Lie group of 3D similarity transforms (scale + rotation + translation) used for alignment and optimization. "Sij E Sim(3)"
- Structure-from-Motion (SfM): A class of methods that recover camera motion and 3D structure from images. "Structure-from-Motion (SfM) and Simultaneous Local- ization and Mapping (SLAM)"
- Submap: A locally reconstructed segment of the trajectory/map used as a unit for scalable alignment and optimization. "submap-level pose graph optimization"
- Transformer-based architectures: Deep models using self-attention layers; here used in 3D vision but limited by quadratic complexity. "Transformer-based architectures imposes severe memory constraints and computational bottlenecks"
- Turning Segment Encapsulation: A partitioning strategy that keeps continuous turning motions within a single submap to preserve geometry. "Turning Segment Encapsulation."
- Zero-motion drift: Spurious motion estimates accumulated during truly static intervals due to noise and model priors. "zero-motion drift"
Collections
Sign up for free to add this paper to one or more collections.