Papers
Topics
Authors
Recent
Search
2000 character limit reached

FoVNet: Field-of-View Extrapolation

Updated 31 December 2025
  • FoVNet is a field-of-view extrapolation framework that synthesizes wider outputs from narrow sensor data with temporal consistency and uncertainty estimation.
  • It integrates geometric coordinate warping with attention-based feature aggregation and configurable modules for active vision and speech enhancement.
  • The architecture supports applications in autonomous navigation, wearable smart glasses, and computation-constrained systems by robustly handling unobserved regions.

FoVNet refers to several distinct technical frameworks centered around field-of-view (FoV) modulation, primarily in vision and speech enhancement domains. Across applications—scene extrapolation, active foveation, and selective audio enhancement—FoVNet designs unify geometric, attention-based, and modular signal processing approaches to leverage narrow aperture data for the synthesis or enhancement of wider field-of-view outputs, often with built-in uncertainty quantification or configurability.

1. Formal Definition and Problem Scope

FoVNet, in its foundational incarnation, is a temporally consistent field-of-view extrapolation architecture for visual scene prediction. The system infers the current wide-field-of-view scene from narrow-FoV video input, while estimating a per-pixel hallucination uncertainty. The scope extends beyond image stitching, demanding (a) propagation under camera motion, (b) hallucination of unobserved regions, (c) temporal consistency, and (d) principled uncertainty quantification. Similar terminology has been adopted in foveated vision models for active spatial sampling (Killick et al., 2023) and configurable speech enhancement systems for smart glasses (Xu et al., 2024), where FoVNet describes adaptation to spatial or directional constraints.

2. Architecture and Computational Pipeline

FoVNet’s vision pipeline is a two-stage recurrent system:

  • Coordinate Generation Stage Computes per-pixel warping coordinates for each past frame, leveraging estimated depth (Dθ(I)D_\theta(I)) and relative camera pose (PϕP_\phi). Rigid 3D flow is given by ftirig(ci)=KTit(Di(ci)K1ci)cif^{rig}_{t\to i}(c_i) = K\,T_{i\rightarrow t}\bigl(\mathcal{D}_i(c_i)\,K^{-1}c_i\bigr) - c_i and inverted to allow forward warping via feature scatter.
  • Frame Aggregation Stage

    • AFA fuses attention-weighted features:

    Ai=softmax(Convatt(Fi)),F~=i=1k+1AiFiA^i = \mathrm{softmax}\bigl(\mathrm{Conv}_{att}(F^i)\bigr),\quad \tilde F = \sum_{i=1}^{k+1} A^i \odot F^i - GSA adaptively gates local self-attended features, distinguishing between hallucination zones and reliably propagated regions. - Output includes the wide-FoV RGB image (OtO_t) and a pixelwise uncertainty map (UtU_t).

Adversarial training incorporates both image-level and temporal discriminators.

In the context of active vision, FoVNet architectures deploy:

  • Differentiable Foveated Sensor Irregularly samples high-density fovea and sparse periphery, parameterized by radius rr, sample count dd, and sunflower spatial layout.
  • Graph Convolutional Network (GCN) Features are processed on the irregular grid using Gaussian-derivative edge-conditioned graph convolutions.
  • Attention (Fixation) Module Softmax-based saliency map predicts the next fixation; soft-argmax enables gradient-based end-to-end optimization.
  • Classification Head Class logits are averaged across TT fixations or produced via a localization network for one-shot approaches.

FoVNet in audio employs:

  • Front-End Spatial Sampling Max-DI beamformer “scans” 20 horizontal blocks; beamformed STFTs are mapped to 64-band ERB features.
  • Neural Module (Ultra-Low Computation) FiLM-based FoV embeddings, depth-wise spatial convolutions, reference branch, and bi-layer GRUs produce ERB-band gains. Masked enhancement (MstftM_{stft}) is applied to the reference STFT.
  • Multi-Channel Wiener Filter (MCWF) Covariance estimation via time-recursive smoothing enables minimum-distortion beamforming.
  • Post-Processing (PP) Residual-reduction masks reconcile distortions between beamformer and neural network estimates.

3. Loss Functions and Uncertainty Modeling

  • Coordinate Generation Loss (LCGL_{CG}): masked photometric consistency and smoothness penalties.
  • Frame Aggregation Loss (LFAL_{FA}):

    • Uncertainty-aware L1 loss (L1UL_{1U}) with Laplacian posterior per-pixel,

    L1U=E[OtWt1UtM+OtWt1(1M)+Ut]L_{1U} = \mathbb{E}[\frac{\|O_t - W_t\|_1}{U_t} \odot M + \|O_t - W_t\|_1 \odot (1-M) + U_t] - Perceptual loss (LpercL_{perc}; VGG-based), adversarial (LSGAN) losses, with hyperparameters λs=103\lambda_s=10^{-3}, λ1=3\lambda_1=3, λ2=10\lambda_2=10.

Composite loss: loss=SI-SDR(y,y^)+λ1logYlogY^fovnet1+λ2(logReYlogReY^fovnet1+logImYlogImY^fovnet1)\text{loss} = -\text{SI-SDR}(y, \hat{y}) + \lambda_1 \|\log|Y| - \log|\hat{Y}_{fovnet}|\|_1 + \lambda_2(\|\log|Re\,Y| - \log|Re\,\hat{Y}_{fovnet}|\|_1 + \|\log|Im\,Y| - \log|Im\,\hat{Y}_{fovnet}|\|_1) Typical values: λ1=0.01\lambda_1 = 0.01, λ2=1.0\lambda_2 = 1.0.

4. Empirical Results and Benchmarks

Scene Extrapolation

On KITTI/Cityscapes, FoVNet demonstrates:

Model SSIM LPIPS FID FVD
Mono 0.680 0.298 31.1 269.0
VF 0.718 0.305 33.5 360.8
LGTSM 0.737 0.291 52.6 495.7
Mono+LGTSM 0.703 0.280 19.0 201.6
FoVNet 0.716 0.229 10.9 82.7

FoVNet’s uncertainty map (AUSE ≈ 0.0049) is notably sharper than random (0.0197) (Ma et al., 2022).

Foveated Vision Classification

ImageNet-100 results (input constraint 112² pixels):

Architecture Params Top-1 (%)
Uniform ConvNeXt 3.7M 70.0
Foveated GraphConv 3.7M 72.5
Fov STN (ours) 4.8M 74.2
Learning-to-Zoom 4.8M 75.8

Full-resolution ConvNeXt (224² pixels) reaches 78.4% (Killick et al., 2023).

Smart Glasses Speech Enhancement

Single-target, 0 interferer (Xu et al., 2024):

Model PESQ STOI SI-SDR (dB)
Noisy 1.46 0.64 -1.25
SC-CRN 1.78 0.67 +3.41
maxDI+SC-CRN 2.06 0.76 +4.21
FoVNet 2.02 0.74 +5.53
FoVNet+MCWF+PP 2.05 0.74 +5.01

The pipeline operates at ≈50 MMACS and 0.206M parameters.

5. Component Analysis and Ablations

FoVNet ablation studies confirm distinct performance gains attributable to each architectural segment (Ma et al., 2022):

  • Forward warping by coordinate inversion yields robust propagation versus naive inverse warping.
  • Local self-attention window (7×7) plus gating balances long-range modeling and detail preservation.
  • Uncertainty prediction under a Laplacian model yields interpretable per-pixel confidence.
  • Removal of each module (3D propagation, AFA, GSA, uncertainty, recurrence/temporal loss) degrades metrics by 5–20%.
  • FiLM-based FoV embedding and ERB spectral compression generalize over arbitrary FoVs in speech enhancement (Xu et al., 2024).

6. Practical Applications and Configurability

FoVNet supports robotic and autonomous systems requiring temporally consistent extrapolation beyond limited sensor FoVs. Key use cases include:

  • Autonomous vehicles negotiating occlusions or limited visibility corners (Ma et al., 2022).
  • Wearable augmented hearing for smart glasses, allowing arbitrary configurability of the angular sector of enhancement without explicit target-talker direction-of-arrival input (Xu et al., 2024).
  • Computation-constrained vision systems in active recognition, focusing resources on critical regions (Killick et al., 2023).

Configurable FoV design—e.g., selecting field blocks in audio or tuning spatial fovea radius in vision—extends task generality and adaptability.

7. Research Directions and Future Work

Future extensions of FoVNet-style approaches involve:

  • Memory mechanisms for past fixations in active vision (Killick et al., 2023).
  • Cross-fixation spatiotemporal attention for feature fusion.
  • Joint optimization strategies for foveal shape, sensor layout, and attention policies.
  • Advanced uncertainty modeling for downstream planners and safety-critical decision-making.
  • Improved distortion suppression and multi-talker separation in audio by integrating more sophisticated post-processing and beamforming modules (Xu et al., 2024).

A plausible implication is the expanded use of modular, uncertainty-aware, and spatially adaptive architectures beyond their current domains, enabling robust extrapolation and enhancement in systems with strategic sensor constraints.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FoVNet.