FoVNet: Field-of-View Extrapolation
- FoVNet is a field-of-view extrapolation framework that synthesizes wider outputs from narrow sensor data with temporal consistency and uncertainty estimation.
- It integrates geometric coordinate warping with attention-based feature aggregation and configurable modules for active vision and speech enhancement.
- The architecture supports applications in autonomous navigation, wearable smart glasses, and computation-constrained systems by robustly handling unobserved regions.
FoVNet refers to several distinct technical frameworks centered around field-of-view (FoV) modulation, primarily in vision and speech enhancement domains. Across applications—scene extrapolation, active foveation, and selective audio enhancement—FoVNet designs unify geometric, attention-based, and modular signal processing approaches to leverage narrow aperture data for the synthesis or enhancement of wider field-of-view outputs, often with built-in uncertainty quantification or configurability.
1. Formal Definition and Problem Scope
FoVNet, in its foundational incarnation, is a temporally consistent field-of-view extrapolation architecture for visual scene prediction. The system infers the current wide-field-of-view scene from narrow-FoV video input, while estimating a per-pixel hallucination uncertainty. The scope extends beyond image stitching, demanding (a) propagation under camera motion, (b) hallucination of unobserved regions, (c) temporal consistency, and (d) principled uncertainty quantification. Similar terminology has been adopted in foveated vision models for active spatial sampling (Killick et al., 2023) and configurable speech enhancement systems for smart glasses (Xu et al., 2024), where FoVNet describes adaptation to spatial or directional constraints.
2. Architecture and Computational Pipeline
Visual Scene Extrapolation (Ma et al., 2022)
FoVNet’s vision pipeline is a two-stage recurrent system:
- Coordinate Generation Stage Computes per-pixel warping coordinates for each past frame, leveraging estimated depth () and relative camera pose (). Rigid 3D flow is given by and inverted to allow forward warping via feature scatter.
- Frame Aggregation Stage
- AFA fuses attention-weighted features:
- GSA adaptively gates local self-attended features, distinguishing between hallucination zones and reliably propagated regions. - Output includes the wide-FoV RGB image () and a pixelwise uncertainty map ().
Adversarial training incorporates both image-level and temporal discriminators.
Foveated Active Vision (Killick et al., 2023)
In the context of active vision, FoVNet architectures deploy:
- Differentiable Foveated Sensor Irregularly samples high-density fovea and sparse periphery, parameterized by radius , sample count , and sunflower spatial layout.
- Graph Convolutional Network (GCN) Features are processed on the irregular grid using Gaussian-derivative edge-conditioned graph convolutions.
- Attention (Fixation) Module Softmax-based saliency map predicts the next fixation; soft-argmax enables gradient-based end-to-end optimization.
- Classification Head Class logits are averaged across fixations or produced via a localization network for one-shot approaches.
Speech Enhancement for Wearables (Xu et al., 2024)
FoVNet in audio employs:
- Front-End Spatial Sampling Max-DI beamformer “scans” 20 horizontal blocks; beamformed STFTs are mapped to 64-band ERB features.
- Neural Module (Ultra-Low Computation) FiLM-based FoV embeddings, depth-wise spatial convolutions, reference branch, and bi-layer GRUs produce ERB-band gains. Masked enhancement () is applied to the reference STFT.
- Multi-Channel Wiener Filter (MCWF) Covariance estimation via time-recursive smoothing enables minimum-distortion beamforming.
- Post-Processing (PP) Residual-reduction masks reconcile distortions between beamformer and neural network estimates.
3. Loss Functions and Uncertainty Modeling
Visual FoVNet Loss (Ma et al., 2022)
- Coordinate Generation Loss (): masked photometric consistency and smoothness penalties.
- Frame Aggregation Loss ():
- Uncertainty-aware L1 loss () with Laplacian posterior per-pixel,
- Perceptual loss (; VGG-based), adversarial (LSGAN) losses, with hyperparameters , , .
Speech Enhancement Loss (Xu et al., 2024)
Composite loss: Typical values: , .
4. Empirical Results and Benchmarks
Scene Extrapolation
On KITTI/Cityscapes, FoVNet demonstrates:
| Model | SSIM | LPIPS | FID | FVD |
|---|---|---|---|---|
| Mono | 0.680 | 0.298 | 31.1 | 269.0 |
| VF | 0.718 | 0.305 | 33.5 | 360.8 |
| LGTSM | 0.737 | 0.291 | 52.6 | 495.7 |
| Mono+LGTSM | 0.703 | 0.280 | 19.0 | 201.6 |
| FoVNet | 0.716 | 0.229 | 10.9 | 82.7 |
FoVNet’s uncertainty map (AUSE ≈ 0.0049) is notably sharper than random (0.0197) (Ma et al., 2022).
Foveated Vision Classification
ImageNet-100 results (input constraint 112² pixels):
| Architecture | Params | Top-1 (%) |
|---|---|---|
| Uniform ConvNeXt | 3.7M | 70.0 |
| Foveated GraphConv | 3.7M | 72.5 |
| Fov STN (ours) | 4.8M | 74.2 |
| Learning-to-Zoom | 4.8M | 75.8 |
Full-resolution ConvNeXt (224² pixels) reaches 78.4% (Killick et al., 2023).
Smart Glasses Speech Enhancement
Single-target, 0 interferer (Xu et al., 2024):
| Model | PESQ | STOI | SI-SDR (dB) |
|---|---|---|---|
| Noisy | 1.46 | 0.64 | -1.25 |
| SC-CRN | 1.78 | 0.67 | +3.41 |
| maxDI+SC-CRN | 2.06 | 0.76 | +4.21 |
| FoVNet | 2.02 | 0.74 | +5.53 |
| FoVNet+MCWF+PP | 2.05 | 0.74 | +5.01 |
The pipeline operates at ≈50 MMACS and 0.206M parameters.
5. Component Analysis and Ablations
FoVNet ablation studies confirm distinct performance gains attributable to each architectural segment (Ma et al., 2022):
- Forward warping by coordinate inversion yields robust propagation versus naive inverse warping.
- Local self-attention window (7×7) plus gating balances long-range modeling and detail preservation.
- Uncertainty prediction under a Laplacian model yields interpretable per-pixel confidence.
- Removal of each module (3D propagation, AFA, GSA, uncertainty, recurrence/temporal loss) degrades metrics by 5–20%.
- FiLM-based FoV embedding and ERB spectral compression generalize over arbitrary FoVs in speech enhancement (Xu et al., 2024).
6. Practical Applications and Configurability
FoVNet supports robotic and autonomous systems requiring temporally consistent extrapolation beyond limited sensor FoVs. Key use cases include:
- Autonomous vehicles negotiating occlusions or limited visibility corners (Ma et al., 2022).
- Wearable augmented hearing for smart glasses, allowing arbitrary configurability of the angular sector of enhancement without explicit target-talker direction-of-arrival input (Xu et al., 2024).
- Computation-constrained vision systems in active recognition, focusing resources on critical regions (Killick et al., 2023).
Configurable FoV design—e.g., selecting field blocks in audio or tuning spatial fovea radius in vision—extends task generality and adaptability.
7. Research Directions and Future Work
Future extensions of FoVNet-style approaches involve:
- Memory mechanisms for past fixations in active vision (Killick et al., 2023).
- Cross-fixation spatiotemporal attention for feature fusion.
- Joint optimization strategies for foveal shape, sensor layout, and attention policies.
- Advanced uncertainty modeling for downstream planners and safety-critical decision-making.
- Improved distortion suppression and multi-talker separation in audio by integrating more sophisticated post-processing and beamforming modules (Xu et al., 2024).
A plausible implication is the expanded use of modular, uncertainty-aware, and spatially adaptive architectures beyond their current domains, enabling robust extrapolation and enhancement in systems with strategic sensor constraints.