Dense Gaussian Processes for Few-Shot Segmentation

Published 7 Oct 2021 in cs.CV | (2110.03674v2)

Abstract: Few-shot segmentation is a challenging dense prediction task, which entails segmenting a novel query image given only a small annotated support set. The key problem is thus to design a method that aggregates detailed information from the support set, while being robust to large variations in appearance and context. To this end, we propose a few-shot segmentation method based on dense Gaussian process (GP) regression. Given the support set, our dense GP learns the mapping from local deep image features to mask values, capable of capturing complex appearance distributions. Furthermore, it provides a principled means of capturing uncertainty, which serves as another powerful cue for the final segmentation, obtained by a CNN decoder. Instead of a one-dimensional mask output, we further exploit the end-to-end learning capabilities of our approach to learn a high-dimensional output space for the GP. Our approach sets a new state-of-the-art on the PASCAL-5$^i$ and COCO-20$^i$ benchmarks, achieving an absolute gain of $+8.4$ mIoU in the COCO-20$^i$ 5-shot setting. Furthermore, the segmentation quality of our approach scales gracefully when increasing the support set size, while achieving robust cross-dataset transfer. Code and trained models are available at \url{https://github.com/joakimjohnander/dgpnet}.

Abstract PDF Upgrade to Chat

Citations (24)

View on Semantic Scholar

Summary

The paper introduces DGPNet, a novel approach that integrates dense Gaussian Process regression into CNNs for few-shot segmentation.
It captures complex appearance variations and uncertainty, achieving an 8.4 mIoU gain on benchmarks like PASCAL-5^i and COCO-20^i.
The method scales with larger support sets through learning the GP output space and effective uncertainty modeling for improved mask prediction.

Dense Gaussian Processes for Few-Shot Segmentation

The paper "Dense Gaussian Processes for Few-Shot Segmentation" (2110.03674) introduces a novel approach to few-shot segmentation (FSS) that leverages dense Gaussian process (GP) regression to address the challenges of limited annotated data and significant variations in appearance and context. The proposed method, DGPNet, learns the mapping from local deep image features to mask values using a dense GP, effectively capturing complex appearance distributions and providing a principled means of capturing uncertainty. By integrating the GP as a layer in a CNN and learning the output space of the GP, DGPNet achieves state-of-the-art results on the PASCAL- $5^i$ and COCO- $20^i$ benchmarks.

Addressing Challenges in Few-Shot Segmentation

FSS aims to segment novel query images given only a small annotated support set. A key challenge is designing a mechanism that effectively extracts and aggregates information from the support set while remaining robust to variations in appearance and context. Specifically, the design of this module must address three key challenges:

Flexibility of $f$ : The few-shot learner must be able to learn and represent complex functions to segment a wide range of classes unseen during training.
Scalability in support set size $K$ : The method should effectively leverage additional support samples to achieve better accuracy and robustness for larger support sets.
Uncertainty Modeling: The network needs to handle unseen appearances in the query images and assess the relevance of the information in the support images.

Dense Gaussian Process Few-Shot Learner

To address these challenges, the paper proposes a dense Gaussian Process (GP) learner for few-shot segmentation. A GP is used to learn a mapping between dense local deep feature vectors and their corresponding mask values. This approach provides several advantages:

Flexibility: GPs can represent flexible and highly non-linear functions by selecting an appropriate kernel, such as the squared exponential (SE) kernel.
Scalability: GPs effectively benefit from additional support samples since all given data is retained.
Uncertainty Modeling: GPs explicitly model the uncertainty in the prediction by learning a probabilistic distribution over a space of functions.
Figure 1: Performance of the proposed DGPNet approach on the PASCAL- $5^i$ and COCO- $20^i$ benchmarks, compared to the state-of-the-art.

The dense GP learner predicts a distribution of functions $f$ from the input features $x$ to an output $y$ . The key assumption is that the support and query outputs are jointly Gaussian:

$\begin{pmatrix}\mathbf{y}_\support \ \mathbf{y}_\query\end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \bm{\mu}_\support \ \bm{\mu}_\query \end{pmatrix}, \begin{pmatrix} \mathbf{K}_{\support\support} & \mathbf{K}_{\support\query} \ \mathbf{K}_{\support\query}\tp & \mathbf{K}_{\query\query} \end{pmatrix} \right)$.

The posterior probability distribution of the query outputs is then inferred based on the rules for conditioning in a joint Gaussian:

$\mathbf{y}_\query|\mathbf{y}_\support, \mathbf{x}_\support, \mathbf{x}_\query \sim \mathcal{N}(\bm{\mu}_{\query|\support},\bm{\Sigma}_{\query|\support})$,

where

$\bm{\mu}_{\query|\support} = \mathbf{K}_{\support\query}\tp(\mathbf{K}_{\support\support}+\sigma_y^2 \mathbf{I})^{-1} \mathbf{y}_\support$,

$\bm{\Sigma}_{\query|\support} = \mathbf{K}_{\query\query} - \mathbf{K}_{\support\query}\tp(\mathbf{K}_{\support\support}+\sigma_y^2 \mathbf{I})^{-1} \mathbf{K}_{\support\query}$.

Learning the GP Output Space and Final Mask Prediction

To further improve the performance, the paper introduces a method for learning the GP output space during the meta-training stage. A mask-encoder is trained to construct the outputs, allowing additional information to be encoded into the outputs to guide the decoder. The decoder transforms the query outputs predicted by the few-shot learner into an accurate mask by processing the GP output mean and covariance, as well as shallow image features.

Figure 2: Overview of the proposed approach, showing how support and query images are processed to infer the probability distribution of query mask encodings using Gaussian process regression.

The final mask is predicted by feeding the GP output mean $z_\mu$ , GP output covariance $z_\Sigma$ , and shallow image features $\mathbf{x}_{\query,\shallow}$ into a decoder $U$ :

$\widehat{M}_\query = U(z_\mu, z_\Sigma, \mathbf{x}_{\query,\shallow})$.

Experimental Results

The proposed DGPNet was evaluated on the PASCAL- $5^i$ and COCO- $20^i$ benchmarks. The results demonstrate that DGPNet outperforms existing methods for 5-shot by a significant margin, achieving a new state-of-the-art on both benchmarks. Specifically, DGPNet achieves an absolute gain of 8.4 mIoU for 5-shot segmentation on the COCO- $20^i$ benchmark compared to the best reported results in the literature when using the ResNet101 backbone.

Figure 3: Qualitative results of the approach with 1, 5, and 10 support samples, showing how performance improves with more support.

The paper also demonstrates the cross-dataset transfer capabilities of DGPNet from COCO- $20^i$ to PASCAL. Ablative studies are conducted to probe the effectiveness of different components, including the choice of kernel, the incorporation of predictive covariance, and the benefits of learning the GP output space. The study shows that the SE kernel greatly outperforms the linear kernel and that incorporating uncertainty and learning the GP output space further improves the results.

Figure 4: Qualitative comparison between the proposed model and a baseline in the 1-shot setting from the COCO- $20^i$ benchmark.

Figure 5: Qualitative results on challenging episodes in the 1-shot setting from the PASCAL- $5^i$ benchmark.

Figure 6: Qualitative results on challenging episodes in the 1-shot setting from the COCO- $20^i$ benchmark.

Conclusion

The dense GP learner based on Gaussian process regression offers a promising approach for few-shot segmentation (2110.03674). The GP models the support set in deep feature space, and its flexibility permits it to capture complex feature distributions. It makes probabilistic predictions on the query image, providing both a point estimate and additional uncertainty information. These predictions are fed into a CNN decoder that predicts the final segmentation. The resulting approach sets a new state-of-the-art on 5-shot PASCAL- $5^i$ and COCO- $20^i$ , with absolute improvements of up to 8.4 mIoU. The approach also scales well with larger support sets during inference, even when trained for a fixed number of shots. This work highlights the potential of GPs for few-shot learning and opens up new avenues for future research. Future work could explore more advanced kernel functions, investigate different architectures for the mask encoder and decoder, and apply the approach to other few-shot learning tasks.