Guided Patch-Grouping Wavelet Transformer with Spatial Congruence for Ultra-High Resolution Segmentation
Abstract: Most existing ultra-high resolution (UHR) segmentation methods always struggle in the dilemma of balancing memory cost and local characterization accuracy, which are both taken into account in our proposed Guided Patch-Grouping Wavelet Transformer (GPWFormer) that achieves impressive performances. In this work, GPWFormer is a Transformer ($\mathcal{T}$)-CNN ($\mathcal{C}$) mutual leaning framework, where $\mathcal{T}$ takes the whole UHR image as input and harvests both local details and fine-grained long-range contextual dependencies, while $\mathcal{C}$ takes downsampled image as input for learning the category-wise deep context. For the sake of high inference speed and low computation complexity, $\mathcal{T}$ partitions the original UHR image into patches and groups them dynamically, then learns the low-level local details with the lightweight multi-head Wavelet Transformer (WFormer) network. Meanwhile, the fine-grained long-range contextual dependencies are also captured during this process, since patches that are far away in the spatial domain can also be assigned to the same group. In addition, masks produced by $\mathcal{C}$ are utilized to guide the patch grouping process, providing a heuristics decision. Moreover, the congruence constraints between the two branches are also exploited to maintain the spatial consistency among the patches. Overall, we stack the multi-stage process in a pyramid way. Experiments show that GPWFormer outperforms the existing methods with significant improvements on five benchmark datasets.
- The filmmaker’s handbook: A comprehensive guide for the digital age. Penguin, 2007.
- Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495, 2017.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2017.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, pages 801–818, 2018.
- Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8924–8933, 2019.
- Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8890–8899, 2020.
- The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 172–181, 2018.
- Rethinking bisenet for real-time semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9716–9725, 2021.
- Challenges on large scale surveillance video analysis. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 69–76, 2018.
- Dual attention network for scene segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3146–3154, 2019.
- Deep learning. MIT press, 2016.
- Mild-net: Minimal information loss dilated network for gland instance segmentation in colon histology images. Medical Image Analysis, 52:199–211, 2019.
- Isdnet: Integrating shallow and deep networks for efficient ultra-high resolution segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4361–4370, June 2022.
- Parameter selection in svm with rbf kernel function. In World Automation Congress, pages 1–4. IEEE, 2012.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Class-wise dynamic graph convolution for semantic segmentation. In European Conference on Computer Vision, pages 1–17. Springer, 2020.
- Progressive semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16755–16764, 2021.
- End to end multi-scale convolutional neural network for crowd counting. In Eleventh International Conference on Machine Vision, volume 11041, pages 761–766, 2019.
- Context-aware graph convolution network for target re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1646–1654, 2021.
- Structural and statistical texture knowledge distillation for semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16876–16885, 2022.
- Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. arXiv preprint arXiv:2305.10899, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Pointrend: Image segmentation as rendering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9799–9808, 2020.
- Efficient inference in fully connected crfs with Gaussian edge potentials. In Advances in Neural Information Processing Systems, volume 24, pages 1–9, 2011.
- From contexts to locality: Ultra-high resolution image segmentation via locality-aware contextual correlation. In IEEE/CVF International Conference on Computer Vision, pages 7252–7261, 2021.
- Focal loss for dense object detection. In IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
- Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
- Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In IEEE International Geoscience and Remote Sensing Symposium, pages 3226–3229, 2017.
- MMSegmentation. Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2022. Accessed: 2022-08-16.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
- The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1):1–9, 2018.
- Ipgn: Interactiveness proposal graph network for human-object interaction detection. IEEE Transactions on Image Processing, 30:6583–6593, 2021.
- Learning social spatio-temporal relation graph in the wild and a video benchmark. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14, 2021.
- Fast end-to-end trainable guided filter. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1838–1847, 2018.
- Patch proposal network for fast semantic segmentation of high-resolution images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, page 12402–12409, 2020.
- Wave-vit: Unifying wavelet and transformers for visual representation learning. In European Conference on Computer Vision, pages 328–345. Springer, 2022.
- Bisenet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision, pages 325–341, 2018.
- Object-contextual representations for semantic segmentation. In European Conference on Computer Vision, pages 173–190. Springer, 2020.
- Segfix: Model-agnostic boundary refinement for segmentation. In European Conference on Computer Vision, pages 489–506. Springer, 2020.
- Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2881–2890, 2017.
- Icnet for real-time semantic segmentation on high-resolution images. In European Conference on Computer Vision, pages 405–420, 2018.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6877–6886, 2021.
- Learning statistical texture for semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12537–12546, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.