Papers
Topics
Authors
Recent
Search
2000 character limit reached

On depth prediction for autonomous driving using self-supervised learning

Published 10 Mar 2024 in cs.CV | (2403.06194v1)

Abstract: Perception of the environment is a critical component for enabling autonomous driving. It provides the vehicle with the ability to comprehend its surroundings and make informed decisions. Depth prediction plays a pivotal role in this process, as it helps the understanding of the geometry and motion of the environment. This thesis focuses on the challenge of depth prediction using monocular self-supervised learning techniques. The problem is approached from a broader perspective first, exploring conditional generative adversarial networks (cGANs) as a potential technique to achieve better generalization was performed. In doing so, a fundamental contribution to the conditional GANs, the acontrario cGAN was proposed. The second contribution entails a single image-to-depth self-supervised method, proposing a solution for the rigid-scene assumption using a novel transformer-based method that outputs a pose for each dynamic object. The third significant aspect involves the introduction of a video-to-depth map forecasting approach. This method serves as an extension of self-supervised techniques to predict future depths. This involves the creation of a novel transformer model capable of predicting the future depth of a given scene. Moreover, the various limitations of the aforementioned methods were addressed and a video-to-video depth maps model was proposed. This model leverages the spatio-temporal consistency of the input and output sequence to predict a more accurate depth sequence output. These methods have significant applications in autonomous driving (AD) and advanced driver assistance systems (ADAS).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (189)
  1. GANVO: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In Proceedings - IEEE International Conference on Robotics and Automation, volume 2019-May, pages 5474–5480, 2019.
  2. Martín Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. ArXiv, abs/1701.04862, 2017.
  3. Do gans learn the distribution? some theory and empirics. In International Conference on Learning Representations, 2018.
  4. Stochastic variational video prediction. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, 2018.
  5. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
  6. Bayesian prediction of future street scenes using synthetic likelihoods. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
  7. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems, 32:35–45, 2019.
  8. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9157–9166, 2019.
  9. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  10. Stdepthformer: Predicting spatio-temporal depth from video with a self-supervised transformer model. arXiv preprint arXiv:2303.01196, 2023.
  11. Are conditional gans explicitly conditional? arXiv preprint arXiv:2106.15011, 2021.
  12. Un apprentissage de bout-en-bout d’adaptateur de domaine avec des réseaux antagonistes génératifs de cycles consistants. In Journée des Jeunes Chercheurs en Robotique, Visioconference, France, November 2020.
  13. Forecasting of depth and ego-motion with transformers and self-supervision. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 3706–3713. IEEE, 2022.
  14. Instance-aware multi-object self-supervision for monocular depth prediction. IEEE Robotics and Automation Letters, 7(4):10962–10968, 2022.
  15. Large scale gan training for high fidelity natural image synthesis. ArXiv, abs/1809.11096, 2019.
  16. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020-December, 2020.
  17. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
  18. Everybody dance now. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5932–5941, 2019.
  19. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8573–8581, 2020.
  20. Searching for efficient multi-scale architectures for dense image prediction. Advances in neural information processing systems, 31, 2018.
  21. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  22. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2624–2632, 2019.
  23. Spatialflow: Bridging all tasks for panoptic segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 31(6):2288–2300, 2020.
  24. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international conference on computer vision, pages 1511–1520, 2017.
  25. Single-image depth perception in the wild. Advances in neural information processing systems, 29:730–738, 2016.
  26. A simple single-scale vision transformer for object localization and instance segmentation. arXiv preprint arXiv:2112.09747, 2021.
  27. Panonet: Real-time panoptic segmentation through position-sensitive feature embedding. arXiv preprint arXiv:2008.00192, 2020.
  28. Banet: Bidirectional aggregation network with occlusion handling for panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3793–3802, 2020.
  29. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7063–7072, 2019.
  30. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. Proceedings of the IEEE International Conference on Computer Vision, 2019-October:7062–7071, 2019.
  31. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  32. Masked-attention mask transformer for universal image segmentation. arXiv preprint arXiv:2112.01527, 2021.
  33. Segmenting the Future. IEEE Robotics and Automation Letters, 5(3):4202–4209, 2020.
  34. Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. arXiv preprint arXiv:2010.02893, 2020.
  35. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  36. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
  37. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  38. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1(Mlm):4171–4186, 2019.
  39. Margingan: Adversarial training in semi-supervised learning. In NeurIPS, 2019.
  40. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  41. A learned representation for artistic style. ArXiv, abs/1610.07629, 2017.
  42. Depth map prediction from a single image using a multi-scale deep network. Advances in Neural Information Processing Systems, 27:2366–2374, 2014.
  43. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, volume 3, pages 2366–2374, 2014.
  44. Farzan Farnia and A. Ozdaglar. Gans may have no nash equilibria. ArXiv, abs/2002.09124, 2020.
  45. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, pages 64–72, 2016.
  46. Retinamask: Learning to predict masks improves state-of-the-art single-shot detection for free. arXiv preprint arXiv:1901.03353, 2019.
  47. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European conference on computer vision, pages 740–756. Springer, 2016.
  48. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  49. Digging into self-supervised monocular depth estimation. Proceedings of the IEEE International Conference on Computer Vision, 2019-October(1):3827–3837, 2019.
  50. Unsupervised monocular depth estimation with left-right consistency. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-Janua:6602–6611, 2017.
  51. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  52. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2, pages 2672–2680, 2014.
  53. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob:8976–8985, 2019.
  54. Scaling and benchmarking self-supervised visual representation learning. In Proceedings of the ieee/cvf International Conference on computer vision, pages 6391–6400, 2019.
  55. Panoptic segmentation forecasting. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 2279–2288, 2021.
  56. Multi-frame self-supervised depth with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 160–170, 2022.
  57. Towards gan benchmarks which require generalization. arXiv preprint arXiv:2001.03653, 2020.
  58. Sotr: Segmenting objects with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7157–7166, 2021.
  59. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  60. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  61. Pm-huber: Patchmatch with huber regularization for stereo matching. In Proceedings of the IEEE International Conference on Computer Vision, pages 2360–2367, 2013.
  62. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  63. Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):328–341, 2008.
  64. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  65. Sara Hooker. The hardware lottery. Communications of the ACM, 64(12):58–65, 2021.
  66. Probabilistic Future Prediction for Video Scene Understanding. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 12361 LNCS, pages 767–785, 2020.
  67. X. Huang and Serge J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV), pages 1510–1519, 2017.
  68. Self-supervised monocular scene flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7396–7405, 2020.
  69. Self-supervised monocular depth estimation using hybrid transformer encoder. IEEE Sensors Journal, 22(19):18762–18770, 2022.
  70. Image-to-image translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
  71. Spatial transformer networks. Advances in neural information processing systems, 28:2017–2025, 2015.
  72. Spatial transformer networks. Advances in Neural Information Processing Systems, 2015-Janua:2017–2025, 2015.
  73. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  74. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 4755–4764, 2020.
  75. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 4756–4765, 2020.
  76. Contragan: Contrastive learning for conditional image generation. arXiv: Computer Vision and Pattern Recognition, 2020.
  77. Training generative adversarial networks with limited data. ArXiv, abs/2006.06676, 2020.
  78. A multi-class hinge loss for conditional gans. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1290–1299, 2021.
  79. PoseNet: A convolutional network for real-time 6-dof camera relocalization. Proceedings of the IEEE International Conference on Computer Vision, 2015 Inter:2938–2946, 2015.
  80. Raymond A Kent. Estimation. Data Construction and Data Analysis for Survey Research, page 157, 2001.
  81. Unsupervised visual domain adaptation: A deep max-margin gaussian process approach. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2019-June, pages 4375–4385, 2019.
  82. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  83. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
  84. Self-supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12365 LNCS:582–600, 2020.
  85. On convergence and stability of gans. arXiv: Artificial Intelligence, 2018.
  86. Cifar-10 (canadian institute for advanced research). ””, 2009.
  87. Videoflow: A conditional flow-based model for stochastic video generation. arXiv preprint arXiv:1903.01434, 2019.
  88. Deblurgan: Blind motion deblurring using conditional adversarial networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8183–8192, 2018.
  89. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pages 734–750, 2018.
  90. Learning monocular depth in dynamic scenes via instance-aware projection consistency. arXiv preprint arXiv:2102.02629, 2021.
  91. Attentive and contrastive learning for joint depth and motion field estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4862–4871, 2021.
  92. Centermask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13906–13915, 2020.
  93. Triple generative adversarial nets. In NIPS, 2017.
  94. Grounded language-image pre-training. arXiv preprint arXiv:2112.03857, 2021.
  95. Object-driven text-to-image synthesis via adversarial training. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12166–12174, 2019.
  96. Improved multiscale vision transformers for classification and detection. arXiv preprint arXiv:2112.01526, 2021.
  97. Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2359–2367, 2017.
  98. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
  99. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  100. Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2024–2039, 2016.
  101. Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (TOG), 38:1 – 14, 2019.
  102. Are we ready for a new paradigm shift? a survey on visual deep mlp. arXiv preprint arXiv:2111.04060, 2021.
  103. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8759–8768, 2018.
  104. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
  105. Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  106. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  107. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
  108. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  109. Predicting Deeper into the Future of Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, volume 2017-Octob, pages 648–657, 2017.
  110. Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10):2624–2641, 2019.
  111. 1 Year, 1000km: The Oxford RobotCar Dataset. The International Journal of Robotics Research (IJRR), 36(1):3–15, 2017.
  112. Geometry-based next frame prediction from monocular video. In IEEE Intelligent Vehicles Symposium, Proceedings, pages 1700–1707, 2017.
  113. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 5667–5675, 2018.
  114. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
  115. Monocular depth estimation with self-supervised instance adaptation. arXiv preprint arXiv:2004.05821, 2020.
  116. Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702, 2021.
  117. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  118. Machine learning, volume 1. McGraw-hill New York, 1997.
  119. cgans with projection discriminator. ArXiv, abs/1802.05637, 2018.
  120. Efficientps: Efficient panoptic segmentation. International Journal of Computer Vision, 129(5):1551–1579, 2021.
  121. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
  122. Kevin P Murphy. Probabilistic machine learning: an introduction. MIT press, 2022.
  123. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  124. Weili Nie and A. Patel. Jr-gan: Jacobian regularization for generative adversarial networks. ArXiv, abs/1806.09235, 2018.
  125. Sesame: Semantic editing of scenes by adding, manipulating or erasing objects. ArXiv, abs/2004.04977, 2020.
  126. Conditional image synthesis with auxiliary classifier gans. In ICML, 2017.
  127. OpenAI. Gpt-4 technical report, 2023.
  128. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  129. Semantic image synthesis with spatially-adaptive normalization. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2332–2341, 2019.
  130. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  131. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  132. Don’t worry about the weather: Unsupervised condition-dependent domain adaptation. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 33–40. IEEE, 2019.
  133. 3D motion decomposition for RGBD future dynamic scene synthesis. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2019-June, pages 7665–7674, 2019.
  134. Enhanced pix2pix dehazing network. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8152–8160, 2019.
  135. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  136. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
  137. Latent video transformer. In VISIGRAPP 2021 - Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, volume 5, pages 101–112, 2021.
  138. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  139. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2019-June, pages 12232–12241, 2019.
  140. 3D packing for self-supervised monocular depth estimation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2482–2491, 2020.
  141. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  142. Generative adversarial text to image synthesis. In ICML, 2016.
  143. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  144. On gans and gmms. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5852–5863, 2018.
  145. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  146. Self-supervised monocular scene decomposition and depth estimation. In 2021 International Conference on 3D Vision (3DV), pages 627–636. IEEE, 2021.
  147. Improved techniques for training gans. In NIPS, 2016.
  148. Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3752–3761, 2018.
  149. Warp to the Future: Joint Forecasting of Features and Feature Motion. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 10645–10654, 2020.
  150. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  151. Veegan: Reducing mode collapse in gans using implicit variational learning. In NIPS, 2017.
  152. A benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012.
  153. Towards all-in-one pre-training via maximizing multi-modal mutual information. arXiv preprint arXiv:2211.09807, 2022.
  154. You only need adversarial supervision for semantic image synthesis. ArXiv, abs/2012.04781, 2020.
  155. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  156. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  157. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10781–10790, 2020.
  158. Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7867–7876, 2020.
  159. Recurrent flow-guided semantic forecasting. In Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019, pages 1703–1712, 2019.
  160. Coulomb gans: Provably optimal nash equilibria via potential fields. ArXiv, abs/1708.08819, 2018.
  161. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  162. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017.
  163. Modulating early visual processing by language. In NIPS, 2017.
  164. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2019-June, pages 2512–2521, 2019.
  165. Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2021.
  166. PatchMatchNet: Learned multi-view patchmatch stereo. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, c:14189–14198, 2021.
  167. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV, pages 108–126. Springer, 2020.
  168. Video-to-video synthesis. In NeurIPS, 2018.
  169. High-resolution image synthesis and semantic manipulation with conditional gans. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8798–8807, 2018.
  170. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  171. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
  172. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1164–1174, 2021.
  173. Inverting the pose forecasting pipeline with spf2: Sequential pointcloud forecasting for sequential pose forecasting. arXiv preprint arXiv:2003.08376, 2020.
  174. Gabriel Kreiman William Lotter and David Cox. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. In Advances in Neural Information Processing Systems, pages 64–72, 2016.
  175. Detectron2, 2019.
  176. Moving slam: Fully unsupervised deep learning in non-rigid scenes. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4611–4617. IEEE, 2021.
  177. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1316–1324, 2018.
  178. Transcenter: Transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145, 2021.
  179. FDA: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 4084–4094, 2020.
  180. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In Thirty-Second AAAI conference on artificial intelligence, 2018.
  181. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1983–1992, 2018.
  182. Scaling vision transformers. arXiv preprint arXiv:2106.04560, 2021.
  183. Image de-raining using a conditional generative adversarial network. IEEE Transactions on Circuits and Systems for Video Technology, 30:3943–3956, 2020.
  184. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, 41(8):1947–1962, 2018.
  185. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  186. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  187. Searching towards class-aware generators for conditional generative adversarial networks. ArXiv, abs/2006.14208, 2020.
  188. Unsupervised learning of depth and ego-motion from video. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-January:6612–6621, 2017.
  189. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.