Papers
Topics
Authors
Recent
Search
2000 character limit reached

Context Enhanced Transformer for Single Image Object Detection

Published 22 Dec 2023 in cs.CV | (2312.14492v2)

Abstract: With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems. The project page and code will be made available at: https://ku-cvlab.github.io/CETR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119.
  2. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 213–229. Springer.
  3. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE international conference on computer vision, 2722–2730.
  4. Contrastive test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 295–305.
  5. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2147–2156.
  6. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10337–10346.
  7. STFAR: Improving Object Detection Robustness at Test-Time by Self-Training with Feature Alignment Regularization. arXiv preprint arXiv:2303.17937.
  8. Cui, Y. 2023. Feature Aggregated Queries for Transformer-Based Video Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6365–6376.
  9. R-fcn: Object detection via region-based fully convolutional networks. Advances in neural information processing systems, 29.
  10. Object guided external memory network for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6678–6687.
  11. Relation distillation networks for video object detection. In Proceedings of the IEEE/CVF international conference on computer vision, 7023–7032.
  12. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, 2758–2766.
  13. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, 6569–6578.
  14. Detect to track and track to detect. In Proceedings of the IEEE international conference on computer vision, 3038–3046.
  15. Foreground gating and background refining network for surveillance object detection. IEEE Transactions on Image Processing, 28(12): 6077–6090.
  16. Girshick, R. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, 1440–1448.
  17. Progressive sparse local attention for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3909–3918.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  19. Object detection applied to indoor environments for mobile robot navigation. Sensors, 16(8): 1180.
  20. Long short-term memory. Neural computation, 9(8): 1735–1780.
  21. Test-time classifier adjustment module for model-agnostic domain generalization. Advances in Neural Information Processing Systems, 34: 2427–2440.
  22. Test-time adaptation via self-training with nearest neighbor information. arXiv preprint arXiv:2207.10792.
  23. Video Object Detection with Locally-Weighted Deformable Neighbors. In AAAI Conference on Artificial Intelligence.
  24. Learning where to focus for efficient video object detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, 18–34. Springer.
  25. Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 727–735.
  26. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10): 2896–2907.
  27. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13619–13627.
  28. Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779.
  29. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
  30. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
  31. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329.
  32. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  33. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3651–3660.
  34. Performance evaluation of object detection algorithms for video surveillance. IEEE Transactions on Multimedia, 8(4): 761–774.
  35. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
  36. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 658–666.
  37. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 82–91.
  38. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211–252.
  39. Leveraging long-range temporal relationships between proposals for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9756–9764.
  40. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, 9627–9636.
  41. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726.
  42. Ptseformer: Progressive temporal-spatial enhanced transformer towards video object detection. In European Conference on Computer Vision, 732–747. Springer.
  43. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7201–7211.
  44. Fully motion-aware network for video object detection. In Proceedings of the European conference on computer vision (ECCV), 542–557.
  45. Sequence level semantics aggregation for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9217–9225.
  46. Understanding traffic density from large-scale web camera data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5898–5907.
  47. TransVOD: end-to-end video object detection with spatial-temporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  48. Towards high performance video object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7210–7218.
  49. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.
  50. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE international conference on computer vision, 408–417.
  51. Deep feature flow for video recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2349–2358.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

  1. CETR 
  2. GitHub - KU-CVLAB/CETR (17 stars)