What Makes Training Multi-Modal Classification Networks Hard?

Published 29 May 2019 in cs.CV and cs.LG | (1905.12681v5)

Abstract: Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based on their overfitting behavior. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection.

Abstract PDF Upgrade to Chat

Citations (394)

View on Semantic Scholar

Summary

The paper provides detailed formatting guidelines for CVPR submissions, emphasizing uniform structure and strict page limits.
It outlines key specifications including two-column layouts, precise font choices, and margin settings to maintain professional consistency.
It highlights best practices for anonymizing manuscripts during blind review and suggests future automation for compliance checks.

An Analysis of the CVPR LaTeX Author Guidelines Paper

The document under consideration is a comprehensive guideline for authors preparing manuscripts for submission to the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). This paper serves as a vital resource, detailing the formatting and submission requirements to ensure uniformity and consistency across conference proceedings.

Structural and Formatting Specifications

The guideline delineates various essential structural components of a CVPR manuscript. It mandates the language of the paper to be English and prescribes specifics on paper length, emphasizing an eight-page limit excluding references. This constraint is strictly enforced to maintain the reviewing process's integrity, ensuring that the review focuses on the manuscript's core content rather than extensive appendices. Furthermore, specific formatting rules are discussed, including the need to adhere to a two-column layout with precise dimensions and margins to ensure a standardized appearance across all submitted papers.

Additionally, the document emphasizes the significance of font choice, typography, and spacing across different sections, such as titles, headings, and body text. The authors are guided to use Times-type fonts for consistency and are provided specifics on font sizes for various elements, including main text, footnotes, and captions, which are critical to maintaining visual uniformity.

Typesetting and Anonymization

For mathematics and figures, the guidelines recommend using appropriate LaTeX commands and environments to seamlessly integrate them into the text. This integration is crucial as it maintains the document's readability while adhering to typographical standards. The included LaTeX commands serve as practical examples to support authors in achieving these formatting requirements effectively.

The paper also highlights the importance of blind review processes, underscoring the need for authors to anonymize their submissions appropriately. It clarifies common misconceptions about anonymization, such as the incorrect removal of self-citations, and offers strategies to reference previous work without compromising the review process's impartiality.

Practical and Theoretical Implications

From a practical perspective, the document provides detailed instructions to ensure authors can accurately and efficiently format their submissions. This consistency not only streamlines the review process but also aids in creating a professional and cohesive conference proceeding. For researchers, especially those entering the domain of computer vision conferences, adherence to such guidelines is crucial in facilitating smooth interactions with the peer review process.

Theoretically, the paper serves as a meta-commentary on the role of standardized guidelines in academic publishing. It reflects on the evolving norms of scholarly communication and the need for rigorous adherence to format standards to maintain the scientific record's credibility and accessibility.

Future Directions

While this document focuses on formatting requirements specifically tailored for CVPR, it suggests potential areas for future exploration in academic publishing. Automation of some compliance checks through developed tools can further assist authors. Additionally, as visual data becomes more complex, exploring enhanced formats that accommodate such complexities without deviating from the standard could be a valuable extension.

Overall, the guideline represents a fundamental component of the submission process to CVPR, underscoring the critical role of structured format adherence in scholarly communication within the computer vision research community. Its detailed instructions are reflective of ongoing efforts to maintain quality in academia and the efficient dissemination of scientific knowledge.

Markdown Report Issue