Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

Published 26 Jul 2021 in cs.CV | (2107.12090v2)

Abstract: Although text recognition has significantly evolved over the years, state-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts. This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities. In this paper, we argue that semantic information offers a complementary role in addition to visual only. More specifically, we additionally utilize semantic information by proposing a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning. Our novelty lies in the intuition that for text recognition, the prediction should be refined in a stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling attentional decoder where non-differentiability, invoked by discretely predicted character labels, needs to be bypassed for end-to-end training. While the first stage predicts using visual features, subsequent stages refine on top of it using joint visual-semantic information. Additionally, we introduce multi-scale 2D attention along with dense and residual connections between different stages to deal with varying scales of character sizes, for better performance and faster convergence during training. Experimental results show our approach to outperform existing SOTA methods by a considerable margin.

Abstract PDF Upgrade to Chat

Citations (54)

View on Semantic Scholar

Summary

The paper provides detailed LaTeX author guidelines for submitting to ICCV proceedings, covering formatting, paper length limits, and submission policies.
It specifies technical requirements like two-column layout, font sizes, margins, handling figures, mathematical notation, and adherence to blind review rules.
These guidelines ensure standardization across submissions, improving readability for reviewers and contributing to the quality of academic discourse at the conference.

Overview of LaTeX Author Guidelines for ICCV Proceedings

The presented paper offers meticulous guidelines for authors preparing submissions for the International Conference on Computer Vision (ICCV), with specific attention to using LaTeX for document creation. While the paper primarily serves a practical role in facilitating adherence to conference formatting protocols, it also provides insight into the broader considerations involved in academic publication processes, including dual submission policies, blind review nuances, and mathematical notation.

Content Specification

The document highlights several critical aspects of formatting and submission:

Language and Submission Policies: It mandates English as the submission language and provides detailed instructions on dual submission protocols to maintain the integrity and originality of conference materials.
Paper Length and Review Process: Authors are advised on the eight-page limit for the main content, excluding references, with stern warnings against attempting to manipulate formatting to extend beyond the prescribed limit. This ensures equitable treatment of all submissions within the review process.
Formatting and Style Guidelines: Noteworthy are the specifications for formatting, including two-column layouts, font types and sizes, and precise margin settings. The inclusion of a printed ruler in the LaTeX template aids reviewers in referencing specific content without ambiguity.
Mathematical Notation and Blind Review Details: From rigorous section and equation numbering to appropriate citation practices during blind review, the guidelines ensure clarity and impartiality in scientific communication.
Technical Elements: The paper addresses handling figures and illustrations, emphasizing the importance of ensuring clarity in printed formats, which is particularly pertinent given the technical nature of ICCV contributions.

Implications and Future Perspectives

While the paper itself does not present novel research findings, its utility in ensuring standardization across submissions should not be underestimated. Consistent formatting improves accessibility and readability, permitting reviewers and the broader scientific community to focus on content quality and contribution without being distracted by inconsistencies in presentation.

The guidelines also serve an educational role, equipping both novice and seasoned researchers with frameworks essential for successful scientific writing. As the landscape of AI and computer vision continues to evolve, future iterations of these guidelines may incorporate considerations for ethical AI research, open-access dissemination, or the inclusion of multimedia elements, reflecting changing standards in technology and publication.

In conclusion, while straightforward, the guidelines encapsulated within this document underpin the foundational practices necessary for high-quality academic discourse and ensure that the technical rigor of ICCV submissions is matched by equally rigorous presentation standards.

Markdown Report Issue