Primitive Representation Learning for Scene Text Recognition

Published 10 May 2021 in cs.CV | (2105.04286v1)

Abstract: Scene text recognition is a challenging task due to diverse variations of text instances in natural scene images. Conventional methods based on CNN-RNN-CTC or encoder-decoder with attention mechanism may not fully investigate stable and efficient feature representations for multi-oriented scene texts. In this paper, we propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images. We model elements in feature maps as the nodes of an undirected graph. A pooling aggregator and a weighted aggregator are proposed to learn primitive representations, which are transformed into high-level visual text representations by graph convolutional networks. A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding. Furthermore, by integrating visual text representations into an encoder-decoder model with the 2D attention mechanism, we propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods. Experimental results on both English and Chinese scene text recognition tasks demonstrate that PREN keeps a balance between accuracy and efficiency, while PREN2D achieves state-of-the-art performance.

Abstract PDF Upgrade to Chat

Citations (65)

View on Semantic Scholar

Summary

The paper proposes a graph-based primitive representation network (PREN) that leverages pooling and weighted aggregators to extract both global and sample-specific text features.
It converts primitive features into visual text representations using graph convolutional networks, preserving spatial relationships for efficient, parallel decoding.
Experimental results on English and Chinese datasets demonstrate that PREN and its improved version, PREN2D, achieve competitive speed and state-of-the-art recognition accuracy.

Primitive Representation Learning for Scene Text Recognition

The paper delineates a novel approach for scene text recognition (STR) by introducing primitive representation learning. The method aims to overcome the challenges typical in STR, particularly those related to the diversity and orientations of texts found in natural images.

Core Contributions and Methodology

The study suggests a shift from conventional CNN-RNN-based frameworks and encoder-decoder models with attention mechanisms, proposing instead a Primitive REpresentation learning Network (PREN). This approach leverages graph convolutional networks (GCNs) to transform the intrinsic primitive representations from scene text images into high-level visual text representations. The primitive representations serve as an intermediary step to capture the complex variabilities found in scene text, aiding in efficient recognition.

Aggregator Models: The methodology employs two types of aggregators—pooling and weighted—to learn primitive representations. While the pooling aggregator extracts shared intrinsic structural information through global average pooling, the weighted aggregator focuses on sample-specific information by learning dynamic aggregating weights. These aggregators enable the model to capture global features and idiosyncratic characteristics of individual text instances within the images.

Graph-Based Visual Representations: The primitive representations are converted into visual text representations via GCNs. This graph-based approach helps maintain the spatial relationships and textual cues necessary for accurate STR. Following this, the constructed representations are employed for parallel decoding, significantly improving computational efficiency.

Experimental Results

Experimental validations were performed on several English and Chinese datasets, demonstrating the efficacy of the proposed network. The PREN achieved competitive accuracy metrics while maintaining efficiency, and the enhanced model, PREN2D, integrated the visual representations into a 2D attention-based encoder-decoder framework yielding superior improvements. The results show PREN's balanced performance between speed and accuracy and PREN2D’s state-of-the-art results across diverse datasets, particularly those featuring irregular text.

Implications and Future Directions

This primitive representation learning framework provides a viable solution to the STR complexities associated with diverse text orientations and characteristics in natural scenes. By shifting the focus to graph-based and parallel decoding techniques, the study opens avenues for further refinement in high-dimensional data representation learning.

For future research directions, the potential extensions could explore:

Integration with Other Modalities: Combining primitive representation learning with other sensory data to enrich the text recognition process.
Expansion in Different Languages: Extending the model’s applicability to more languages with complex character sets, leveraging its ability to handle diversity.
Optimizations in Computational Costs: Further reducing the computational costs, especially in deployment on edge devices.

The paper thus lays the groundwork for advancing scene text recognition methodologies, emphasizing efficient representation learning with graph networks.

Markdown Report Issue