Robust Scene Text Recognition with Automatic Rectification

Published 12 Mar 2016 in cs.CV | (1603.03915v2)

Abstract: Recognizing text in natural images is a challenging task with many unsolved problems. Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text. RARE is a specially-designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN). In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more "readable" image for the following SRN, which recognizes text through a sequence recognition approach. We show that the model is able to recognize several types of irregular text, including perspective text and curved text. RARE is end-to-end trainable, requiring only images and associated text labels, making it convenient to train and deploy the model in practical systems. State-of-the-art or highly-competitive performance achieved on several benchmarks well demonstrates the effectiveness of the proposed model.

Abstract PDF Upgrade to Chat

Citations (569)

View on Semantic Scholar

Summary

The paper introduces RARE, which integrates a Spatial Transformer Network and an attention-based Sequence Recognition Network to rectify irregular text in natural images.
It achieves end-to-end trainability without needing extra geometric annotations, demonstrating robust performance on benchmarks like IIIT5K and SVT-Perspective.
RARE outperforms traditional OCR methods by effectively handling perspective distortions and curved text, paving the way for real-world deployment.

Robust Scene Text Recognition with Automatic Rectification

The paper "Robust Scene Text Recognition with Automatic Rectification" introduces RARE, a novel model designed for recognizing irregular text in natural images. The challenge stems from factors like perspective distortion and curved character placement, distinct from traditional OCR tasks focused on regular text. The proposed RARE framework integrates a Spatial Transformer Network (STN) with a Sequence Recognition Network (SRN) to enhance robustness to such irregularities.

Core Contributions

RARE's architecture brings several key innovations:

Integration of STN and SRN: The STN predicts a Thin-Plate-Spline (TPS) transformation, rectifying the image into a form more suitable for recognition. This transformation is versatile enough to handle various text distortions effectively.
End-to-End Trainability: RARE can be trained end-to-end without requiring additional geometric annotations. This is achieved by leveraging the back-propagation capabilities of both the STN and SRN.
Attention-Based Sequence Recognition: The SRN employs an attention mechanism within a convolutional-recurrent structure, where it processes the rectified images, extracting a sequential representation conducive to character recognition.

Experimental Insights

Extensive evaluations on several benchmarks underscore RARE's efficacy:

Performance: Achieving state-of-the-art accuracy, particularly in tasks involving irregular text, highlights RARE's robustness. On the IIIT5K dataset, RARE outperformed existing methods with notable improvements in recognizing curved and perspective text.
Comparison with Baselines: Without a lexicon, RARE demonstrated superior performance over previous architectures such as CRNN, especially on datasets like SVT-Perspective and CUTE80, which focus on distorted text.

Implications and Future Directions

The findings suggest several implications:

Practical Deployments: Given its robustness and flexibility, RARE can be effectively deployed in real-world applications where text appears distorted due to environmental factors.
Potential Integrations: Future work could explore integrating RARE with text detection systems to build comprehensive end-to-end solutions for scene text reading.

The research illustrates a critical advancement in handling irregular text recognition by unifying rectification and recognition processes within a single trainable model architecture. The extension of STN capabilities combined with attention-based sequence recognition sets a foundation for further improvements in scene text recognition technologies.