- The paper presents Platypus, a unified OCR model that leverages prompt-based methods to accurately process natural scenes, document images, cropped text, and formulas.
- It employs an encoder-decoder framework combining a Swin-B Transformer, Feature Pyramid Network, and innovative prompt encoder for task-specific recognition.
- Experimental results demonstrate that Platypus outperforms specialized models on benchmarks like STS, STR, HTR, and MER, setting a new standard in OCR versatility.
Platypus: A Generalized Specialist Model for Reading Text in Various Forms
Overview
The paper introduces "Platypus," a novel model designed for reading text from images across various forms, whether in natural scenes or documents. This work addresses a known challenge in Optical Character Recognition (OCR): the dichotomy between specialist models, which exhibit high accuracy within narrow domains, and generalist models, which can handle diverse text forms but generally at the cost of accuracy and efficiency. Platypus aims to unify these strengths by being a generalized specialist model capable of achieving high accuracy and efficiency across multiple text reading tasks.
Motivation and Background
The task of reading text from images spans numerous applications, from archiving documents to real-time translations. Traditional OCR systems often involve separate models specialized for sub-tasks, such as scene text recognition, handwritten text recognition, and mathematical expression recognition. Despite their effectiveness within their domains, these specialist models lack the flexibility to generalize across the broader spectrum of text reading scenarios. The emergence of multimodal LLMs (MLLMs) has shown potential for more holistic text reading but with trade-offs in computational efficiency and accuracy. Platypus combines these approaches into a single, unified model leveraging their complementary advantages.
Methodology
Platypus operates on an encoder-decoder framework incorporating a novel prompt-based system for task specification. The model differentiates text reading tasks into four broad categories: natural scene full images, document full images, cropped text, and cropped formulas. It uses different prompts to specify scenarios such as recognizing all text (RAT), point prompt recognition (PPR), and box prompt recognition (BPR).
Image Encoder
The model employs a Swin-B Transformer, pretrained on ImageNet 22k, to encode images into multi-scale visual features enhanced by a Feature Pyramid Network (FPN).
Prompt Encoder
The prompt encoder ingests various task-specific prompts, including scene categories and granularity levels. These prompts help the model correctly interpret and process the text reading task at hand, adding a layer of flexibility and specificity that traditional models often lack.
Recognition Decoder
Inspired by the Transformer architecture, the recognition decoder generates the text output sequence. It integrates visual features from the Image Encoder and prompt embeddings to produce its final text recognition result.
Experimental Evaluation
The Platypus model was trained on a comprehensive dataset named "Worms," composed of various subsets tailored to the distinct types of text reading tasks. The training involved a two-phase process: an initial pre-training phase focused on full-image data and a subsequent joint training phase integrating cropped-image data.
Datasets and Benchmarks
Experiments were conducted on widely-recognized benchmarks for Scene Text Spotting (STS), Scene Text Recognition (STR), Handwritten Text Recognition (HTR), and Mathematical Expression Recognition (MER). The evaluation employed various performance metrics, such as H-mean for STS and word accuracy ignoring case and symbols (WAICS) for STR and HTR.
Key Findings
Platypus demonstrated superior performance across diverse text reading tasks. Notably, it outperformed state-of-the-art models specialized in STS, STR, HTR, and MER on multiple benchmarks. For instance, Platypus excelled on the Curated Artistic Text (CAT) Benchmark, showcasing its adeptness at handling multi-orientation, occlusion, and artistically rendered text images—a testament to its superior generalization capabilities.
Implications and Future Work
The findings from this research have significant implications for the development of more versatile and robust OCR systems. Platypus sets a new standard for text reading models by achieving high accuracy and efficiency across a broad range of tasks and text formats. Future research could aim to extend Platypus' capabilities to include multi-language text recognition and further refine its efficiency and accuracy for real-world applications.
In summary, Platypus bridges a crucial gap in OCR technology, providing a unified solution that does not compromise on accuracy and efficiency, whether dealing with printed, handwritten, or complex mathematical expressions. The introduction of Platypus is a meaningful contribution to the field, offering both practical and theoretical advancements in text understanding from images.