Platypus: A Generalized Specialist Model for Reading Text in Various Forms

Published 27 Aug 2024 in cs.CV | (2408.14805v1)

Abstract: Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able to recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/Platypus.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents Platypus, a unified OCR model that leverages prompt-based methods to accurately process natural scenes, document images, cropped text, and formulas.
It employs an encoder-decoder framework combining a Swin-B Transformer, Feature Pyramid Network, and innovative prompt encoder for task-specific recognition.
Experimental results demonstrate that Platypus outperforms specialized models on benchmarks like STS, STR, HTR, and MER, setting a new standard in OCR versatility.

Platypus: A Generalized Specialist Model for Reading Text in Various Forms

Overview

The paper introduces "Platypus," a novel model designed for reading text from images across various forms, whether in natural scenes or documents. This work addresses a known challenge in Optical Character Recognition (OCR): the dichotomy between specialist models, which exhibit high accuracy within narrow domains, and generalist models, which can handle diverse text forms but generally at the cost of accuracy and efficiency. Platypus aims to unify these strengths by being a generalized specialist model capable of achieving high accuracy and efficiency across multiple text reading tasks.

Motivation and Background

The task of reading text from images spans numerous applications, from archiving documents to real-time translations. Traditional OCR systems often involve separate models specialized for sub-tasks, such as scene text recognition, handwritten text recognition, and mathematical expression recognition. Despite their effectiveness within their domains, these specialist models lack the flexibility to generalize across the broader spectrum of text reading scenarios. The emergence of multimodal LLMs (MLLMs) has shown potential for more holistic text reading but with trade-offs in computational efficiency and accuracy. Platypus combines these approaches into a single, unified model leveraging their complementary advantages.

Methodology

Platypus operates on an encoder-decoder framework incorporating a novel prompt-based system for task specification. The model differentiates text reading tasks into four broad categories: natural scene full images, document full images, cropped text, and cropped formulas. It uses different prompts to specify scenarios such as recognizing all text (RAT), point prompt recognition (PPR), and box prompt recognition (BPR).

Image Encoder

The model employs a Swin-B Transformer, pretrained on ImageNet 22k, to encode images into multi-scale visual features enhanced by a Feature Pyramid Network (FPN).

Prompt Encoder

The prompt encoder ingests various task-specific prompts, including scene categories and granularity levels. These prompts help the model correctly interpret and process the text reading task at hand, adding a layer of flexibility and specificity that traditional models often lack.

Recognition Decoder

Inspired by the Transformer architecture, the recognition decoder generates the text output sequence. It integrates visual features from the Image Encoder and prompt embeddings to produce its final text recognition result.

Experimental Evaluation

The Platypus model was trained on a comprehensive dataset named "Worms," composed of various subsets tailored to the distinct types of text reading tasks. The training involved a two-phase process: an initial pre-training phase focused on full-image data and a subsequent joint training phase integrating cropped-image data.

Datasets and Benchmarks

Experiments were conducted on widely-recognized benchmarks for Scene Text Spotting (STS), Scene Text Recognition (STR), Handwritten Text Recognition (HTR), and Mathematical Expression Recognition (MER). The evaluation employed various performance metrics, such as H-mean for STS and word accuracy ignoring case and symbols (WAICS) for STR and HTR.

Key Findings

Platypus demonstrated superior performance across diverse text reading tasks. Notably, it outperformed state-of-the-art models specialized in STS, STR, HTR, and MER on multiple benchmarks. For instance, Platypus excelled on the Curated Artistic Text (CAT) Benchmark, showcasing its adeptness at handling multi-orientation, occlusion, and artistically rendered text images—a testament to its superior generalization capabilities.

Implications and Future Work

The findings from this research have significant implications for the development of more versatile and robust OCR systems. Platypus sets a new standard for text reading models by achieving high accuracy and efficiency across a broad range of tasks and text formats. Future research could aim to extend Platypus' capabilities to include multi-language text recognition and further refine its efficiency and accuracy for real-world applications.

In summary, Platypus bridges a crucial gap in OCR technology, providing a unified solution that does not compromise on accuracy and efficiency, whether dealing with printed, handwritten, or complex mathematical expressions. The introduction of Platypus is a meaningful contribution to the field, offering both practical and theoretical advancements in text understanding from images.