- The paper introduces PP-FormulaNet, a formula recognition model that achieves state-of-the-art accuracy (PP-FormulaNet-L) and high efficiency (PP-FormulaNet-S) by employing innovative techniques.
- The model significantly boosts inference speed (up to 4.63x faster) using a multi-token parallel prediction technique in its decoder.
- Knowledge distillation enables the efficient PP-FormulaNet-S model to run 16 times faster than previous state-of-the-art while maintaining high accuracy.
PP-FormulaNet is a formula recognition model that balances accuracy and efficiency using several innovations. The model comes in two variants, PP-FormulaNet-L and PP-FormulaNet-S, designed for high-accuracy and high-efficiency scenarios, respectively (2503.18382).
The PP-FormulaNet architecture is based on an encoder-decoder framework and incorporates a formula mining system, weight interpolation, knowledge distillation, and multi-token prediction techniques.
A key component of PP-FormulaNet is its system for creating a large dataset of formula source code and image pairs, crucial for training a robust model. The system addresses the complexities of formula extraction from arXiv papers through several steps:
- Source Code Indexing and Sorting: This process ensures the correct order of source code, especially important for handling user-defined commands within LaTeX documents.
- Formula Source Code Extraction: The system expands the range of recognized formula identifiers (e.g.,
\begin{align}, \begin{eqnarray}) to capture a broader variety of formulas, while also filtering out irrelevant content such as figures and tables from the source code.
- User-Defined Command Recovery: Custom commands (e.g.,
\newcommand) are parsed and restored using regular expressions and a stack algorithm to accommodate non-standard LaTeX syntax.
- Formula Source Code Normalization: The extracted formula source code is normalized using KaTex to ensure consistency and compatibility.
- Formula Source Code Rendering: The cleaned formulas are rendered into images using
pdflatex and fitz, creating pairs of source code and corresponding images.
This mining system constructs a dataset of 4 million formulas.
PP-FormulaNet is implemented with two specialized models:
- PP-FormulaNet-L: Designed for high accuracy, this model uses the Vary-VIT-B backbone from GOT2.0-OCR as the vision encoder, paired with a 512-dimensional MBart Decoder.
- PP-FormulaNet-S: Optimized for high efficiency, this model employs the distilled PP-HGNetV2-B4 as the vision encoder and a 384-dimensional MBart Decoder.
Weight Interpolation
To address the challenges of directly loading pre-trained weights after model dimensionality reduction, PP-FormulaNet employs a weight interpolation technique. This is particularly relevant when reducing the dimensionality of the decoder (e.g., from 1024 dimensions to 512 or 384).
The method is based on the assumption that adjacent dimensions of attention weights in the decoder encode similar semantic information, allowing for the mixing of dimensional features through interpolation. The weights of linear and normalization layers are adaptively adjusted using nearest-neighbor interpolation techniques to ensure compatibility with the reduced-dimension model.
Weight interpolation facilitates the efficient utilization of pre-trained weights without requiring complete model retraining. Experiments have demonstrated that this technique improves the CPE-BLEU score from 0.7970 to 0.8445 on complex formula datasets. Using Vary-VIT-B as the backbone further increases the CPE-BLEU score to 0.9148.
Knowledge Distillation
Knowledge distillation is used to enhance the PP-FormulaNet-S backbone (PP-HGNetV2-B4), which has limited parameters and may lack domain-specific knowledge of mathematical notation. Knowledge from the Vary-VIT-B (teacher) is transferred to PP-HGNetV2-B4 (student) by freezing the teacher network and training the student network using feature-level supervision via a fully connected layer. A distillation loss is calculated as the L2 norm between the teacher's feature tensor and a linear projection of the student's feature tensor.
This distillation process is trained on a corpus of 500,000 document samples. After knowledge distillation, PP-HGNetV2-B4 achieves effective feature extraction with only 15.6 million parameters, improving the formula recognition BLEU score from 80.87 to 84.32.
Multi-Token Parallel Prediction
To address the inference inefficiency of autoregressive models, which predict formula sequences token by token, PP-FormulaNet introduces a multi-token parallel prediction technique. A parallel causal mask enables the decoder to predict multiple tokens simultaneously in each step.
During training, the decoder predicts characters from position step+1 to 2*step, then from 2*step+1 to 3*step, and so on, using previous characters as a basis. During inference, the initial characters are replicated step times as input, and the model predicts the next sequence of length step, reducing the number of inference steps.
Experiments show that inference speed increases by 2.05 times, 2.86 times, 3.77 times, and 4.63 times when the parallel steps are set to 2, 3, 4, and 5, respectively. PP-FormulaNet-S employs this technique for faster inference, setting a parallel step count of 3 to balance accuracy and speed.
PP-FormulaNet-L achieves state-of-the-art accuracy, surpassing UniMERNet by 6% in average BLEU score across several test datasets. PP-FormulaNet-S offers a speed improvement, operating 16 times faster than UniMERNet in GPU inference time (batch size = 15).
Conclusion
PP-FormulaNet advances formula recognition by providing a balance of accuracy and efficiency. The innovations in data mining, model architecture, and training techniques make it suitable for applications with complex formulas and real-time processing needs.