The paper presents Sketchformer, a transformer-based network designed to enhance the representation of free-hand sketches captured as sequences of vector strokes. The authors aim to address the limitations of recurrent neural network (RNN) architectures like SketchRNN by leveraging the transformer architecture's ability to model long-range dependencies within sequences. Sketchformer is evaluated across multiple applications: sketch classification, sketch-based image retrieval (SBIR), and generative tasks such as sketch reconstruction and interpolation.
Methodology
Sketchformer builds on the Transformer architecture introduced by Vaswani et al., integrating three main variations of input representation: continuous stroke representation, dictionary-learned tokenization, and spatial grid tokenization. These variations aim to balance the temporal and spatial complexities inherent in sketch data.
The core innovation lies in employing the self-attention mechanism of transformers to create a dense sketch embedding from stroke sequences. By adapting and modifying the encoder-decoder structure of the transformer, Sketchformer learns a compact and effective representation that encodes the salient features of sketches in a multi-task learning framework.
- Sketch Classification: Sketchformer demonstrates superior performance in sketch classification over the publicly available QuickDraw dataset. The variant using dictionary-learned tokenization (TForm-Tok-Dict) achieves a mean average precision (mAP) increase of 6% compared to RNN-based methods like SketchRNN and derivatives. This confirms the effectiveness of the transformer in capturing intricate structures within sketches.
- Generative Sketch Modeling: The authors highlight significant improvements in reconstruction accuracy for complex sketches with extended stroke sequences. Tokenized representations, particularly TForm-Tok-Dict, enable stable and plausible interpolations both within and between sketch classes. These findings emphasize the capabilities of transformers in handling generative tasks traditionally challenging for LSTMs due to their limited sequence modeling capacity.
- Sketch-Based Image Retrieval (SBIR): For SBIR tasks, the integration of Sketchformer with dual sketch and image embeddings yields improved retrieval performance on large datasets like Stock10M. The representation learning includes a triplet-based approach that aligns sketch and image data within a shared embedding space, enhancing retrieval precision compared to competing methods.
Implications and Future Directions
Sketchformer elevates the prospect of using transformer architectures for tasks beyond language modeling, particularly in domains requiring complex temporal-spatial interactions. The findings suggest promising applications in sketch synthesis, cross-modal retrieval, and beyond.
The adoption of transformer networks for stroke sequence modeling could progress further by exploring the continuous variant's broader application scope, incorporating advancements from natural language processing into vector graphics processing. Further, this could enable novel applications that fuse sketches with other modalities, such as text or photographs, providing fertile ground for research into sketch-driven generative tasks like image creation.
Overall, Sketchformer offers a compelling alternative to traditional RNN-based approaches, unlocking new possibilities for sketch understanding and manipulation through a robust, transformer-based modeling paradigm.