Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pyramid VisionLLaMA: Hierarchical Vision Transformer

Updated 28 December 2025
  • Pyramid VisionLLaMA is a hierarchical vision transformer that adapts the LLaMA block for handling high-resolution images with a multi-stage pyramid structure.
  • It interleaves local and global self-attention with SwiGLU and LayerNorm, achieving competitive performance in tasks like classification, segmentation, detection, and image generation.
  • The architecture supports flexible input sizes using 2D rotary position encoding and efficient downsampling, paving the way for unified and scalable vision models.

Pyramid VisionLLaMA refers to a hierarchical vision transformer architecture designed as an adaptation of the LLaMA transformer block for vision tasks, employing a pyramid (multi-stage) spatial hierarchy. Its design enables processing of high-resolution images and supports a wide range of vision applications including classification, segmentation, detection, and image generation. Pyramid VisionLLaMA originates from the work by the Meituan-AutoML group, unifying vision model design under a LLaMA-style backbone while leveraging the strengths of hierarchical feature representations pervasive in earlier convolutional and transformer-based vision systems (Chu et al., 2024).

1. Architectural Design and Pyramid Hierarchy

Pyramid VisionLLaMA’s architecture is distinctly hierarchical, organizing computation into four spatial resolution stages, drawing from canonical convolutional and transformer pyramid patterns. Each stage reduces the spatial dimension while increasing the channel capacity, forming a progression from fine-grained, high-resolution features in early layers to semantically rich, low-resolution features in later stages.

  • Stage 1: Resolution (H/4)×(W/4), Channels: {64/96/128}
  • Stage 2: Resolution (H/8)×(W/8), Channels: {128/192/256}
  • Stage 3: Resolution (H/16)×(W/16), Channels: {256/384/512}
  • Stage 4: Resolution (H/32)×(W/32), Channels: {512/768/1024}

The hierarchy is constructed using consecutive patch embedding layers with specified downsampling strides (Pâ‚–), followed by alternating blocks of local self-attention (LSA) and global self-attention (GSA). Each stage applies progressively coarser representations, consistent with common CNN and modern transformer practices (Chu et al., 2024).

2. Adaptation of the LLaMA Block

Pyramid VisionLLaMA generalizes the LLaMA block—originally designed for language modeling—to two-dimensional data and multi-resolution hierarchies. Key architectural points include:

  • Patch Embedding: At the start of each stage, patch tokens are produced via a linear projection of flattened image patches, parameterized by Wₚ in each stage.
  • Positional Encoding: 1D Rotary Position Embedding (RoPE) is extended to two dimensions (AS2DRoPE), where each token at (i,j) in a 2D grid is rotated by a block-diagonal matrix R_{i,j}. This formulation allows for arbitrary input resolutions, as AS2DRoPE coordinates can interpolate beyond pre-defined training grid sizes.
  • Block Structure: Each block employs Pre-Norm LayerNorm, attention modules (using AS2DRoPE), and SwiGLU feed-forward layers, with residual connections. Both LSA and GSA blocks are interleaved to balance computational efficiency and global receptive field growth.
  • No [CLS] Token: Unlike ViT-B, representation pooling is performed via Global Average Pooling (GAP) over the final-stage feature map.
  • Token Flow: Downsampling between stages is handled by patch-embedding layers, i.e., changing Pâ‚– from 4 to 2 across successive stages.

3. Mathematical Formulation

Let ximg(k−1)[n]x^{(k-1)}_{img}[n] denote the input feature at patch nn in stage (k−1)(k-1). The stage-kk patch embedding computes

xpatch(k)[n]=Wp(k) ximg(k−1)[n]+bp(k)x^{(k)}_{patch}[n] = W^{(k)}_{p}\,x^{(k-1)}_{img}[n] + b^{(k)}_{p}

Self-attention with 2D RoPE:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

with QQ and KK first rotated by Ri,jR_{i,j}, such that:

(Ri1,j1qi1,j1)T(Ri2,j2ki2,j2)=qi1,j1TRi1−i2,j1−j2ki2,j2(R_{i_1,j_1} q_{i_1,j_1})^T (R_{i_2,j_2} k_{i_2,j_2}) = q_{i_1,j_1}^T R_{i_1-i_2,j_1-j_2} k_{i_2,j_2}

This approach supports flexible input shape generalization and robustly encodes spatial context.

4. Training Paradigms and Losses

Pyramid VisionLLaMA supports a variety of pre-training and fine-tuning paradigms relevant for diverse vision tasks:

  • Supervised Classification: Trained with cross-entropy loss on ImageNet-1K.
  • Masked Autoencoding (MAE): Pre-training with pixel reconstruction (â„“2\ell_2 loss) on randomly masked patches, followed by fine-tuning.
  • Semantic Segmentation: Using standard segmentation heads (e.g., UPerNet) with cross-entropy and auxiliary losses; performance measured via mean IoU.
  • Object Detection: Integrated with Mask R-CNN head, trained on bounding box regression (GIoU) and mask cross-entropy losses.
  • Diffusion-based Image Generation: Score-matching losses (e.g., for DiT and SiT frameworks), measured with FID, sFID, IS, etc.

5. Empirical Performance and Ablation Results

Comprehensive benchmarks on classification, segmentation, detection, and generation reveal the competitive or superior performance of Pyramid VisionLLaMA compared to contemporary vision transformers:

Task Metric Pyramid VisionLLaMA Swin/ViT/Other Baseline
ImageNet-1K Supervised (B) Top-1 acc. 83.2% Swin-S 83.0%, Twins-B 83.2%
ADE20K Segmentation (B) mIoU 49.1 Swin-S 47.6, Twins-S 47.7
COCO Detection (B, AP/mAP) box 49.1 mask 43.8 Swin-S box 47.6
ViTDet Detection (MAE, 800e) box AP 52.2 ViT-B (MAE 1600e): 51.6
Diffusion Image Generation (XL/2) FID 9.84 DiT-XL/2 FID 10.67

Ablation studies highlight:

  • AS2DRoPE is essential for generalizing to resolutions beyond 224×224; 1D RoPE fails at large scale.
  • Replacing FFN with SwiGLU introduces no performance drop.
  • LayerNorm outperforms RMSNorm slightly.

6. Per-Stage Architectural Hyperparameters

The pyramid structure for the Base variant is as follows:

Stage Output Size Patch Size PkP_k Channels CkC_k Blocks per Stage
1 ⌊H/4⌋×⌊W/4⌋\lfloor H/4 \rfloor \times \lfloor W/4 \rfloor 4 96 [LSA→GSA]×1
2 ⌊H/8⌋×⌊W/8⌋\lfloor H/8 \rfloor \times \lfloor W/8 \rfloor 2 192 [LSA→GSA]×1
3 ⌊H/16⌋×⌊W/16⌋\lfloor H/16 \rfloor \times \lfloor W/16 \rfloor 2 384 [LSA→GSA]×9
4 ⌊H/32⌋×⌊W/32⌋\lfloor H/32 \rfloor \times \lfloor W/32 \rfloor 2 768 [GSA]×2

Other hyperparameters include:

  • Drop-path rates per model size (e.g., 0.3 for Base).
  • AdamW optimizer, 0.05 weight decay, cosine learning rate decay, five-epoch warmup, 300 epochs for supervised settings.
  • 12 attention heads at embedding dimension 768.

7. Significance and Impact

Pyramid VisionLLaMA establishes a unified, LLaMA-block-based foundation for vision backbones, yielding versatile applicability across perception and generative domains (Chu et al., 2024). The architectural adaptation preserves the computational scaling benefits of transformer-based designs while enabling locality and multi-scale representation, achieving faster convergence and superior or equivalent accuracy to established vision transformers such as Swin and Twins-SVT.

The pyramid formulation with 2D rotary position encodings resolves key challenges in handling arbitrary input sizes and transferring pretrained models across tasks. Its flexibility and competitive empirical profile position it as a strong baseline for future vision transformer research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pyramid VisionLLaMA.