HyperMixer: An MLP-based Low Cost Alternative to Transformers

Published 7 Mar 2022 in cs.CL, cs.AI, and cs.LG | (2203.03691v3)

Abstract: Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternative MLP-based models, and on par with Transformers. In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyperparameter tuning.

Abstract PDF Upgrade to Chat

Citations (11)

View on Semantic Scholar

Summary

The paper introduces HyperMixer, an MLP-based architecture that dynamically generates token mixing weights via hypernetworks to replace conventional attention mechanisms.
It achieves linear computational complexity and competitive performance on NLU benchmarks like GLUE, excelling in low-resource settings and on tasks such as QNLI.
The approach simplifies hyperparameter tuning and promotes energy-efficient Green AI practices, making it a practical and low-cost alternative for real-world applications.

This paper introduces HyperMixer, an MLP-based architecture proposed as a low-cost alternative to Transformers for Natural Language Understanding (NLU) tasks. The motivation stems from the significant computational cost, data requirements, and hyperparameter tuning effort associated with large Transformer models, aligning with the concept of "Green AI".

HyperMixer builds upon the MLPMixer architecture, which uses separate MLPs for feature mixing (applied per token) and token mixing (applied per feature across tokens). However, the standard MLPMixer has limitations for NLP due to its fixed-size token mixing MLP and position-specific weights, making it unsuitable for variable-length inputs and lacking position invariance necessary for generalization in language tasks.

HyperMixer addresses these limitations by employing hypernetworks to dynamically generate the weights of the token mixing MLP. This "HyperMixing" mechanism acts as a drop-in replacement for the attention mechanism in a standard Transformer encoder layer (Figure 1 in the paper illustrates this). Instead of learning a fixed weight matrix for token mixing, HyperMixer learns smaller hypernetworks that take token representations as input and output the weights for the token mixing MLP.

The core idea of HyperMixing is to generate the weight matrices $W_1$ and $W_2$ for the token mixing MLP $TM-MLP(x) = W_2(\sigma(W_1^T x))$ dynamically based on the input tokens. The paper describes this dynamic generation using two hypernetworks, $h_1$ and $h_2$ . Specifically, $h_1$ and $h_2$ are implemented such that the rows of the weight matrices are generated independently from each input token's representation, potentially combined with token information like position embeddings.

The pseudocode provided in the paper for the HyperMixing layer demonstrates this process:

class HyperMixing (nn.Module) :
    def __init__ (self, d, d') :
        # learnable parameters
        self.hypernetwork_in = MLP([d, d, d']) # Hypernetwork for W1
        self.hypernetwork_out = MLP([d, d, d']) # Hypernetwork for W2
        # layer normalization improves training stability
        self.layer_norm = LayerNorm(d)

    def forward(self, queries, keys, values) :
        # queries: [B, M, d]
        # keys / values: [B, N, d]

        # add token information (e.g. position embeddings)
        # This can be position embeddings added element-wise to key/query features
        hyp_in = add_token_information(keys)
        hyp_out = add_token_information(queries)

        # Generate weights W1 and W2 dynamically using hypernetworks
        # W1 will have shape [B, N, d'], W2 will have shape [B, M, d']
        W1 = self.hypernetwork_in(hyp_in)
        W2 = self.hypernetwork_out(hyp_out)

        # Compose the TM-MLP function with dynamically generated weights
        # This function maps [B, d, N] -> [B, d, M]
        token_mixing_mlp = compose_TM_MLP(W1, W2)

        # Transpose so MLP is applied to sequence dimension
        values = values.transpose(1, 2) # [B, d, N]

        # Apply the dynamic TM-MLP
        output = token_mixing_mlp(values) # [B, d, M]

        # Transpose back
        output = output.transpose(1, 2) # [B, M, d]

        # optionally apply LayerNorm
        return self.layer_norm(output)

Confidence Score: 5/5 - I understand the core concepts and the practical aspects discussed in the paper well. I can structure the response according to the requirements.

Strategizing complete. I will now generate the response following these steps.HyperMixer [2203.03691] is an MLP-based neural network architecture designed as a computationally less expensive alternative to Transformers for Natural Language Understanding (NLU) tasks. The paper is motivated by the high costs associated with large Transformer models, including significant computational resources for training and inference, large data requirements, and extensive hyperparameter tuning, aligning with the principles of "Green AI".

Transformers achieve state-of-the-art performance in NLP largely due to their attention mechanism, which allows modeling interactions between tokens with a global receptive field and position-invariant dynamics. Earlier MLP-based models like MLPMixer, while simpler, lacked key inductive biases necessary for NLP. MLPMixer uses a static MLP for token mixing applied independently per feature across all tokens. However, this requires a fixed input length and the learned weights are position-specific, limiting its applicability and generalization for variable-length text sequences.

HyperMixer introduces a novel token mixing mechanism called HyperMixing, which replaces the static token mixing MLP of MLPMixer with a dynamic one generated by hypernetworks. This allows HyperMixer to process variable-length inputs and ensures position invariance in how token interactions are modeled, similar to attention.

The core of HyperMixer is the HyperMixing layer. Like a Transformer encoder layer, it contains a token mixing component and a feature mixing component (a standard MLP applied per token). The key difference lies in the token mixing component. Instead of a fixed MLP or attention, HyperMixing uses hypernetworks to generate the weights (%%%%7%%%% and %%%%8%%%%) for a simple two-layer MLP (%%%%9%%%%) which performs the mixing across the sequence dimension for each feature independently.

The paper provides pseudocode for the `HyperMixing` layer:

python class HyperMixing (nn.Module) : def init (self, d, d') : # Learnable parameters: Hypernetworks are MLPs that output weights # hypernetwork_in generates W1, hypernetwork_out generates W2 self.hypernetwork_in = MLP([d, d, d']) # Example MLP structure: input d, hidden d, output d' self.hypernetwork_out = MLP([d, d, d']) # Example MLP structure: input d, hidden d, output d'

# Layer normalization for stability self.layer_norm = LayerNorm(d) # Applied to the final output of the block

def forward(self, queries, keys, values) : # queries: [B, M, d] - sequence length M, feature dim d # keys / values: [B, N, d] - sequence length N, feature dim d

# Add token information (e.g., position embeddings) to keys and queries # This allows hypernetworks to be aware of position or other token properties hyp_in = add_token_information(keys) # e.g., keys + position_embeddings_k hyp_out = add_token_information(queries) # e.g., queries + position_embeddings_q

# Generate weight matrices W1 and W2 dynamically using hypernetworks # hypernetwork_in is applied per token in hyp_in [B, N, d] to output [B, N, d'] W1 = self.hypernetwork_in(hyp_in) # Shape [B, N, d'] # hypernetwork_out is applied per token in hyp_out [B, M, d] to output [B, M, d'] W2 = self.hypernetwork_out(hyp_out) # Shape [B, M, d']

# The TM-MLP operates on the transposed values [B, d, N] # For each feature dimension, it applies W2 @ GELU(W1.transpose(0,1) @ feature_vector) # where feature_vector is B, N # compose_TM_MLP implements this per-feature MLP logic using the generated W1 and W2 token_mixing_mlp = compose_TM_MLP(W1, W2)

# Transpose values to apply TM-MLP across the sequence dimension N values = values.transpose(1, 2) # Shape [B, d, N]

# Apply the dynamic token mixing MLP output = token_mixing_mlp(values) # Shape [B, d, M]

# Transpose back to standard sequence-first shape output = output.transpose(1, 2) # Shape [B, M, d]

# Optionally apply LayerNorm to the output of the mixing component (as done in Alg 1) return self.layer_norm(output)

This dynamic weight generation provides HyperMixer with the necessary inductive biases for NLP:

Adaptive Size: The hypernetworks generate weights of size proportional to the input sequence length, handling variable inputs.
Position Invariance: The hypernetworks' MLPs process each token independently (after potentially adding position info), and the generated weights are used across the sequence dimension in a consistent manner, making the core mixing operation position-invariant similar to attention.
Global Receptive Field: The token mixing MLP mixes information across the entire sequence length.
Dynamicity: The mixing weights are a function of the input sequence itself.

Empirical evaluation on GLUE benchmark tasks shows that HyperMixer performs better than other MLP-based models and achieves performance on par with or slightly better than vanilla Transformers, particularly excelling on QNLI.

More importantly, HyperMixer demonstrates substantially lower costs according to the Green AI metric Cost(R) = E * D * H:

Processing Time (E): HyperMixer has linear complexity $O(N \cdot d \cdot d')$ with respect to input length $N$ (compared to $O(N^2 \cdot d)$ for standard self-attention), leading to faster wall-clock time, especially for long sequences (Figure 2). This makes it suitable for applications requiring low-latency inference or processing very long documents.
Training Data (D): HyperMixer shows a larger relative performance improvement over Transformers in low-resource settings (using only 10% of training data) (Figure 3), suggesting better data efficiency. This is beneficial when training data is limited.
Hyperparameter Tuning (H): HyperMixer is empirically shown to be easier to tune than Transformers, achieving higher expected validation performance at lower tuning budgets (Figure 4). This reduces the computational cost and time spent on hyperparameter optimization.

Experiments on a synthetic task further illustrate that HyperMixer learns attention-like patterns of interaction between tokens, supporting the idea that its architecture captures effective inductive biases for modeling relationships in sequences.

Implementation Considerations:

Architecture: HyperMixer can generally replace the multi-head self-attention block in a standard Transformer encoder layer. It typically consists of alternating HyperMixing and feature mixing (MLP) blocks, with Layer Normalization and skip connections.
Hypernetwork Implementation: The hypernetworks (hypernetwork_in and hypernetwork_out) are standard MLPs (e.g., two linear layers with GELU activation). They should be designed to process individual token representations and output vectors that form the rows/columns of the TM-MLP weights.
TM-MLP Implementation: The compose_TM_MLP part requires careful implementation to apply the dynamic weights. It involves transposing the input sequence (values), performing matrix multiplications with the generated $W_1$ and $W_2$ weights (potentially batch-wise across features), and applying the GELU non-linearity. Ensure correct handling of batch dimensions and matrix shapes.
Position Information: Adding positional embeddings (learned or fixed) to the token representations before feeding them to the hypernetworks is crucial for the model to utilize positional information.
Normalization and Layout: Layer Normalization significantly improves training stability for HyperMixer, especially when using different Transformer layer layouts (pre-norm, post-norm, etc.). Adding LayerNorm after the HyperMixing component is recommended (Appendix F).
Tied Hypernetworks: Tying hypernetwork_in and hypernetwork_out (i.e., $W_1 = W_2$ ) reduces parameter count and was found beneficial in the paper's low-resource setting.

Trade-offs:

Parameter Count: While HyperMixer can be configured to have a similar parameter count to Transformers, the number of parameters in the hypernetworks and the hidden dimension $d'$ of the TM-MLP affect efficiency and capacity. The tied hypernetwork configuration offers a good balance.
Complexity vs. Expressiveness: HyperMixer's linear complexity comes from its MLP-based mixing. While shown to learn attention-like patterns, whether this mechanism is as universally powerful or expressive as quadratic attention in all scenarios (especially very large scale pretraining) remains an open question.

Limitations and Future Work:

The study primarily focuses on small models trained on limited data. Scaling HyperMixer to billions of parameters and pretraining on massive corpora, akin to LLMs, is necessary to confirm its efficiency benefits in that regime. Adapting HyperMixing for generative tasks requiring causal masking (like standard language modeling with decoder-only architectures) needs significant modeling advancements, which is highlighted as promising future work. Evaluating HyperMixer on a wider range of NLP tasks and domains is also needed to establish its versatility compared to Transformers.

In summary, HyperMixer presents a compelling MLP-based approach to NLU that achieves competitive performance with Transformers while offering significant advantages in terms of computational cost, data efficiency, and ease of tuning, particularly valuable for low-resource scenarios and promoting Green AI principles.