Adaptively Sparse Transformers

Published 30 Aug 2019 in cs.CL and stat.ML | (1909.00015v2)

Abstract: Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns. This sparsity is accomplished by replacing softmax with $\alpha$-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the $\alpha$ parameter -- which controls the shape and sparsity of $\alpha$-entmax -- allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.

Abstract PDF Upgrade to Chat

Citations (230)

View on Semantic Scholar

Summary

The paper introduces a method where the model learns to adjust sparsity in attention heads using a differentiable alpha-entmax function.
It demonstrates that specialized head diversity enhances both interpretability and computational efficiency in neural machine translation tasks.
Empirical results show that the adaptively sparse Transformer maintains or improves performance compared to traditional models without added complexity.

Adaptively Sparse Transformers

The paper "Adaptively Sparse Transformers" introduces a novel modification to the Transformer architecture, aimed at enhancing the sparsity of attention mechanisms. This work presents the adaptively sparse Transformer, which offers flexibility in attention head sparsity that is both context-dependent and learnable.

Introduction and Motivation

The Transformer model, prominent in NLP tasks, particularly for Neural Machine Translation (NMT), utilizes multi-head attention mechanisms to derive context-aware word representations. In conventional approaches, softmax is used to ensure each word in the context receives non-zero attention weight. However, this paper argues that such dense attention can obscure interpretability and limit the model's flexibility.

Methodology

The key innovation in this study is the replacement of softmax with $\alpha$ -\entmaxtext{}, a differentiable generalization that permits sparse attention distributions by setting certain weights to zero. This is controlled by the $\alpha$ parameter, which the authors propose to learn automatically. This parameter allows the model to dynamically switch between dense and sparse attention, optimizing the sparsity pattern at each attention head based on context.

Numerical Results

The adaptively sparse Transformer was tested on several machine translation datasets, with results indicating that it matches or slightly surpasses the performance of the traditional Transformer. Notably, it achieves this without increasing the model's complexity, maintaining accuracy while adding sparsity. This encourages diverse specialization across attention heads.

Analysis and Implications

An in-depth analysis demonstrates that different heads adopt varying sparsity patterns, thereby improving head diversity. This diversity is quantitatively measured using Jensen-Shannon Divergence, showing greater disagreement among heads compared to the softmax baseline. Moreover, certain heads exhibit clear specializations, such as positional awareness or BPE-merging capabilities, which enhance interpretability.

Theoretical and Practical Implications

From a theoretical standpoint, introducing adaptively sparse attention has implications for the study of attention mechanisms in neural networks, suggesting that sparsity can be beneficial and learned dynamically. Practically, these findings propose efficiency improvements in model execution, as fewer weights need computation, leading to potential improvements in speed without loss of accuracy.

Future Directions

The paper opens avenues for further exploration into static variations inspired by dynamic behaviors identified in this model, such as deterministic positional heads. Moreover, the methodology for adaptively learning sparsity parameters can be explored in other architectures beyond Transformers to assess its general utility in deep learning.

In summary, this paper contributes a valuable perspective on managing attention sparsity in Transformers, providing insights into both improving model interpretability and maintaining performance through adaptive approaches.