R-Transformer: Recurrent Neural Network Enhanced Transformer

Published 12 Jul 2019 in cs.LG, cs.CL, cs.CV, and eess.AS | (1907.05572v1)

Abstract: Recurrent Neural Networks have long been the dominating choice for sequence modeling. However, it severely suffers from two issues: impotent in capturing very long-term dependencies and unable to parallelize the sequential computation procedure. Therefore, many non-recurrent sequence models that are built on convolution and attention operations have been proposed recently. Notably, models with multi-head attention such as Transformer have demonstrated extreme effectiveness in capturing long-term dependencies in a variety of sequence modeling tasks. Despite their success, however, these models lack necessary components to model local structures in sequences and heavily rely on position embeddings that have limited effects and require a considerable amount of design efforts. In this paper, we propose the R-Transformer which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks. The proposed model can effectively capture both local structures and global long-term dependencies in sequences without any use of position embeddings. We evaluate R-Transformer through extensive experiments with data from a wide range of domains and the empirical results show that R-Transformer outperforms the state-of-the-art methods by a large margin in most of the tasks. We have made the code publicly available at \url{https://github.com/DSE-MSU/R-transformer}.

Abstract PDF Upgrade to Chat

Citations (101)

View on Semantic Scholar

Summary

The paper introduces R-Transformer, which integrates LocalRNN for local structure modeling with multi-head attention to capture long-term dependencies.
It achieves notable improvements on tasks like MNIST classification, polyphonic music, and language modeling by eliminating the need for position embeddings.
Empirical results demonstrate superior accuracy and efficiency through full parallelization that balances local cues with global context.

Introduction

The paper introduces the R-Transformer model, which unifies the strengths of Recurrent Neural Networks (RNNs) and Transformer models to address the limitations associated with sequence modeling. While RNNs suffer from vanishing/exploding gradients and lack of parallelization, Transformers capture long-term dependencies but struggle with local structure representation and position embeddings. R-Transformer aims to capture both local structures and global dependencies efficiently without relying on position embeddings.

Sequence Modeling Problem

Sequence modeling involves mapping an input sequence of length $N$ : $x_1, x_2, \cdots, x_N$ , into a label space $\mathcal{Y}$ : ( $f: \mathcal{X}^N \rightarrow \mathcal{Y}$ ), such that $y = f(x_1, x_2, \cdots, x_N)$ where $y \in \mathcal{Y}$ is the label of the input sequence. The paper does not address sequence-to-sequence learning problems directly, though R-Transformer could be extended for such tasks.

R-Transformer Model

LocalRNN: Modeling Local Structures

R-Transformer introduces LocalRNN, which processes signals within a local window, allowing the model to capture local structures effectively. Unlike traditional RNNs that handle entire sequences, LocalRNN processes short sequences independently, leveraging RNN's strengths without its pitfalls. This approach captures sequential information within local windows without position embeddings, mitigating RNN's gradient-related issues.

Multi-Head Attention for Global Dependencies

The model uses multi-head attention mechanisms to capture global long-term dependencies, allowing for connections between any position in the sequence. By refining representations of each position through attention scores, R-Transformer benefits from efficient long-term dependency modeling. This component mirrors the pooling operation in CNNs but utilizes attention to ensure comprehensive information flow across sequences.

Architecture and Components

R-Transformer consists of hierarchical layers: LocalRNN for local structure modeling, multi-head attention for long-term dependencies, and position-wise feedforward networks for feature transformation. The layers use residual and layer normalization connections for stability and efficiency. The model can be fully parallelized, facilitating operation on modern computing architectures.

Experimental Evaluation

R-Transformer was evaluated across various sequential data tasks, including MNIST sequence classification, polyphonic music modeling, and language modeling on the PennTreebank dataset. The paper reports significant improvements over TCN, standard Transformer, and canonical RNN architectures, highlighting its superior ability to model both local and global dependencies effectively.

Pixel-by-Pixel MNIST Classification

R-Transformer achieved a test accuracy of 99.1%, outperforming TCN and Transformer by incorporating local and global cues. Its success underscores the importance of modeling local structures in sequence classification tasks.

Nottingham: Polyphonic Music Modeling

For music modeling, R-Transformer recorded a negative log-likelihood of 2.37, demonstrating superior performance aligned with sequences that exhibit strong local cues.

Language Modeling

In character-level language modeling, R-Transformer achieved an NLL of 1.24, outperforming Transformer and attaining stable performance. In word-level language modeling, R-Transformer showed a perplexity of 84.38, excelling over other architectures except LSTM, which performed best.

RNNs have been traditionally used in sequence modeling but faced challenges with gradient issues and efficiency. Recent models have shifted towards convolution and attention mechanisms to overcome these limitations. TCNs have gained popularity for their convolutional approach to modeling sequences. Meanwhile, Transformers have excelled in tasks requiring long-term dependency modeling, though their reliance on position embeddings and local information limitations persist. R-Transformer leverages multi-head attention and LocalRNN components to address these challenges effectively.

Conclusion

R-Transformer presents a significant advancement in sequence modeling, combining the strengths of RNN and Transformers while alleviating their respective weaknesses. Its novel architecture eliminates the need for position embeddings, efficiently captures local and global sequence information, and offers full parallelization. Empirical evaluations confirm its superiority against state-of-the-art architectures, paving the way for further research and application in diverse sequential domains.