Shortcut Sequence Tagging in Deep RNNs
- Shortcut sequence tagging is a method that uses vertical skip connections in deep BiLSTMs to mitigate vanishing gradients in token-level labeling tasks.
- It incorporates various skip connection types — applied to gates, cell states, and outputs — to provide clean identity mappings for effective information flow.
- The shortcut block architecture eliminates traditional LSTM cell states, simplifying computation and delivering state-of-the-art performance in CCG supertagging and POS tasks.
Shortcut sequence tagging refers to the use of vertical skip (shortcut) connections in deep, stacked recurrent neural networks—predominantly bidirectional LSTMs—to facilitate information and gradient flow across layers in sequence labeling tasks. The term encompasses both (a) the integration of various skip connection types (to gates, cell states, outputs) in deep stacked LSTM models for token-level tagging (Wu et al., 2016), and (b) a specialized architectural framework that replaces the self-recurrent cell state of LSTMs with gated cross-layer shortcuts, termed the "shortcut block" (Wu et al., 2017). These mechanisms have yielded state-of-the-art or near state-of-the-art results on CCG supertagging and POS tagging by improving gradient propagation and simplifying the training of very deep RNN stacks.
1. Underlying Motivation and Problem Setting
Deeply stacked RNNs, particularly bidirectional LSTMs (BiLSTM), are the backbone of many sequence tagging pipelines. However, as network depth increases, these models suffer from vanishing/exploding gradients in the layer-wise (vertical) direction, which impedes effective learning and degrades trainability. Skip (shortcut) connections across layers can ameliorate this issue by providing identity paths for gradients and signals, thereby enabling deeper, more expressive networks. Classic LSTM cell structures rely on self-connected cell states to carry information across time steps ("horizontal memory"), but for sequence tagging—where inputs and outputs are aligned one-to-one and long-distance temporal dependencies are less critical—such horizontal memory is less essential, motivating simplification via cross-layer shortcuts (Wu et al., 2017).
2. Forms of Skip (Shortcut) Connections in Stacked BiLSTM Taggers
The empirical investigation by Wu et al. (2016) (Wu et al., 2016) considers the following three primary points of injection for skip connections across layers in stacked BiLSTM models:
- Skip connections to the gates: The output of layer , , acts as an additional input (via identity mapping) to the gate computations (input, forget, output) and candidate. This extends the usual input to the LSTM cell with a direct path from a lower layer.
- Skip connections to the internal cell state: Here, is added directly into the internal memory cell state , either with or without an additional gating mechanism. This modifies the standard cell state update equation.
- Skip connections to the cell outputs: The output from layer is added directly to the output of the LSTM cell at layer , potentially modulated by a gate:
where is a sigmoid-based gating vector computed as
Gating controls the extent to which the skip carries signal, preventing uncontrolled accumulation.
Empirically, skip connections to cell outputs—especially when combined with a gated-identity function—lead to the cleanest gradient path and best empirical results on both CCG supertagging and POS tagging tasks (Wu et al., 2016).
3. Shortcut Block: Eliminating the LSTM Cell State for Tagging
Wu et al. (2017) introduce the "shortcut block" (Wu et al., 2017), which further simplifies the stacked architecture for sequence tagging by discarding the LSTM's self-recurrent cell state. Instead, information is propagated vertically by gated skip connections only:
- Each shortcut block computes four vectors—input gate , shortcut gate , output gate , and candidate —as
- Internal state,
- Final output:
The shortcut gate can be parameterized via deterministic sigmoids, linear mappings, or stochastic Bernoulli variables, with non-linear deterministic gates performing best. This structure removes the requirement to store and gate a long-term cell memory, thus simplifying both forward and backward computation in deep stacked BiLSTMs for sequence tagging.
4. Shortcut Topologies and Composition Strategies
Shortcut connections can span varying numbers of layers and be composed in different patterns. Wu et al. (2017) (Wu et al., 2017) evaluate several such topologies, including:
- Type 1 (fan-in): Connecting the first hidden layer to all subsequent layers
- Type 2 (dense span-1): Each layer receives a skip connection from layer
- Type 3 (span-2): Skips span two layers ()
- Type 4 (nested): Combining shorter and longer skips (e.g., both and to )
- Type 5 (fully dense): All possible span-1 and span-2 connections
Empirical results indicate that the dense span-1 strategy (each layer receives a gated shortcut from two layers below) strikes the optimal balance between simplicity and performance.
5. Training Protocols and Empirical Results
Training procedures for both skip-BiLSTM and shortcut block-based taggers are similar:
- Input encoding: Each token is embedded using pre-trained word vectors, fixed-length character prefix/suffix encodings, and capitalization features. Context windows are used to incorporate neighboring tokens.
- Hidden layers: Networks are constructed with 7–13 bidirectional layers, each containing several hundred cells per direction.
- Regularization: Dropout is applied to context windows and boundary (first/last) hidden layers.
- Optimization: SGD with an initial learning rate of 0.02, with learning rate reduction on plateau; no momentum or gradient clipping. Recurrent weights are initialized as random orthogonal matrices.
On CCGbank supertagging:
| Model | Test Accuracy (%) |
|---|---|
| 9-stacked Bi-LSTM (no skip) | 94.69 |
| 9-stacked shortcut block | 94.99 |
This corresponds to an absolute gain of +0.49% and a relative error reduction of approximately 6% over the best prior Bi-LSTM (Wu et al., 2017). For POS tagging (WSJ), a 9-layer shortcut block achieves 97.53% test accuracy, on par with or exceeding previous systems—without task-specific retuning.
For skip connection variants (Wu et al., 2016):
| Configuration | CCG Test Acc. (%) |
|---|---|
| Vanilla 7-layer BiLSTM | 94.26 |
| Skip→gates | ≈93.90 |
| Skip→internals (ungated) | 94.63 |
| Skip→internals + gate | 94.52 |
| Skip→outputs (ungated) | 93.89 |
| Skip→outputs + gate (forget-bias=0) | 94.67 |
Ablation studies reveal the necessity of character-level features, dropout, and identity skip mappings for optimal performance.
6. Interpretation, Mechanism, and Limitations
The success of vertical skip connections—specifically, gated identity skips to cell outputs—appears to derive from preserving clean identity paths for gradients in the vertical stack. This avoids interference with internal LSTM gating dynamics and prevents both vanishing gradients and uncontrolled activation growth. Gating enables the network to dynamically learn when to utilize shortcut signals or rely on learned depth transformations.
With shortcut blocks, the elimination of horizontal cell memory both simplifies recurrent step updates and acts as a form of regularization through richer vertical information mixing. This suggests that LSTM-style horizontal memory may be dispensable for aligned sequence tagging tasks, though its necessity in auto-regressive or unaligned tasks remains untested.
Limitations include:
- Scope restricted to token-level tagging (CCG, POS); application to structured prediction or generation is unexplored.
- Only span-1 shortcut patterns and simple gate parameterizations are studied; multi-scale or attention-based gating may yield further improvement.
- The interaction with alternative regularization (zoneout, batch-norm) or optimization (Adam) has not been systematically investigated.
7. Impact and Prospects
Shortcut sequence tagging architectures, incorporating vertical skip connections with effective gating, have advanced the state of the art in CCG supertagging and achieved robust results in POS tagging without significant task-specific tuning (Wu et al., 2016, Wu et al., 2017). These approaches mitigate longstanding optimization difficulties in deep RNNs. A plausible implication is that similar vertical skip designs could benefit other vertically deep, non-generative sequence processing architectures, although this awaits comprehensive study. Future explorations into more general skip patterns, sophisticated gating schemes, and application to broader tasks represent promising avenues for research.