FRAE: Feedback Recurrent AutoEncoder
- FRAE is a recurrent autoencoder architecture that leverages decoder-to-encoder feedback to efficiently compress sequential data by exploiting temporal redundancies.
- It integrates convolutional GRUs, quantized bottleneck representations, and learned priors to optimize rate-distortion tradeoffs across video and speech domains.
- Empirical evaluations demonstrate FRAE's superiority over traditional codecs, achieving lower distortion metrics and bitrate requirements while addressing issues like temporal consistency.
The Feedback Recurrent AutoEncoder (FRAE) is a recurrent autoencoding architecture for online sequential data compression, designed to efficiently exploit temporal redundancies by explicitly feeding back the decoder's hidden state into the encoder at each timestep. FRAE has demonstrated state-of-the-art effectiveness in domains such as video compression (Golinski et al., 2020) and speech spectrogram compression (Yang et al., 2019), yielding compact discrete representations that are amenable to entropy coding, and achieving significant improvements over previous learned and traditional codecs in rate-distortion tradeoffs.
1. Architecture and Operational Principles
FRAE operates as an online, fully causal sequence-to-sequence autoencoder with three core design elements: (i) recurrent connections in both encoder and decoder, (ii) explicit decoder-to-encoder hidden state feedback at every timestep, and (iii) a quantized bottleneck representation facilitating low bit-rate compression.
Video Compression FRAE
For video, compression is organized in Group-of-Pictures (GoP) structure. Each GoP starts with an I-frame (), encoded via a stand-alone image autoencoder, followed by P-frames () processed by a recurrent P-frame network. At each :
- The encoder receives as input three elements: the current frame , the previous reconstructed frame warped via an optical flow estimator (MENet), and the decoder hidden state .
- The decoder latent is quantized and decoded jointly with , updating via a convolutional GRU (ConvGRU).
- The decoded flow and residual components together reconstruct the new frame.
Speech Compression FRAE
For audio, at each spectrogram frame :
- The encoder consumes and the previous decoder state , outputting a continuous code .
- is quantized into a discrete via a learned codebook.
- The decoder receives and , updates the state with a GRU, and reconstructs .
In both domains, provides a summary of past reconstructions, enabling the encoder to focus capacity on innovations not predictable from previous context. The system is strictly online and supports both fixed and variable bitrate operation via quantization and entropy coding.
2. Mathematical Formulation
The recurrent feedback and rate-distortion objectives for FRAE are formalized as follows.
Recurrent Feedback
For time , let be the quantized latent and the previous decoder hidden state.
- Encoding: ,
- Decoding: ,
In the video setting, this includes optical flow estimation and warping as additional inputs (Golinski et al., 2020). In both cases, is updated per a GRU cell (possibly convolutional for video):
Rate-Distortion Objective
The objective is the sum over sequence positions:
where is the domain-specific distortion metric (e.g., -SSIM for video, Mel-MSE for audio), denotes rate under a learned prior, and controls the rate-distortion tradeoff (Golinski et al., 2020, Yang et al., 2019).
3. Neural Components and Quantization Strategies
The FRAE architecture leverages modular neural blocks tailored to the data modality.
Encoder/Decoder Backbone
- Stacks of convolutional layers and residual blocks implement feature extraction in both encoder and decoder.
- Downsampling is performed via strided convolutions in the encoder; upsampling by transposed convolutions in the decoder.
- Batch normalization is managed carefully, switching to inference mode after warmup to avoid train-test mismatch across unrolled time steps (Golinski et al., 2020).
Recurrent Module
- ConvGRU in video, standard GRU in speech; always receives feedback from .
- In both cases, state feedback is crucial for modeling long-range dependencies.
Quantization
- Vector quantization to a learned codebook, with nearest-neighbor selection in the forward pass and straight-through gradients in backpropagation [Mentzer et al. 2018].
- In audio, per dimension is typical, producing discrete codes .
Prior Models
- Gated PixelCNN produces for video.
- In audio, learned MLPs conditioned on produce categorical priors used for entropy coding.
Additional Modules
- MENet: U-Net architecture for optical flow estimation in video.
- Neural vocoder (e.g., WaveNet) for waveform synthesis from decoded spectrograms in speech (Yang et al., 2019).
4. Training Regimes and Experimental Setup
Video
- Training on subsets of Kinetics400, validation on separate datasets, final evaluation on the HD UVG set (1080p, 3,900 frames).
- GoP of length 8, P-frame network unrolled for 7 steps.
- Loss combines rate-distortion with auxiliary optical flow losses for early iterations:
- Adam optimizer, batch size 16, up to 250,000 iterations.
Speech
- Data: single-speaker and multi-speaker corpora (LibriVox, WSJ1).
- Preprocessing via 16kHz STFT, mapped to magnitude spectrograms on the dB scale at 100Hz.
- Loss is Mel-scale weighted MSE; Adam optimization, standard minibatch size.
- Schedule adjusts to sweep the rate-distortion curve; quantization gradients handled straight-through.
5. Empirical Performance and Comparative Analysis
Video Compression
- FRAE outperforms (in -SSIM/bitrate) leading learned compression methods (Lu et al. 2019, Liu et al. 2020, Rippel et al. 2018, Habibian et al. 2019) and established codecs (H.264, H.265/x265/x264) within the streaming-relevant 0.05–0.35 bpp range (Golinski et al., 2020).
- Ablations:
- Removing feedback: -SSIM ≈ −0.002.
- Removing ConvGRU: −0.003.
- Removing MENet: −0.005 and major visual artifacts.
Speech Coding
- At 1.6 kbps, FRAE+WaveNet achieves POLQA ≈3.2–3.4, compared to Opus' 2.0 at same bitrate (Yang et al., 2019).
- Decoder-encoder feedback outperforms alternative recurrent architectures by 25% lower MSE and 0.3 higher POLQA at constant rate.
- Learned autoregressive priors () enable up to 1.5 kbps saving at fixed quality.
| Domain | State Feedback Gain | Best-Reported Distortion Metrics | Traditional Baseline |
|---|---|---|---|
| Video | MS-SSIM +0.002–0.005 | Outperforms H.265 at 0.35 bpp | H.265, x264/x265 |
| Speech | 25% MSE reduction | POLQA up to [email protected] kbps | Opus, Griffin-Lim |
General Findings
- Decoder-to-encoder state feedback is crucial for high compression ratios and quality in both domains.
- Quantized bottlenecks and learned priors enable practical variable-rate transmission.
- Online, strictly forward operation allows for causal, low-latency decoding in streaming scenarios.
6. Practical Limitations and Proposed Remedies
Temporal Consistency and Flicker (Video)
MS-SSIM/bit allocation in P-frames declines across GoPs, manifesting as perceptible quality "flicker" at I-frame boundaries. This is attributed to per-frame loss averaging. Proposed approaches to mitigate include adding temporal consistency losses (e.g., using frame-difference SSIM, optical flow-based warping loss) or enforcing uniform bit allocation.
Color Shift and Artifacts
MS-SSIM’s partial insensitivity to color drift leads to hue shifts in low-rate models. Incorporating color-sensitive distortion terms (e.g., in YUV, chroma PSNR) or adjusted metrics is suggested to address this issue.
Edge Artifacts
Unpadded Gaussian filtering in -SSIM underweights corners, leading the network to minimize bitrate allocation at image edges. Applying replicate-padding before local statistics resolves this bias (Golinski et al., 2020).
Stability in Recurrent Architectures (Speech)
While output-feedback FRAE architectures can slightly outperform state feedback in MSE, they are more prone to instability. Decoder-to-encoder state feedback offers robust convergence and consistent improvements in spectrogram coding (Yang et al., 2019).
7. Broader Impact and Future Directions
The FRAE's reliance on recurrent feedback—operationalized via decoder-to-encoder hidden state transmission—offers a general mechanism for exploiting temporal dependencies in learned compression systems for both video and speech. This approach enables high compression ratios even under strict causality constraints, supporting real-time applications. Extensions via learned priors open avenues for flexible variable-rate deployment. Open empirical issues, particularly regarding temporal artifacts and perceptual color consistency, highlight potential refinements in rate-distortion objectives and loss functions. Further integration with neural synthesis frameworks (e.g., for audio: WaveNet vocoders) broadens the practical utility of FRAE in end-to-end learned compression pipelines (Golinski et al., 2020, Yang et al., 2019).