Papers
Topics
Authors
Recent
Search
2000 character limit reached

FRAE: Feedback Recurrent AutoEncoder

Updated 10 February 2026
  • FRAE is a recurrent autoencoder architecture that leverages decoder-to-encoder feedback to efficiently compress sequential data by exploiting temporal redundancies.
  • It integrates convolutional GRUs, quantized bottleneck representations, and learned priors to optimize rate-distortion tradeoffs across video and speech domains.
  • Empirical evaluations demonstrate FRAE's superiority over traditional codecs, achieving lower distortion metrics and bitrate requirements while addressing issues like temporal consistency.

The Feedback Recurrent AutoEncoder (FRAE) is a recurrent autoencoding architecture for online sequential data compression, designed to efficiently exploit temporal redundancies by explicitly feeding back the decoder's hidden state into the encoder at each timestep. FRAE has demonstrated state-of-the-art effectiveness in domains such as video compression (Golinski et al., 2020) and speech spectrogram compression (Yang et al., 2019), yielding compact discrete representations that are amenable to entropy coding, and achieving significant improvements over previous learned and traditional codecs in rate-distortion tradeoffs.

1. Architecture and Operational Principles

FRAE operates as an online, fully causal sequence-to-sequence autoencoder with three core design elements: (i) recurrent connections in both encoder and decoder, (ii) explicit decoder-to-encoder hidden state feedback at every timestep, and (iii) a quantized bottleneck representation facilitating low bit-rate compression.

Video Compression FRAE

For video, compression is organized in Group-of-Pictures (GoP) structure. Each GoP starts with an I-frame (x0x_0), encoded via a stand-alone image autoencoder, followed by P-frames (x1,,xN1x_1, \dots, x_{N-1}) processed by a recurrent P-frame network. At each tt:

  • The encoder receives as input three elements: the current frame xtx_t, the previous reconstructed frame x^t1\hat x_{t-1} warped via an optical flow estimator (MENet), and the decoder hidden state ht1h_{t-1}.
  • The decoder latent ztz_t is quantized and decoded jointly with ht1h_{t-1}, updating hth_t via a convolutional GRU (ConvGRU).
  • The decoded flow and residual components together reconstruct the new frame.

Speech Compression FRAE

For audio, at each spectrogram frame xtx_t:

  • The encoder consumes xtx_t and the previous decoder state ht1h_{t-1}, outputting a continuous code ctc_t.
  • ctc_t is quantized into a discrete ztz_t via a learned codebook.
  • The decoder receives ztz_t and ht1h_{t-1}, updates the state with a GRU, and reconstructs x^t\hat x_t.

In both domains, ht1h_{t-1} provides a summary of past reconstructions, enabling the encoder to focus capacity on innovations not predictable from previous context. The system is strictly online and supports both fixed and variable bitrate operation via quantization and entropy coding.

2. Mathematical Formulation

The recurrent feedback and rate-distortion objectives for FRAE are formalized as follows.

Recurrent Feedback

For time tt, let ztz_t be the quantized latent and ht1h_{t-1} the previous decoder hidden state.

  • Encoding: ct=Enc(xt,ht1)c_t = \mathrm{Enc}(x_t, h_{t-1}), zt=Q(ct)z_t = Q(c_t)
  • Decoding: ht=DecRNN(ht1,zt)h_t = \mathrm{DecRNN}(h_{t-1}, z_t), x^t=DecOut(ht)\hat x_t = \mathrm{DecOut}(h_t)

In the video setting, this includes optical flow estimation and warping as additional inputs (Golinski et al., 2020). In both cases, hth_t is updated per a GRU cell (possibly convolutional for video):

rt=σ(Wr[ct,ht1]+br),ut=σ(Wu[ct,ht1]+bu)r_t = \sigma(W_r [c_t, h_{t-1}] + b_r), \quad u_t = \sigma(W_u [c_t, h_{t-1}] + b_u)

h~t=tanh(Wh[ct,rtht1]+bh),ht=(1ut)ht1+uth~t\tilde{h}_t = \tanh(W_h [c_t, r_t \odot h_{t-1}] + b_h), \quad h_t = (1-u_t)\odot h_{t-1} + u_t\odot \tilde{h}_t

Rate-Distortion Objective

The objective is the sum over sequence positions:

LRD=tD(xt,x^t)+βR(zt)L_{\mathrm{RD}} = \sum_{t} D(x_t, \hat x_t) + \beta R(z_t)

where DD is the domain-specific distortion metric (e.g., 1MS1-\mathrm{MS}-SSIM for video, Mel-MSE for audio), R(zt)=logPZ(zt)R(z_t) = -\log P_Z(z_t) denotes rate under a learned prior, and β\beta controls the rate-distortion tradeoff (Golinski et al., 2020, Yang et al., 2019).

3. Neural Components and Quantization Strategies

The FRAE architecture leverages modular neural blocks tailored to the data modality.

Encoder/Decoder Backbone

  • Stacks of convolutional layers and residual blocks implement feature extraction in both encoder and decoder.
  • Downsampling is performed via strided convolutions in the encoder; upsampling by transposed convolutions in the decoder.
  • Batch normalization is managed carefully, switching to inference mode after warmup to avoid train-test mismatch across unrolled time steps (Golinski et al., 2020).

Recurrent Module

  • ConvGRU in video, standard GRU in speech; always receives feedback from ht1h_{t-1}.
  • In both cases, state feedback is crucial for modeling long-range dependencies.

Quantization

  • Vector quantization to a learned codebook, with nearest-neighbor selection in the forward pass and straight-through gradients in backpropagation [Mentzer et al. 2018].
  • In audio, K=4K=4 per dimension is typical, producing discrete codes zt{1,,K}dz_t \in \{1,\dots,K\}^d.

Prior Models

  • Gated PixelCNN produces PZ(zt)P_Z(z_t) for video.
  • In audio, learned MLPs conditioned on ht1h_{t-1} produce categorical priors used for entropy coding.

Additional Modules

4. Training Regimes and Experimental Setup

Video

  • Training on subsets of Kinetics400, validation on separate datasets, final evaluation on the HD UVG set (1080p, 3,900 frames).
  • GoP of length 8, P-frame network unrolled for 7 steps.
  • Loss combines rate-distortion with auxiliary optical flow losses for early iterations:

Lfe=D(warp(xt1,ft),xt),Lfd=D(warp(x^t1,f^t),xt)L_{fe} = D(\mathrm{warp}(x_{t-1}, f_t), x_t), \quad L_{fd} = D(\mathrm{warp}(\hat x_{t-1}, \hat f_t), x_t)

  • Adam optimizer, batch size 16, up to 250,000 iterations.

Speech

  • Data: single-speaker and multi-speaker corpora (LibriVox, WSJ1).
  • Preprocessing via 16kHz STFT, mapped to magnitude spectrograms on the dB scale at 100Hz.
  • Loss is Mel-scale weighted MSE; Adam optimization, standard minibatch size.
  • Schedule adjusts β\beta to sweep the rate-distortion curve; quantization gradients handled straight-through.

5. Empirical Performance and Comparative Analysis

Video Compression

  • FRAE outperforms (in MS\mathrm{MS}-SSIM/bitrate) leading learned compression methods (Lu et al. 2019, Liu et al. 2020, Rippel et al. 2018, Habibian et al. 2019) and established codecs (H.264, H.265/x265/x264) within the streaming-relevant 0.05–0.35 bpp range (Golinski et al., 2020).
  • Ablations:
    • Removing feedback: ΔMS\Delta \mathrm{MS}-SSIM ≈ −0.002.
    • Removing ConvGRU: −0.003.
    • Removing MENet: −0.005 and major visual artifacts.

Speech Coding

  • At 1.6 kbps, FRAE+WaveNet achieves POLQA ≈3.2–3.4, compared to Opus' <<2.0 at same bitrate (Yang et al., 2019).
  • Decoder-encoder feedback outperforms alternative recurrent architectures by >>25% lower MSE and >>0.3 higher POLQA at constant rate.
  • Learned autoregressive priors (p(ztht1)p(z_t|h_{t-1})) enable up to 1.5 kbps saving at fixed quality.
Domain State Feedback Gain Best-Reported Distortion Metrics Traditional Baseline
Video Δ\DeltaMS-SSIM +0.002–0.005 Outperforms H.265 at <<0.35 bpp H.265, x264/x265
Speech >>25% MSE reduction POLQA up to [email protected] kbps Opus, Griffin-Lim

General Findings

  • Decoder-to-encoder state feedback is crucial for high compression ratios and quality in both domains.
  • Quantized bottlenecks and learned priors enable practical variable-rate transmission.
  • Online, strictly forward operation allows for causal, low-latency decoding in streaming scenarios.

6. Practical Limitations and Proposed Remedies

Temporal Consistency and Flicker (Video)

MS-SSIM/bit allocation in P-frames declines across GoPs, manifesting as perceptible quality "flicker" at I-frame boundaries. This is attributed to per-frame loss averaging. Proposed approaches to mitigate include adding temporal consistency losses (e.g., using frame-difference SSIM, optical flow-based warping loss) or enforcing uniform bit allocation.

Color Shift and Artifacts

MS-SSIM’s partial insensitivity to color drift leads to hue shifts in low-rate models. Incorporating color-sensitive distortion terms (e.g., L2L_2 in YUV, chroma PSNR) or adjusted metrics is suggested to address this issue.

Edge Artifacts

Unpadded 11×1111 \times 11 Gaussian filtering in MS\mathrm{MS}-SSIM underweights corners, leading the network to minimize bitrate allocation at image edges. Applying replicate-padding before local statistics resolves this bias (Golinski et al., 2020).

Stability in Recurrent Architectures (Speech)

While output-feedback FRAE architectures can slightly outperform state feedback in MSE, they are more prone to instability. Decoder-to-encoder state feedback offers robust convergence and consistent improvements in spectrogram coding (Yang et al., 2019).

7. Broader Impact and Future Directions

The FRAE's reliance on recurrent feedback—operationalized via decoder-to-encoder hidden state transmission—offers a general mechanism for exploiting temporal dependencies in learned compression systems for both video and speech. This approach enables high compression ratios even under strict causality constraints, supporting real-time applications. Extensions via learned priors open avenues for flexible variable-rate deployment. Open empirical issues, particularly regarding temporal artifacts and perceptual color consistency, highlight potential refinements in rate-distortion objectives and loss functions. Further integration with neural synthesis frameworks (e.g., for audio: WaveNet vocoders) broadens the practical utility of FRAE in end-to-end learned compression pipelines (Golinski et al., 2020, Yang et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Feedback Recurrent AutoEncoder (FRAE).