FRAE: Feedback Recurrent AutoEncoder

Updated 10 February 2026

FRAE is a recurrent autoencoder architecture that leverages decoder-to-encoder feedback to efficiently compress sequential data by exploiting temporal redundancies.
It integrates convolutional GRUs, quantized bottleneck representations, and learned priors to optimize rate-distortion tradeoffs across video and speech domains.
Empirical evaluations demonstrate FRAE's superiority over traditional codecs, achieving lower distortion metrics and bitrate requirements while addressing issues like temporal consistency.

The Feedback Recurrent AutoEncoder (FRAE) is a recurrent autoencoding architecture for online sequential data compression, designed to efficiently exploit temporal redundancies by explicitly feeding back the decoder's hidden state into the encoder at each timestep. FRAE has demonstrated state-of-the-art effectiveness in domains such as video compression (Golinski et al., 2020) and speech spectrogram compression (Yang et al., 2019), yielding compact discrete representations that are amenable to entropy coding, and achieving significant improvements over previous learned and traditional codecs in rate-distortion tradeoffs.

1. Architecture and Operational Principles

FRAE operates as an online, fully causal sequence-to-sequence autoencoder with three core design elements: (i) recurrent connections in both encoder and decoder, (ii) explicit decoder-to-encoder hidden state feedback at every timestep, and (iii) a quantized bottleneck representation facilitating low bit-rate compression.

Video Compression FRAE

For video, compression is organized in Group-of-Pictures (GoP) structure. Each GoP starts with an I-frame ( $x_0$ ), encoded via a stand-alone image autoencoder, followed by P-frames ( $x_1, \dots, x_{N-1}$ ) processed by a recurrent P-frame network. At each $t$ :

The encoder receives as input three elements: the current frame $x_t$ , the previous reconstructed frame $\hat x_{t-1}$ warped via an optical flow estimator (MENet), and the decoder hidden state $h_{t-1}$ .
The decoder latent $z_t$ is quantized and decoded jointly with $h_{t-1}$ , updating $h_t$ via a convolutional GRU (ConvGRU).
The decoded flow and residual components together reconstruct the new frame.

Speech Compression FRAE

For audio, at each spectrogram frame $x_t$ :

The encoder consumes $x_t$ and the previous decoder state $h_{t-1}$ , outputting a continuous code $c_t$ .
$c_t$ is quantized into a discrete $z_t$ via a learned codebook.
The decoder receives $z_t$ and $h_{t-1}$ , updates the state with a GRU, and reconstructs $\hat x_t$ .

In both domains, $h_{t-1}$ provides a summary of past reconstructions, enabling the encoder to focus capacity on innovations not predictable from previous context. The system is strictly online and supports both fixed and variable bitrate operation via quantization and entropy coding.

2. Mathematical Formulation

The recurrent feedback and rate-distortion objectives for FRAE are formalized as follows.

Recurrent Feedback

For time $t$ , let $z_t$ be the quantized latent and $h_{t-1}$ the previous decoder hidden state.

Encoding: $c_t = \mathrm{Enc}(x_t, h_{t-1})$ , $z_t = Q(c_t)$
Decoding: $h_t = \mathrm{DecRNN}(h_{t-1}, z_t)$ , $\hat x_t = \mathrm{DecOut}(h_t)$

In the video setting, this includes optical flow estimation and warping as additional inputs (Golinski et al., 2020). In both cases, $h_t$ is updated per a GRU cell (possibly convolutional for video):

$r_t = \sigma(W_r [c_t, h_{t-1}] + b_r), \quad u_t = \sigma(W_u [c_t, h_{t-1}] + b_u)$

$\tilde{h}_t = \tanh(W_h [c_t, r_t \odot h_{t-1}] + b_h), \quad h_t = (1-u_t)\odot h_{t-1} + u_t\odot \tilde{h}_t$

Rate-Distortion Objective

The objective is the sum over sequence positions:

$L_{\mathrm{RD}} = \sum_{t} D(x_t, \hat x_t) + \beta R(z_t)$

where $D$ is the domain-specific distortion metric (e.g., $1-\mathrm{MS}$ -SSIM for video, Mel-MSE for audio), $R(z_t) = -\log P_Z(z_t)$ denotes rate under a learned prior, and $\beta$ controls the rate-distortion tradeoff (Golinski et al., 2020, Yang et al., 2019).

3. Neural Components and Quantization Strategies

The FRAE architecture leverages modular neural blocks tailored to the data modality.

Encoder/Decoder Backbone

Stacks of convolutional layers and residual blocks implement feature extraction in both encoder and decoder.
Downsampling is performed via strided convolutions in the encoder; upsampling by transposed convolutions in the decoder.
Batch normalization is managed carefully, switching to inference mode after warmup to avoid train-test mismatch across unrolled time steps (Golinski et al., 2020).

Recurrent Module

ConvGRU in video, standard GRU in speech; always receives feedback from $h_{t-1}$ .
In both cases, state feedback is crucial for modeling long-range dependencies.

Quantization

Vector quantization to a learned codebook, with nearest-neighbor selection in the forward pass and straight-through gradients in backpropagation [Mentzer et al. 2018].
In audio, $K=4$ per dimension is typical, producing discrete codes $z_t \in \{1,\dots,K\}^d$ .

Prior Models

Gated PixelCNN produces $P_Z(z_t)$ for video.
In audio, learned MLPs conditioned on $h_{t-1}$ produce categorical priors used for entropy coding.

Additional Modules

MENet: U-Net architecture for optical flow estimation in video.
Neural vocoder (e.g., WaveNet) for waveform synthesis from decoded spectrograms in speech (Yang et al., 2019).

4. Training Regimes and Experimental Setup

Video

Training on subsets of Kinetics400, validation on separate datasets, final evaluation on the HD UVG set (1080p, 3,900 frames).
GoP of length 8, P-frame network unrolled for 7 steps.
Loss combines rate-distortion with auxiliary optical flow losses for early iterations:

$L_{fe} = D(\mathrm{warp}(x_{t-1}, f_t), x_t), \quad L_{fd} = D(\mathrm{warp}(\hat x_{t-1}, \hat f_t), x_t)$

Adam optimizer, batch size 16, up to 250,000 iterations.

Speech

Data: single-speaker and multi-speaker corpora (LibriVox, WSJ1).
Preprocessing via 16kHz STFT, mapped to magnitude spectrograms on the dB scale at 100Hz.
Loss is Mel-scale weighted MSE; Adam optimization, standard minibatch size.
Schedule adjusts $\beta$ to sweep the rate-distortion curve; quantization gradients handled straight-through.

5. Empirical Performance and Comparative Analysis

Video Compression

FRAE outperforms (in $\mathrm{MS}$ -SSIM/bitrate) leading learned compression methods (Lu et al. 2019, Liu et al. 2020, Rippel et al. 2018, Habibian et al. 2019) and established codecs (H.264, H.265/x265/x264) within the streaming-relevant 0.05–0.35 bpp range (Golinski et al., 2020).
Ablations:
- Removing feedback: $\Delta \mathrm{MS}$ -SSIM ≈ −0.002.
- Removing ConvGRU: −0.003.
- Removing MENet: −0.005 and major visual artifacts.

Speech Coding

At 1.6 kbps, FRAE+WaveNet achieves POLQA ≈3.2–3.4, compared to Opus' $<$ 2.0 at same bitrate (Yang et al., 2019).
Decoder-encoder feedback outperforms alternative recurrent architectures by $>$ 25% lower MSE and $>$ 0.3 higher POLQA at constant rate.
Learned autoregressive priors ( $p(z_t|h_{t-1})$ ) enable up to 1.5 kbps saving at fixed quality.

Domain	State Feedback Gain	Best-Reported Distortion Metrics	Traditional Baseline
Video	$\Delta$ MS-SSIM +0.002–0.005	Outperforms H.265 at $<$ 0.35 bpp	H.265, x264/x265
Speech	$>$ 25% MSE reduction	POLQA up to [email protected] kbps	Opus, Griffin-Lim

General Findings

Decoder-to-encoder state feedback is crucial for high compression ratios and quality in both domains.
Quantized bottlenecks and learned priors enable practical variable-rate transmission.
Online, strictly forward operation allows for causal, low-latency decoding in streaming scenarios.

6. Practical Limitations and Proposed Remedies

Temporal Consistency and Flicker (Video)

MS-SSIM/bit allocation in P-frames declines across GoPs, manifesting as perceptible quality "flicker" at I-frame boundaries. This is attributed to per-frame loss averaging. Proposed approaches to mitigate include adding temporal consistency losses (e.g., using frame-difference SSIM, optical flow-based warping loss) or enforcing uniform bit allocation.

Color Shift and Artifacts

MS-SSIM’s partial insensitivity to color drift leads to hue shifts in low-rate models. Incorporating color-sensitive distortion terms (e.g., $L_2$ in YUV, chroma PSNR) or adjusted metrics is suggested to address this issue.

Edge Artifacts

Unpadded $11 \times 11$ Gaussian filtering in $\mathrm{MS}$ -SSIM underweights corners, leading the network to minimize bitrate allocation at image edges. Applying replicate-padding before local statistics resolves this bias (Golinski et al., 2020).

Stability in Recurrent Architectures (Speech)

While output-feedback FRAE architectures can slightly outperform state feedback in MSE, they are more prone to instability. Decoder-to-encoder state feedback offers robust convergence and consistent improvements in spectrogram coding (Yang et al., 2019).

7. Broader Impact and Future Directions

The FRAE's reliance on recurrent feedback—operationalized via decoder-to-encoder hidden state transmission—offers a general mechanism for exploiting temporal dependencies in learned compression systems for both video and speech. This approach enables high compression ratios even under strict causality constraints, supporting real-time applications. Extensions via learned priors open avenues for flexible variable-rate deployment. Open empirical issues, particularly regarding temporal artifacts and perceptual color consistency, highlight potential refinements in rate-distortion objectives and loss functions. Further integration with neural synthesis frameworks (e.g., for audio: WaveNet vocoders) broadens the practical utility of FRAE in end-to-end learned compression pipelines (Golinski et al., 2020, Yang et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

Feedback Recurrent Autoencoder for Video Compression (2020)

Feedback Recurrent AutoEncoder (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Feedback Recurrent AutoEncoder (FRAE).