Papers
Topics
Authors
Recent
Search
2000 character limit reached

MPEG FCM Test Model (FCTM) Overview

Updated 14 January 2026
  • MPEG FCM Test Model (FCTM) is a standardized reference pipeline that compresses, transmits, and reconstructs neural-network features for split inference.
  • It employs a six-stage encoder–decoder design combining neural transforms, feature selection, quantization, pruning, and entropy coding to optimize rate–accuracy trade-offs.
  • FCTM achieves significant BD-Rate savings (up to 95%) with minimal accuracy loss, supporting interoperable and privacy-preserving edge-cloud machine vision.

The MPEG Feature Coding Test Model (FCTM) is the standardized reference pipeline and software implementation for compressing, transmitting, and reconstructing intermediate neural-network features in split-inference systems. Standardized within the MPEG Feature Coding for Machines (FCM) framework, FCTM enables low-bitrate, privacy-preserving, and interoperable machine vision by adapting multidimensional DNN feature tensors into a compact bitstream—compatible with established video codecs—for collaborative edge-cloud inference with minimal accuracy degradation. The FCTM design combines neural transforms, feature selection, quantization, channel/energy-based pruning, and entropy coding, aligning with both task-centric (rate–accuracy) and classical (rate–distortion) optimization.

1. System Architecture and Encoding Pipeline

FCTM organizes the split-inference coding process as a six-stage encoder–decoder pipeline, adapting feature tensors produced at a selected DNN split point into a bitstream that is efficiently decoded by a remote node to resume inference. The encoder stages, each invertible at the decoder, are:

  1. Feature Reduction
    • Calculation of global statistics (μx\mu_x, σx\sigma_x) over input layers.
    • (Optional) Temporal downsampling (1× or 2×).
    • Learned neural transform (FENet) fusing, spatially downsampling, and channel reweighting.
    • Selective Learning Strategy (SLS): channel reordering by importance and energy.
    • Channel adjustment: truncation (pruning) of low-activation channels.
  2. Feature Conversion
    • Recalculation of (μz\mu_z, σz\sigma_z) on reduced feature zz.
    • Packing C×H×WC \times H \times W tensors into a single-channel 2D frame.
    • Min–max normalization and uniform bb-bit quantization.
  3. Feature Inner Encoding
    • Treating the quantized frame as a 4:0:0 monochrome YUV image.
    • Compression using a standard video codec (typically VVC in low-delay mode).

The decoder inverts these stages:

  1. Feature Inner Decoding: Video codec decoding (\to 10-bit frame).
  2. Feature Inverse Conversion: Dequantization, unpacking, distribution matching via (μz\mu_z, σz\sigma_z).
  3. Feature Restoration: Inverse channel adjustment, DRNet transform, temporal upsampling (if used), and final restoration to match input statistics (μx\mu_x, σx\sigma_x) (Eimon et al., 11 Dec 2025).

2. Bitstream Syntax and Signaling Semantics

FCTM employs a bitstream structure compatible with the NAL-unit model of established video codecs. Three main classes of syntax elements are defined:

Element Frequency Core Fields
Feature Parameter Set (FPS) Infrequent fps_id, version, codec_profile, max_layers, etc.
Frame Header Per packed frame fps_id_ref, frame_type, num_layers, dims, μx\mu_x, σx\sigma_x
Slice Header Pre-inner slice bitdepth, zminz_{min}, zmaxz_{max}, μz\mu_z, σz\sigma_z, channel map, refine interval

Payload data follows in standard video codec NAL-unit sequence. Decoder operation is determined by parsing these headers and correctly associating payload and side information for accurate feature restoration (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).

3. Transform, Quantization, Pruning, and Coding Modules

Neural Feature Transform and Fusion

A learned FENet CNN processes potentially multi-layered input (xnRCn×Hn×Wnx_n \in \mathbb{R}^{C_n \times H_n \times W_n}, n=1,,Nn = 1, \dots, N) into a lower-dimensional tensor z=FENet(X;Θ)z = \mathrm{FENet}(X; \Theta), reducing redundancy and decorrelating features. Block-wise, FENet is stacked, each block halving spatial resolution and incorporating residual connections (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).

Channel Pruning

FCTM implements energy-based or range-based channel pruning. In one approach, the range statistic per channel,

rc=max(zc)min(zc)r_c = \max(z_c) - \min(z_c)

is compared against a tunable threshold:

T=α1Cc=1CrcT = \alpha \frac{1}{C} \sum_{c=1}^C r_c

Channels with rc<Tr_c < T are pruned, reducing bitrate with minimal impact on accuracy (α\alpha typically in [0.66,0.75][0.66, 0.75]). The resulting mask is signaled periodically in the bitstream (Eimon et al., 11 Dec 2025, Merlos et al., 11 Dec 2025).

Packing and Quantization

Post-pruning, channels are tiled into 2D frames for video codec compatibility, with possible Z-order or optimal tiling to minimize entropy. Quantization employs uniform scalar techniques post-minmax normalization:

znorm[i]=clip(z[i]zminzmaxzmin,0,1),zq[i]=min(max(znorm[i]L,0),L1)z_{\mathrm{norm}}[i] = \mathrm{clip} \left( \frac{z[i] - z_{min}}{z_{max} - z_{min}}, 0, 1 \right), \quad z_q[i] = \min \left( \max \left( \lfloor z_{\mathrm{norm}}[i] \cdot L \rfloor, 0 \right), L-1 \right)

with L=2bL = 2^b for bb-bit depth. Dequantization simply rescales to [0,1][0,1] (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025, Merlos et al., 11 Dec 2025).

Entropy Coding

The packed and quantized frames are coded using CABAC/arithmetic coding of the parent video codec (VVC/HEVC/AVC). Context models are instantiated per bitplane or coefficient band, and coding is applied subblock-wise (e.g., 64×64 CTUs), supporting adaptive histograms (Eimon et al., 11 Dec 2025, Merlos et al., 11 Dec 2025).

4. Rate–Distortion, Tuning, and Optimization

FCTM target metric is typically a rate–accuracy optimization, not classical pixel-wise distortion. The Lagrangian minimization is:

J=Dtask(Y^;Y^refined)+λRtotalJ = D_{task}(\hat{Y}; \hat{Y}_{refined}) + \lambda R_{total}

where DtaskD_{task} is the decrease in end-task accuracy (e.g., mAP, MOTA), RtotalR_{total} combines payload and side-information bits, and λ\lambda is controlled via both codec QP and pruning/quantization parameters. Bit allocation is subject to explicit trade-offs:

Regular empirical tuning guidance is provided. For example, higher α\alpha disables more aggressive channel pruning (preserving more channels), and temporal downsampling is discouraged for high-motion applications.

5. Performance, Codec Compatibility, and Evaluation

Empirical results under the MPEG Common Test and Training Conditions (CTTC) demonstrate the following:

  • Bitrate Reduction: FCTM (v6.1) realizes an average 85.14% BD-Rate reduction over pixel/video-based remote inference. Instance segmentation and detection tasks exhibit \sim95% savings; tracking tasks achieve \sim94% savings (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).
  • Accuracy: Task accuracy drop (mask AP, mAP, MOTA) is within 0.1–0.3 points relative to lossless feature baselines, even at aggressive quantization settings.
  • Inner Codec Selection:
    • VVC (H.266) is the anchor codec; HEVC (H.265) achieves an average BD-Rate penalty of 1.39%, while AVC (H.264) incurs a much higher cost (32.28% on average).
    • On tracking, HEVC can outperform VVC (–1.81% BD-Rate), suggesting codec flexibility depending on use-case and available hardware (Eimon et al., 11 Dec 2025).
  • Computational Complexity: Encoder cost is dominated by the FENet forward path; decoder overhead is much smaller (encoder/decoder complexity ratio: 4.39 / 0.27). The system is highly suitable for resource-constrained edge nodes.
Task/Dataset BD-Rate Savings (vs. remote inference)
Instance Seg. (OIV6) –94.24%
Detection (OIV6) –95.45%
Tracking (TVD) –94.57%
Average (5 tasks) –85.14%

(Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).

6. Scalability, Interoperability, and Privacy Considerations

FCTM is engineered for broad deployment scenarios:

  • Interoperability: Standardized bitstream structure (parameter sets, headers, NAL units), profile-based restrictions for low-complexity or hardware-constrained decoders, and clear separation of feature metadata enable wide adoption (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).
  • Scalability:
    • Spatial: Multi-layer packing and channel subsetting.
    • Quality: Layered quality via QP sublayers in the same bitstream (base and enhancement layers).
    • Temporal: Configurable downsampling/interpolation for dynamic bitrate and latency tradeoffs.
  • Privacy: By transmitting only intermediate feature activations—never raw RGB pixels—FCTM substantially reduces the risk of reconstructing the source image, as the refinement process requires only global feature statistics, not per-pixel maps. No direct inversion network can recover faces or backgrounds, addressing privacy and GDPR-oriented constraints (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).

7. Recent Extensions and Research Directions

Recent proposals have further optimized the FCTM pipeline:

  • Global-Statistics Preservation: Z-score normalization and re-scaling at the decoder ensure that reconstructed features precisely match the encoder's global distribution. Transmission of means and standard deviations every refresh period allows accurate restoration with minimized side-information overhead. This extension results in additional BD-Rate savings: avg. –17.09%, up to –65.69% for tracking (Eimon et al., 10 Dec 2025).
  • Range-Based Channel Truncation and Packing: Runtime range statistics allow dropping low-activation channels dynamically. Packed channel masks are signaled infrequently (e.g., every 128 frames), and tiling minimizes codec footprint. This achieves a further –10.59% mean BD-Rate gain with negligible computational overhead (Merlos et al., 11 Dec 2025).

A plausible implication is that ongoing development will continue to focus on optimizing encoder complexity, codec integration, and support for diverse DNN architectures. Standardization ensures that any conforming decoder can recover features for downstream machine vision tasks, achieving robust, low-latency, and privacy-friendly distributed inference.


References:

(Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025, Eimon et al., 10 Dec 2025, Merlos et al., 11 Dec 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MPEG FCM Test Model (FCTM).