MPEG FCM Test Model (FCTM) Overview
- MPEG FCM Test Model (FCTM) is a standardized reference pipeline that compresses, transmits, and reconstructs neural-network features for split inference.
- It employs a six-stage encoder–decoder design combining neural transforms, feature selection, quantization, pruning, and entropy coding to optimize rate–accuracy trade-offs.
- FCTM achieves significant BD-Rate savings (up to 95%) with minimal accuracy loss, supporting interoperable and privacy-preserving edge-cloud machine vision.
The MPEG Feature Coding Test Model (FCTM) is the standardized reference pipeline and software implementation for compressing, transmitting, and reconstructing intermediate neural-network features in split-inference systems. Standardized within the MPEG Feature Coding for Machines (FCM) framework, FCTM enables low-bitrate, privacy-preserving, and interoperable machine vision by adapting multidimensional DNN feature tensors into a compact bitstream—compatible with established video codecs—for collaborative edge-cloud inference with minimal accuracy degradation. The FCTM design combines neural transforms, feature selection, quantization, channel/energy-based pruning, and entropy coding, aligning with both task-centric (rate–accuracy) and classical (rate–distortion) optimization.
1. System Architecture and Encoding Pipeline
FCTM organizes the split-inference coding process as a six-stage encoder–decoder pipeline, adapting feature tensors produced at a selected DNN split point into a bitstream that is efficiently decoded by a remote node to resume inference. The encoder stages, each invertible at the decoder, are:
- Feature Reduction
- Calculation of global statistics (, ) over input layers.
- (Optional) Temporal downsampling (1× or 2×).
- Learned neural transform (FENet) fusing, spatially downsampling, and channel reweighting.
- Selective Learning Strategy (SLS): channel reordering by importance and energy.
- Channel adjustment: truncation (pruning) of low-activation channels.
- Feature Conversion
- Recalculation of (, ) on reduced feature .
- Packing tensors into a single-channel 2D frame.
- Min–max normalization and uniform -bit quantization.
- Feature Inner Encoding
- Treating the quantized frame as a 4:0:0 monochrome YUV image.
- Compression using a standard video codec (typically VVC in low-delay mode).
The decoder inverts these stages:
- Feature Inner Decoding: Video codec decoding ( 10-bit frame).
- Feature Inverse Conversion: Dequantization, unpacking, distribution matching via (, ).
- Feature Restoration: Inverse channel adjustment, DRNet transform, temporal upsampling (if used), and final restoration to match input statistics (, ) (Eimon et al., 11 Dec 2025).
2. Bitstream Syntax and Signaling Semantics
FCTM employs a bitstream structure compatible with the NAL-unit model of established video codecs. Three main classes of syntax elements are defined:
| Element | Frequency | Core Fields |
|---|---|---|
| Feature Parameter Set (FPS) | Infrequent | fps_id, version, codec_profile, max_layers, etc. |
| Frame Header | Per packed frame | fps_id_ref, frame_type, num_layers, dims, , |
| Slice Header | Pre-inner slice | bitdepth, , , , , channel map, refine interval |
Payload data follows in standard video codec NAL-unit sequence. Decoder operation is determined by parsing these headers and correctly associating payload and side information for accurate feature restoration (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).
3. Transform, Quantization, Pruning, and Coding Modules
Neural Feature Transform and Fusion
A learned FENet CNN processes potentially multi-layered input (, ) into a lower-dimensional tensor , reducing redundancy and decorrelating features. Block-wise, FENet is stacked, each block halving spatial resolution and incorporating residual connections (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).
Channel Pruning
FCTM implements energy-based or range-based channel pruning. In one approach, the range statistic per channel,
is compared against a tunable threshold:
Channels with are pruned, reducing bitrate with minimal impact on accuracy ( typically in ). The resulting mask is signaled periodically in the bitstream (Eimon et al., 11 Dec 2025, Merlos et al., 11 Dec 2025).
Packing and Quantization
Post-pruning, channels are tiled into 2D frames for video codec compatibility, with possible Z-order or optimal tiling to minimize entropy. Quantization employs uniform scalar techniques post-minmax normalization:
with for -bit depth. Dequantization simply rescales to (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025, Merlos et al., 11 Dec 2025).
Entropy Coding
The packed and quantized frames are coded using CABAC/arithmetic coding of the parent video codec (VVC/HEVC/AVC). Context models are instantiated per bitplane or coefficient band, and coding is applied subblock-wise (e.g., 64×64 CTUs), supporting adaptive histograms (Eimon et al., 11 Dec 2025, Merlos et al., 11 Dec 2025).
4. Rate–Distortion, Tuning, and Optimization
FCTM target metric is typically a rate–accuracy optimization, not classical pixel-wise distortion. The Lagrangian minimization is:
where is the decrease in end-task accuracy (e.g., mAP, MOTA), combines payload and side-information bits, and is controlled via both codec QP and pruning/quantization parameters. Bit allocation is subject to explicit trade-offs:
- Higher λ: lower bitrate via more aggressive pruning, coarser quantization, less frequent intra-frames.
- Lower λ: higher accuracy at increased bitrate (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).
Regular empirical tuning guidance is provided. For example, higher disables more aggressive channel pruning (preserving more channels), and temporal downsampling is discouraged for high-motion applications.
5. Performance, Codec Compatibility, and Evaluation
Empirical results under the MPEG Common Test and Training Conditions (CTTC) demonstrate the following:
- Bitrate Reduction: FCTM (v6.1) realizes an average 85.14% BD-Rate reduction over pixel/video-based remote inference. Instance segmentation and detection tasks exhibit 95% savings; tracking tasks achieve 94% savings (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).
- Accuracy: Task accuracy drop (mask AP, mAP, MOTA) is within 0.1–0.3 points relative to lossless feature baselines, even at aggressive quantization settings.
- Inner Codec Selection:
- VVC (H.266) is the anchor codec; HEVC (H.265) achieves an average BD-Rate penalty of 1.39%, while AVC (H.264) incurs a much higher cost (32.28% on average).
- On tracking, HEVC can outperform VVC (–1.81% BD-Rate), suggesting codec flexibility depending on use-case and available hardware (Eimon et al., 11 Dec 2025).
- Computational Complexity: Encoder cost is dominated by the FENet forward path; decoder overhead is much smaller (encoder/decoder complexity ratio: 4.39 / 0.27). The system is highly suitable for resource-constrained edge nodes.
| Task/Dataset | BD-Rate Savings (vs. remote inference) |
|---|---|
| Instance Seg. (OIV6) | –94.24% |
| Detection (OIV6) | –95.45% |
| Tracking (TVD) | –94.57% |
| Average (5 tasks) | –85.14% |
(Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).
6. Scalability, Interoperability, and Privacy Considerations
FCTM is engineered for broad deployment scenarios:
- Interoperability: Standardized bitstream structure (parameter sets, headers, NAL units), profile-based restrictions for low-complexity or hardware-constrained decoders, and clear separation of feature metadata enable wide adoption (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).
- Scalability:
- Spatial: Multi-layer packing and channel subsetting.
- Quality: Layered quality via QP sublayers in the same bitstream (base and enhancement layers).
- Temporal: Configurable downsampling/interpolation for dynamic bitrate and latency tradeoffs.
- Privacy: By transmitting only intermediate feature activations—never raw RGB pixels—FCTM substantially reduces the risk of reconstructing the source image, as the refinement process requires only global feature statistics, not per-pixel maps. No direct inversion network can recover faces or backgrounds, addressing privacy and GDPR-oriented constraints (Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025).
7. Recent Extensions and Research Directions
Recent proposals have further optimized the FCTM pipeline:
- Global-Statistics Preservation: Z-score normalization and re-scaling at the decoder ensure that reconstructed features precisely match the encoder's global distribution. Transmission of means and standard deviations every refresh period allows accurate restoration with minimized side-information overhead. This extension results in additional BD-Rate savings: avg. –17.09%, up to –65.69% for tracking (Eimon et al., 10 Dec 2025).
- Range-Based Channel Truncation and Packing: Runtime range statistics allow dropping low-activation channels dynamically. Packed channel masks are signaled infrequently (e.g., every 128 frames), and tiling minimizes codec footprint. This achieves a further –10.59% mean BD-Rate gain with negligible computational overhead (Merlos et al., 11 Dec 2025).
A plausible implication is that ongoing development will continue to focus on optimizing encoder complexity, codec integration, and support for diverse DNN architectures. Standardization ensures that any conforming decoder can recover features for downstream machine vision tasks, achieving robust, low-latency, and privacy-friendly distributed inference.
References:
(Eimon et al., 11 Dec 2025, Eimon et al., 11 Dec 2025, Eimon et al., 10 Dec 2025, Merlos et al., 11 Dec 2025)