FCM Codec: Efficient Feature Coding
- FCM Codec is a standardized framework that compresses intermediate DNN features using modular encoder–decoder pipelines, enabling efficient split inference.
- It employs feature reduction, conversion, and inner encoding to achieve up to 85% bitrate reduction while maintaining task accuracy.
- The codec integrates mature video coding tools and rate–distortion optimization to ensure interoperability and secure, bandwidth-efficient machine vision transmission.
Feature Coding for Machines (FCM) Codec
Feature Coding for Machines (FCM) is a standardized codec framework, initiated by the Moving Picture Experts Group (MPEG), for compressing intermediate feature tensors produced by deep neural networks (DNNs) within split-inference pipelines. The FCM codec addresses the challenge of efficiently transmitting intermediate DNN activations from resource-constrained edge devices to servers or cloud back-ends, enabling collaborative intelligence while reducing bandwidth, improving privacy, and preserving task accuracy. The FCM standard and its Feature Coding Test Model (FCTM) are foundational technologies for scalable and interoperable deployment of machine vision applications in bandwidth- and privacy-constrained environments, achieving on average 85% bitrate reduction compared to equivalent pixel-streaming approaches while maintaining inference accuracy (Eimon et al., 11 Dec 2025).
1. Architecture and Bitstream Syntax
The FCM codec specification uses a modular encoder–decoder pipeline, with a clearly defined bitstream syntax and interface for interoperability. The architecture is divided into three main encoder stages, with a parallel decoding pipeline at the server:
- Encoder Stages:
- Global-statistic signaling: Encodes mean () and standard deviation () for each original and fused tensor.
- Temporal downsampling: Optionally drops every other frame (2× sampling), with decoder-side linear interpolation.
- Learned transform (FENet): Employs a multi-scale fusion encoder with residual and attention blocks, executing spatial downsampling and channel-wise gain scaling.
- Selective learning and channel pruning: Reorders and truncates channels using activity/range thresholds () and masks ().
- 2. Feature Conversion:
- Packing: Raster-scans reduced feature tensor z into a 2D single-channel (monochrome) frame, signaling original shapes.
- Normalization and quantization: Applies min–max normalization and uniform quantization to 10-bit integer range, with parameterized bitdepth, , and .
- Side information: Shape metadata, statistical parameters, and quantization info embedded in SEI-like syntax elements.
- 3. Feature Inner Encoding: Feeds 10-bit monochrome feature frames into a standard video codec (e.g., VVC in low-delay mode), leveraging intra/inter prediction, block transforms, and CABAC entropy coding.
- Bitstream Elements:
- Feature Parameter Set NAL (FPS): Stores global statistics (means and variances) in 16- or 32-bit formats, original tensor shapes, and channel-masks.
- Temporal Sampling Flag: Indicates 1× or 2× frame rate.
- Quantization Field: Contains bit-depth, min, and max for each quantized tensor.
- Packed Feature Frame NALs: Each NAL is a monochrome frame with rasterized, quantized feature samples.
- Metadata Signaling: All side information (dimensions, statistics, quantization parameters) is encapsulated in SEI fields at fixed intervals (per-sequence or intra period), ensuring decoder synchronization and eliminating changes to legacy video slice syntax except for payload typing.
2. Feature Coding Pipeline Details
The end-to-end FCM pipeline is designed to compress feature representations for efficient, interoperable transmission. The principal steps are as follows:
- Feature Reduction:
- Global means and deviations for each feature are computed and signaled.
- Temporal downsampling is used for bandwidth reduction, with interpolation restoring frame rate at the receiver.
- FENet, a multi-scale neural encoder, spatially compresses and fuses features, applying per-channel gain to concentrate information.
- Channels are reordered by informativeness, with truncation masking near-constant or low-range channels (), allowing for a lower channel count without impairing task accuracy.
- Feature Conversion:
- Reduced features are flattened and packed into a 2D array.
- Normalization (min–max) rescales tensor values to [0,1], followed by uniform quantization according to bitdepth .
- All relevant side information is transmitted as bitstream metadata.
- Feature Inner Encoding:
- The quantized packed feature frame is encoded with a conventional video codec. The codec operates in monochrome mode (Y-only), reusing standard prediction and coding tools.
- No modification to core video coding components is necessary; only the payload type is specialized to indicate feature data.
- Decoder Operations:
- Reverse each encoder step: video decode → unpacking/recovery → dequantization/denormalization using signaled global statistics → restoration of original feature tensor shapes and channels using metadata.
- Temporal upsampling fills in dropped frames using interpolation.
- The fully reconstructed intermediate features are fed into the remaining stages of the neural network for task completion.
3. Rate–Distortion Formulation and Optimization
FCM employs explicit rate–distortion (RD) optimization tailored for machine inference:
- Bitrate Estimate: , with the empirical distribution of quantized values.
- Distortion Metric: Either pixel/feature-level MSE () or a task-driven distortion (e.g., drop in mAP or MSE in DNN feature space).
- Lagrangian Tradeoff: , where is varied to generate RD curves. Bitdepth and pruning parameters are optimized to achieve minimal drop in task accuracy at a given bitrate.
- Quantizer: Uniform scalar quantization is standard; side information enables feature reconstruction with preserved statistical moments.
Adaptive coding tools, including dynamic channel pruning and future-perceptual entropy models, are recommended to further approach optimal rate–distortion performance.
4. Performance Evaluation and Computational Complexity
Comprehensive experiments subject FCM to rigorous benchmarks under standard MPEG CTTC protocols:
- Datasets/Tasks: OpenImagesV6 (MaskRCNN-X101-FPN, FasterRCNN-X101-FPN), SFU v1 (detection), TVD and HiEve (multi-object tracking, JDE splits).
- Video Codec: VTM-23.0 in all-intra mode for images and low-delay for video.
- Results: Average BD-rate savings are 85.14% over pixel-based remote inference; best-case reductions exceed 94% for segmentation, detection, and tracking on select tasks.
- Task Accuracy: Maintained at parity with edge/cloud baselines, with <0.1% mAP/MOTA drop.
- Complexity Ratios:
- Encoder: 4.39× that of the back-end NN segment (NN-Part 2).
- Decoder: 0.27× that of the edge segment (NN-Part 1).
- The largest cost is the encoder's FENet, motivating future lightweight architectures.
5. Insights, Challenges, and Future Directions
Empirical analysis identifies key components and open challenges:
- Effectiveness of FENet: Provides >90% spatial–channel entropy reduction but dominates encoder complexity; lighter alternatives are needed.
- Channel Pruning: Pruned channels are signaled infrequently, adding negligible overhead and yielding savings with minimal accuracy loss.
- Statistical Metadata: Accurate conveyance of global statistics (mean, variance) preserves feature distribution alignment post-quantization, enabling improved feature recovery at the decoder.
- Reuse of Mature Video Codecs: By raster-packing features and leveraging VVC/HEVC codecs, FCM inherits robust, hardware-accelerated coding pipelines without modification.
Recommendations for Future Work:
- Develop lightweight, task/network-agnostic transforms to reduce device-side load.
- Explore adaptive bit allocation across both scale and channel, potentially integrating entropy models with richer priors or context.
- Standardize bitstream syntax for feature parameter sets to facilitate universal interoperability.
- Address privacy concerns by protecting side information and enabling secure feature transmission.
- Investigate advanced entropy coding and further reduction in encoder bottlenecks.
6. Significance and Broader Context
The FCM codec represents a paradigm shift in machine vision data flows:
- Bandwidth Efficiency: Delivers up to 85% reduction in transmission cost without accuracy compromise, compared to conventional cloud-offloading approaches.
- Privacy: Intermediate features generally lack a direct mapping to the input's appearance, offering inherent obfuscation of sensitive content.
- Interoperability: By adopting existing video codec infrastructure and standardized metadata, FCM is suitable for integration into existing networks, devices, and cloud services.
- Scalability and Extensibility: Supports continued evolution through modular reference implementations (e.g., FCTM), consistent with MPEG-AI standards.
FCM enables efficient, secure, and accurate split-inference deployment, forming the technical foundation for intelligent consumer devices, IoT endpoints, and distributed machine vision systems in contemporary and future applications (Eimon et al., 11 Dec 2025).