Papers
Topics
Authors
Recent
Search
2000 character limit reached

IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs

Published 18 Sep 2025 in cs.LG | (2509.15455v1)

Abstract: LLMs promise impressive capabilities, yet their multi-billion-parameter scale makes on-device or low-resource deployment prohibitive. Mixed-precision quantization offers a compelling solution, but existing methods struggle when the average precision drops below four bits, as they rely on isolated, layer-specific metrics that overlook critical inter-layer interactions affecting overall performance. In this paper, we propose two innovations to address these limitations. First, we frame the mixed-precision quantization problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE) to efficiently obtain accurate Shapley estimates of layer sensitivities and inter-layer interactions. Second, building upon SPQE, we propose Interaction-aware Mixed-Precision Quantization (IMPQ) which translates these Shapley estimates into a binary quadratic optimization formulation, assigning either 2 or 4-bit precision to layers under strict memory constraints. Comprehensive experiments conducted on Llama-3, Gemma-2, and Qwen-3 models across three independent PTQ backends (Quanto, HQQ, GPTQ) demonstrate IMPQ's scalability and consistently superior performance compared to methods relying solely on isolated metrics. Across average precisions spanning 4 bit down to 2 bit, IMPQ cuts Perplexity by 20 to 80 percent relative to the best baseline, with the margin growing as the bit-width tightens.

Summary

  • The paper introduces IMPQ, a framework that leverages Shapley-based Progressive Quantization Estimation to capture inter-layer interactions in LLMs.
  • It formulates mixed precision quantization as a binary quadratic optimization, assigning optimal 2-bit and 4-bit precisions under memory constraints.
  • Evaluation on models like Llama and Qwen shows a 20–80% perplexity reduction over baselines, demonstrating improved scalability and stability.

IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs

Introduction

LLMs present immense capabilities across NLP tasks, yet their scale imposes prohibitive memory and computational requirements, impeding deployment on resource-constrained environments. Existing mixed-precision quantization strategies falter at lower bit precisions due to reliance on isolated metrics that overlook inter-layer interactions. This paper introduces Interaction-aware Mixed-Precision Quantization (IMPQ), a framework that innovatively integrates Shapley-based Progressive Quantization Estimation (SPQE) to accurately estimate layer sensitivities and interactions, translating these insights into globally optimal precision assignments using a quadratic optimization approach.

Method

The research proposes a reimagining of mixed-precision quantization as a cooperative game among layers, utilizing Shapley value analysis to model layer interactions. SPQE offers a novel method to accurately quantify contributions through progressive quantization, maintaining model stability and low variance in Shapley estimates. Built on these estimates, IMPQ formulates precision assignment as a binary quadratic optimization problem, assigning 2 or 4-bit precision under stringent memory constraints. This approach ensures optimal resource allocation, significantly enhancing performance over traditional layer-isolated metrics. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Wikitext-2 Perplexity comparison of quantization methods across Gemma, Llama, Qwen models on GPTQ.

Results

The evaluation across models such as Llama-3, Gemma-2, and Qwen-3 demonstrated IMPQ's consistent superiority. It achieves remarkable reductions in Perplexity by 20-80% relative to the best baselines within constrained bit-widths, illustrating the critical role of interaction-aware strategies in mixed-precision quantization. IMPQ not only outperforms isolated metric approaches but also showcases enhanced scalability and robustness under tighter precision constraints. Figure 2

Figure 2: Comparison of perplexity for SPQE and layer pruning-based Shapley estimation on Llama 3.1-8B using Quanto. Layer pruning causes perplexity to diverge after 5 layers, while progressive quantization remains stable.

Discussion

IMPQ marks a pivotal shift from isolated layer assessment to viewing the quantization problem as sustaining inter-layer cooperation through Shapley values. This acknowledges and leverages the propagated errors across layers, optimizing precision assignment far beyond existing heuristics. Despite the computational costs associated with SPQE, these are mitigated by its one-time application that fundamentally enhances subsequent deployments.

Conclusion

IMPQ, through its novel Shapley-based quantization strategy, significantly advances low-bit quantization efficacy by accommodating inter-layer interactions. It sets a new benchmark for deploying LLMs on limited-resource platforms while paving the way for extending cooperative game-theoretic methods to other model compression techniques, ensuring comprehensive evaluation and minimizing losses through informed precision assignments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.