Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

Published 27 Nov 2025 in cs.CL and cs.MA | (2512.17914v1)

Abstract: Multi-agent LLM systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel protocol for transmitting compressed KV caches between agents, achieving 5-6x compression while preserving coherence quality above 0.77.
It employs adaptive layer-wise quantization and hybrid information extraction to optimize semantic fidelity with variable bit-widths across transformer layers.
The study demonstrates cross-architecture communication via heterogeneous model calibration, paving the way for efficient LLM deployment in bandwidth-constrained environments.

Overview of Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

Introduction

The paper "Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression" (2512.17914) presents a critical advancement in the domain of multi-agent LLM systems. Traditional approaches in this field involve the transmission of raw text between agents, which necessitates redundant recomputation of semantic representations, contributing to substantial bandwidth and computational overhead. Q-KVComm proposes a novel protocol that enables direct transmission of compressed key-value (KV) cache representations between agents, facilitating a transition from text-based to representation-based communication. This paradigm shift aims to significantly reduce bandwidth requirements while maintaining computational efficiency.

Technical Innovations

Q-KVComm institutes three principal innovations:

Adaptive Layer-wise Quantization: This method employs sensitivity profiling to allocate variable bit-widths across transformer layers. Unlike uniform quantization methods, adaptive quantization allows the protocol to assign 4-8 bits based on reconstruction error sensitivity. This approach ensures optimal compression while preserving semantic fidelity.
Hybrid Information Extraction: The paper introduces a sophisticated pipeline combining keyword extraction and named entity recognition with content-type detection to preserve critical information across diverse domains. This ensures the transmission of essential semantic elements without substantial quality loss, accommodating various content types from API documentation to narrative texts.
Heterogeneous Model Calibration: This component enables the translation of KV caches between different architectures through learned statistical mappings, thereby facilitating cross-architecture communication without necessitating identical sender and receiver models.
Figure 1: Overview of Q-KVComm Architecture.

Experimental Results

The experimental evaluation spans three diverse question-answering datasets: SQuAD, HotpotQA, and NarrativeQA. Q-KVComm demonstrates 5-6x compression ratios while maintaining coherence quality scores above 0.77, robust across model sizes ranging from 1.1B to 1.5B parameters. The protocol exhibits resilience in real-world applications inclusive of conversational QA and multi-hop reasoning. Notably, Q-KVComm delivers a profound improvement in resource efficiency without significant degradation in performance.

Implications and Future Directions

The implications of this research extend both practically and theoretically in AI and machine learning. Practically, Q-KVComm facilitates the deployment of LLM systems in bandwidth-constrained environments such as edge computing and mobile applications. Theoretically, it establishes a foundational shift towards representation-based inter-agent communication, challenging the necessity of text-based paradigms and opening new avenues for efficient large-scale LLM collaboration.

Future research could explore scaling Q-KVComm to larger models, examining its implications on federated learning, and enhancing security protocols to safeguard internal model representations. Additionally, adopting learned compression techniques might offer further improvements by leveraging more complex statistical structures within KV caches.

Conclusion

Q-KVComm represents a transformative step in multi-agent LLM systems, addressing the longstanding inefficiencies in agent communication. By optimizing bandwidth usage and computational redundancy, it sets a precedent for scalable, efficient communication protocols in distributed AI systems. Its adaptable architecture and robust performance across varied tasks underscore its potential to redefine multi-agent collaboration frameworks in real-world scenarios. These contributions lay the groundwork for future inquiries into more sophisticated, yet efficient, compression strategies and their applications in global AI ecosystems.

Markdown Report Issue