Papers
Topics
Authors
Recent
Search
2000 character limit reached

Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques

Published 5 May 2025 in cs.DC | (2505.02351v1)

Abstract: In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, optimizing the traditional Multi-Head Attention (MHA) mechanism by grouping query heads and sharing key-value vectors. Optimized GQA (Opt-GQA) effectively reduces computational complexity, minimizes memory fragmentation, and enhances memory utilization for large-scale models. Opt-GPTQ is optimized for Data Center Units (DCUs) and integrated into the vLLM model to maximize hardware efficiency. It customizes GPU kernels to further enhance attention computation by reducing memory access latency and boosting parallel computing capabilities. Opt-GQA integrates Attention with Linear Biases (ALiBi) to reduce overhead and enhance long-sequence processing. Experimental results show that Opt?GPTQ significantly reduces computation time and memory usage while improving model performance.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.