Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models

Published 16 Jan 2024 in cs.CL | (2401.08294v1)

Abstract: We present Inferflow, an efficient and highly configurable inference engine for LLMs. With Inferflow, users can serve most of the common transformer models by simply modifying some lines in corresponding configuration files, without writing a single line of source code. Compared with most existing inference engines, Inferflow has some key features. First, by implementing a modular framework of atomic build-blocks and technologies, Inferflow is compositionally generalizable to new models. Second, 3.5-bit quantization is introduced in Inferflow as a tradeoff between 3-bit and 4-bit quantization. Third, hybrid model partitioning for multi-GPU inference is introduced in Inferflow to better balance inference speed and throughput than the existing partition-by-layer and partition-by-tensor strategies.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents Inferflow as an inference engine that significantly enhances LLM deployment efficiency through modular design and dynamic batching.
It introduces a novel 3.5-bit quantization method that bridges the gap between 4-bit efficiency and 3-bit performance loss.
The engine employs hybrid partitioning to optimize GPU workload distribution, ensuring rapid inference and reduced VRAM consumption.

Introduction to Inferflow

The paper introduces Inferflow, designed to optimize the deployment and operation of LLMs. Inferflow addresses challenges such as the deployment size, system requirements, and latency issues that arise when working with the substantial parameter counts typical of contemporary LLMs, which can extend into the billions. By focusing on inference speed, throughput, result quality, VRAM consumption, and extensibility, Inferflow positions itself as a solution conducive to diverse applications, especially those demanding quick response times and a reduced hardware footprint.

Modular Configuration and Extended Support

A key advantage of Inferflow stems from its modular framework and extensibility. Different from most inference engines that require source code manipulation to accommodate new models, Inferflow simplifies the process. Users can deploy a new model by editing configuration files, thanks to the engine's composition of atomic building blocks and technologies. Additionally, Inferflow showcases an extensive range of features, including support for various transformer model types such as encoder-only, decoder-only, and encoder-decoder. Furthermore, the engine is equipped to handle an array of file formats directly, mitigating security concerns associated with formats like pickle by parsing them in a secure environment. Compatibility with both GPU and CPU inferences further underlines the versatility of Inferflow.

Quantization and Efficiency

Inferflow introduces novel quantization strategies, providing a bridge between the tangible benefits of 4-bit quantization and the performance sacrifice seen in standard 3-bit schemes. By implementing a unique 3.5-bit quantization approach, Inferflow operates on two adjacent weight vectors using 7 bits, thus striking a balance between efficiency and computational accuracy. This method showcases a tangible reduction in quantization errors compared to traditional 3-bit quantization, without a significant loss of model performance.

Hybrid Partitioning and Dynamic Batching

To leverage multi-GPU systems, Inferflow introduces hybrid model partitioning. This strategy aims to optimize throughput and inference speed by employing a blend of standard layer-wise and tensor-wise partitioning methods. Hybrid partitioning considers various facets of model layers and tensors, effectively assigning portions of the computational workload to individual GPUs. Such an approach seeks a balanced performance across different parts of a given LLM.

Inferflow also implements dynamic batching techniques which enable the model to process requests with varying length inputs in real-time, bypassing the latency typically incurred through static batching methods that wait for batch completion. This real-time processing capacity is particularly advantageous for applications demanding prompt generation of responses.

Conclusion

In conclusion, the technical report presents Inferflow as an innovative and adaptable inference engine suitable for LLMs, specifically designed to optimize efficiency without compromising on performance or functionality. Its approach to modularity, support for various model types, advanced quantization techniques, and strategies such as hybrid partitioning and dynamic batching serve to significantly enhance the usability of LLMs across a range of applications. The source code and further details regarding Inferflow are available for those interested in exploring its capabilities.