Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry

Published 7 Mar 2024 in cs.AI, cs.CL, cs.DC, and cs.IR | (2403.04311v2)

Abstract: Compound AI applications chain together subcomponents such as generative LLMs, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.

Abstract PDF HTML Upgrade to Chat

References (42)

Citations (4)

View on Semantic Scholar

Summary

The paper presents ALTO, leveraging intermediate output streaming to boost throughput up to 3x and reduce tail latency by 1.8x.
The paper details a methodology using aggregation-aware routing and distributed prompt-aware scheduling to ensure correctness and efficient load balancing.
The paper outlines future directions, including automating aggregation inference and refining scheduling algorithms to further optimize AI pipeline performance.

Efficient Orchestrating of AI Pipelines with ALTO: Addressing Key Challenges and Future Directions

Introduction

The advent of compound AI systems, which combine several generative LMs and other AI components into complex pipelines, has introduced new challenges and opportunities in the field of AI serving systems. The paper on ALTO (Automatic Language Token Orchestrator) proposes an innovative approach to efficiently manage these systems, focusing particularly on the streaming of intermediate outputs to optimize for high throughput and low latency. ALTO introduces noteworthy concepts like aggregation-aware routing and the idea of distributed prompt-aware scheduling to tackle the intricacies involved in orchestrating compound AI systems.

Streaming Improves Pipelines Performance

The fundamental observation that generative LMs produce partial outputs sequentially opens up the opportunity for intermediate data streaming between pipeline stages. In practice, this capability is shown to dramatically enhance the performance of AI serving pipelines, specifically by reducing latency and increasing throughput for compound AI systems. For instance, an evaluation of ALTO on a FacTool-inspired pipeline achieved up to 3x higher throughput at a set latency target and reduced tail latency by 1.8x when compared to non-streaming approaches.

Challenges in Streaming Partial Outputs

Despite the benefits, streaming partial outputs introduces significant challenges, particularly concerning correctness and efficient load balancing:

Correctness: Ensuring accurate aggregation of streamed partial outputs in stateful stages requires a sophisticated routing strategy. This necessity gave birth to aggregation-aware routing in ALTO, which ensures that all partial outputs relating to a request are consistently routed through the correct instances of pipeline stages.
Efficient Load Balancing: The diverse and dynamic nature of partial output generation complicates load balancing across distributed stage instances. ALTO takes preliminary steps towards addressing this through the design of a speculative distributed prompt-aware scheduling algorithm which aims to balance load while optimizing the utility of caching mechanisms in LLM serving engines.

ALTO's System Design

ALTO’s approach to these challenges involves a centralized runtime orchestrator and an asynchronous queue-based communication system between pipeline stages. Application developers interact with ALTO through a straightforward interface that allows specifying pipeline stages, their connections, and any necessary aggregation constraints. ALTO introduces a unique interface for expressing these constraints, providing a foundation for future enhancements to enable automatic inference of aggregation logic and a comprehensive library of aggregation operators.

Future Directions

While ALTO lays down a solid framework for efficient AI pipeline serving, there is ample room for development. The current aggregation-aware routing interface can be improved by automating constraint inference based on pipeline structure and adopting a generalized set of aggregation operators. Additionally, a fully realized distributed prompt-aware scheduling algorithm would represent a significant advancement, providing dynamic load balancing that accounts for prompt locality to optimize the use of computational resources.

Concluding Thoughts

ALTO represents a significant step forward in the field of AI pipeline orchestration, specifically for compound systems integrating generative LLMs. By enabling intermediate output streaming, ALTO offers a path to dramatically improved pipeline performance. However, realizing the full potential of this approach necessitates overcoming unique challenges in correctness and load balancing — challenges that ALTO begins to address with novel solutions. Moving forward, advancements in automatic aggregation rule inference, aggregation operator libraries, and distributed prompt-aware scheduling are anticipated to further refine and enhance the capabilities of AI pipeline serving systems.