Papers
Topics
Authors
Recent
Search
2000 character limit reached

Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry

Published 7 Mar 2024 in cs.AI, cs.CL, cs.DC, and cs.IR | (2403.04311v2)

Abstract: Compound AI applications chain together subcomponents such as generative LLMs, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Aurora: a data stream management system. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California) (SIGMOD ’03). Association for Computing Machinery, New York, NY, USA, 666. https://doi.org/10.1145/872757.872855
  2. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687 (2023).
  3. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (12 2015).
  4. TelegraphCQ: Continuous Dataflow Processing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003. 668. https://doi.org/10.1145/872757.872857
  5. Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain
  6. Complex Claim Verification with Evidence Retrieved in the Wild. arXiv preprint arXiv:2305.11859 (2023).
  7. FacTool: Factuality Detection in Generative AI–A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv preprint arXiv:2307.13528 (2023).
  8. InferLine: latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491.
  9. Chain-of-Verification Reduces Hallucination in Large Language Models. arXiv preprint arXiv:2309.11495 (2023).
  10. Raven: In-context learning with retrieval augmented encoder-decoder language models. arXiv preprint arXiv:2308.07922 (2023).
  11. Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In EACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 874–880.
  12. Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG]
  13. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714 (2023).
  14. An LLM Compiler for Parallel Function Calling. arXiv preprint arXiv:2312.04511 (2023).
  15. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
  16. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 9459–9474. https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf
  17. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111 (2023).
  18. Jerry Liu. 2022. LlamaIndex. https://doi.org/10.5281/zenodo.1234
  19. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960 (2023).
  20. Ray: A distributed framework for emerging {{\{{AI}}\}} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18). 561–577.
  21. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP ’13). Association for Computing Machinery, New York, NY, USA, 439–455. https://doi.org/10.1145/2517349.2522738
  22. Skeleton-of-thought: Large language models can do parallel decoding. Proceedings ENLSP-III (2023).
  23. Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424 (2019).
  24. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334 (2023).
  25. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
  26. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 (2023).
  27. Okapi at TREC-3. In Overview of the Third Text REtrieval Conference (TREC-3) (overview of the third text retrieval conference (trec–3) ed.). Gaithersburg, MD: NIST, 109–126. https://www.microsoft.com/en-us/research/publication/okapi-at-trec-3/
  28. Tptu: Task planning and tool usage of large language model-based ai agents. arXiv preprint arXiv:2308.03427 (2023).
  29. Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123 (2023).
  30. PLAID: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 1747–1756.
  31. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734.
  32. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023. 2387–2413.
  33. SQL-PaLM: Improved Large Language ModelAdaptation for Text-to-SQL. arXiv preprint arXiv:2306.00739 (2023).
  34. Kenton Varda. 2008. Protocol buffers: Google’s data interchange format. https://opensource.googleblog.com/2008/07/protocol-buffers-googles-data.html.
  35. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
  36. SEDA: An architecture for well-conditioned, scalable internet services. ACM SIGOPS operating systems review 35, 5 (2001), 230–243.
  37. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
  38. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023).
  39. ReAct: Synergizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations.
  40. Discretized streams: fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP ’13). Association for Computing Machinery, New York, NY, USA, 423–438. https://doi.org/10.1145/2517349.2522737
  41. The Shift from Models to Compound AI Systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
  42. Efficiently Programming Large Language Models using SGLang. arXiv preprint arXiv:2312.07104 (2023).
Citations (4)

Summary

  • The paper presents ALTO, leveraging intermediate output streaming to boost throughput up to 3x and reduce tail latency by 1.8x.
  • The paper details a methodology using aggregation-aware routing and distributed prompt-aware scheduling to ensure correctness and efficient load balancing.
  • The paper outlines future directions, including automating aggregation inference and refining scheduling algorithms to further optimize AI pipeline performance.

Efficient Orchestrating of AI Pipelines with ALTO: Addressing Key Challenges and Future Directions

Introduction

The advent of compound AI systems, which combine several generative LMs and other AI components into complex pipelines, has introduced new challenges and opportunities in the field of AI serving systems. The paper on ALTO (Automatic Language Token Orchestrator) proposes an innovative approach to efficiently manage these systems, focusing particularly on the streaming of intermediate outputs to optimize for high throughput and low latency. ALTO introduces noteworthy concepts like aggregation-aware routing and the idea of distributed prompt-aware scheduling to tackle the intricacies involved in orchestrating compound AI systems.

Streaming Improves Pipelines Performance

The fundamental observation that generative LMs produce partial outputs sequentially opens up the opportunity for intermediate data streaming between pipeline stages. In practice, this capability is shown to dramatically enhance the performance of AI serving pipelines, specifically by reducing latency and increasing throughput for compound AI systems. For instance, an evaluation of ALTO on a FacTool-inspired pipeline achieved up to 3x higher throughput at a set latency target and reduced tail latency by 1.8x when compared to non-streaming approaches.

Challenges in Streaming Partial Outputs

Despite the benefits, streaming partial outputs introduces significant challenges, particularly concerning correctness and efficient load balancing:

  • Correctness: Ensuring accurate aggregation of streamed partial outputs in stateful stages requires a sophisticated routing strategy. This necessity gave birth to aggregation-aware routing in ALTO, which ensures that all partial outputs relating to a request are consistently routed through the correct instances of pipeline stages.
  • Efficient Load Balancing: The diverse and dynamic nature of partial output generation complicates load balancing across distributed stage instances. ALTO takes preliminary steps towards addressing this through the design of a speculative distributed prompt-aware scheduling algorithm which aims to balance load while optimizing the utility of caching mechanisms in LLM serving engines.

ALTO's System Design

ALTO’s approach to these challenges involves a centralized runtime orchestrator and an asynchronous queue-based communication system between pipeline stages. Application developers interact with ALTO through a straightforward interface that allows specifying pipeline stages, their connections, and any necessary aggregation constraints. ALTO introduces a unique interface for expressing these constraints, providing a foundation for future enhancements to enable automatic inference of aggregation logic and a comprehensive library of aggregation operators.

Future Directions

While ALTO lays down a solid framework for efficient AI pipeline serving, there is ample room for development. The current aggregation-aware routing interface can be improved by automating constraint inference based on pipeline structure and adopting a generalized set of aggregation operators. Additionally, a fully realized distributed prompt-aware scheduling algorithm would represent a significant advancement, providing dynamic load balancing that accounts for prompt locality to optimize the use of computational resources.

Concluding Thoughts

ALTO represents a significant step forward in the field of AI pipeline orchestration, specifically for compound systems integrating generative LLMs. By enabling intermediate output streaming, ALTO offers a path to dramatically improved pipeline performance. However, realizing the full potential of this approach necessitates overcoming unique challenges in correctness and load balancing — challenges that ALTO begins to address with novel solutions. Moving forward, advancements in automatic aggregation rule inference, aggregation operator libraries, and distributed prompt-aware scheduling are anticipated to further refine and enhance the capabilities of AI pipeline serving systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 52 likes about this paper.