Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry
Abstract: Compound AI applications chain together subcomponents such as generative LLMs, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.
- Aurora: a data stream management system. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California) (SIGMOD ’03). Association for Computing Machinery, New York, NY, USA, 666. https://doi.org/10.1145/872757.872855
- Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687 (2023).
- Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (12 2015).
- TelegraphCQ: Continuous Dataflow Processing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003. 668. https://doi.org/10.1145/872757.872857
- Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain
- Complex Claim Verification with Evidence Retrieved in the Wild. arXiv preprint arXiv:2305.11859 (2023).
- FacTool: Factuality Detection in Generative AI–A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv preprint arXiv:2307.13528 (2023).
- InferLine: latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491.
- Chain-of-Verification Reduces Hallucination in Large Language Models. arXiv preprint arXiv:2309.11495 (2023).
- Raven: In-context learning with retrieval augmented encoder-decoder language models. arXiv preprint arXiv:2308.07922 (2023).
- Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In EACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 874–880.
- Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099Â [cs.LG]
- Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714 (2023).
- An LLM Compiler for Parallel Function Calling. arXiv preprint arXiv:2312.04511 (2023).
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 9459–9474. https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf
- Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111 (2023).
- Jerry Liu. 2022. LlamaIndex. https://doi.org/10.5281/zenodo.1234
- Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960 (2023).
- Ray: A distributed framework for emerging {{\{{AI}}\}} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18). 561–577.
- Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP ’13). Association for Computing Machinery, New York, NY, USA, 439–455. https://doi.org/10.1145/2517349.2522738
- Skeleton-of-thought: Large language models can do parallel decoding. Proceedings ENLSP-III (2023).
- Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424 (2019).
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334 (2023).
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 (2023).
- Okapi at TREC-3. In Overview of the Third Text REtrieval Conference (TREC-3) (overview of the third text retrieval conference (trec–3) ed.). Gaithersburg, MD: NIST, 109–126. https://www.microsoft.com/en-us/research/publication/okapi-at-trec-3/
- Tptu: Task planning and tool usage of large language model-based ai agents. arXiv preprint arXiv:2308.03427 (2023).
- Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123 (2023).
- PLAID: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 1747–1756.
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734.
- WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023. 2387–2413.
- SQL-PaLM: Improved Large Language ModelAdaptation for Text-to-SQL. arXiv preprint arXiv:2306.00739 (2023).
- Kenton Varda. 2008. Protocol buffers: Google’s data interchange format. https://opensource.googleblog.com/2008/07/protocol-buffers-googles-data.html.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
- SEDA: An architecture for well-conditioned, scalable internet services. ACM SIGOPS operating systems review 35, 5 (2001), 230–243.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023).
- ReAct: Synergizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations.
- Discretized streams: fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP ’13). Association for Computing Machinery, New York, NY, USA, 423–438. https://doi.org/10.1145/2517349.2522737
- The Shift from Models to Compound AI Systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
- Efficiently Programming Large Language Models using SGLang. arXiv preprint arXiv:2312.07104 (2023).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.