Chain of Draft: Thinking Faster by Writing Less

Published 25 Feb 2025 in cs.CL | (2502.18600v2)

Abstract: LLMs have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks. Our code and data are available at https://github.com/sileix/chain-of-draft.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Chain of Draft (CoD), a method that minimizes verbose intermediate steps with a per-step token limit, achieving an 80–92% token reduction compared to traditional Chain-of-Thought.
Empirical evaluations on tasks like GSM8k and BIG-bench show that CoD maintains competitive accuracy while reducing latency by up to 76.2% on models such as GPT-4o and Claude 3.5 Sonnet.
The approach eliminates the need for additional token budgeting calls by adapting per-step constraints, enabling cost-efficient and real-time LLM applications.

Okay, I need to summarize the paper titled "Chain of Draft: Thinking Faster by Writing Less." Let's start by understanding the main contributions and structure of the paper.

First, the abstract introduces Chain of Draft (CoD) as a method inspired by human cognitive processes, aiming to reduce verbosity in intermediate reasoning steps of LLMs while maintaining or improving accuracy. They claim significant token reduction (7.6%) and latency improvements.

Looking at the introduction, the motivation is clear: traditional Chain-of-Thought (CoT) is effective but verbose, leading to higher computational costs and latency. CoD is proposed to mimic human-like concise drafting. The contributions are threefold: introducing CoD, empirical validation of efficiency, and implications for LLM deployment.

The related work section covers existing structured reasoning frameworks like CoT, Tree of Thoughts, and others. It also discusses prior efforts to reduce latency, such as Skeleton-of-Thought (SoT) and token-budget methods, but points out their limitations. CoD is positioned as a better alternative with per-step token limits and adaptability.

The CoD section contrasts with CoT through an example problem. CoD uses minimalistic equations instead of verbose explanations, reducing token count. The authors provide system prompts for CoD, emphasizing concise steps (5 words max).

In experiments, they evaluate on arithmetic (GSM8K), commonsense (BIG-bench), and symbolic reasoning (coin flip) tasks. Models tested are GPT-4o and Claude 3.5 Sonnet. Key results show CoD achieving comparable or better accuracy than CoT with much lower token usage. For instance, on GSM8K, CoD uses ~40 tokens vs. CoT's ~200, with accuracy around 91% vs. 95%. Latency reductions are notable, e.g., 76.2% for GPT-4o.

Tables in the results section back these claims. The discussion highlights CoD's cost benefits and potential for real-time applications. They also mention combining CoD with other methods like parallel decoding for further improvements.

I need to ensure the summary is technical, avoids sensational language, and highlights numerical results. Also, since the user is an expert, I should use technical terms without explanations. Mention the tasks, models, token reduction percentages, accuracy numbers, and latency improvements. Compare CoD with CoT and standard prompting. Note cases where CoD even outperforms CoT in accuracy, like sports understanding. Emphasize the per-step token limitation and adaptability over previous budget methods. Avoid mentioning arXiv IDs as per the user's request.

Check if all key points are covered: problem statement (CoT's verbosity), CoD's approach, experimental setups, results across tasks, comparison with existing methods, and implications. Make sure to present this in a structured, academic tone with lists for readability where appropriate.

Need to avoid phrases like "our paper" or "recent." Use formal language, technical terms, and ensure all claims are supported by the provided data from the paper. Include specific numbers like 7.6% token usage, 80% token reduction, 76.2% latency reduction. Highlight where CoD's accuracy is slightly lower than CoT but significantly more efficient, and cases where it's better (e.g., sports understanding).

Also, mention that CoD doesn't require additional LLM calls for budget estimation, unlike TALE, making it more efficient. The ability to handle unlimited reasoning steps with per-step budgets is a key differentiator.

Make sure the summary flows logically, starting with the problem, introducing CoD, methodology, experiments, results, and discussion points. Use bold for key terms like Chain of Draft (CoD) when first mentioned, and possibly for the models and datasets. Ensure that all acronyms are expanded on first mention (e.g., LLM).

Check for any formatting guidelines: markdown, use of lists, no headers at the start. Start with a paragraph, then use sections if needed, but the user example didn't use headers, so maybe just paragraphs and lists. However, the user allows headings, so maybe use sections like the original paper but in summary form. Wait, the user says "Do not include a heading at the beginning of your response. Your response should start with a paragraph." So start with a paragraph, then use lists and bold as needed.

Putting it all together: The summary should start with an overview of CoD, its motivation, methodology, key results, and implications. Use bullet points for contributions, experimental results, and comparisons. Highlight numerical results in bold where appropriate. Mention the tasks, models, token reductions, accuracy, latency. Discuss how CoD compares to prior work like SoT, TALE, and CCoT. Conclude with implications for real-world applications and future work.

The paper "Chain of Draft: Thinking Faster by Writing Less" introduces Chain of Draft (CoD), a prompting strategy designed to address the inefficiencies of Chain-of-Thought (CoT) reasoning in LLMs. While CoT improves accuracy on complex reasoning tasks by generating verbose intermediate steps, CoD reduces token usage by 80–92% while maintaining comparable or superior performance across arithmetic, commonsense, and symbolic reasoning benchmarks.

Key Contributions

CoD Framework: A minimalist prompting strategy that limits intermediate reasoning steps to ≤5 words per step, mimicking human shorthand notation (e.g., representing "Jason gave Denny 8 lollipops" as 20 - x = 12; x = 8).
Efficiency Gains: Achieves latency reductions of 48.4–76.2% compared to CoT while using only 7.6–18.9% of CoT’s token count, as demonstrated on GPT-4o and Claude 3.5 Sonnet.
Task-Specific Performance:
- GSM8k: CoD attains 91.1–91.4% accuracy (vs. CoT’s 95.4–95.8%) with 80% fewer tokens (40 vs. 200 tokens).
- Sports Understanding (BIG-bench): CoD outperforms CoT (98.3% vs. 95.9% for GPT-4o; 97.3% vs. 93.2% for Claude) while reducing tokens by 48–92%.
- Symbolic Reasoning (Coin Flip): Matches CoT’s 100% accuracy with 68–86% fewer tokens.

Methodological Innovations

Per-Step Token Budgeting: Unlike global token-budget methods like TALE (Han et al., 2024) or CCoT (Nayab et al., 2024), CoD imposes no fixed global budget, allowing adaptive step-wise compression. This avoids the need for additional LLM calls to estimate complexity.
System Prompts: Instructs models to "keep a minimum draft for each thinking step" and append answers after ####, enabling consistent answer extraction while suppressing extraneous details.

Critical Analysis

Accuracy-Efficiency Trade-off: CoD’s 4–5% accuracy drop on GSM8k compared to CoT suggests verbosity may still aid certain multi-step arithmetic tasks. However, its superior performance on sports understanding (+2.4–4.1% over CoT) indicates task-dependent benefits.
Latency Reduction Mechanisms:
- Token Minimization: Reduces output sequence length, directly lowering generation time.
- Parallelizability: Unlike CoT’s sequential steps, CoD’s condensed drafts could enable partial parallel decoding, though this remains unexplored in the paper.
Compatibility: CoD remains applicable to black-box LLMs (e.g., GPT-4o, Claude) without requiring architectural changes, unlike Coconut (Hao et al., 2024), which uses latent-space reasoning but sacrifices interpretability.

Implications

Cost Reduction: At $0.03/1M input tokens and$0.06/1M output tokens (GPT-4 Turbo pricing), CoD reduces per-query costs by 80–90% for reasoning tasks.
Real-Time Applications: Latency reductions from 4.2s to 1.0s (GSM8k on GPT-4o) make CoD viable for interactive use cases like tutoring systems or real-time analytics.

Limitations

Interpretability: Condensed drafts may hinder human debugging of reasoning errors compared to CoT’s verbose traces.
Step-Length Heuristics: The 5-word limit lacks theoretical justification; optimal step lengths may vary by task complexity.

The work establishes that effective LLM reasoning does not necessitate verbosity, providing a pathway to deploy cost-efficient models without sacrificing accuracy. Future directions include integrating CoD with speculative decoding (Zhang et al., 2023) and tree/graph-based reasoning (Chen et al., 2024, Yu et al., 2023) to further optimize multi-step inference.