TimeBill: Time-Budgeted Inference for Large Language Models
Abstract: LLMs are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Explaining “TimeBill: Time-Budgeted Inference for LLMs”
Overview: What is this paper about?
This paper is about making LLMs give good answers fast enough to meet strict time limits. Think of situations like a self-driving car, a robot, or an assembly line machine: the AI must decide quickly and correctly. The challenge is that LLMs write answers one token (small chunks of text) at a time, so it’s hard to guess exactly how long they’ll take. The authors propose “TimeBill,” a system that predicts how long an LLM will take and adjusts how it runs so it meets a deadline without hurting the quality of its response too much.
Key Questions the Paper Tries to Answer
- How can we accurately predict how long an LLM’s full answer will be before it finishes?
- Given a time limit (a “time budget”), how can we estimate the total time the LLM will need?
- How can we change the way the LLM works—while it’s running—so it stays within the time limit and still gives a useful answer?
How TimeBill Works (Methods and Approach)
You can think of an LLM’s process in two simple steps:
- Prefill phase: The model reads the prompt (like “the question”) and prepares to answer.
- Decoding phase: The model writes the answer token by token (like typing one word at a time).
Two main tools make TimeBill work:
- A response length predictor (RLP)
- Analogy: A “sidekick” model that guesses how long the LLM’s answer will be, before the LLM finishes.
- How it predicts: Instead of guessing the exact number of tokens (which is hard), it predicts a “bucket” (a range). For example, “between 64 and 80 tokens.”
- Why a small model: The predictor is a smaller LLM that’s trained (“distilled”) to understand the behavior of the big model. This keeps it fast and able to handle long inputs.
- Safety step: If the predicted length is bigger than the system’s allowed maximum, it clamps it down to that maximum.
- An execution time estimator (ETE)
- Analogy: A smart stopwatch that can predict how long each part takes based on “how much work” is done.
- FLOPs explained: FLOPs are like counting how many “push-ups” the computer needs to do. More tokens or more memory to look back on means more work.
- What matters most:
- In the prefill phase, time grows quickly with the length of the prompt.
- In the decoding phase, time depends on how many past tokens the model remembers (its “KV cache,” a kind of memory).
- KV cache eviction (the key trick):
- The model can drop some memory of older tokens to speed up decoding. This is called “eviction,” controlled by a ratio α (alpha). Higher α means more memory thrown away, which runs faster but may hurt answer quality.
- Profiling and fitting:
- The authors measure real run times on hardware and fit simple equations to predict future runs. This makes the estimator accurate and practical.
- Worst-case time (WCET):
- For safety, they multiply the predicted length by a “pessimistic factor” k (like planning for the worst). This helps ensure the model won’t miss the deadline.
- Choosing the best α (eviction ratio) under a time limit
- Goal: Use the smallest α that still lets the model finish on time.
- Why smallest? Because less eviction usually keeps the answer quality higher.
- Process:
- Predict response length with the small model.
- Estimate total time (prefill + decoding) under different α values.
- Pick the α that meets the time budget while trying to preserve quality.
- Efficiency trick:
- The predictor and estimator run in parallel with the prefill phase so they don’t slow things down.
- If needed, they compress the prompt for the predictor so it finishes quickly.
Main Findings and Why They Matter
Here are the key results the authors report from tests using the Qwen2.5-7B LLM and the LongBench dataset:
- The response length predictor (RLP) is accurate:
- It beats other length prediction methods (based on BERT).
- Using more “buckets” (finer ranges) improves accuracy.
- Trying to predict an exact number (regression) was worse than predicting a range.
- The execution time estimator (ETE) is reliable:
- Errors for estimating the prefill and decoding times were small (roughly 1–2%).
- The end-to-end predictions were close to real times.
- The worst-case estimate stayed above actual times (good for safety).
- TimeBill improves performance under deadlines:
- Baselines:
- Vanilla (no special strategy) often missed deadlines and scored poorly.
- Fixed eviction (one α for all tasks) showed a trade-off: higher α finished more jobs but hurt quality.
- Weight quantization (AWQ) helped a bit, but didn’t solve timing issues alone.
- TimeBill achieved:
- Higher average answer quality while still finishing a similar number of tasks as aggressive fixed-eviction settings.
- Good results under different “overrun strategies,” such as:
- Kill: cancel the job if it would miss the deadline.
- Skip-Next: let the current job finish but skip upcoming ones.
- Choosing the pessimistic factor k affects results:
- Small to moderate k (around 5) improves reliability and completion without hurting quality too much.
- Very large k can force too much eviction, lowering answer quality.
Why This Research Is Important
TimeBill makes LLMs more dependable in situations where timing really matters—like robots, cars, and factory controls—without giving up too much answer quality. It predicts, plans, and adjusts on the fly to meet different time budgets. It also works alongside other speed-up methods (like quantization), so it can fit into many real systems.
Implications and Potential Impact
- Safer systems: Better timing predictability reduces risk in safety-critical tasks (e.g., autonomous driving decisions).
- Smarter scheduling: Controllers can use TimeBill to assign time budgets per task and trust that the model will adapt.
- Broader adoption: Since TimeBill is model- and hardware-aware (it profiles), it can be applied across different platforms and LLMs.
- Future directions: Combine with more compression, smarter memory management, or dynamic quality controls to further boost speed without hurting accuracy.
Overall, TimeBill shows a practical way to balance speed and quality for LLMs under strict deadlines: predict how long answers will be, estimate total time, and adjust memory usage so responses arrive on time and still make sense.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, intended to guide future research.
- Generalizability across models and hardware: Results are demonstrated on a single LLM (Qwen2.5-7B-Instruct) and a single GPU (NVIDIA A40). It is unclear how the RLP and ETE perform across different architectures (e.g., LLaMA, Mistral), model sizes, vocabularies, and heterogeneous hardware (consumer GPUs, TPUs, CPUs, multi-GPU setups).
- Assumptions about KV eviction behavior: The time model assumes that evicting a fraction α yields an effective KV length of
(1-α)·N_x + i - 1uniformly across layers/heads. This may not hold for content- or importance-aware eviction (e.g., SnapKV), where retained KV positions are non-uniform. How sensitive is ETE accuracy to the specific eviction policy? - Limited exploration of configuration factors θ: The framework only adapts the KV cache eviction ratio. Other decoding parameters (temperature, top-p/top-k, repetition penalty, beam search, early stopping), which affect both response length and latency, are not modeled or controlled.
- Lack of uncertainty quantification in RLP: The response length predictor outputs a single bucket index with no confidence intervals or predictive distributions. Without calibrated uncertainty, choosing the pessimistic factor k is ad hoc and may be either too conservative or unsafe.
- Fixed pessimistic factor k: k is set globally (default k=5) and tuned empirically. There is no method to adapt k per input, per task, or based on predictor confidence, nor a formal guarantee that k produces valid worst-case bounds under platform variability.
- No mid-decoding reconfiguration: α is set once after prefill. The framework does not adjust α during decoding in response to observed step times or remaining budget, missing a potentially valuable feedback loop to avoid overruns or preserve quality.
- Handling tight budgets: The method requires
t_predict ≤ t_prefill. There is no fallback strategy when time budgets are smaller than prefill time (e.g., extremely tight T), or when predictor+WCET estimation cannot complete in time. - Prompt compression details and impacts: The paper relies on prompt compression to bound
t_predictbut does not specify the algorithm, compression ratio selection, or quantify its effect on RLP accuracy and downstream deadline adherence. - FLOPs-based time modeling limitations: The ETE maps FLOPs to time with polynomial fits, ignoring memory bandwidth limits, kernel launch overheads, attention kernel specialization, and cache/memory effects. Robustness under different backend implementations, batching, and quantization levels is untested.
- Ignored non-matmul costs: Embedding, normalization, tokenization, and I/O overheads are assumed negligible. For certain models/hardware, these components can be non-trivial; their omission might skew ETE predictions.
- Concurrency and serving under load: The evaluation assumes single-request inference. There is no analysis of multi-request serving, batching, queueing, GPU contention, or interference from other workloads, all of which can invalidate WCET assumptions.
- Overrun strategies are limited: Only Kill and Skip-Next are considered. Quality-aware truncation, partial-output acceptance, dynamic deadline extension, or task criticality-based policies remain unexplored.
- Response quality under safety constraints: The study uses aggregate metrics (F1, ROUGE-L, Levenshtein) on LongBench but does not evaluate safety-critical scenarios (robotics, autonomous driving). How to guarantee decision quality under deadlines is not specified.
- Domain and data generalization for RLP: The RLP is trained on Arena-Human-Preference-100k and evaluated on LongBench. There is no analysis of cross-domain generalization, distribution shifts, or continual learning to maintain length prediction accuracy as tasks change.
- Bucketization design: The RLP uses fixed bucket sizes (e.g., B=16 tokens) and a fixed number of buckets (512). Guidelines for choosing bucket granularity, trade-offs with predictor latency, and adaptive bucket sizing based on task or prompt length are not provided.
- α_max selection: The maximum eviction ratio is fixed at 95% without principled justification. α_max likely depends on model, task, and context length; methods to select or adapt α_max to preserve quality are missing.
- Integration with other online methods: TimeBill is orthogonal to KV quantization and pruning, yet combined optimization (e.g., co-tuning α with KV quantization level, selective layer eviction) is not evaluated.
- Sensitivity to decoding hyperparameters: Experiments do not report or vary temperature/top-p settings. Since these affect length distributions and per-step compute, ETE and RLP robustness under different sampling regimes are unknown.
- Streaming output semantics: The framework treats inference as a single deadline. In practice, users may consume streamed tokens; modeling per-chunk deadlines, partial-output utility, and adaptive α for streaming is not addressed.
- Formal WCET guarantees: The approach relies on fitted models plus k for pessimism; there is no formal WCET proof under worst-case inputs, GPU scheduler behavior, or kernel variability.
- Deadline adherence metrics: Completion rate is reported, but tail-latency distributions, tardiness (by how much deadlines are missed), and the tightness of the WCET bound over diverse inputs are not analyzed.
- Energy and cost efficiency: The trade-off between meeting deadlines and energy consumption or cloud cost is not measured; energy-aware α selection remains open.
- Security and privacy considerations: Running predictors and compression on separate processors may introduce data movement and privacy risks; secure scheduling and isolation are unaddressed.
- Real-world deployment and control loop integration: The paper motivates robotics/autonomous driving but does not integrate TimeBill into a real control system to assess end-to-end safety, stability, and deadline compliance in situ.
Practical Applications
Practical Applications Derived from the Paper
The paper introduces TimeBill, a framework for time-budgeted inference in LLMs, combining a fine-grained response length predictor (RLP), a workload-guided execution time estimator (ETE), and an adaptive KV cache eviction controller (alpha*). The following applications translate these findings into actionable use cases across industry, academia, policy, and daily life.
Immediate Applications
- SLA-aware LLM serving in cloud platforms
- Sector: Software, cloud infrastructure
- Use case: Enforce per-request deadlines for multi-tenant LLM services (e.g., vLLM, TGI, Triton) by predicting end-to-end latency and adapting KV cache eviction to meet SLAs.
- Potential tools/products/workflows:
- “TimeBill Controller” microservice (RLP+ETE) running alongside LLM backends.
- Per-model/hardware calibration toolkit for profiling and fitting ETE coefficients.
- Deadline-aware routing layer that assigns
alpha*at runtime and logs completion rates and quality metrics (ROUGE/F1). - Assumptions/dependencies: Requires per-model, per-hardware profiling to fit ETE coefficients; a KV eviction method (e.g., SnapKV/DuoAttention) must be available; time budgets must be provided per request.
- Real-time conversational agents with predictable latency
- Sector: Customer support, fintech, telecom
- Use case: Contact center bots and financial chat assistants that must respond within fixed times (e.g., 2–5 seconds) without severely degrading quality.
- Potential tools/products/workflows:
- SDK for integrating
alpha*into existing bot frameworks. - Policy for overrun strategies (“Kill” vs “Skip-Next”) to balance throughput and service continuity.
- Assumptions/dependencies: Availability of short-term quality metrics and acceptance of modest quality trade-offs when eviction is applied; message priority queues to apply different time budgets.
- On-device assistants with bounded latency
- Sector: Mobile, IoT, consumer electronics
- Use case: Voice assistants and note-taking apps on laptops/phones that must return answers before a user-defined timeout, using prompt compression to keep predictor overhead below prefill time.
- Potential tools/products/workflows:
- Embedded RLP model (small LLM) compiled for CPU/GPU/NPU to run in parallel with prefill.
- Local profiler to calibrate ETE on-device (initial setup wizard).
- Assumptions/dependencies: Predictive overhead must be ≤ prefill duration; device variability requires lightweight calibration and conservative pessimistic factor
k.
- Robotics and industrial automation task execution
- Sector: Robotics, manufacturing
- Use case: LLM-in-the-loop controllers for task planning or instruction parsing that must deliver outputs within hard/firm deadlines to avoid halting an assembly line or robot.
- Potential tools/products/workflows:
- ROS/Autoware integration plugin that attaches TimeBill to LLM planning nodes.
- Runtime policy engine mapping task criticality to time budgets and allowed
alpha*bounds. - Assumptions/dependencies: Deterministic scheduling requires conservative WCET via
k; tight integration with KV eviction implementations and safe overrun policies.
- Real-time explainability outputs for ADAS/autonomous driving
- Sector: Automotive
- Use case: Generate concise, bound-latency explanations (e.g., decisions from DriveGPT-like modules) during driving without overruns, prioritizing completion under strict time budgets.
- Potential tools/products/workflows:
- “Explanation Budgeter” wrapping LLM inference with
alpha*and max generation length enforcement. - Quality guardrails (templates, minimal viable response length targets).
- Assumptions/dependencies: Safety teams must validate quality degradation bounds; model variants may require retraining RLP for domain-specific prompts.
- Throughput and goodput optimization in LLM serving
- Sector: Cloud optimization, AIOps
- Use case: Increase effective goodput by converting overruns (incomplete responses) into completed responses with controlled performance degradation, especially under “Kill” or “Skip-Next.”
- Potential tools/products/workflows:
- Adaptive scheduling that selects
alpha*usingETEand aggregates completion rate/quality trade-offs per tenant. - Integration with quantization/pruning pipelines (AWQ, GPTQ) for orthogonal gains.
- Assumptions/dependencies: Quality-vs-eviction curve must be monitored; workloads change, requiring periodic recalibration.
- Academic benchmarking and reproducibility in real-time LLMs
- Sector: Academia
- Use case: Establish reproducible latency/quality benchmarks for LLMs with hard deadlines on public datasets (e.g., LongBench) under different overrun policies.
- Potential tools/products/workflows:
- Open-source ETE/RLP evaluation suite with configurable bucket sizes and pessimistic factors.
- Public profiles for common models (Qwen, LLaMA) across GPUs (A40, A100, H100).
- Assumptions/dependencies: Access to profiling data; standardized metrics for completion rate and quality.
- Compliance-ready operational policies for AI services
- Sector: Policy, enterprise governance
- Use case: Define and enforce overrun strategies, maximum
alphathresholds, and per-request deadlines in regulated industries (finance, healthcare). - Potential tools/products/workflows:
- Governance policies and audit logs capturing deadline adherence, alpha settings, and outcome quality.
- Risk registers mapping tasks to acceptable pessimistic factor
k. - Assumptions/dependencies: Domain regulators must accept bounded-degradation strategies; internal validation of safety/accuracy impacts.
Long-Term Applications
- Certified WCET for safety-critical LLM deployments
- Sector: Automotive, robotics, industrial control
- Use case: Formal certification pipelines that include
RLP+ETEcalibration, conservativekselection, and documented KV eviction impacts, enabling LLMs to be used in safety-critical loops. - Potential tools/products/workflows:
- Standardized “LLM WCET Certification Kit” with test vectors, profiles, and stress scenarios.
- Assumptions/dependencies: Industry consensus on WCET practices; rigorous third-party validation; model/hardware updates require re-certification.
- OS/runtime-level deadline-aware LLM schedulers
- Sector: Systems software
- Use case: Built-in real-time schedulers in LLM runtimes that expose timing APIs, predict
t_e2e, and allocate compute while negotiatingalpha*and max generation lengths across concurrent requests. - Potential tools/products/workflows:
- Deadline-aware attention kernels; scheduler hooks in frameworks (PyTorch, vLLM).
- GPU/accelerator firmware support for timing guarantees.
- Assumptions/dependencies: Hardware and kernel co-design to reduce jitter; standard APIs for timing hints.
- Energy-latency co-optimization in edge/cloud
- Sector: Energy, green computing
- Use case: Jointly optimize deadline adherence and energy use, tuning
alpha*, quantization levels, and batching to meet both energy budgets and SLAs. - Potential tools/products/workflows:
- “Energy-Aware TimeBill” mode with dynamic policies based on telemetry (power draw, thermal headroom).
- Assumptions/dependencies: Reliable energy telemetry; multi-objective controllers; workload stability.
- Adaptive multi-model orchestration with personalized RLPs
- Sector: Software platforms, personalization
- Use case: Select among multiple LLMs (small/large) per request, with per-user or per-task RLPs predicting response lengths and quality to meet personalized latency/quality targets.
- Potential tools/products/workflows:
- Model router using
ETEto choose model+alpha*+N_maxper request. - AutoML pipeline for training domain-specific RLPs.
- Assumptions/dependencies: Availability of diverse models; data for per-domain RLP training; monitoring to prevent drift.
- Standardized timing and overrun policies for AI services
- Sector: Policy, cross-industry consortia
- Use case: Define industry-wide standards for deadline declarations, pessimistic factors, acceptable overrun handling (Kill vs Skip-Next), and reporting of completion/quality metrics.
- Potential tools/products/workflows:
- Compliance scorecards and machine-readable timing manifests attached to AI APIs.
- Assumptions/dependencies: Multi-stakeholder coordination; mapping standards to diverse sectors.
- Real-time human-AI collaboration tools
- Sector: Education, productivity
- Use case: Meeting assistants and classroom tools that guarantee concise, on-time outputs, adapting
alpha*andN_maxdynamically to fit speaking turns or timeboxes. - Potential tools/products/workflows:
- “Timeboxed Assistant” feature with explicit user-set deadlines and visible trade-off controls (quality vs speed).
- Assumptions/dependencies: UX design for transparency; user tolerance for shorter answers under tight deadlines.
- Deadline-aware LLMs in AR/VR and gaming
- Sector: Media, entertainment
- Use case: In-game NPC dialogue and AR/VR guidance systems that must respond within frame budgets to avoid motion sickness or breaking immersion.
- Potential tools/products/workflows:
- Integration with render loops to schedule LLM decoding within frame time and adjust
alpha*when needed. - Assumptions/dependencies: Stable frame-time prediction; careful balance of content richness and timing.
- Safety guardrails for clinical triage and emergency dispatch
- Sector: Healthcare, public safety
- Use case: Time-bounded suggestions for triage and dispatch that degrade gracefully under tight deadlines, with conservative
kand strictalpha_maxcap to maintain safety-critical quality thresholds. - Potential tools/products/workflows:
- “Clinical TimeGuard” policy layer enforcing minimum content requirements and maximum degradation.
- Assumptions/dependencies: Extensive clinical validation; liability frameworks; human oversight in the loop.
Notes on Assumptions and Dependencies
- Model/hardware profiling: ETE coefficients must be fit per model and hardware; changes (driver, kernel, firmware, batching) can invalidate prior fits.
- Predictor alignment: RLP performance depends on training data representativeness and bucket configuration; long-context domains must ensure RLP can handle prompt compression without losing key signals.
- KV eviction effects: Quality degradation varies by task; eviction methods (SnapKV, DuoAttention) must be validated for target domains;
alpha_maxshould be conservative. - Pessimistic factor
k: Balances deadline assurance vs quality; sector-specific risk tolerances will dictate conservative defaults (e.g., k≈5 in hard real-time). - Overrun strategies: “Kill” vs “Skip-Next” have operational impacts; selection depends on service design (throughput vs fairness vs task criticality).
- Integration with other efficiency methods: TimeBill is orthogonal to quantization/pruning; joint optimization needs careful QA to avoid compounding quality loss.
- Operational variability: Queueing delays, co-location interference, memory bandwidth, and thermal throttling introduce jitter; monitoring and dynamic recalibration are necessary.
- Regulatory acceptance: Safety-critical and regulated domains require formal validation of timing guarantees and controlled degradation policies before deployment.
Glossary
- AFAP (As-Fast-As-Possible): A strategy that prioritizes speed over output quality, often degrading the final response under constraints. "The As-Fast-As-Possible (AFAP) strategy will degrade the response performance"
- Auto-regressive generation: A token-by-token generation process where each new token depends on previously generated tokens, making runtime uncertain. "However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time."
- AWQ (Activation-aware Weight Quantization): A quantization technique that reduces model weights to low-bit precision while accounting for activation patterns to preserve accuracy. "Activation-aware Weight Quantization (AWQ), where the model weights is quantized to 4 bits~\cite{AWQ}."
- BERT: A bidirectional transformer-based LLM often used for classification or prediction tasks. "ProxyModel~\cite{ProxyModel-5class} builds a 5-class classifier based on BERT~\cite{BERT} to predict which bucket the response length will fall into."
- Causal attention mask: A masking mechanism in attention that prevents tokens from attending to future positions, enforcing causality in generation. "and is the causal attention mask~\cite{Attention}."
- CausalAttention: An attention mechanism that applies a causal mask so each position attends only to current and past tokens. "Each decoder layer includes four layers in sequence: an RMSNorm layer~\cite{RMSNorm}, a CausalAttention layer, another RMSNorm layer, and an FFN with SwiGLU layer~\cite{SwiGLU}."
- Decoding phase: The autoregressive stage of inference where the model generates subsequent tokens after prefill. "The inference process of LLMs consists of two phases, namely the prefill phase and the decoding phase~\cite{DistServe}."
- End-to-end execution time: The total time from input arrival to final output generation for a model inference. "Therefore, it is essential to model the end-to-end execution time of LLM inference and balance between inference efficiency and response performance under the given time budget."
- Execution Time Estimator (ETE): A component that predicts the overall runtime by combining analytical modeling and profiling. "we propose a fine-grained response length predictor (RLP) and a workload-guided execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs."
- F1 score: A metric combining precision and recall to assess response performance. "the response performance score is evaluated using the official evaluation metrics, such as F1 score, ROUGE-L~\cite{ROUGE}, and Levenshtein distance."
- FeedForward layer: The position-wise fully connected sub-network in a transformer layer that processes each token independently. "Similarly, the FeedForward layer handles the input prompt consisting of tokens during the prefill phase, and processes the last generated token in each decoding step."
- FLOPs (Floating point operations): A measure of computational workload used to model and estimate runtime. "we develop a workload-guided ETE, integrating floating point operations (FLOPs) -based analytical modeling and profiling-based fitting."
- Hard real-time systems: Systems with strict deadlines where missing a time constraint is considered a failure. "In time-critical systems with hard deadline constraints (hard real-time systems), inference that exceeds the time budget is considered a system failure~\cite{Overrun}."
- Key-value (KV) cache: Stored keys and values from attention used to speed up subsequent decoding steps in transformers. "Online methods achieve efficient inference primarily through runtime eviction and quantization of the key-value (KV) cache."
- KV cache eviction: Discarding portions of the KV cache to reduce memory and computation during decoding. "The KV cache eviction is performed using SnapKV~\cite{SnapKV}."
- KV cache eviction ratio: The fraction of cached KV entries removed to meet timing constraints. "existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets"
- KV cache quantization: Reducing the precision of KV cache entries to lower memory usage and latency. "KVQuant~\cite{KVQuant} and KIVI~\cite{KIVI} quantize the KV cache to 4 bits or lower."
- Kill (overrun strategy): A policy that terminates a running job if it is about to miss its deadline. "Kill. The current job will be terminated and considered incomplete."
- Language modeling head (LMHead): The final linear projection in a transformer that produces logits for next-token prediction. "and the language modeling head LMHead."
- Least Squares (LS): A fitting method used to estimate coefficients in runtime models from profiling data. "Well-established fitting methods, such as the Least Squares (LS), can be applied."
- Levenshtein distance: A string similarity metric measuring the minimal edits needed to transform one string into another. "the response performance score is evaluated using the official evaluation metrics, such as F1 score, ROUGE-L~\cite{ROUGE}, and Levenshtein distance."
- Overrun: The event of exceeding the allotted time budget during inference. "When an inference job is about to overrun and miss the deadline, the hard real-time system will apply an overrun strategy."
- Overrun strategies: System policies applied when a job is likely to miss its deadline (e.g., Kill, Skip-Next). "We apply two of the most commonly used overrun strategies~\cite{Overrun},"
- Pessimistic factor: A multiplicative safety margin applied to predicted response length to estimate worst-case time. "we introduce the pessimistic factor , , to ."
- Prefill phase: The initial stage that processes the entire input prompt to produce the first output token. "The inference process of LLMs consists of two phases, namely the prefill phase and the decoding phase~\cite{DistServe}."
- Profiling-based fitting: Estimating runtime model coefficients using measured execution data. "To this end, we propose a profiling-based fitting method to estimate the execution time."
- Prompt compression: Reducing prompt length to fit scheduling or runtime constraints for auxiliary predictors. "we integrate our ETE in Sec.~\ref{sec:time_prediction} with prompt compression~\cite{PromptCompression}."
- Pruning: Removing weights or structures from a model to reduce compute and memory costs. "Offline efficient inference methods, including quantization~\cite{Smoothquant, AWQ, GPTQ} and pruning~\cite{SparseGPT, LLM-Pruner}, compress the model before deployment to reduce resource consumption."
- Quantization: Lowering numerical precision of model parameters or caches to accelerate inference. "Offline efficient inference methods, including quantization~\cite{Smoothquant, AWQ, GPTQ} and pruning~\cite{SparseGPT, LLM-Pruner}, compress the model before deployment to reduce resource consumption."
- Response length predictor (RLP): A model that predicts the length of a generated response to inform timing estimates. "we propose a fine-grained response length predictor (RLP) and a workload-guided execution time estimator (ETE)"
- RMSNorm: A normalization layer variant that uses root mean square statistics. "Each decoder layer includes four layers in sequence: an RMSNorm layer~\cite{RMSNorm}, a CausalAttention layer, another RMSNorm layer, and an FFN with SwiGLU layer~\cite{SwiGLU}."
- ROUGE-L: A text evaluation metric based on longest common subsequence overlap. "the response performance score is evaluated using the official evaluation metrics, such as F1 score, ROUGE-L~\cite{ROUGE}, and Levenshtein distance."
- Roofline model: A performance model relating operational intensity to achievable computation on hardware. "RLM-ML~\cite{RLM-predicting} and LLMStation~\cite{LLMStation} combine the roofline model with machine learning to predict execution time through data collection and optimization."
- Skip-Next (overrun strategy): A policy that skips upcoming jobs until the current one completes. "Skip-Next. Skip the next few jobs until the current job completes."
- Small LLM (SLM): A compact LLM used as an auxiliary predictor due to lower computational cost and longer input handling. "The fine-grained RLP based on Small LLM (SLM) is proposed to predict response lengths of the target LLM inference"
- SwiGLU: A gated activation function variant used in transformer feed-forward networks. "an FFN with SwiGLU layer~\cite{SwiGLU}."
- Time budget: The maximum allowed duration for an inference to complete in time-critical systems. "LLMs are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial"
- Time-budgeted inference: An approach that adapts inference to meet a specified time constraint while maintaining performance. "we propose TimeBill, a novel time-budgeted inference framework for LLMs"
- Worst-case execution time (WCET): A conservative upper bound on runtime used to ensure deadlines are met. "Modeling the worst-case execution time (WCET) during system design ensures that each LLM inference meets deadlines."
Collections
Sign up for free to add this paper to one or more collections.