Papers
Topics
Authors
Recent
Search
2000 character limit reached

TimeBill Framework for Deadline-Driven LLMs

Updated 15 January 2026
  • TimeBill is a time-budgeted inference framework that adapts KV cache eviction ratios to meet hard deadlines while optimizing response quality.
  • It integrates response length prediction, execution time estimation, and closed-form optimization to balance latency and fidelity in LLM outputs.
  • Experimental benchmarks demonstrate up to 15% higher average scores and robust deadline compliance in real-time, safety-critical applications.

TimeBill is a time-budgeted inference framework for LLMs, designed to guarantee hard deadline compliance while maximizing LLM response quality in time-critical applications. It introduces fine-grained runtime prediction and analytic modeling tailored to autoregressive LLMs, enabling per-inference adaptation of the key-value (KV) cache eviction ratio. This method overcomes the inefficiency of prior approaches using global or fixed eviction ratios, especially for tasks with diverse real-time constraints and variable prompt/response structures (Fan et al., 26 Dec 2025).

1. Problem Definition and Objectives

TimeBill addresses the challenge of deploying LLMs in scenarios with stringent deadlines (e.g., robotics, autonomous vehicles, industrial automation), where the inherent uncertainty in autoregressive decoding causes unpredictable execution times. The centralized objectives of the TimeBill framework are:

  • To guarantee that the predicted worst-case latency t^WCET\hat{t}_{\mathrm{WCET}} does not exceed a user-specified budget TT.
  • To choose the minimal possible eviction ratio α\alpha for the KV cache, ensuring maximal fidelity of the generated response while respecting the deadline.

The core difficulty arises from the linear time complexity of autoregressive generation, coupled with response length variability and complex prompt effects on latency. Traditional fixed-ratio cache strategies cannot simultaneously optimize quality and deadline compliance (Fan et al., 26 Dec 2025).

2. Architectural Components and Workflow

TimeBill is structured into three tightly integrated components:

  • Response Length Predictor (RLP): Casts response length prediction as a multi-class classification task. Given prompt xx (length NxN_x), it predicts a bucket n^\hat n (of size BB), outputting a response length estimate N^=min(n^B,Nmax)\hat N = \min(\hat n B, N_{\max}).
  • Execution Time Estimator (ETE): Uses offline floating-point profiling to fit closed-form models for prefill and decode phases. Prefill time is modeled as tprefill(Nx)=aNx2+bNx+ct_{\rm prefill}(N_x) = a N_x^2 + bN_x + c. Single-step decode time is tdecode_step(Nkv)=pNkv+qt_{\rm decode\_step}(N_{\rm kv}) = p N_{\rm kv} + q. Total decode time sums over all output tokens, incorporating dynamic KV cache length under the current eviction ratio.
  • Time-Budgeted Decoder: Solves for the minimal TT0 such that total predicted execution time (including inflated response length for worst-case) plus RLP overhead does not exceed TT1. The optimal TT2 is used to evict the corresponding fraction of KV cache after the prefill phase.

The workflow is highly parallelized: RLP and worst-case time estimation are run concurrently with the LLM prefill phase, utilizing available CPU/GPU resources.

3. Mathematical Models and Optimization

The essential mathematical formulations running through TimeBill include:

  • Response Length Prediction: TT3 where TT4 is a transformer-based classifier producing bucket probabilities TT5.
  • Prefill and Decode Time Estimation: TT6
  • Worst-case Length Inflation: TT7 for pessimism factor TT8.
  • Total Predicted Latency: TT9
  • Optimization for Eviction Ratio: α\alpha0 This is solved in closed form:

α\alpha1

4. Implementation and Deployment Aspects

  • Model Choices: The framework targets LLMs such as Qwen2.5-7B-Instruct (context 32,768, max generation 8,192 tokens) and an RLP model based on Qwen2.5-0.5B-Instruct with 512 buckets (α\alpha2).
  • Profiling for ETE: Empirical measurements are made for various prompt lengths (α\alpha3) for prefill and for varying KV-cache sizes, fitting α\alpha4 for prefill and α\alpha5 for decode steps. The mean absolute percentage errors are 1.22% (prefill) and 1.69% (decode step), indicating close fit.
  • Resource Utilization: TimeBill is implemented with PyTorch and custom CUDA kernels for efficient KV cache eviction. Hardware includes Intel Xeon Platinum 8350C CPUs and NVIDIA A40 GPUs.
  • Prompt Compression: If α\alpha6 (RLP overhead) would exceed the prefill computation window, any prompt compression method can be used to produce α\alpha7 such that α\alpha8—ensuring RLP does not delay inference.

5. Experimental Results and Benchmarks

TimeBill was evaluated on LongBench (bilingual, multi-task long context) using the following metrics:

  • Quality Metrics: F1, ROUGE-L, Levenshtein distance, aggregated as “average score.”
  • Timing Strategies and Overrun Policies:
    • Kill: any job overrun is dropped (score = 0).
    • Skip-Next: if an overrun is imminent, subsequent prompts are skipped until completion.
  • Completion Rate: The fraction of tasks finishing before the deadline.

Baselines include:

Key Findings:

  • RLP achieves MAE ≈ 42.7 tokens, RMSE ≈ 78.1, with xx0, outperforming 5- or 10-class BERT models for this task.
  • End-to-end predicted latency closely tracks actual runtime, with xx1 always upper bounding the true runtime.
  • Under time budgets xx2–10 s, TimeBill achieves up to 15% higher average score than vanilla and matches the completion rate of fixed xx3 SnapKV.
  • Performance peaks at xx4 for length inflation, confirming the “5× pessimism” rule common in hard real-time systems.

6. Significance and Impact

TimeBill establishes a systematic approach for meeting hard deadlines with LLMs, leveraging runtime modeling and analytic optimization to balance latency and answer quality. By integrating a fine-grained, LLM-tailored response length predictor, closed-form execution time models based on empirical hardware profiling, and an effective cache management scheme, TimeBill demonstrates robust empirical improvements in deadline completion rates and output fidelity. Its framework generalizes to any scenario with stringent real-time LLM requirements and has direct applicability to industrial, robotic, and safety-critical deployments (Fan et al., 26 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TimeBill Framework.