Papers
Topics
Authors
Recent
Search
2000 character limit reached

DevPiolt: LLM-based IoT Recommender

Updated 25 November 2025
  • DevPiolt is a large language model-based recommendation framework explicitly designed for personalized IoT operation suggestions in complex environments.
  • It integrates continual domain-adaptive pre-training, multi-task fine-tuning, direct preference optimization, and a confidence-based exposure control mechanism to enhance recommendation precision.
  • Deployed at scale in the Xiaomi Home ecosystem, DevPiolt demonstrates significant performance gains in user acceptance and operational accuracy compared to traditional models.

DevPiolt is a LLM-based recommendation framework engineered for personalized operation suggestions in Internet-of-Things (IoT) environments, specifically implemented and deployed at scale within the Xiaomi Home ecosystem. The system addresses three core technical challenges in IoT operation recommendation: complex multi-step device logic, heterogeneous and evolving user preferences, and adverse user reactions to suboptimal or low-confidence recommendations. DevPiolt integrates continual domain-adaptive pre-training, multi-task fine-tuning, direct preference optimization (DPO), and a confidence-based exposure control mechanism to optimize both offline accuracy and real-world user experience (Wang et al., 18 Nov 2025).

1. System Architecture and Core Modules

DevPiolt’s architecture is an LLM-centric pipeline, partitioned into four sequential modules, each addressing a distinct challenge in the operational recommendation workflow:

  1. Pre-training for Domain Knowledge: The base LLM is adapted with IoT operation logic by continual pre-training on a composite of Xiaomi Home’s recent operation logs and a general-domain corpus. This ensures preservation of general language capability while enabling domain-specific reasoning.
  2. Multi-task Fine-tuning: Using a manually annotated corpus mapping user prompts to both structured device control actions and corresponding natural language explanations, DevPiolt is fine-tuned through the LoRA (Low-Rank Adaptation) paradigm (rank = 16, α = 32, dropout = 0.05 applied to all projection layers). Fine-tuning optimizes a multi-task loss, first predicting structured actions and subsequently generating user-oriented descriptions.
  3. Direct Preference Optimization (DPO): To finely align recommendations with empirical user behavior, DevPiolt employs DPO. Preference pairs—constituted from real acceptance and contextually defined rejections—directly optimize the likelihood ratio between the current model and a reference checkpoint, amplifying the probability of user-preferred actions in nuanced operational contexts.
  4. Confidence-based Exposure Control: To mitigate user aversion to incorrect suggestions, each candidate operation is scored by the model’s output probabilities over key action attributes. Only actions exceeding an adaptively tuned threshold are exposed, automatically trading off between recommendation coverage and acceptance quality.

2. Continual Pre-Training for IoT Contextualization

DevPiolt’s knowledge acquisition for domain intricacies initiates with continual pre-training. The pre-training corpus comprises three months of Xiaomi Home device operation logs (approximately 45,000 monthly samples), incorporating both sensor-driven environment logs (temperature, humidity, etc.) and explicit user control actions. Data are mixed 1:1 with samples from general LLM corpora (e.g., ShareGPT, WuDao) to balance specificity and language generality.

Each pre-training instance is serialized as: [Oitime;Oienv;Oiact][\mathcal{O}_i^{time};\,\mathcal{O}_i^{env};\,\mathcal{O}_i^{act}]

The next-action prediction task optimizes: Lpt=OiHlogP(OiactO<i,Oitime,Oienv)\mathcal{L}_{pt} = -\sum_{\mathcal{O}_i\in\mathcal{H}} \log P\bigl(\mathcal{O}_i^{act}\,\big|\, \mathcal{O}_{<i},\,\mathcal{O}_i^{time},\,\mathcal{O}_i^{env}\bigr)

This instills serialization-specific knowledge, device constraints (such as legal value ranges), and canonical operation orderings (e.g., power-on followed by mode setting).

3. Multi-task Fine-tuning and Action-First Decoding

The fine-tuning phase employs a 15,000-sample manually curated set. Each sample contains:

  • User prompt (operation history, current timestamp, environment readouts, device candidate list)
  • Ground-truth device action set (1–n quadruples)
  • Human-friendly textual explanation

The LoRA-adapted LLM is optimized under an “action-first” two-stage decoding regime, shown to yield higher exact-match accuracy than joint or description-first alternatives. The objective: Lad=[logP(OlactP)+logP(OldescP,Olact)]\mathcal{L}_{ad} = - \left[ \log P\bigl(\mathcal{O}_l^{act}\mid \mathcal{P}\bigr) + \log P\bigl(\mathcal{O}_l^{desc}\mid \mathcal{P},\,\mathcal{O}_l^{act}\bigr) \right]

This approach first infers the structured device action, then generates the associated explanation, closely reflecting user-executed behaviors.

4. Recommendation Refinement via Direct Preference Optimization

Generalized fine-tuning remains insensitive to context-specific user rejection criteria (e.g., ignoring a suggestion within ±1 hour or issuing an inverse operation shortly afterward). DevPiolt addresses this with explicit preference signal extraction.

Preference pairs are constructed as:

  • Positive: Spos=(P,  Olact)S^{pos}=(\mathcal{P},\;\mathcal{O}_l^{act}), where action was taken by the user
  • Negative: Sneg=(P,  Olact)S^{neg}=(\mathcal{P},\;\mathcal{O}_l^{act'}), where action was not performed in proximity or was overridden

The DPO loss, comparing the current and reference models, is: LDPO=logσ[βlogPθ(OlactP)Pref(OlactP)βlogPθ(OlactP)Pref(OlactP)]\mathcal{L}_{DPO} = -\log \sigma\left[\beta \log \frac{P_{\theta}(\mathcal{O}_l^{act}\mid \mathcal{P})}{P_{ref}(\mathcal{O}_l^{act}\mid \mathcal{P})} - \beta \log \frac{P_{\theta}(\mathcal{O}_l^{act'}\mid \mathcal{P})}{P_{ref}(\mathcal{O}_l^{act'}\mid \mathcal{P})}\right]

This enforces higher likelihood on actions manifestly preferred by users in situ, reinforcing temporally and contextually sensitive operational patterns.

5. Confidence-based Exposure Control Mechanism

To suppress low-confidence and potentially negative user experiences, DevPiolt incorporates a multi-factor confidence score for each recommendation: Conf(Oi)=1aij=1aik{device,field,value}αkP(attrk(Oi,jact)P)\mathrm{Conf}(\mathcal{O}_i) = \frac{1}{a_i} \sum_{j=1}^{a_i} \sum_{k\in\{\text{device},\,\text{field},\,\text{value}\}} \alpha_k\, P\left(\mathrm{attr}_k\left(\mathcal{O}_{i,j}^{act}\right)\mid \mathcal{P}\right) where aia_i is the number of recommended actions, attrk\mathrm{attr}_k the predicted attribute value, and αk\alpha_k pre-set attribute weights. The final threshold for action display is dynamically lowered by 10% for large candidate sets or continuous-valued actions and further refined through cascade pruning: device-level failures result in immediate exclusion.

In deployment, a confidence threshold of 0.7 balances the tradeoff between maximizing user exposure (coverage) and reducing rejection/negative experience rates.

6. Empirical Evaluation and Ablation

DevPiolt’s performance was benchmarked against CGC (Customized Gate Control for IoT), DeepSeek-V3 (prompt-driven LLM recommender), and GPT-4o. The test dataset comprises 4,882 unique user-operation instances, subdivided by device count and device category.

Method EM-Acc (%) LM-F1 (%) Rule (%)
CGC 19.95 17.72 23.48
DeepSeek 32.97 44.68 41.45
GPT-4o 33.43 45.19 42.32
DevPiolt 44.33 57.49 53.03

DevPiolt achieved:

  • +95.5% improvement in exact match accuracy relative to GPT-4o (44.33% vs. 33.43%)
  • +58.5% in loose match F1 (57.49% vs. 45.19%)
  • +54.0% in rule-based match (53.03% vs. 42.32%)
  • Average metric gain ≈ 69.5% over baselines.

Ablation studies demonstrate cumulative gains with each architectural component: | Configuration | EM-Acc | LM-F1 | Rule | |------------------------------|--------|--------|--------| | Base Qwen2.5-14B | 32.04 | 44.05 | 41.92 | | + fine-tuning | 42.36 | 56.12 | 51.23 | | + pre-training (IoT corpus) | 43.57 | 56.99 | 52.74 | | + DPO refinement | 44.33 | 57.49 | 53.03 |

Each step—fine-tuning, pre-training, DPO—contributes to the final performance envelope.

7. Production Deployment and Impact

DevPiolt was deployed for a full quarter in the Xiaomi Home App, delivering personalized device operation recommendations to 255,000 daily active users and generating approximately 326,000 suggestions per day.

A/B experiments conducted online reveal:

  • +21.6% relative increase in unique-visitor device coverage
  • +29.1% rise in user acceptance rate of recommended page views

These improvements demonstrate both the offline testing gains and the ability to positively influence user engagement and device utilization in a large-scale real-world context.


DevPiolt exemplifies the integration of domain-adapted LLMs with user-aligned optimization and exposure control mechanisms within IoT recommendation pipelines, achieving significant advances over contemporary LLM and classic algorithmic baselines in both prediction fidelity and operational deployment impact (Wang et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DevPiolt.