DomusFM: Smart-Home Sensor Foundation Model

Updated 9 February 2026

DomusFM is a foundation model for smart-home sensor data that uses dual contrastive self-supervised learning to capture semantic and temporal dependencies.
It integrates specialized semantic, status, and temporal encoders with Transformer layers to effectively process sparse binary events in smart-home environments.
The model demonstrates strong transferability across diverse datasets, enhancing ADL recognition and event prediction with minimal labeled data while ensuring privacy.

DomusFM is a foundation model specifically designed and pretrained for smart-home sensor data, addressing the limitations of supervised and existing foundation approaches in activity recognition and event analysis. Distinct from prior models, it is optimized for the sparse, discrete binary events generated by smart-home ambient sensors, capturing both their semantic richness and temporal dependencies through a hybrid, dual-contrastive self-supervised learning paradigm. DomusFM exhibits robust transferability across diverse home environments and enables practical, privacy-preserving deployment for applications such as healthcare monitoring and assistive technologies (Fiori et al., 2 Feb 2026).

1. Model Architecture and Data Representation

DomusFM operates on a global, time-ordered stream of binary sensor events, denoted as $e_i = \langle t_i, s_i, \sigma_i \rangle$ , where $t_i$ is the timestamp, $s_i$ the sensor identifier, and $\sigma_i \in \{\text{ON}, \text{OFF}\}$ the binary sensor state. Each event is described by five discrete attributes: HouseItem, Room, Sensor Type, Status, and timestamp.

The architecture consists of two core stages:

Event-level Feature Extraction ( $h_e$ ): Each event's five attributes are embedded through specialized mechanisms:
- Semantic encoders (LLM-based): HouseItem, Room, and Type are encoded via Sentence-BERT (all-MiniLM-L6-v2) into 384-dimensional vectors.
- Status encoder: A dedicated learned embedding for ON, OFF, and MASK states (384d).
- Temporal encoder: Day-of-week and hour-of-day are mapped using cyclic sin/cos encoding, and intra-hour seconds use a categorical embedding, all projected to 384d.
- The five embeddings are aggregated via an attribute-wise self-attention layer, yielding $h_e(e) \in \mathbb{R}^{384}$ .
Contextualization ( $h_{cxt}$ ): Events are segmented into sliding windows ( $N=30$ , stride 1). For a window $W = \{ e_1, ..., e_N \}$ , a stack of 12 Transformer encoder layers (12 heads each) is applied to $[h_e(e_1), ..., h_e(e_N)]$ . This produces context-aware embeddings $h_{cxt}(e_i, W) \in \mathbb{R}^{384}$ , capturing the temporal and relational dependencies within each window. The feature extractor $g_\theta(W)$ outputs the window's contextualized representations.

2. Self-supervised Dual Contrastive Pretraining

DomusFM is pretrained using a dual contrastive learning scheme formulated as two phases, employing InfoNCE losses to force the model to develop both semantic invariance and sequence-level temporal structure:

Attribute-level (Token-level) Contrastive Learning: Events in a window are randomly selected (with probability $p_{attr}$ ); for each, one attribute is masked (replaced with MASK). Let $Z_{orig} = \text{Pool}(g_\theta(W))$ and $Z_{attr} = \text{Pool}(g_\theta(W_{attr}))$ (with Pool such as mean-pooling). The loss is:

$\mathcal{L}_{attr} = -\frac{1}{B} \sum_{i=1}^B \log \frac{\exp(\text{sim}(Z_i, Z_i^+) / \tau)}{ \sum_{j=1}^B \exp(\text{sim}(Z_i, Z_j^-) / \tau) }$

where $\text{sim}(u,v)$ is cosine similarity, $\tau$ is a temperature hyperparameter, and $Z_i^-$ denotes negatives (other windows).

Event-level (Sequence-level) Contrastive Learning: After attribute-level training, $h_e$ and its self-attention are frozen; with probability $p_{evt}$ , whole events are masked. Pooling as above produces $Z_{evt}$ for the masked window. The same contrastive loss is applied, denoted $\mathcal{L}_{evt}$ .

The overall pretraining objective is additive: $\mathcal{L}_{pretrain} = \mathcal{L}_{attr} + \mathcal{L}_{evt}$ .

3. Pretraining, Data Pipeline, and Optimization

DomusFM is trained on an aggregated collection of seven public smart-home datasets, encompassing both binary sensors and binarized continuous channels. Preprocessing removes redundant events and encodes HouseItem, Room, and Type using their text labels. Input streams are segmented into overlapping windows (30 events, stride 1), with batches composed for parallel training.

Attribute and event masking serve as augmentations. To address dataset imbalance, random oversampling is applied at the dataset level. Training uses the Adam optimizer with learning rate schedules standard in Transformer pretraining. Pretraining proceeds in two stages: initial optimization of $\mathcal{L}_{attr}$ for $M_1$ epochs, followed by freezing early event-level layers and optimizing $\mathcal{L}_{evt}$ for $M_2$ epochs.

4. Transfer, Fine-tuning, and Evaluation

DomusFM is evaluated for few-shot transferability and generalization through a leave-one-dataset-out protocol across the following datasets: CASAS Milan, CASAS Aruba, Van Kasteren A, Van Kasteren C, UCI B, Orange4Home, and MuRAL. Each evaluation iteration holds out one dataset for fine-tuning and downstream testing, with the remaining six used for pretraining, ensuring strict data separation.

DomusFM supports multiple downstream tasks:

ADL (Activities of Daily Living) Recognition: A linear classifier maps the last contextualized embedding of a window to $M$ activities; performance is scored using weighted F1.
Next- $k$ Events Prediction: Dual heads predict the set of sensor-state combinations and their frequency counts for the next $k$ events ( $k=10,30$ ), evaluated by bag-of-events F1, matching the multiset nature of event sequences.

Fine-tuning is conducted under varying data scarcity, with only 5%, 10%, 15%, or 30% of labeled windows from the held-out dataset. Five-fold cross-validation (10 epochs, no early stopping) is employed.

Performance comparisons demonstrate that at 5% labeled data, DomusFM outperforms DeepCASAS on ADL recognition (e.g., F1 = 0.72 vs. 0.54 on CASAS Milan) and GPT-2–based baselines on next-30 prediction (e.g., F1 = 0.59 vs. 0.46 on Van Kasteren C). Margins persist up to 30% labeled data, confirming the efficiency of DomusFM under few-shot calibration (Fiori et al., 2 Feb 2026).

5. Practical Deployment and Systems Considerations

DomusFM contains approximately 36 million parameters and requires less than 500 MB in memory. Inference latency is approximately 9.6 ms (next- $k$ ) to 9.9 ms (ADL) per window of 30 events, using a standard CPU-based smart-home gateway. No external APIs, GPU, or cloud services are required, enabling on-site, low-latency, privacy-preserving deployment. Raw sensor data is not transmitted off-site, mitigating privacy, bandwidth, and recurring cost concerns.

6. Key Contributions, Limitations, and Open Challenges

DomusFM, as the first foundation model natively pretrained on smart-home binary sensor data, introduces the following innovations:

Dual contrastive learning for robust, generalizable feature extraction at both semantic (attribute) and temporal (event sequence) levels.
Hybrid integration of LLM-derived semantic priors with specialized binary and temporal encoding for ambient sensor streams.
Rigorous, dataset-agnostic evaluation confirming the model’s ability to generalize with limited calibration data.
Suitability for real-world edge deployment due to its compute and memory efficiency.

Limitations include the reliance on single-occupancy or perfect person attribution (multi-occupant segmentation is not addressed), binarization of continuous sensors (with possible information loss), and the need for at least limited labeled target data (the model is not strictly zero-shot). DomusFM has only been validated on ADL recognition and next- $k$ event prediction; other tasks such as anomaly detection and behavior change are prospective areas for future research.

7. Application Domains

DomusFM enables multiple downstream smart-home analytics and control scenarios, including:

Healthcare Monitoring: Early detection of deviations from individuals’ routine for elderly care and risk management.
Assistive Technologies: Context-adaptive prompting, reminders, and risk alerts for vulnerable populations.
Energy Management & Automation: Predictive control of home devices for efficiency and comfort.
Anomaly/Fraud Detection: Identification of unusual usage patterns reflecting security or integrity concerns.

These domains benefit from DomusFM’s ability to operate under stringent privacy requirements and data scarcity, enabling practical deployment of intelligent systems in everyday living environments (Fiori et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DomusFM: A Foundation Model for Smart-Home Sensor Data (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DomusFM.