DomusFM: Smart-Home Sensor Foundation Model
- DomusFM is a foundation model for smart-home sensor data that uses dual contrastive self-supervised learning to capture semantic and temporal dependencies.
- It integrates specialized semantic, status, and temporal encoders with Transformer layers to effectively process sparse binary events in smart-home environments.
- The model demonstrates strong transferability across diverse datasets, enhancing ADL recognition and event prediction with minimal labeled data while ensuring privacy.
DomusFM is a foundation model specifically designed and pretrained for smart-home sensor data, addressing the limitations of supervised and existing foundation approaches in activity recognition and event analysis. Distinct from prior models, it is optimized for the sparse, discrete binary events generated by smart-home ambient sensors, capturing both their semantic richness and temporal dependencies through a hybrid, dual-contrastive self-supervised learning paradigm. DomusFM exhibits robust transferability across diverse home environments and enables practical, privacy-preserving deployment for applications such as healthcare monitoring and assistive technologies (Fiori et al., 2 Feb 2026).
1. Model Architecture and Data Representation
DomusFM operates on a global, time-ordered stream of binary sensor events, denoted as , where is the timestamp, the sensor identifier, and the binary sensor state. Each event is described by five discrete attributes: HouseItem, Room, Sensor Type, Status, and timestamp.
The architecture consists of two core stages:
- Event-level Feature Extraction (): Each event's five attributes are embedded through specialized mechanisms:
- Semantic encoders (LLM-based): HouseItem, Room, and Type are encoded via Sentence-BERT (all-MiniLM-L6-v2) into 384-dimensional vectors.
- Status encoder: A dedicated learned embedding for ON, OFF, and MASK states (384d).
- Temporal encoder: Day-of-week and hour-of-day are mapped using cyclic sin/cos encoding, and intra-hour seconds use a categorical embedding, all projected to 384d.
- The five embeddings are aggregated via an attribute-wise self-attention layer, yielding .
- Contextualization (): Events are segmented into sliding windows (, stride 1). For a window , a stack of 12 Transformer encoder layers (12 heads each) is applied to . This produces context-aware embeddings , capturing the temporal and relational dependencies within each window. The feature extractor outputs the window's contextualized representations.
2. Self-supervised Dual Contrastive Pretraining
DomusFM is pretrained using a dual contrastive learning scheme formulated as two phases, employing InfoNCE losses to force the model to develop both semantic invariance and sequence-level temporal structure:
- Attribute-level (Token-level) Contrastive Learning: Events in a window are randomly selected (with probability ); for each, one attribute is masked (replaced with MASK). Let and (with Pool such as mean-pooling). The loss is:
where is cosine similarity, is a temperature hyperparameter, and denotes negatives (other windows).
- Event-level (Sequence-level) Contrastive Learning: After attribute-level training, and its self-attention are frozen; with probability , whole events are masked. Pooling as above produces for the masked window. The same contrastive loss is applied, denoted .
The overall pretraining objective is additive: .
3. Pretraining, Data Pipeline, and Optimization
DomusFM is trained on an aggregated collection of seven public smart-home datasets, encompassing both binary sensors and binarized continuous channels. Preprocessing removes redundant events and encodes HouseItem, Room, and Type using their text labels. Input streams are segmented into overlapping windows (30 events, stride 1), with batches composed for parallel training.
Attribute and event masking serve as augmentations. To address dataset imbalance, random oversampling is applied at the dataset level. Training uses the Adam optimizer with learning rate schedules standard in Transformer pretraining. Pretraining proceeds in two stages: initial optimization of for epochs, followed by freezing early event-level layers and optimizing for epochs.
4. Transfer, Fine-tuning, and Evaluation
DomusFM is evaluated for few-shot transferability and generalization through a leave-one-dataset-out protocol across the following datasets: CASAS Milan, CASAS Aruba, Van Kasteren A, Van Kasteren C, UCI B, Orange4Home, and MuRAL. Each evaluation iteration holds out one dataset for fine-tuning and downstream testing, with the remaining six used for pretraining, ensuring strict data separation.
DomusFM supports multiple downstream tasks:
- ADL (Activities of Daily Living) Recognition: A linear classifier maps the last contextualized embedding of a window to activities; performance is scored using weighted F1.
- Next- Events Prediction: Dual heads predict the set of sensor-state combinations and their frequency counts for the next events (), evaluated by bag-of-events F1, matching the multiset nature of event sequences.
Fine-tuning is conducted under varying data scarcity, with only 5%, 10%, 15%, or 30% of labeled windows from the held-out dataset. Five-fold cross-validation (10 epochs, no early stopping) is employed.
Performance comparisons demonstrate that at 5% labeled data, DomusFM outperforms DeepCASAS on ADL recognition (e.g., F1 = 0.72 vs. 0.54 on CASAS Milan) and GPT-2–based baselines on next-30 prediction (e.g., F1 = 0.59 vs. 0.46 on Van Kasteren C). Margins persist up to 30% labeled data, confirming the efficiency of DomusFM under few-shot calibration (Fiori et al., 2 Feb 2026).
5. Practical Deployment and Systems Considerations
DomusFM contains approximately 36 million parameters and requires less than 500 MB in memory. Inference latency is approximately 9.6 ms (next-) to 9.9 ms (ADL) per window of 30 events, using a standard CPU-based smart-home gateway. No external APIs, GPU, or cloud services are required, enabling on-site, low-latency, privacy-preserving deployment. Raw sensor data is not transmitted off-site, mitigating privacy, bandwidth, and recurring cost concerns.
6. Key Contributions, Limitations, and Open Challenges
DomusFM, as the first foundation model natively pretrained on smart-home binary sensor data, introduces the following innovations:
- Dual contrastive learning for robust, generalizable feature extraction at both semantic (attribute) and temporal (event sequence) levels.
- Hybrid integration of LLM-derived semantic priors with specialized binary and temporal encoding for ambient sensor streams.
- Rigorous, dataset-agnostic evaluation confirming the model’s ability to generalize with limited calibration data.
- Suitability for real-world edge deployment due to its compute and memory efficiency.
Limitations include the reliance on single-occupancy or perfect person attribution (multi-occupant segmentation is not addressed), binarization of continuous sensors (with possible information loss), and the need for at least limited labeled target data (the model is not strictly zero-shot). DomusFM has only been validated on ADL recognition and next- event prediction; other tasks such as anomaly detection and behavior change are prospective areas for future research.
7. Application Domains
DomusFM enables multiple downstream smart-home analytics and control scenarios, including:
- Healthcare Monitoring: Early detection of deviations from individuals’ routine for elderly care and risk management.
- Assistive Technologies: Context-adaptive prompting, reminders, and risk alerts for vulnerable populations.
- Energy Management & Automation: Predictive control of home devices for efficiency and comfort.
- Anomaly/Fraud Detection: Identification of unusual usage patterns reflecting security or integrity concerns.
These domains benefit from DomusFM’s ability to operate under stringent privacy requirements and data scarcity, enabling practical deployment of intelligent systems in everyday living environments (Fiori et al., 2 Feb 2026).