Creating General User Models from Computer Use

Published 16 May 2025 in cs.HC, cs.AI, and cs.CL | (2505.10831v3)

Abstract: Human-computer interaction has long imagined technology that understands us-from our preferences and habits, to the timing and purpose of our everyday actions. Yet current user models remain fragmented, narrowly tailored to specific apps, and incapable of the flexible reasoning required to fulfill these visions. This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer. The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture user knowledge and preferences. GUMs can infer that a user is preparing for a wedding they're attending from messages with a friend. Or recognize that a user is struggling with a collaborator's feedback on a draft by observing multiple stalled edits and a switch to reading related work. GUMs introduce an architecture that infers new propositions about a user from multimodal observations, retrieves related propositions for context, and continuously revises existing propositions. To illustrate the breadth of applications that GUMs enable, we demonstrate how they augment chat-based assistants with context, manage OS notifications to selectively surface important information, and enable interactive agents that adapt to preferences across apps. We also instantiate proactive assistants (GUMBOs) that discover and execute useful suggestions on a user's behalf using their GUM. In our evaluations, we find that GUMs make calibrated and accurate inferences about users, and that assistants built on GUMs proactively identify and perform actions that users wouldn't think to request explicitly. Altogether, GUMs introduce methods that leverage multimodal models to understand unstructured context, enabling long-standing visions of HCI and entirely new interactive systems that anticipate user needs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces General User Models that consolidate computer use data into confidence-weighted user propositions.
It employs a four-module pipeline—Propose, Retrieve, Revise, and Audit—to generate, update, and safeguard user inferences.
Evaluation shows robust accuracy (up to 100% for high-confidence propositions) and effective privacy auditing in real-world deployments.

Creating General User Models from Computer Use: An Expert Analysis

Motivation and Problem Statement

The paper addresses a longstanding challenge in HCI and AI: constructing user models that are both general and actionable across diverse contexts, rather than being siloed within specific applications or domains. Existing user models are typically narrow, focusing on isolated preferences (e.g., music, shopping) or limited to single-application telemetry. This fragmentation precludes the realization of context-aware, proactive, and adaptive systems as envisioned in foundational HCI work. The authors propose General User Models (GUMs), which learn from unstructured, multimodal observations of user behavior to generate confidence-weighted, natural language propositions about the user's knowledge, beliefs, and preferences.

GUM Architecture and Abstractions

GUMs are instantiated as collections of confidence-weighted propositions, each representing an inference about the user derived from raw observations (e.g., screenshots, notifications, file system events). The architecture comprises four primary modules:

Propose: Converts unstructured observations into candidate propositions with associated confidence scores and rationales.
Retrieve: Indexes and retrieves contextually relevant propositions using a two-stage BM25 + LLM reranking pipeline, incorporating recency and diversity via MMR.
Revise: Refines and updates existing propositions in light of new evidence, supporting both reinforcement and contradiction.
Audit: Enforces privacy by leveraging contextual integrity, using the GUM itself to filter observations that violate user-specific privacy norms.

This pipeline enables continuous, online learning of user context, supporting both stable (long-term) and transient (moment-to-moment) inferences.

Figure 1: General User Models ground user queries in broader context by marshalling relevant prior interactions and observations, enabling highly contextualized responses.

Implementation Details

Observations and Propositions

Observations are any data that can be tokenized and processed by a VLM or LLM, including screenshots, text, and sensor data. Each observation is timestamped and source-tagged. Propositions are generated via prompted LLM inference, with explicit rationales and confidence scores (1-10, normalized to [0,1]). Decay scores are also generated to model proposition staleness, inspired by cognitive architectures.

Model Choices and Engineering

Vision-LLM: Qwen 2.5 VL is used for screen transcription and multimodal understanding.
LLM: Llama 3.3 70B is used for proposition generation, revision, and reranking.
Retrieval: BM25 for initial candidate selection, followed by LLM-based reranking for semantic precision.
Privacy: All inference is performed locally or on secure, user-controlled infrastructure; the Audit module uses the GUM to answer contextual integrity questions and block sensitive data.

API and Extensibility

GUM is implemented as a Python package, pip-installable, and exposes an interface for subscribing to observation streams, querying propositions, and managing proposition metadata. Observers can be extended to new modalities (audio, health data, etc.) with minimal engineering overhead.

Applications and System Integration

LLM Prompt Augmentation

GUMs can be used to augment LLM prompts with rich, user-specific context, closing the grounding gap that plagues current LLM-based assistants. For example, underspecified queries ("help me with this section") are automatically contextualized with relevant prior work, collaborators' feedback, and recent activities, yielding more relevant and actionable model outputs.

Context-Aware and Proactive Systems

GUMs enable a new class of applications that can query user context across all observed modalities. Examples include:

Notification Filtering: OS-level notification managers can suppress irrelevant alerts and surface only those aligned with the user's current focus and priorities.
Personalized Interface Agents: Agents can anticipate user needs and automate tasks (e.g., travel planning, document organization) without explicit user input.
Proactive Assistants (Gumbo): The Gumbo system demonstrates end-to-end integration, ingesting screenshots to build a GUM, generating and ranking suggestions, and autonomously executing low-risk actions.

(Figure 2)

Figure 2: Gumbo assistant surfaces and executes suggestions based on the user's GUM, with interfaces for suggestion review and proposition management.

Evaluation

Technical Evaluation: Accuracy and Calibration

A controlled study with $N=18$ participants using email data demonstrates that GUMs generate accurate and well-calibrated propositions. Highly confident propositions ( $\text{confidence} = 1.0$ ) achieve 100% accuracy, and overall accuracy is $76.15\%$ across all propositions. The full GUM architecture (with both Retrieve and Revise) outperforms ablations in user preference rankings and calibration (Brier score $0.17 \pm 0.03$ ), with underconfidence being the dominant error mode—an advantageous property for user trust.

(Figure 3)

Figure 3: The GUM pipeline processes unstructured inputs, audits for privacy, constructs and revises propositions, and retrieves contextually relevant information.

Privacy Audit

The Audit module, evaluated on the same participant set, effectively blocks most privacy-violating propositions, with only 7/180 flagged as violations. Participants rated the contextual integrity responses highly ( $\mu = 6.06$ on a 1-7 Likert scale). However, the system is susceptible to adversarial manipulation (e.g., spam emails injecting false propositions), highlighting the need for robust adversarial filtering.

End-to-End Evaluation: Gumbo Deployment

A longitudinal deployment with $N=5$ participants using Gumbo over 5 days confirms that GUMs remain accurate and calibrated in real-world, multimodal settings (accuracy $0.79 \pm 0.07$ , Brier $0.28 \pm 0.04$ ). Participants found Gumbo's suggestions both immediately useful and capable of surfacing needs they had not explicitly articulated. Notably, 25% of suggestions were rated as strong or excellent, and two participants requested continued use post-study.

Limitations and Trade-offs

Privacy vs. Utility: The more data GUM ingests, the more powerful its inferences, but this amplifies privacy risks. The Audit module mitigates but does not eliminate these risks.
Generalization: GUMs are limited by the scope of observed modalities; activities outside the computer or on unobserved devices are not captured.
Model Limitations: Current LLMs and VLMs are prone to hallucinations, misattributions, and context forgetting. The retrieval and revision pipeline compensates for these weaknesses, but future end-to-end models may subsume these steps.
Resource Requirements: Running GUMs with large models (e.g., Llama 3.3 70B, Qwen 2.5 VL) requires significant compute (80GB H100 GPUs in the study), though quantization and model distillation are feasible paths for on-device deployment.
Adversarial Manipulation: GUMs are vulnerable to prompt injection and adversarial data, necessitating further research on robust filtering and user-in-the-loop validation.

Implications and Future Directions

GUMs represent a significant step toward unified, cross-application user modeling, enabling both classic and novel HCI applications. The architecture is modular and extensible, supporting integration with future, more efficient multimodal models and additional sensor modalities. Key research directions include:

On-device, privacy-preserving inference: Model distillation and quantization to enable local deployment without cloud dependencies.
Adversarial robustness: Developing techniques for detecting and mitigating manipulation via malicious observations.
Ethical safeguards: Ensuring user agency, transparency, and auditability in proposition generation and suggestion execution.
Long-term adaptation: Investigating mechanisms for lifelong learning, user-driven proposition editing, and dynamic privacy norm modeling.

Conclusion

The GUM framework operationalizes a general, multimodal, and privacy-aware approach to user modeling, moving beyond application-specific silos. Empirical results demonstrate that GUMs can synthesize accurate, calibrated, and actionable inferences from unstructured user behavior, supporting both reactive and proactive system behaviors. While challenges remain in privacy, robustness, and resource efficiency, GUMs provide a foundation for the next generation of context-aware, adaptive, and user-aligned interactive systems.

Markdown Report Issue