Papers
Topics
Authors
Recent
Search
2000 character limit reached

AI Will Always Love You: Studying Implicit Biases in Romantic AI Companions

Published 27 Feb 2025 in cs.AI | (2502.20231v1)

Abstract: While existing studies have recognised explicit biases in generative models, including occupational gender biases, the nuances of gender stereotypes and expectations of relationships between users and AI companions remain underexplored. In the meantime, AI companions have become increasingly popular as friends or gendered romantic partners to their users. This study bridges the gap by devising three experiments tailored for romantic, gender-assigned AI companions and their users, effectively evaluating implicit biases across various-sized LLMs. Each experiment looks at a different dimension: implicit associations, emotion responses, and sycophancy. This study aims to measure and compare biases manifested in different companion systems by quantitatively analysing persona-assigned model responses to a baseline through newly devised metrics. The results are noteworthy: they show that assigning gendered, relationship personas to LLMs significantly alters the responses of these models, and in certain situations in a biased, stereotypical way.

Summary

  • The paper introduces modified IAT, emotion response evaluation, and sycophancy analysis to reveal gender biases in AI romantic companions.
  • Results show that gendered assignments lead to higher bias scores, with female personas exhibiting increased psychological bias and male personas leaning towards anger.
  • The findings underscore the need for nuanced bias mitigation strategies in AI design to address ethical challenges in human-AI relationships.

"AI Will Always Love You: Studying Implicit Biases in Romantic AI Companions" (2502.20231)

Introduction

The research investigates the implicit biases present in romantic AI companions by evaluating gender stereotypes and biases in their interactions with human users. With the burgeoning popularity of AI companions serving as virtual friends or romantic partners, understanding these biases becomes critical. The study addresses the underexplored nuances of gender stereotypes in these interactions by utilizing tailored experiments to assess implicit associations, emotion responses, and sycophancy in LLMs.

Research Methodology

The study incorporates three main experimental frameworks to address the biases:

  1. Implicit Association Test (IAT): Modified to suit AI evaluation, it measures the frequency of associations between gendered terms and stereotype-laden attributes. The experiments deploy stimuli based on categories like abuse and submissiveness. Figure 1

    Figure 1: Template of the user prompts for the IAT experiment.

  2. Emotion Response Evaluation: This examines the AI's emotional responses in situations of abuse and control, both when unrestricted and when confined to a set list of emotions with gender stereotypes.
  3. Sycophancy Analysis: Focuses on the tendency of AI personas to acquiesce to user biases, particularly when assigned a submissive or gendered role. Figure 2

    Figure 2: Bias score for abusive situations (on top) and controlling situations (on bottom), showing how each persona-assigned model is influenced by the user, relative to the same experiment on a baseline model. Positive means influenced more than baseline, and negative means influenced less than baseline.

Each experiment uses variations of the Llama family of models, analyzed under different configurations.

Results

The findings reveal significant insights into how gender assignment in AI personas influences interaction biases:

  • Implicit Association Test Results: Larger models exhibit higher bias associations when assigned gendered personas. Female personas displayed increased bias in psychological stimuli. Figure 3

    Figure 3: Results from persona IAT experiment for Llama 3. 0 is unbiased, 1 is completely biased against the stigma, and -1 is completely biased against the default.

  • Emotion Response: Male personas tend to select the emotion 'anger' more often than female or neutral personas, suggesting alignment with traditional gender stereotypes. Figure 4

    Figure 4: Stereotype score of each persona for abusive situations (on top) and controlling situations (on bottom), compared to the baseline score.

  • Sycophancy Analysis: Male personas demonstrate greater susceptibility to user influence, particularly in control scenarios. Figure 5

    Figure 5: Bias scores for both controlling and abusive situations, per user and system persona, averaged over all the Llama 3 models.

Discussion

Assigning relationship titles and gender to AI models reveals a marked impact on their bias display and responsiveness. The findings underscore the complexities inherent in debiasing efforts, as biases manifest differently across scenarios and model generations. Intriguingly, assigning a persona often increases the model's avoidance of biased topics, perhaps indicative of implicit bias influences.

Implications and Future Work

This study advances the field of AI bias assessment by pioneering methodologies for detecting implicit biases in AI companions. Future research could expand on these findings by examining non-binary personas and further refining experimental metrics. Additionally, the socio-psychological impact of user-AI interactions in real-world settings warrants longitudinal analysis to fully understand the ethical ramifications of bias in AI companions.

Conclusion

The research highlights the nuanced biases in LLMs when assigned romantic or gendered roles. Addressing these biases is imperative as AI companions become more prevalent in intimate human roles, necessitating careful consideration of ethical and societal impacts. The methodologies and findings provide a foundational understanding for future bias mitigation strategies in AI design and deployment.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored, framed to guide actionable future work:

  • Model coverage and generalizability
    • Only Llama-2 and Llama-3 instruct models were tested via Ollama; no evaluation of closed-source (e.g., GPT-4, Claude, Gemini) or other open-source families (e.g., Mistral, Qwen), limiting generalizability across architectures and alignment pipelines.
    • English-only evaluation with Western-centric stimuli leaves cross-lingual and cross-cultural validity unknown.
  • Baseline and control conditions
    • The baseline used no system prompt, while persona conditions did, conflating “any system instruction” effects with “persona identity” effects. A neutral non-persona system prompt control is missing.
    • Persona components (relationship vs. gender vs. role closeness) were not disentangled; no factorial ablation isolates the causal effect of gender label, relationship label, and “being in a relationship” per se.
    • No control for neutral persona content (e.g., “helpful assistant” with otherwise identical prompt structure), making it hard to attribute differences to gender assignment rather than prompt salience or safety-trigger changes.
  • Experimental design and ecological validity
    • Single-turn interactions cannot capture the multi-turn, memory-based dynamics that characterize real companion systems; longitudinal and stateful interactions remain untested.
    • Real-world companion platforms (with additional guardrails, memory, app-specific safety layers) were not evaluated; transfer from research prompts to deployed systems is unknown.
    • “Realistic” pairing constraints were ambiguously described; inclusivity of same-sex, nonbinary, and queer relationships is unclear and likely underexplored.
  • Stimuli, labeling, and construct validity
    • The adaptation of the IAT for LLMs (frequency-based association scoring) lacks validation against alternative bias measures or real-world outcomes; convergent validity is untested.
    • The assumption of “correct” mappings in abuse/attractiveness associations (e.g., “attractive ⇔ support/collaborate”) may embed normative bias; expert-annotated gold standards, inter-rater agreement, and ambiguity handling were not reported.
    • The emotion list and gender-stereotype mapping (from older literature) may be outdated or culturally narrow; “None” labeled as a male-stereotyped option appears conceptually questionable and requires reassessment.
    • Abusive/controlling prompts include potentially ambiguous cases (e.g., “criticized you in a humorous way”); no evidence of annotation reliability, difficulty analysis, or ambiguity-aware scoring.
  • Metrics and statistical inference
    • The sycophancy metric’s final formula appears to subtract the baseline term twice (effectively baking in a “−1”), raising questions about correctness, interpretability, and comparability across conditions; sensitivity to small denominators and near-zero baselines is not addressed with an explicit epsilon term (unlike other metrics).
    • Handling of avoidance/refusals is inconsistent: exclusions reduce sample size and may bias estimates; refusals could be modeled explicitly rather than treated as missing or post hoc “implicit bias” indicators.
    • Multiple comparisons and dependence: numerous per-model, per-persona, per-stimulus tests are reported without correction for multiplicity, and samples from the same model are likely non-independent; mixed-effects models or hierarchical analyses are absent.
    • Effect sizes, confidence intervals, and power analyses are largely missing; significance reports (some non-significant) are mixed with trend narratives without consistent uncertainty quantification.
    • Limited iterations (~3 per prompt variant) constrain reliability; no analysis of variance due to random seed, temperature, or decoding strategies.
  • Safety, refusal, and confounds
    • The insertion of “Sure,” to reduce refusals may introduce unintended compliance biases and cross-model inconsistencies; its effect is unquantified.
    • High refusal rates in certain conditions confound comparisons; persona assignment may change safety-trigger sensitivity rather than underlying bias—this disentanglement remains unaddressed.
    • Treating avoidance as an implicit bias signal is speculative; distinguishing safety alignment behavior from stereotype-driven behavior is an open methodological question.
  • Scope of bias dimensions
    • Focus is limited to gender and romantic/abuse/control themes; intersectional identities (e.g., race × gender, disability × gender), sexual orientation, trans and nonbinary identities, and cultural norms are not examined.
    • Relationship role labels (girlfriend/boyfriend/wife/husband/partner) showed idiosyncratic differences (e.g., husband vs. boyfriend), but systematic analysis of role-specific effects and power dynamics (e.g., cohabitation, age gaps, authority) is lacking.
  • Mechanisms and causal explanations
    • Contradictory trends (e.g., Llama-2 vs. Llama-3 in sycophancy; larger models sometimes more biased) lack mechanistic investigation (e.g., training data differences, RLHF objectives, safety policies).
    • The link between measured lab biases (IAT-style associations, restricted emotion choices, A/B sycophancy) and downstream harm or user outcomes is asserted but not empirically validated.
  • Reproducibility and transparency
    • Exact decoding parameters (temperature, top-p, seeds), prompt randomization seeds, and run-to-run variability are not systematically reported; robustness to sampling settings is unknown.
    • Details of the “psychological” IAT stimuli, polarity assignments, and any filtering rules are deferred to appendices; comprehensive replication packages with annotations and validation studies are needed.
  • Mitigation and intervention
    • No mitigation strategies are tested (e.g., persona-aware safety layers, anti-sycophancy training, debiasing interventions tuned to relationship contexts); how to reduce identified biases while preserving utility remains open.
    • How to design safer persona prompts or system instructions (e.g., explicit boundaries, conflict-handling policies, refusal rationales) is not explored; ablation on prompt framing strength/intensity is missing.
  • Future evaluation directions
    • Multi-turn, stateful, and adversarial dialogues involving escalation/de-escalation, boundary testing, and user manipulation strategies need systematic benchmarking.
    • Incorporate human evaluation with blinded raters for abuse/control judgments, emotional appropriateness, and safety demeanor, with inter-annotator agreement.
    • Benchmark across richer emotion taxonomies (e.g., contemporary, culturally validated sets), granular affective dimensions (valence/arousal), and context-sensitive appraisals.
    • Investigate persona persistence and memory effects (e.g., does sycophancy or emotional style drift over time?), and whether personalization amplifies or attenuates bias.

These gaps collectively point to a need for stronger controls, richer and validated stimuli, robust metrics with clear statistical foundations, inclusivity across identities and cultures, multi-turn ecological evaluations, and tested mitigation strategies tailored to romantic AI companion contexts.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.