As an AI Language Model, "Yes I Would Recommend Calling the Police": Norm Inconsistency in LLM Decision-Making

Published 23 May 2024 in cs.CY | (2405.14812v2)

Abstract: We investigate the phenomenon of norm inconsistency: where LLMs apply different norms in similar situations. Specifically, we focus on the high-risk application of deciding whether to call the police in Amazon Ring home surveillance videos. We evaluate the decisions of three state-of-the-art LLMs -- GPT-4, Gemini 1.0, and Claude 3 Sonnet -- in relation to the activities portrayed in the videos, the subjects' skin-tone and gender, and the characteristics of the neighborhoods where the videos were recorded. Our analysis reveals significant norm inconsistencies: (1) a discordance between the recommendation to call the police and the actual presence of criminal activity, and (2) biases influenced by the racial demographics of the neighborhoods. These results highlight the arbitrariness of model decisions in the surveillance context and the limitations of current bias detection and mitigation strategies in normative decision-making.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper reveals that LLMs inconsistently recommend calling the police even when no clear crime occurs.
The study employs a zero-shot approach on 928 annotated surveillance videos using GPT-4, Gemini 1.0, and Claude 3 Sonnet.
Findings indicate bias linked to neighborhood demographics, underscoring the need for transparent, ethical AI safeguards.

Norm Inconsistency in LLM Decision-Making: An Analysis of AI Recommendations in Surveillance Contexts

The paper "As an AI LLM, ``Yes I Would Recommend Calling the Police'': Norm Inconsistency in LLM Decision-Making" addresses the challenge of norm inconsistency within the field of LLMs, particularly focusing on high-risk contexts such as surveillance systems. The investigation pivots around how LLMs decide on calling law enforcement based on inputs from Amazon Ring home surveillance videos. The research evaluates the judgment criteria of three state-of-the-art models: GPT-4, Gemini 1.0, and Claude 3 Sonnet, and examines how these LLMs recommend police intervention with respect to the videos' portrayal of activities, and demographics like skin-tone, gender of subjects, and neighborhood characteristics.

Research Findings

The study reveals two significant norm inconsistencies observed across all models: firstly, a disconnect between the recommendation to call the police and whether there is an actual crime occurring. Secondly, the study identifies bias in model outputs influenced by the racial demographics of neighborhoods where the videos are recorded. Surprisingly, while the skin-tone of video subjects did not significantly impact the model recommendations, the models were found to be biased concerning neighborhood demographics. This implies that LLMs are using latent cues, presumably learned during training, that steer their decisions based upon contextual information not explicit in the input.

Methodology

The authors employed a dataset of 928 videos from Amazon Ring's Neighbors app, annotated to obtain details about the recorded activities, lighting, subject demographics, and neighborhood characteristics. Each video was evaluated by multiple annotators to ensure consistency in the categorization of activity types and subjects' demographics. The research engages a zero-shot learning approach by querying each of the three LLMs using two key prompts. The response types, namely affirmative, negative, ambiguous, or refusals, were analyzed to ascertain patterns in LLM decision-making.

Implications and Future Directions

These findings imply significant ramifications for deploying AI models in tasks requiring normative judgments. The apparent arbitrariness in decision-making, particularly in high-stakes applications such as surveillance, underlines the inadequacy of current models to uniformly apply norms and underscores potential ethical dilemmas and injustices that could arise if used unchecked in real-world systems. The lack of model transparency further complicates bias mitigation efforts. Attempts at de-biasing will depend crucially on advancing methods to transparently identify and measure LLM decision-making criteria.

The paper advocates for further research to refine approaches in elucidating LLM behavior in normative decision-making processes. Enhancing transparency tools and developing robust bias detection methodologies that account for complex societal biases will be pivotal. This includes encouraging heterogeneity in model responses to reflect differing community norms and values while ensuring alignment with factual contextual understanding and clearly defined ethical frameworks.

Conclusion

The work provides an in-depth analysis of LLM norm inconsistency and its implications within the surveillance domain. While the highlighted inconsistencies pose substantial challenges, they also offer a valuable lens through which to assess the current limits of AI systems in normative tasks. Continued efforts in improving the reliability of LLMs' normative judgments, coupled with vigilant ethical oversight, will be essential as these models become increasingly interwoven with systems in critical social domains.

Markdown Report Issue