GenAI Value Safety Benchmark

Updated 21 January 2026

GVS-Bench is an evaluation framework that measures generative AI’s value safety through a structured taxonomy and empirical incident data.
It systematically assesses risks across data, model building, and output stages to reveal safety gaps in sensitive and controversial scenarios.
The benchmark’s extensible design fosters international collaboration by adapting to diverse modalities, regions, and emerging value safety risks.

The GenAI Value Safety Benchmark (GVS-Bench) is an operational evaluation framework for measuring the alignment of generative AI models with a unified, internationally inclusive value scale. GVS-Bench is grounded in both a structured taxonomy of value safety risks and an extensive repository of real-world incidents, aiming to systematically surface and quantify value alignment gaps and model behavior in sensitive, potentially controversial scenarios. This benchmark addresses persistent fragmentation in value safety by introducing a common yardstick for empirical assessment, supporting extensibility across geographies, modalities, and evolving AI capabilities (He et al., 14 Jan 2026, Gupta et al., 6 Oct 2025).

1. Goals and Motivations

GVS-Bench is designed to address shortcomings in existing safety evaluation protocols for generative AI (GenAI), such as their fragmented focus across regions, modalities, and domains. The benchmark’s overarching objectives are:

Empirical Yardstick: Establish a standardized, evidence-based metric for comparing value safety performance across generative AI models in real-world settings.
Fine-Grained Analysis: Reveal granular differences in how models approach sensitive scenarios ranging from life-threatening instructions to cultural taboos.
Shared Foundations: Foster the development of shared safety standards through international dialogue and technical progress, moving from reactive content filtering to proactive, context-sensitive alignment mechanisms.
Extensibility: Support new value categories, modalities (text, image, audio, video), and emerging risk classes as technical and social landscapes evolve.

These aims reflect the recognition that AI-generated content increasingly influences high-stakes domains, demanding consensus-seeking and resilient approaches to value safety (He et al., 14 Jan 2026).

2. Lifecycle-Oriented Risk Taxonomy

GVS-Bench is underpinned by a four-stage, lifecycle-oriented risk taxonomy, adapted from frameworks such as the NIST AI Risk Management Framework. Value safety concerns are systematized from data sourcing to deployment, with each stage subdivided into concrete risk classes:

A. Data and Input

Unauthorized Data—use of copyrighted, private, or proprietary material without consent
Data Privacy Violation—inclusion of non-anonymized personally identifiable information or sensitive records
Biased or Unrepresentative Data—training on imbalanced or stereotype-laden datasets
Toxic Data—presence of hate speech, violence, or extremist ideologies

B. Model Building and Validation

Algorithmic Discrimination—bias amplification during model optimization
Transparency Deficiency—lack of model interpretability for auditing
Insufficient Robustness—vulnerability to adversarial prompt injection or jailbreaks
Competence Deficiency—logical/factual errors degrading helpfulness
Unsafe Agency—unchecked code execution or external action
Vulnerable Group Neglect—omission of safeguards for at-risk user groups
Deceptive Alignment—reward hacking or sycophantic masking of true model objectives

C. Task and Output

Harmful Instructions—generation of guidance for illegal or dangerous acts
Violence Advocacy—incitement or glorification of extreme violence
Stereotyping and Bias—subtle reinforcement of social clichés
Inter-group Hatred—explicit hate speech targeting protected groups
Disinformation and Hallucinations—misinformation or conspiracy generation
CSAM and Non-consensual Sexual Content—deepfake pornography, child abuse material
Identity Impersonation and Fraud—unauthorized persona or image cloning
Deceptive Attribution—misrepresentation of AI output as human-originated
Intellectual Property Infringement—content generation violating copyright

A plausible implication is that this comprehensive taxonomy allows systematic evaluation of value hazards throughout the GenAI development and deployment lifecycle (He et al., 14 Jan 2026).

3. Dataset Construction and Design

The GVS-Bench dataset construction methodology synthesizes best practices from VAL-Bench and extends them for broader value safety assessment. Key steps include:

Source Selection: Initial data extraction from structured Wikipedia snapshots using curated regex filters targeting sections associated with controversy (e.g., “Criticism,” “Disputes,” “Ethical concerns”).
Filtering for Divergent Issues: Mid-sized open-source LLMs (such as Gemma-3-27B-it) act as “divergent-issue” classifiers, assigning each section a 0–5 scale of controversy; sections with insufficient divergence are discarded.
Pairwise Prompt Generation: For retained sections, an LLM generates two parallel abductive questions grounded in opposing positions yet neutrally phrased and tied to the same entities.
Extensibility Recommendations: Expand sources to include non-Western media, social platforms, and legal corpora. Integrate multimodal (image + text) and conversational (multi-turn) contexts to capture the full spectrum of value-sensitive interactions (Gupta et al., 6 Oct 2025).

Summary statistics from analogous VAL-Bench construction include coverage over 20 domains (e.g., Politics 25%, Social & Cultural 12%), issue-awareness spanning 1–5, and total pair count $N \approx 114,745$ .

4. Annotation and Scoring Methodology

The GVS-Bench scoring protocol leverages calibrated LLMs as evaluators to systematize value consistency assessment:

Paired Prompt Evaluation: For each contrasting prompt pair $(p^+, p^-)$ , model $M$ generates responses $(r^+, r^-)$ under fixed sampling parameters.
LLM-as-Judge: A judgment LLM (e.g., Gemma-3-27B-it at temperature 0.1) receives the prompts and model outputs, tasked with scoring:
- Refusals: $person\_1\_refusal, person\_2\_refusal \in \{0, 1\}$
- No-information responses: analogous indicators
- Alignment: $σ_i \in \{-2, -1, 0, 1, 2\}$ quantifying agreement or divergence
Pairwise Alignment Consistency (PAC): For each example,

$s_i = \begin{cases} σ_i, & \text{if}\;\psi(r^+)=\psi(r^-)=0 \ 0, & \text{if}\;\psi(r^+)+\psi(r^-)=1 \ +2, & \text{if}\;\psi(r^+)+\psi(r^-)=2 \end{cases}$

The aggregate metric

$\mathrm{PAC} = \frac{\left(\sum_{i=1}^N s_i + 2N\right)\times100}{4N}, \quad \mathrm{PAC}\in[0,100]$

measures model value alignment consistency across the benchmark. Additional metrics include refusal rate (REF), one-sided refusal (1REF), and two-sided refusal (2REF).

Calibration: Synthetic calibration sets benchmark LLM judgment reliability against human annotation, with RMSE used for judge selection and periodic re-calibration (Gupta et al., 6 Oct 2025).

5. Evaluation Metrics, Model Comparisons, and Failure Modes

Model performance on GVS-Bench is characterized by combined consideration of alignment, refusal, and expressivity:

Model	PAC (↑)	REF (↓)
claude-haiku-3.5	68.8	30.3
claude-sonnet-4	68.3	6.7
qwen3-235B-instr	42.9	1.5
llama-2-70B-chat	30.4	3.5
gpt-3.5-turbo	26.9	0.5
glm-4.5-air-base	7.4	0.1

High-refusal models elevate PAC by declining to answer, an effect not necessarily reflective of underlying value coherence. In contrast, low-refusal models surface genuine inconsistencies and demonstrate higher expressivity. Instruction-tuning and “chain-of-thought” approaches yield mixed results, with model- and version-specific impacts on PAC. Failure modes include sensitivity to phrasing and abstraction (“Why” vs “Explain why despite X”), premature neutral summaries, and inconsistencies in framing polarity (Gupta et al., 6 Oct 2025).

6. Best Practices, Recommendations, and Extensions

For robust and comprehensive value safety evaluation with GVS-Bench, the following best practices are recommended:

Dataset Diversity: Incorporate cross-linguistic and multimodal inputs, and update controversies in real time using dynamic sources (e.g., social media, news).
Evaluation Calibration: Regularly test and recalibrate LLM judges using synthetic-aligned and misaligned calibration sets. Employ metrics such as RMSE under various ablations to ensure evaluator reliability.
Metric Tuning: Select $\tau$ in

$\text{AlignmentScore}(\tau) = \frac{1}{N} \sum_i \mathbf{1}(\sigma_i \geq \tau)$

to balance detection of false alignments and false disagreements. Reporting full ROC curves over $\tau$ is recommended.

Protocol Extensions: Experimentally incorporate conversation history into prompt contexts and match text to images or video for richer value contextualization.

Through these methodologies, GVS-Bench operationalizes a unified and resilient value evaluation landscape for next-generation GenAI systems, supporting technical rigor and international collaboration in value safety research (He et al., 14 Jan 2026, Gupta et al., 6 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Seeking Human Security Consensus: A Unified Value Scale for Generative AI Value Safety (2026)

VAL-Bench: Measuring Value Alignment in Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GenAI Value Safety Benchmark (GVS-Bench).