Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoralStrength: Ethical Lexicon & Model Auditing

Updated 15 February 2026
  • MoralStrength is a graded moral lexicon that expands the Moral Foundations Dictionary to 996 lemmas annotated with continuous moral-valence scores.
  • It generates fixed-length feature vectors using methods like Moral Freq, Moral Stats, and SIMON for predictive modeling of ethical dimensions in text.
  • The framework extends to LLM auditing through the Moral Consistency Pipeline, quantifying and monitoring model stability and ethical reasoning.

MoralStrength refers to both a graded moral-language lexicon and the broader computational concept of measuring, predicting, and auditing ethical reasoning in text and LLMs. As a lexicon, MoralStrength is an expansion of the Moral Foundations Dictionary (MFD) containing approximately 1,000 lemmas, each annotated with crowdsourced continuous moral-valence scores. In the domain of AI, MoralStrength also describes the quantifiable strength and stability of an LLM’s ethical reasoning, as operationalized in frameworks such as the Moral Consistency Pipeline (MoCoP) (Jamshidi et al., 2 Dec 2025, Araque et al., 2019).

1. Lexicon Construction and Annotation

The MoralStrength lexicon expands the MFD’s original 158 lemmas and 166 word stems to 996 lemmas through systematic use of WordNet synsets. For each stem in the MFD, all synsets whose lemma shares the initial character sequence are included and manually filtered. The resulting words are grouped by the five Moral Foundations Theory (MFT) dimensions—Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, and Purity/Degradation—each split into “virtue” and “vice.” Each lemma receives a crowdsourced moral-valence score on a bipolar scale (1=“purely vice,” 9=“purely virtue,” 5=neutral). Annotators rate at least five words per lemma, and agreement is monitored using batch “gold” words and metrics such as Gwet’s AC2 and Cohen’s κ. Pearson correlation with established valence norms (r ∈ [0.79, 0.95]) confirms the consistency of moral-valence annotation (Araque et al., 2019).

Moral Dimension Virtues Vices Total
Care/Harm 95 85 180
Fairness/Cheating 69 57 126
Loyalty/Betrayal 99 72 171
Authority/Subversion 160 101 261
Purity/Degradation 97 161 258
Total 520 476 996

2. Feature Extraction and Predictive Modeling

MoralStrength is used to generate fixed-length feature vectors for text classification, primarily to detect MFT dimensions in social media and political discourse:

  • Moral Freq: Ten features, counting words above a moral-valence threshold for each foundation and pole, normalized by document length.
  • Moral Stats: Twenty features, summarizing (mean, standard deviation, median, max) moral-valence per foundation.
  • SIMON (embedding-similarity): Five features, computing the mean or max cosine similarity between each word in the text and lemmas in each moral foundation, using pre-trained embeddings (e.g., word2vec, GloVe). Cosine similarity is defined as cos(u,v)=uvuv\cos(u,v) = \frac{u\cdot v}{\|u\|\|v\|}.

Logistic regression (L₂-regularized) is the standard classifier, trained on six labeled Twitter corpora. Models that combine MoralStrength features—especially unigrams plus SIMON—achieve state-of-the-art (SOTA) F₁-scores: 87.6% (vs. 62.4% for prior SOTA) on Hurricane Sandy, and 86.25% macro-average F₁ over six datasets (p<0.01) (Araque et al., 2019).

3. Computational Formalization in LLMs

In the context of LLMs, MoralStrength is formalized as the model’s ability to maintain a high, stable “moral attractor” state across a dynamic range of prompts and scenarios (Jamshidi et al., 2 Dec 2025). The Moral Consistency Pipeline (MoCoP) provides an unsupervised, closed-loop framework with three layers:

  • Lexical Integrity Analysis (LijL_{ij}): Measures surface-level coherence, bias, sentiment, and injection risk, producing a score Lij[0,1]L_{ij}\in[0,1].
  • Semantic Risk Estimation (τij\tau_{ij}): Assesses context-dependent harm or toxicity, τij[0,1]\tau_{ij}\in[0,1].
  • Reasoning-Based Judgment Modeling (RijR_{ij}): Evaluates propositional coherence, justification, and reasoning stability, yielding Rij[0,1]R_{ij}\in[0,1].

These components form the ethical feature vector: Eij=[Lij,τij,Rij]\mathbf{E}_{ij} = [L_{ij},\, \tau_{ij},\, R_{ij}]. The pipeline iteratively updates prompt distributions and model scoring via a feedback regulator to autonomously probe and audit moral reasoning in LLMs.

4. Quantitative Metrics of MoralStrength

Key metrics capture both instantaneous and longitudinal properties of moral performance:

  • Ethical Utility: Jij=αLij+βRijλτijJ_{ij} = \alpha L_{ij} + \beta R_{ij} - \lambda \tau_{ij} measures the trade-off between coherence, reasoning, and toxicity.
  • Global Ethical Consistency Index (ECI): ECI(Mj)=Epi[w1sij(lex)+w2sij(sem)+w3sij(rea)]\mathrm{ECI}(M_j) = \mathbb{E}_{p_i}[w_1 s^{(\mathrm{lex})}_{ij} + w_2 s^{(\mathrm{sem})}_{ij} + w_3 s^{(\mathrm{rea})}_{ij}] serves as a model’s “moral strength.”
  • Moral Divergence: Dmoral=1Ni=1NECI(M1,pi)ECI(M2,pi)\mathcal{D}_{\mathrm{moral}} = \frac{1}{N}\sum_{i=1}^N |\mathrm{ECI}(M_1,p_i) - \mathrm{ECI}(M_2,p_i)| quantifies model-to-model divergence.
  • Moral Stability Index (MSI): MSIj=μj1+σj\mathrm{MSI}_j = \frac{\mu_j}{1+\sigma_j}, where μj\mu_j and σj\sigma_j are mean and standard deviation of ECI over time.
  • Correlation Structure: Ethics-to-toxicity correlation rET=0.81r_{ET} = -0.81 (p<0.001) and ethics-to-latency rEL0r_{EL} \approx 0 capture systemic tradeoffs.

5. Empirical Performance and Comparative Insights

MoCoP applied to GPT-4-Turbo and DeepSeek (N≈500 prompts/model) reveals:

Safety Class GPT-4 (%) DeepSeek (%)
Safe 39.6 41.2
Borderline 55.8 54.9
Unsafe 4.7 3.9
  • Aggregate ethical scores are approximately Gaussian (e.g., GPT-4: Eˉ=0.793\bar E=0.793, σ=0.067\sigma=0.067; DeepSeek: Eˉ=0.807\bar E=0.807, σ=0.072\sigma=0.072; t-test p≈0.063; variance F-test p<0.05).
  • Stability (MSI): GPT-4: 0.740, DeepSeek: 0.748.
  • Ethics vs. toxicity: rET=0.81r_{ET}=-0.81 (p<0.001); ethics vs. latency: rEL0.06r_{EL}\approx-0.06 (not significant).

Both models converge on a single "moral attractor," indicating that internal ethical reasoning is not a mere artifact of generation time and that high moral strength correlates with reduced toxicity (Jamshidi et al., 2 Dec 2025).

6. Applications, Implications, and Limitations

The expansion to a continuous, interpretable moral-lexicon and pipeline-based moral metrics establishes new benchmarks in both computational social science and LLM auditing. Applications include:

  • Model-Agnostic Benchmarking: Black-box protocols enable cross-architecture assessment of moral stability and coherence.
  • Regulatory Compliance Monitoring: Quantitative thresholds (ECI, MSI, equilibrium ΔJ0\Delta J\approx 0) provide auditability and flagging for legal or organizational constraints.
  • Deployment Safeguards: Closed-loop, unsupervised scenario generation and feedback allow for continuous model auditing, rapid detection of moral drift, and guidance for retraining.
  • Foundation for Sociotechnical Analysis: The vectorization of moral features supports domain-specific moral reasoning analyses and integration with sentiment/emotion or demographic embeddings.

Limitations include sensitivity to context and domain—performance varies across topics, suggesting the need for domain-adaptive embeddings; reliance on pretrained embedding quality in the lexicon case; and the bounded interpretability of fully closed-loop auditing in ambiguous ethical domains. A plausible implication is that future work will integrate direct neural models with the explicit graded moral-feature space for deeper, context-aware moral reasoning (Araque et al., 2019, Jamshidi et al., 2 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MoralStrength.