MoralStrength: Ethical Lexicon & Model Auditing

Updated 15 February 2026

MoralStrength is a graded moral lexicon that expands the Moral Foundations Dictionary to 996 lemmas annotated with continuous moral-valence scores.
It generates fixed-length feature vectors using methods like Moral Freq, Moral Stats, and SIMON for predictive modeling of ethical dimensions in text.
The framework extends to LLM auditing through the Moral Consistency Pipeline, quantifying and monitoring model stability and ethical reasoning.

MoralStrength refers to both a graded moral-language lexicon and the broader computational concept of measuring, predicting, and auditing ethical reasoning in text and LLMs. As a lexicon, MoralStrength is an expansion of the Moral Foundations Dictionary (MFD) containing approximately 1,000 lemmas, each annotated with crowdsourced continuous moral-valence scores. In the domain of AI, MoralStrength also describes the quantifiable strength and stability of an LLM’s ethical reasoning, as operationalized in frameworks such as the Moral Consistency Pipeline (MoCoP) (Jamshidi et al., 2 Dec 2025, Araque et al., 2019).

1. Lexicon Construction and Annotation

The MoralStrength lexicon expands the MFD’s original 158 lemmas and 166 word stems to 996 lemmas through systematic use of WordNet synsets. For each stem in the MFD, all synsets whose lemma shares the initial character sequence are included and manually filtered. The resulting words are grouped by the five Moral Foundations Theory (MFT) dimensions—Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, and Purity/Degradation—each split into “virtue” and “vice.” Each lemma receives a crowdsourced moral-valence score on a bipolar scale (1=“purely vice,” 9=“purely virtue,” 5=neutral). Annotators rate at least five words per lemma, and agreement is monitored using batch “gold” words and metrics such as Gwet’s AC2 and Cohen’s κ. Pearson correlation with established valence norms (r ∈ [0.79, 0.95]) confirms the consistency of moral-valence annotation (Araque et al., 2019).

Moral Dimension	Virtues	Vices	Total
Care/Harm	95	85	180
Fairness/Cheating	69	57	126
Loyalty/Betrayal	99	72	171
Authority/Subversion	160	101	261
Purity/Degradation	97	161	258
Total	520	476	996

2. Feature Extraction and Predictive Modeling

MoralStrength is used to generate fixed-length feature vectors for text classification, primarily to detect MFT dimensions in social media and political discourse:

Moral Freq: Ten features, counting words above a moral-valence threshold for each foundation and pole, normalized by document length.
Moral Stats: Twenty features, summarizing (mean, standard deviation, median, max) moral-valence per foundation.
SIMON (embedding-similarity): Five features, computing the mean or max cosine similarity between each word in the text and lemmas in each moral foundation, using pre-trained embeddings (e.g., word2vec, GloVe). Cosine similarity is defined as $\cos(u,v) = \frac{u\cdot v}{\|u\|\|v\|}$ .

Logistic regression (L₂-regularized) is the standard classifier, trained on six labeled Twitter corpora. Models that combine MoralStrength features—especially unigrams plus SIMON—achieve state-of-the-art (SOTA) F₁-scores: 87.6% (vs. 62.4% for prior SOTA) on Hurricane Sandy, and 86.25% macro-average F₁ over six datasets (p<0.01) (Araque et al., 2019).

3. Computational Formalization in LLMs

In the context of LLMs, MoralStrength is formalized as the model’s ability to maintain a high, stable “moral attractor” state across a dynamic range of prompts and scenarios (Jamshidi et al., 2 Dec 2025). The Moral Consistency Pipeline (MoCoP) provides an unsupervised, closed-loop framework with three layers:

Lexical Integrity Analysis ( $L_{ij}$ ): Measures surface-level coherence, bias, sentiment, and injection risk, producing a score $L_{ij}\in[0,1]$ .
Semantic Risk Estimation ( $\tau_{ij}$ ): Assesses context-dependent harm or toxicity, $\tau_{ij}\in[0,1]$ .
Reasoning-Based Judgment Modeling ( $R_{ij}$ ): Evaluates propositional coherence, justification, and reasoning stability, yielding $R_{ij}\in[0,1]$ .

These components form the ethical feature vector: $\mathbf{E}_{ij} = [L_{ij},\, \tau_{ij},\, R_{ij}]$ . The pipeline iteratively updates prompt distributions and model scoring via a feedback regulator to autonomously probe and audit moral reasoning in LLMs.

4. Quantitative Metrics of MoralStrength

Key metrics capture both instantaneous and longitudinal properties of moral performance:

Ethical Utility: $J_{ij} = \alpha L_{ij} + \beta R_{ij} - \lambda \tau_{ij}$ measures the trade-off between coherence, reasoning, and toxicity.
Global Ethical Consistency Index (ECI): $\mathrm{ECI}(M_j) = \mathbb{E}_{p_i}[w_1 s^{(\mathrm{lex})}_{ij} + w_2 s^{(\mathrm{sem})}_{ij} + w_3 s^{(\mathrm{rea})}_{ij}]$ serves as a model’s “moral strength.”
Moral Divergence: $L_{ij}$ 0 quantifies model-to-model divergence.
Moral Stability Index (MSI): $L_{ij}$ 1, where $L_{ij}$ 2 and $L_{ij}$ 3 are mean and standard deviation of ECI over time.
Correlation Structure: Ethics-to-toxicity correlation $L_{ij}$ 4 (p<0.001) and ethics-to-latency $L_{ij}$ 5 capture systemic tradeoffs.

5. Empirical Performance and Comparative Insights

MoCoP applied to GPT-4-Turbo and DeepSeek (N≈500 prompts/model) reveals:

Safety Class	GPT-4 (%)	DeepSeek (%)
Safe	39.6	41.2
Borderline	55.8	54.9
Unsafe	4.7	3.9

Aggregate ethical scores are approximately Gaussian (e.g., GPT-4: $L_{ij}$ 6, $L_{ij}$ 7; DeepSeek: $L_{ij}$ 8, $L_{ij}$ 9; t-test p≈0.063; variance F-test p<0.05).
Stability (MSI): GPT-4: 0.740, DeepSeek: 0.748.
Ethics vs. toxicity: $L_{ij}\in[0,1]$ 0 (p<0.001); ethics vs. latency: $L_{ij}\in[0,1]$ 1 (not significant).

Both models converge on a single "moral attractor," indicating that internal ethical reasoning is not a mere artifact of generation time and that high moral strength correlates with reduced toxicity (Jamshidi et al., 2 Dec 2025).

6. Applications, Implications, and Limitations

The expansion to a continuous, interpretable moral-lexicon and pipeline-based moral metrics establishes new benchmarks in both computational social science and LLM auditing. Applications include:

Model-Agnostic Benchmarking: Black-box protocols enable cross-architecture assessment of moral stability and coherence.
Regulatory Compliance Monitoring: Quantitative thresholds (ECI, MSI, equilibrium $L_{ij}\in[0,1]$ 2) provide auditability and flagging for legal or organizational constraints.
Deployment Safeguards: Closed-loop, unsupervised scenario generation and feedback allow for continuous model auditing, rapid detection of moral drift, and guidance for retraining.
Foundation for Sociotechnical Analysis: The vectorization of moral features supports domain-specific moral reasoning analyses and integration with sentiment/emotion or demographic embeddings.

Limitations include sensitivity to context and domain—performance varies across topics, suggesting the need for domain-adaptive embeddings; reliance on pretrained embedding quality in the lexicon case; and the bounded interpretability of fully closed-loop auditing in ambiguous ethical domains. A plausible implication is that future work will integrate direct neural models with the explicit graded moral-feature space for deeper, context-aware moral reasoning (Araque et al., 2019, Jamshidi et al., 2 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models (2025)

MoralStrength: Exploiting a Moral Lexicon and Embedding Similarity for Moral Foundations Prediction (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MoralStrength.