MoralStrength: Ethical Lexicon & Model Auditing
- MoralStrength is a graded moral lexicon that expands the Moral Foundations Dictionary to 996 lemmas annotated with continuous moral-valence scores.
- It generates fixed-length feature vectors using methods like Moral Freq, Moral Stats, and SIMON for predictive modeling of ethical dimensions in text.
- The framework extends to LLM auditing through the Moral Consistency Pipeline, quantifying and monitoring model stability and ethical reasoning.
MoralStrength refers to both a graded moral-language lexicon and the broader computational concept of measuring, predicting, and auditing ethical reasoning in text and LLMs. As a lexicon, MoralStrength is an expansion of the Moral Foundations Dictionary (MFD) containing approximately 1,000 lemmas, each annotated with crowdsourced continuous moral-valence scores. In the domain of AI, MoralStrength also describes the quantifiable strength and stability of an LLM’s ethical reasoning, as operationalized in frameworks such as the Moral Consistency Pipeline (MoCoP) (Jamshidi et al., 2 Dec 2025, Araque et al., 2019).
1. Lexicon Construction and Annotation
The MoralStrength lexicon expands the MFD’s original 158 lemmas and 166 word stems to 996 lemmas through systematic use of WordNet synsets. For each stem in the MFD, all synsets whose lemma shares the initial character sequence are included and manually filtered. The resulting words are grouped by the five Moral Foundations Theory (MFT) dimensions—Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, and Purity/Degradation—each split into “virtue” and “vice.” Each lemma receives a crowdsourced moral-valence score on a bipolar scale (1=“purely vice,” 9=“purely virtue,” 5=neutral). Annotators rate at least five words per lemma, and agreement is monitored using batch “gold” words and metrics such as Gwet’s AC2 and Cohen’s κ. Pearson correlation with established valence norms (r ∈ [0.79, 0.95]) confirms the consistency of moral-valence annotation (Araque et al., 2019).
| Moral Dimension | Virtues | Vices | Total |
|---|---|---|---|
| Care/Harm | 95 | 85 | 180 |
| Fairness/Cheating | 69 | 57 | 126 |
| Loyalty/Betrayal | 99 | 72 | 171 |
| Authority/Subversion | 160 | 101 | 261 |
| Purity/Degradation | 97 | 161 | 258 |
| Total | 520 | 476 | 996 |
2. Feature Extraction and Predictive Modeling
MoralStrength is used to generate fixed-length feature vectors for text classification, primarily to detect MFT dimensions in social media and political discourse:
- Moral Freq: Ten features, counting words above a moral-valence threshold for each foundation and pole, normalized by document length.
- Moral Stats: Twenty features, summarizing (mean, standard deviation, median, max) moral-valence per foundation.
- SIMON (embedding-similarity): Five features, computing the mean or max cosine similarity between each word in the text and lemmas in each moral foundation, using pre-trained embeddings (e.g., word2vec, GloVe). Cosine similarity is defined as .
Logistic regression (L₂-regularized) is the standard classifier, trained on six labeled Twitter corpora. Models that combine MoralStrength features—especially unigrams plus SIMON—achieve state-of-the-art (SOTA) F₁-scores: 87.6% (vs. 62.4% for prior SOTA) on Hurricane Sandy, and 86.25% macro-average F₁ over six datasets (p<0.01) (Araque et al., 2019).
3. Computational Formalization in LLMs
In the context of LLMs, MoralStrength is formalized as the model’s ability to maintain a high, stable “moral attractor” state across a dynamic range of prompts and scenarios (Jamshidi et al., 2 Dec 2025). The Moral Consistency Pipeline (MoCoP) provides an unsupervised, closed-loop framework with three layers:
- Lexical Integrity Analysis (): Measures surface-level coherence, bias, sentiment, and injection risk, producing a score .
- Semantic Risk Estimation (): Assesses context-dependent harm or toxicity, .
- Reasoning-Based Judgment Modeling (): Evaluates propositional coherence, justification, and reasoning stability, yielding .
These components form the ethical feature vector: . The pipeline iteratively updates prompt distributions and model scoring via a feedback regulator to autonomously probe and audit moral reasoning in LLMs.
4. Quantitative Metrics of MoralStrength
Key metrics capture both instantaneous and longitudinal properties of moral performance:
- Ethical Utility: measures the trade-off between coherence, reasoning, and toxicity.
- Global Ethical Consistency Index (ECI): serves as a model’s “moral strength.”
- Moral Divergence: quantifies model-to-model divergence.
- Moral Stability Index (MSI): , where and are mean and standard deviation of ECI over time.
- Correlation Structure: Ethics-to-toxicity correlation (p<0.001) and ethics-to-latency capture systemic tradeoffs.
5. Empirical Performance and Comparative Insights
MoCoP applied to GPT-4-Turbo and DeepSeek (N≈500 prompts/model) reveals:
| Safety Class | GPT-4 (%) | DeepSeek (%) |
|---|---|---|
| Safe | 39.6 | 41.2 |
| Borderline | 55.8 | 54.9 |
| Unsafe | 4.7 | 3.9 |
- Aggregate ethical scores are approximately Gaussian (e.g., GPT-4: , ; DeepSeek: , ; t-test p≈0.063; variance F-test p<0.05).
- Stability (MSI): GPT-4: 0.740, DeepSeek: 0.748.
- Ethics vs. toxicity: (p<0.001); ethics vs. latency: (not significant).
Both models converge on a single "moral attractor," indicating that internal ethical reasoning is not a mere artifact of generation time and that high moral strength correlates with reduced toxicity (Jamshidi et al., 2 Dec 2025).
6. Applications, Implications, and Limitations
The expansion to a continuous, interpretable moral-lexicon and pipeline-based moral metrics establishes new benchmarks in both computational social science and LLM auditing. Applications include:
- Model-Agnostic Benchmarking: Black-box protocols enable cross-architecture assessment of moral stability and coherence.
- Regulatory Compliance Monitoring: Quantitative thresholds (ECI, MSI, equilibrium ) provide auditability and flagging for legal or organizational constraints.
- Deployment Safeguards: Closed-loop, unsupervised scenario generation and feedback allow for continuous model auditing, rapid detection of moral drift, and guidance for retraining.
- Foundation for Sociotechnical Analysis: The vectorization of moral features supports domain-specific moral reasoning analyses and integration with sentiment/emotion or demographic embeddings.
Limitations include sensitivity to context and domain—performance varies across topics, suggesting the need for domain-adaptive embeddings; reliance on pretrained embedding quality in the lexicon case; and the bounded interpretability of fully closed-loop auditing in ambiguous ethical domains. A plausible implication is that future work will integrate direct neural models with the explicit graded moral-feature space for deeper, context-aware moral reasoning (Araque et al., 2019, Jamshidi et al., 2 Dec 2025).