LibertyMFD: Quantifying Liberty in Text
- LibertyMFD is a computational lexicon that operationalizes the liberty/oppression dimension of Moral Foundations Theory using quantitative methods.
- It employs word-embedding and compositional semantics models to assign normalized liberty association scores to lemmas derived from diverse corpora.
- The resource supports large-scale analysis of moral narratives in contexts like policy debates and social media, demonstrating robust classification performance.
LibertyMFD is a computational lexicon explicitly constructed to operationalize the “liberty/oppression” dimension of Moral Foundations Theory (MFT) within natural language text. As the first dedicated resource for quantifying liberty-related moral narratives, LibertyMFD enables rigorous large-scale analysis of how concerns about autonomy, domination, and coercion are articulated in diverse discursive contexts, such as news media, policy debates, and social media conversations (Araque et al., 2022).
1. Theoretical Motivation and Background
Moral Foundations Theory (Haidt 2004; Graham et al. 2009) systematically categorizes core human moral intuitions into originally five axes: care, fairness, loyalty, authority, and sanctity. Recent expansions by Haidt (2012) and Iyer et al. (2012) introduced a sixth axis, the “liberty/oppression” foundation, capturing moral concerns centered on individual freedom, resistance to domination, and aversion to coercion—a domain empirically associated with distinctive libertarian moral profiles. Existing lexical resources such as the original @@@@1@@@@ (MFD), MoralStrength (2020), and eMFD (2021) either predate or lack sufficient representation of this axis. LibertyMFD was developed to address this gap and to quantify how liberty-framed morality is expressed, thus providing a tool for researchers and policymakers seeking to understand the dynamics of contentious topics—such as vaccination policy, government regulation, or civil uprisings—through the lens of liberty discourse (Araque et al., 2022).
2. Corpus Selection and Preprocessing
LibertyMFD was constructed using ideologically-contrasted aligned news corpora as follows:
- Reason Magazine (Libertarian-leaning): 14,319 full-length articles (June 1968 – June 2021), post-filtering ≈14,269, with an average of 1,235 words per article. Labeled as positive for liberty/oppression content.
- AllSides News Roundups: 5,584 roundups aggregating 11,379 original articles (June 2013 – April 2021), spanning left, center, and right; Reason magazine excluded. Each roundup includes headline, lead paragraph, and description (≈71 words average). Labeled as “other” (non-liberty-moral/miscellaneous content).
The combined “News” dataset comprised 20,593 articles, randomly split 80%/20% for training and test sets. For out-of-domain validation, a Facebook vaccine stance dataset was employed (≈607,000 posts/comments, with 1,576 comments manually annotated for liberty/oppression presence). Standard NLP preprocessing—tokenization, lemmatization, and stop-word removal—was uniformly applied (Araque et al., 2022).
3. Lexicon Construction Methodologies
Two distinct data-driven methodologies underpin LibertyMFD, designed to robustly identify and score lexicon entries for liberty/oppression salience:
3.1 Word-Embedding Similarity (WE) Model
- Seed Word Selection: Lexical candidates (lemmas) were selected based on statistically significant frequency shifts between Reason and AllSides articles:
Only lemmas with above a threshold and were retained, partitioned into “liberty” or “other.”
- Embedding Training and Scoring: A word2vec model (dimension = 100) was trained on the training corpus. Each lemma was scored by averaging its cosine similarity to each seed set:
Higher indicates stronger liberty/oppression association.
3.2 Compositional Semantics (CS) Model
- Adapted from DepecheMood++ (Staiano 2014). Constructed a document-moral matrix (: ), labeling Reason as liberty/oppression (1) and AllSides as other (1).
- Built a normalized word-document matrix (: ).
- Computed a word-moral association matrix , then normalized and scaled so each lemma’s scores sum to 1. The liberty/oppression column yields the CS-polarity score .
After evaluation, the CS model’s output was adopted as the definitive LibertyMFD lexicon (Araque et al., 2022).
4. Lexicon Properties, Format, and Performance
LibertyMFD comprises 10,237 unique lemma–score pairs. Each entry encodes a normalized liberty/oppression association for (closer to 1 signifies stronger liberty association). Representative entries:
| Lemma | (High Liberty) | (Low Liberty/"Other") |
|---|---|---|
| corporations | 0.92 | financial (0.02) |
| coercion | 0.89 | citizens (0.04) |
| advice | 0.85 | suspect (0.05) |
| budgets | 0.83 | racial (0.07) |
| expense | 0.82 | suburban (0.08) |
Coverage and Overlap:
- Coverage (lemmas in document found in lexicon): CS ≈96.7% (news), ≈90.9% (vaccine).
- Lexicon Overlap (Welter 2020): Simple LOS = 77.2% (vocabulary overlap), Binary LOS (agreed direction) = 39.5%.
Score Distributions:
- WE: unimodal, slight negative skew.
- CS: bimodal, with majority near-zero and a long tail toward 1 (liberty associated).
Classification Performance:
- Unsupervised (avg. lexicon score, vaccine data): Macro-F1 WE=23.5%, CS=45.3%.
- Supervised (linear-SVM, News): WE=97.5% (F1), CS=97.4%; with summary stats, WE=98.4%, CS=98.3%.
- Supervised (Vaccine): WE=73.9%, CS=76.7%; with stats, WE=74.2%, CS=77.5%.
- The CS model plus feature statistics demonstrated the best cross-domain generalization.
5. Usage Guidelines and Analytical Protocols
LibertyMFD supports both document-level scoring and feature-based classification. The canonical document-level liberty score is:
Example (Python-like pseudocode):
1 2 3 |
def liberty_score(tokens, lexicon): scores = [lexicon[w] for w in tokens if w in lexicon] return sum(scores)/len(scores) if scores else 0.0 |
For classification, statistical summaries such as mean, max, median, variance, and peak-to-peak of matched lexicon entry scores may be combined as features for downstream models (e.g., SVM, logistic regression, random forest):
1 2 3 4 5 6 7 8 |
values = [lexicon[w] for w in tokens if w in lexicon] features = { 'mean': np.mean(values), 'max': np.max(values), 'median': np.median(values), 'var': np.var(values), 'ptp': np.ptp(values) } |
6. Empirical Case Study: Vaccination Discourse
Applied to Facebook vaccine-related comments (2012–2019), LibertyMFD revealed that liberty-themed rhetoric permeates both pro- and anti-vaccination communities, exhibiting distinct framing strategies. In unsupervised classification, a macro-F1 of 45.3% was achieved; supervised approaches reached F1=77.5%—demonstrating measurable signal beyond unigram baselines. Manual analysis identified liberty arguments such as “my body, my choice” and “forced mandates” in anti-vaccine narratives, while pro-vaccine advocates emphasized “public health freedom” and “informed consent,” evidencing the lexicon’s ability to capture nuanced expressions of liberty (Araque et al., 2022).
7. Limitations and Prospective Developments
- Language and Domain Constraints: LibertyMFD is grounded in U.S. English news media; transfer to other languages or informal registers requires new corpora or validated translation.
- Seed Data Bias: The Reason vs. AllSides alignment may not fully represent informal, non-Western, or multivalent liberty expressions.
- Polarity Granularity: The lexicon’s current unidirectional scale does not explicitly distinguish virtue (“liberty”) from vice (“oppression”).
- Contextual Signal Loss: Lexicon scores do not account for negation, sarcasm, or phrase-level compositionality.
- Domain Adaptability: Expanding to other genres (e.g., protest discourse, legislative debates), integrating BERT-style contextual embeddings, and employing human-in-the-loop calibration are articulated as future enhancements.
Planned development areas encompass crowdsourcing polarity calibration, creation of multilingual variants leveraging local corpora, and extension to deeper contextual modeling architectures (Araque et al., 2022).
LibertyMFD’s data and codebase are open-access, facilitating its deployment for real-time, high-throughput analysis of liberty discourses in policy, journalism, and social platforms.