Representation Injustice in Data Systems
- Representation injustice is a phenomenon characterized by systematic distortions, omissions, or misrepresentations in data, models, and knowledge systems that marginalize key social groups.
- It manifests across modalities such as structured knowledge bases, language technology, and generative AI, leading to biased outputs that reinforce stereotypes and exclusion.
- Practical interventions include data-level adjustments, fairness-aware representation learning, and participatory design approaches to promote more equitable and accurate depictions.
Representation injustice refers to systematic distortions, omissions, or unfairness in how individuals, social groups, cultures, events, and concepts are encoded, depicted, or made accessible in data, models, algorithms, or knowledge systems. Unlike allocative harms, which focus on the unequal distribution of resources or opportunities, representation injustice concerns both who is made visible and, crucially, how entities and communities are portrayed or silenced. These harms can propagate in data acquisition, knowledge organization, language technology, machine learning pipelines, generative AI, and algorithmic content curation, thereby influencing social imaginaries, reinforcing stereotypes, and enabling epistemic exclusion.
1. Conceptual Foundations and Core Definitions
Representation injustice encompasses disparities and distortions in inclusion, portrayal, and semantic richness of groups or viewpoints within sociotechnical systems. It manifests in multiple modalities—structured datasets, commonsense knowledge bases, search engine outputs, LLMs, and generative AI—each setting specific axes along which presence and depiction can be imbalanced (Ma et al., 2023, Rohrbach et al., 2024, Mehrabi et al., 2021, Mickel et al., 1 Mar 2025).
Technically, representation injustice can be indexed along:
- Coverage/Presence: Whether a group, language, or concept appears with frequency commensurate to its real-world significance (Ma et al., 2023, Rohrbach et al., 2024, Deng et al., 2021).
- Quality of Representation: The richness, fidelity, and absence of stereotypical/categorical reduction in portrayals (Mickel et al., 1 Mar 2025, Abbasi et al., 2019).
- Semantic adequacy: Whether unique or culturally specific concepts are made expressible in the system (e.g., linguistic or ontological gaps) (Helm et al., 2023, Kay et al., 2024).
Formally, metrics such as representation disparity (where is true group share, and is observed share in algorithmic output) and relative bias index are used to quantify deviation from parity (Rohrbach et al., 2024). In knowledge bases, disparities in “page components” (labels, claims, edit counts) directly measure representational inequality (Ma et al., 2023). For language and knowledge resources, overgeneralization rates and representation variances provide quantitative indices of stereotyping and underrepresentation (Mehrabi et al., 2021, Abbasi et al., 2019).
2. Modalities and Manifestations
Representation injustice arises throughout the machine learning and data value chain:
- Structured Knowledge Bases: Wikidata exemplifies coverage and edit disparities across countries due to unequal community sizes and bot-driven imports. German human items, for example, have approximately twice as many multilingual labels as Vietnamese items, and receive disproportionately more attention from human and automated editors (Ma et al., 2023).
- Language and Lexical Technology: Techno-linguistic bias results when AI systems encode the worldview of pivot/dominant languages, collapsing rich conceptual distinctions (e.g., unique kin terms or food types) in “underserved" languages, producing systematic expressivity loss and hermeneutic silencing (Helm et al., 2023).
- Commonsense and Graph-based AI: In commonsense KBs such as ConceptNet, representational harms include both overgeneralization (disproportionate negative or positive attributions to targets) and representation disparity (variance in the number and sentiment of statements per demographic group), which then propagate into downstream model outputs (Mehrabi et al., 2021).
- Generative and LLMs: Even with interventions boosting the numerical representation of underrepresented groups ("who"), output language and framing ("how") often remain stereotyped, with stubborn persistence of gendered, racial, or ableist tropes in GMN-generated biographies and dialogue (Mickel et al., 1 Mar 2025, Kay et al., 2024).
- Algorithmic Mediation (Search Engines and Platforms): Audits of image search results for legislatures revealed both underrepresentation (e.g., women's share in results systematically lower than real-world presence) and misrepresentation (skewed, masculinized visual narratives), leading to measurable impact on users’ perceptions of political reality and candidate viability (Rohrbach et al., 2024).
- Mobility and Disaster Data: During Hurricane Harvey, smartphone-derived mobility datasets systematically underrepresented Black, Hispanic, and poor neighborhoods by up to a factor of two, making real-time analytics and disaster response unrepresentative of the full population (Deng et al., 2021).
3. Theoretical Results and Limits of Fair Representational Learning
Mathematical analyses demonstrate fundamental impossibility results for universal fair representation. No fixed representation can ensure Demographic Parity (DP) or Equalized Odds (EO) for all downstream tasks and distributional shifts:
- DP cannot be maintained under arbitrary changes to the marginal data distribution; any nontrivial representation (allowing for nonconstant classifiers) is susceptible to fairness violations when subject to worst-case group composition (Lechner et al., 2021).
- EO cannot be simultaneously achieved across distinct tasks with differing group label prevalences unless the representation collapses to triviality or the tasks themselves have coinciding base rates (Lechner et al., 2021).
These results underline that fairness is an emergent property of joint data, representation, classifier, labeling rules, and group statistics, not of the representation alone.
4. Measurement, Auditing, and Metrics
To operationalize representation injustice, a suite of formal metrics and methodologies is deployed:
| Modality | Core Metrics/Tests | Notable Papers |
|---|---|---|
| Structured Data | , , RMSE, t-tests, Cohen's | (Ma et al., 2023, Rohrbach et al., 2024) |
| Commonsense KBs | , , , , sentiment/regard taggers | (Mehrabi et al., 2021) |
| Language Technology | Lexical gap coverage, back-translation expressivity loss | (Helm et al., 2023) |
| Generative AI | Subset Similarity, Representational Bias Score, word-level | (Mickel et al., 1 Mar 2025) |
| Mobility Data | representativeness ratio, Mood’s test, Pearson | (Deng et al., 2021) |
These metrics capture both quantitative (coverage, statistical parity, variance) and qualitative (semantic loss, stereotype frequency) dimensions of harm.
5. Practical Interventions, Approaches, and Limitations
Mitigation strategies for representation injustice include:
- Preprocessing and Data-Level Fixes: Oversampling, synthetic data augmentation (e.g., GAN-based contrastive examples (Sharmanska et al., 2020)), reweighting, integration of supplementary knowledge bases, counter-stereotypical additions (Shahbazi et al., 2022, Mehrabi et al., 2021).
- Representation Learning with Guarantees: Provably fair representations seek to reduce the dependence of latent encodings on sensitive attributes, balancing reconstruction error and adversarial loss terms, with high-confidence, user-specified bounds on fairness violations for all downstream classifiers (McNamara et al., 2017, Luo et al., 2023).
- Structural and Sociotechnical Approaches: Participatory co-design—embedding impacted communities in data curation, ontology construction, and model alignment—plus institutional reforms to address root sources of allocative and representational harm (Helm et al., 2023, Kay et al., 2024, Madaio et al., 2021).
- Auditing and Transparency: Large-scale algorithm audits, explicit reporting of representation metrics, and development of observability tools for model outputs, especially in politically or socially critical domains (Rohrbach et al., 2024, Ma et al., 2023).
Limitations persist: Narrowly targeted quota-based interventions can exacerbate disparities on omitted, correlated axes—“Debiasing Paradox”—necessitating intersectional or multivariate bias correction (Smirnov et al., 2020). Stripping representations of all sensitive information risks utility loss or trivialization. Algorithms or toolkits blind to semantic or cultural variance (e.g., scalable rather than meaningful diversity) compound epistemic silencing (Helm et al., 2023).
6. Broader Implications and Societal Impact
Representation injustice entrenches hierarchical access to knowledge, reinforces stereotypes, and perpetuates cycles of marginalization at scale—whether by reifying normative political archetypes in search engines (Rohrbach et al., 2024), impoverishing language expressivity (Helm et al., 2023), or rendering the experiences of minority, poor, or Global South communities invisible in data-driven systems (Ma et al., 2023, Deng et al., 2021). It is implicated in epistemic and hermeneutical injustice, democratic exclusion, and the propagation of digitally mediated structural inequality (Kay et al., 2024, Madaio et al., 2021). Meaningful mitigation requires both algorithmic and institutional innovation, with explicit attention to intersectionality, participatory governance, and ongoing, multi-metric evaluation.
7. Open Challenges and Future Directions
Ongoing research must advance:
- Metrics capturing semantic adequacy and multidimensional/intersectional coverage;
- Participatory, community-led resource and ontology construction;
- Fairness-aware learning with formal guarantees under adversarial and shifting group/distribution scenarios;
- Auditing frameworks sensitive to how, not just who, is represented;
- Structural interventions addressing the root causes of underrepresentation rather than its superficial symptoms (Ma et al., 2023, Mickel et al., 1 Mar 2025, Smirnov et al., 2020, Helm et al., 2023).
Representation injustice thus operates at the intersection of technical, representational, and sociopolitical domains, requiring integrated, formal, and structural responses across the machine learning and data ecosystem.