Advancing Risk Gene Discovery Across the Allele Frequency Spectrum

Published 6 Nov 2025 in q-bio.GN | (2511.04637v1)

Abstract: The discovery of genetic risk factors has transformed human genetics, yet the pace of new gene identification has slowed despite the exponential expansion of sequencing and biobank resources. Current approaches are optimized for the extremes of the allele frequency spectrum: rare, high-penetrance variants identified through burden testing, and common, low-effect variants mapped by genome-wide association studies. Between these extremes lies variants of intermediate frequency and effect size where statistical power is limited, pathogenicity is often misclassified, and gene discovery lags behind empirical evidence of heritable contribution. This 'missing middle' represents a critical blind spot across disease areas, from neurodevelopmental and psychiatric disorders to cancer and aging. In this review, we organize strategies for risk gene identification by variant frequency class, highlighting methodological strengths and constraints at each scale. We draw on lessons across fields to illustrate how innovations in variant annotation, joint modeling, phenotype refinement, and network-based inference can extend discovery into the intermediate range. By framing the frequency spectrum as a unifying axis, we provide a conceptual map of current capabilities, their limitations, and emerging directions toward more comprehensive risk gene discovery.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper reveals that current methods primarily capture rare high-impact and common low-effect variants while systematically missing intermediate frequency variants.
It outlines innovative joint modeling and network-based inference techniques to boost discovery of moderate-effect alleles in complex disorders.
The study emphasizes integrating refined phenotype strategies and experimental validation to enhance genetic risk mapping and therapeutic targeting.

Advancements and Challenges in Risk Gene Discovery Across the Allele Frequency Spectrum

Overview and Motivation

The genetic dissection of human disease has, over the past two decades, transitioned from Mendelian paradigms to highly polygenic models. Despite the exponential growth of sequencing compendia and biobank resources, the discovery rate of new risk genes for complex disorders has markedly plateaued. This paper evaluates the causes of this deceleration, identifying that current statistical approaches primarily capture high-penetrance rare variants and low-effect common variants, leaving a systematically underexplored "missing middle"—variants of intermediate frequency (0.1–5% MAF) and moderate effect size (OR 1–2), which are poorly captured by existing frameworks. The review synthesizes advances in variant annotation, joint modeling, phenotype refinement, and network-based inference to propose strategic directions for comprehensive risk gene discovery.

The Landscape of Coding Variation: Rare Variants

Rare variation (MAF < 0.1%), including CNVs and small coding variants (PTVs, rare missense), has yielded the majority of individually interpretable gene-disease associations owing to high effect sizes. CNVs—particularly recurrent microdeletions and duplications—represent one of the largest sources of effect, notably in neurodevelopmental disorders. However, the identification of precise driver genes in gene-dense CNV regions has proved elusive, due to both breakpoint uniformity and pleiotropy. Somatic CNV profiling in disease-relevant tissues (notably studied in the cancer genomics domain) is proposed as a promising adjunct to increase gene-level resolution, especially for complex, multi-genic events.

Small coding variation, especially PTVs, has produced actionable insights both for diagnosis and therapeutics, exemplified by the PCSK9 cholesterol-lowering paradigm. However, PTVs and rare missense variants explain a minority of aggregate risk in most complex disorders, and their penetrance is profoundly modulated by gene constraint, dosage-sensitivity, and alternative splicing isoforms. Interpretability of missense variants remains a central limitation; current VEPs, even those informed by regional structural constraint or pathogenicity, are hampered by context dependence and limited performance on variants falling outside known pathogenic gene sets. Experimental approaches, such as deep mutational scanning and integration with 3D structural data, are vital for future progress but remain limited in scope due to assay scalability and context specificity.

Atypical classes—tandem repeats, in-frame indels, and MNVs—represent significant functional diversity that is poorly accounted for by standard variant calling and annotation pipelines. Their increased mutational rate, complex inheritance, and strong context effects bring substantial interpretative and statistical challenges but are increasingly being incorporated via improved haplotype-aware calling and post-variant refinement.

Intermediate Frequency Variants: The "Missing Middle"

A key finding is the systematic underrepresentation and probable misclassification of potentially pathogenic intermediate-frequency alleles as benign, a phenomenon reflected by an anomalous excess of "benign" ClinVar annotations in the 0.1–5% MAF range relative to adjacent bins. Statistical power limitations, owing to the sample requirements for moderate effect size detection, and methodological dependencies on constraint metrics, exacerbate this gap.

Joint Genetic Approaches

Joint modeling of rare and intermediate-frequency variants, either within genes (aggregative burden) or across genes (pairwise or higher-order combinations), offers increased discovery power for moderate effect size alleles. While such approaches have been successful in somatic mutational landscapes (notably cancer), their power in germline studies is limited by allele rarity, combinatorial explosion, and lack of multiplexed transmission datasets. Empirical results from high-throughput double-mutant screens in yeast and CRISPR studies in human cell lines indicate that most gene-gene interactions are approximately additive, simplifying statistical modeling under the null yet mandating large-scale observation for detection of rare, synergistic interactions.

The interplay between common variant burden (PRS) and rare variant penetrance is mediated, in part, via background liability models. While PRS can modulate phenotypic expressivity, integration with rare variant analyses in gene discovery remains underdeveloped, suggesting a clear avenue for future methodological innovation.

The dichotomization of affected and unaffected individuals imposes artificial thresholds inconsistent with the continuous liability model of most human phenotypes. Quantitative trait and within-case association approaches (e.g., using continuous metrics like social responsivity for autism or molecular biomarkers for aging) have proven to substantially increase association power, especially for lower-frequency or moderate-effect alleles. Gene mapping focusing on outlier expression or molecular phenotypes (such as in Watershed) shows that intermediate frequency/impact variants can drive distinctive regulatory or splicing signatures, detectable even in the absence of robust case-control associations.

Within-case stratification, accounting for sex or disease subtype, refines risk signals but is susceptible to confounding from secondary effects unrelated to the variant itself (e.g., hormonal or environmental modifiers).

Network-Based Inference

Integrative network approaches (DAWN, VBASS, DYNATE) utilize co-expression, physical interaction, and regulatory data to prioritize candidate risk genes, especially those with marginal direct signals. These methods increase discovery power for lower impact variants but are critically dependent on availability of high-quality, context-matched expression and molecular annotation data.

Common Variant Approaches: Fine-Mapping and Aggregative Gene Discovery

GWAS has mapped a large catalog of common, low-effect SNPs associated with complex disease risk, though translation into causal gene-level understanding remains an ongoing challenge. Fine-mapping methods, especially those utilizing ancestry-specific LD reference panels (e.g., SuSiEx, Tractor), have improved causal inference and risk resolution in admixed populations. The residual lack of overlap between rare and common variant-implicated genes (as noted in recent schizophrenia and breast cancer studies) underscores that these approaches capture distinct biological architectures.

Inclusion of tandem repeats and structural variants as exposure classes for GWAS and QTL mapping is increasing but remains limited by detection accuracy, reference panel completeness, and complex LD structure.

cis-QTL colocalization and TWAS approaches have made progress in associating regulatory variation with disease traits, providing functional anchor points for otherwise ambiguous GWAS loci. Massively parallel reporter assays enable direct functional screening of common variants at scale, although cell-type context and experimental throughput remain limiting factors.

Implications and Future Directions

Theoretical and Practical Implications

The strong numerical result that only a minority of risk genes are detectable under prevailing methods, and that the overlap between rare and common variant gene sets is low, points to intrinsic limitations of the classic gene discovery armamentarium. The review asserts that variants in the intermediate frequency space are not depleted due to evolutionary loss or biological irrelevance but are systematically under-detected due to statistical and methodological constraints.

The consequences for disease risk modeling, heritability estimates, and therapeutic targeting are significant. Intermediate-frequency variants, being more likely to segregate within pedigrees and define shared heritability within populations, are crucial for refining the genetic architecture underlying most disorders. Precision medicine initiatives must incorporate these classes into both risk prediction and gene-based therapeutic strategies.

Methodological Opportunities

Expansion of cohort sizes alone will yield sublinear improvements for intermediate frequency variants; methodological innovation is essential.
Integrative frameworks uniting rare, intermediate, and common variations on a continuous penetrance axis, especially utilizing biobank-scale EHR-linked datasets, offer a pathway toward resolving the "missing middle".
Experimental validation (deep mutational scanning, MPRA) and high-resolution molecular phenotyping will be central for functional interpretation, particularly for non-classical variant classes.
Analytical models explicitly incorporating polygenic background, environment, and dynamic phenotype (beyond case-control status) are likely to provide gains in both detection power and biological insight.
Increasing diversity in reference panels, especially for LD and QTL mapping in non-European populations, will reduce current ancestry-driven biases and improve fine-mapping precision.

Future Developments

Improvements in sequencing technology (especially long-read), broader multi-omics profiling, and statistical frameworks, capable of integrating complex inheritance (e.g., parent-of-origin, mosaicism, multi-locus models), will define the next advancements in risk gene mapping. The prospect of treating penetrance as a continuous, quantifiable trait rather than a binary categorization enables identification of genetic, environmental, and demographic modifiers of risk and offers the foundation for improved individualized disease modeling.

Conclusion

The deceleration in new risk gene discovery is not due to biological depletion but reflects the limits of currently applied statistical frameworks, which have reached saturation at the extremes of allele frequency and effect size. The systematic underdetection of intermediate frequency and moderate effect variants constitutes a critical blind spot in ongoing work. Methodological refinements in joint modeling, phenotype quantification, and integrative annotation are required to fill this gap. Progress on these fronts will deepen our understanding of genomic risk architecture, refine disease models, and inform future translational efforts in both diagnostics and therapeutics.

Markdown Report Issue