Structure-Agnostic Screening
- Structure-agnostic screening is a computational approach that selects chemical, material, and biological candidates without relying on fixed atomic structures, using descriptors and surrogate models.
- It integrates machine learning, cheminformatics, and physical modeling to systematically explore high-dimensional compositional and configurational spaces with minimal human bias.
- This methodology enables rapid virtual screening, unbiased candidate selection, and ensemble generation for high-fidelity re-ranking in various discovery contexts.
Structure-agnostic screening encompasses computational methodologies that enable high-throughput, hypothesis-free selection or ranking of candidates in chemical, materials, and biological discovery without reference to a fixed or known atomistic structure. These approaches bypass a priori assumptions about binding modes, geometry, or atomic coordinates, enabling the systematic exploration of compositional or configurational space even when structural information is incomplete, noisy, or unknown. Incorporating algorithms from cheminformatics, machine learning, structural bioinformatics, and physical modeling, structure-agnostic screening has become a critical paradigm in virtual screening, materials discovery, and high-dimensional data analysis.
1. Conceptual Foundations and General Definitions
Structure-agnostic screening is defined by the absence of constraints or user-imposed bias regarding atom arrangements, bonding topology, or predefined geometric motifs. In this context, candidates (molecules, crystal structures, variable features, or chemical compositions) are assessed solely on the basis of descriptors, model-derived features, or rapidly computed surrogate objectives. In contrast to traditional workflows that require curated initial structures, symmetry assignments, or manually assigned variables, structure-agnostic methods rely on unbiased enumeration, randomized perturbation, or composition-based representations to generate candidate sets and employ automated, data-driven metrics for selection or prioritization (Ota et al., 2016, Tokuyama et al., 23 Dec 2025, Vinogradov et al., 2024, Cheng et al., 2013, Yang et al., 2019, Zhu et al., 6 Jun 2025, Kameda et al., 2024).
This agnosticism confers several advantages: it enables exploration of large and/or ambiguous search spaces; facilitates automation; reduces human bias; and improves applicability to datasets or scenarios where ground truth structures are either unreliable or unavailable.
2. Structure-Agnostic Screening in Molecular and Reaction Discovery
For chemical structure and pathway searches, automated, structure-agnostic workflows have been developed to identify relevant minima, candidate intermediates, or binding motifs without recourse to pre-enumerated conformers or human-guessed initial geometries. A prototypical approach is the two-step structural-screening method coupling Global Reaction Route Mapping (GRRM) with semiempirical force calculations via MOPAC (Ota et al., 2016). This process operates as follows:
- Unbiased Pathway Generation: GRRM's Artificial Force Induced Reaction (AFIR) module applies random perturbations and artificial association forces to an initial ensemble of fragments or reactants (e.g., cation + cellulose), systematically sampling configurations.
- Rapid Screening with Surrogate Model: Semiempirical methods (e.g., PM6, PM3 in MOPAC) supply atomic gradients and energies at each step, enabling high-throughput descent to local energy minima.
- Mathematical Energy Filtering: Minima are filtered by the rule , with user-defined (e.g., 10–20 kcal/mol). Only candidates within this window are retained for more expensive downstream calculations.
- Final High-Fidelity Re-Ranking: Shortlisted structures undergo full ab initio or DFT geometry optimization and property evaluation.
- Ensemble Generation: The protocol yields an unbiased library of low-lying, structurally diverse candidates, suitable for thermodynamic analysis or for further reactivity screening.
This approach eliminates the need for guesswork regarding initial binding geometries, enabling uniform and systematic exploration of possible molecular interactions and configurations (Ota et al., 2016).
3. Composition- and Descriptor-Based Structure-Agnostic Screening
In materials and molecule discovery, embedding structure-agnostic principles into machine learning and data mining frameworks allows for highly scalable candidate identification without explicit atomistic modeling.
Composition-based primary screening: The screening of superconducting ternary hydrides under high pressure illustrates this paradigm (Tokuyama et al., 23 Dec 2025). An ensemble of 30 XGBoost models, trained on curated datasets comprising ∼2000 binary and ternary hydrides and using a filtered set of ∼69 compositional descriptors, predicts superconducting transition temperatures directly from elemental statistics (e.g., atomic radius, ionization energy) and synthesis pressure. The ensemble mean, sample standard deviation, and rank-ordering by the lower bound of a 95% confidence interval,
yield a statistically robust ranking of compositions. This approach enables rapid identification of promising chemical spaces, agnostic to underlying crystal structures, for subsequent computational or experimental validation.
Soft bond valence—descriptor-driven cell construction: Screening in ionic conductor materials can be carried out using parameterized surrogates such as the Soft Bond Valence (SoftBV) approach, where descriptors specific to each cation–anion combination (e.g., bond length at unit valence , softness parameter , coordination number , ionic radius ) inform a machine-learned surrogate for the screening factor . The most effective instantiation—the GPR-NN model—predicts optimal for a given composition, enabling full structure relaxation and evaluation of properties (e.g., lattice constants, global instability index) prior to any quantum-level calculation. Nonlinear dependence on descriptor features, more so than multi-feature coupling, is found to be essential for accuracy at the 1% error level (Kameda et al., 2024).
4. Structure-Agnostic Model Screening in Materials Characterization
The structure-mining approach (Yang et al., 2019) operationalizes structure-agnostic screening by automating the process of model selection against experimental atomic pair distribution function (PDF) data:
- Database Query: Candidate structures are retrieved from external repositories (e.g., the Materials Project Database, the Crystallography Open Database) based on minimal chemical heuristics.
- Automated Refinement: Each model is subjected to staged, automated Levenberg–Marquardt optimization in DiffPy-CMI, refining parameters such as scale, lattice constants, isotropic atomic displacement parameters, and, for nanostructures, particle diameters.
- Objective Function: The weighted residual factor quantifies fit quality without user intervention,
- Ranking and Interpretation: Refined models are ranked by , enabling high-throughput, hypothesis-free identification of plausible structural motifs, including nanocrystals, low-symmetry or distorted structures, and previously unreported phases.
This approach is robust to missing atomic coordinate refinement and parameterizes hierarchy so that even multi-phase or doped samples can be analyzed without explicit structural assignments (Yang et al., 2019).
5. Virtual Screening and Target-/Pocket-Agnostic Molecular Search
Recent advances in cheminformatics and protein–ligand modeling have produced approaches where both ligand-centric and protein-centric structure assumptions are relaxed.
- Potency-focused, structure-agnostic molecular screening: The BIOPTIC B1 platform (Vinogradov et al., 2024) builds a 60-dimensional chemically-potency preserving embedding from SMILES strings via a RoBERTa-style transformer. A single, target-agnostic encoder allows exhaustive cosine similarity search over 40 billion molecules (Enamine REAL library), without access to target protein structures. The approach outperforms classical fingerprint and GNN-based models for novel, scaffold-dissimilar actives, achieving ROC AUCs up to 82.5 on challenging benchmarks and 100% recall due to brute-force, SIMD-accelerated retrieval.
- Structure-agnostic virtual screening with ambiguous protein data: AANet (Zhu et al., 6 Jun 2025) disentangles ligand–pocket recognition from the need for high-quality holo structures by employing tri-modal contrastive alignment across ligand, true holo pocket, and detected candidate pockets from experimental or predicted (AlphaFold2) apo structures. Soft attention is used to aggregate candidate cavity embeddings, allowing the model to rank ligands in a fully pocket-agnostic fashion. On blind apo benchmarks, AANet's EF1% (early enrichment factor for actives in top 1%) is up to 37.2 compared to 11.8 for state-of-the-art DrugCLIP, corresponding to strong capability under high uncertainty and structural ambiguity.
A key advantage in both domains is that structural input—in the traditional atomic, 3D-coordinates sense—is replaced or supplemented by learned embeddings, detected geometric features, or compositional fingerprints.
6. Structure-Agnostic Screening in High-Dimensional Data Analysis
Structure-agnostic principles extend into high-dimensional covariate screening in statistical modeling. The nonparametric independence screening methodology in ultra-high-dimensional longitudinal analysis (Cheng et al., 2013) demonstrates:
- Marginal Model Construction: For each covariate , a B-spline marginal varying-coefficient model is fitted under working independence, irrespective of the true internal coefficient structure (constant, linear, nonparametric).
- Screening Statistic: The empirical energy is calculated and compared against a data-driven threshold to select significant predictors,
- Theoretical Guarantees: Under weak assumptions, with probability tending to 1, all relevant variables are retained while reducing the predictor set to a tractable subset, agnostic to the nature or smoothness of their contribution.
This decoupling of the variable selection process from strong modeling assumptions exemplifies the generality of structure-agnostic screening.
7. Limitations, Applicability, and Outlook
While structure-agnostic screening offers substantial flexibility and efficiency, it has inherent limitations and domains of optimal applicability:
- It is best suited to large problem spaces where a priori structural hypotheses are impractical or unavailable, e.g., novel organic materials, extreme composition spaces, intractable molecular libraries, or ambiguous experimental datasets.
- The quality of downstream results depends on the adequacy of surrogate models, the breadth and bias of databases or training data, and the reliability of descriptors or proxies (e.g., soft bond valence parameters, compositional statistics).
- For problems involving transition-metal complexes, TS searches, or cases requiring explicit consideration of electronic structure, direct ab initio or DFT computation, manual curation, or combined approaches may be required (Ota et al., 2016).
- Current approaches may be limited by database coverage, absence of correlated disorder, and the neglect of local distortions unless further post-processing is incorporated (Yang et al., 2019).
A plausible implication is that, as data availability and surrogate accuracy continue to increase, the role of structure-agnostic screening will expand further into automated discovery frameworks, real-time experimental feedback, and integration with generative design pipelines. In summary, structure-agnostic screening constitutes a rigorous, automated, and generalizable approach for hypothesis-free candidate selection across chemistry, materials science, and data-driven modeling.