A20/E17 Molecular Graph Features
- A20/E17 features are a domain-informed encoding combining 20 atom-level and 17 bond-level descriptors to capture chemical structure and stereochemistry.
- They integrate key properties like element identity, bond order, aromaticity, and ring membership to support advanced GNN architectures such as GINE and PNA.
- Performance gains of 10–12% VP MSE and 5–8% OP MSE under scaffold-split settings demonstrate improved out-of-distribution generalization.
A20/E17 molecular graph features define a high-fidelity, domain-informed encoding for vertices and edges in molecular graphs, optimized for graph neural network (GNN) architectures targeting robust chemical property prediction under strict out-of-distribution (OOD) regimes. The A20 set incorporates 20-dimensional atom-level descriptors reflecting element identity, connectivity, electronic structure, and stereochemistry, while the E17 set specifies 17-dimensional bond-level descriptors integrating bond order, conjugation, ring status, stereochemistry, and ring-size indicators. This feature regime has demonstrated substantial improvements in property regression and multitask settings over lighter featurizations, with superior scaffold-split generalization, especially when integrated into advanced GNN layers such as GINE and PNA (Wu et al., 23 Jan 2026).
1. Formal Specification of A20/E17 Feature Sets
The A20 atom-level feature vector encodes:
- Element identity (10-dim one-hot): ; rare or unsupported elements map to "other".
- Degree (1-dim scalar): Graph-theoretic degree, clipped to , then standardized.
- Formal charge (1-dim scalar): Integer charge, clipped to .
- Hybridization (4-dim one-hot): .
- Aromaticity (1-dim binary): 1 if atom is aromatic.
- Ring membership (1-dim binary): 1 if atom is in any ring.
- Total hydrogen count (1-dim scalar): Sum of explicit and implicit hydrogens, clipped and standardized.
- Chirality center (1-dim binary): 1 if atom is a stereocenter.
The E17 bond-level feature vector encodes:
- Bond order (4-dim one-hot): .
- Conjugation (1-dim binary): RDKit conjugation flag.
- Ring membership (1-dim binary): 1 if bond is in any ring.
- Stereochemistry (6-dim one-hot): .
- Ring-size indicators (5-dim multi-hot): Ring sizes , multi-hot for fused ring assignments.
All features are instantiated from sanitized RDKit graphs; scalar fields are standardized by mean and variance from training data, categorical fields one-hot encoded, and binary flags as 0/1.
2. Preprocessing and Embedding Protocol
Molecules are processed as follows:
- SMILES normalization and removal of explicit hydrogens (RDKit sanitization).
- Extraction of all per-atom and per-bond feature values.
- Clipping and standardization of scalar atom descriptors (degree, formal charge, hydrogen count) and mapping to feature bins prior to one-hot encoding.
- Concatenation of categorical, scalar, and binary attributes into and for all , .
For message-passing GNN architectures, initial embeddings are assigned , with edge features available to edge-MLPs.
3. Integration into Graph Neural Network Architectures
A20/E17 features are directly injected into GNN layers:
- GINE: Each GINE layer executes
with edge-modified neighbor aggregation.
- PNA: Each PNA layer computes
followed by concatenation with center node and MLP update, enabling multi-statistic, degree-aware aggregation.
Message functions () are configured to process both state and edge features. Edge-MLPs consume full E17 vectors.
4. Comparative Performance and Ablation
A20/E17 features deliver measurable gains:
| Model | VP MSE (↓) | OP MSE (↓) |
|---|---|---|
| GINE + light (e4/e6) | 0.255 ± 0.008 | 0.670 ± 0.022 |
| GINE + A20/E17 | 0.223 ± 0.006 | 0.612 ± 0.018 |
| PNA + light (e4/e6) | 0.236 ± 0.007 | 0.632 ± 0.021 |
| PNA + A20/E17 | 0.210 ± 0.005 | 0.598 ± 0.020 |
Relative improvements of 10–12% MSE (VP) and 5–8% MSE (OP) are achieved under scaffold splits, confirming that richer descriptors support enhanced generalization and OOD robustness (Wu et al., 23 Jan 2026).
5. Out-of-Distribution Diagnostics and OOD Generalization
OOD settings are characterized by ECFC4 Tanimoto similarity distributions (median ∼0.4, tail <0.3), scaffold-split evaluation, and similarity-binned residual analysis:
| MaxSim bin | [0, 0.3) | [0.3, 0.5) | [0.5, 0.7) | [0.7, 1.0] |
|---|---|---|---|---|
| ST-VP (PNA) | 0.324 | 0.241 | 0.194 | 0.162 |
| Safe-MT (PNA) | 0.305 | 0.232 | 0.190 | 0.165 |
Performance degrades smoothly with similarity; parity plots reveal tighter alignment to ground truth for PNA+A20/E17 models. Flat residuals vs. similarity indicate that models extrapolate rather than memorize.
6. Comparison with Other Molecular-Graph Feature Regimes
A20/E17 provides:
- Explicit, compact, and chemically grounded descriptors exceeding the minimal four-dimensional light regimes (e.g., element, aromaticity, bond order, ring flag).
- Enhanced structural and stereoelectronic information compared with common RDKit atom/bond tables (Chang, 2019).
- Comprehensive coverage relative to surveyed standard feature strategies (atomic number, hybridization, bond type, stereochemistry, ring flags) (Guo et al., 2022).
- Efficient representation for task-optimized GNN and multitask pipelines, avoiding high-dimensionality and redundancy while retaining interpretability.
A plausible implication is that, for applications requiring strong OOD property prediction and chemical interpretability, A20/E17 features offer a pragmatic optimum between information richness and computational tractability.
7. Significance and Practical Impact
A20/E17 molecular graph features have been established as a robust standard for molecular GNN applications targeting physical properties with substantial chemical diversity and OOD demands. Integration into advanced message-passing schemes (GINE, PNA) yields consistent improvements in molecular property regression, multitask learning, and diagnostics. These gains are confirmed by full experimental reproducibility and scaffold-split error analyses, providing both practitioners and methodologists with a comprehensive, transferable template for chemically meaningful graph representations (Wu et al., 23 Jan 2026).