A20/E17 Molecular Graph Features

Updated 30 January 2026

A20/E17 features are a domain-informed encoding combining 20 atom-level and 17 bond-level descriptors to capture chemical structure and stereochemistry.
They integrate key properties like element identity, bond order, aromaticity, and ring membership to support advanced GNN architectures such as GINE and PNA.
Performance gains of 10–12% VP MSE and 5–8% OP MSE under scaffold-split settings demonstrate improved out-of-distribution generalization.

A20/E17 molecular graph features define a high-fidelity, domain-informed encoding for vertices and edges in molecular graphs, optimized for graph neural network (GNN) architectures targeting robust chemical property prediction under strict out-of-distribution (OOD) regimes. The A20 set incorporates 20-dimensional atom-level descriptors reflecting element identity, connectivity, electronic structure, and stereochemistry, while the E17 set specifies 17-dimensional bond-level descriptors integrating bond order, conjugation, ring status, stereochemistry, and ring-size indicators. This feature regime has demonstrated substantial improvements in property regression and multitask settings over lighter featurizations, with superior scaffold-split generalization, especially when integrated into advanced GNN layers such as GINE and PNA (Wu et al., 23 Jan 2026).

1. Formal Specification of A20/E17 Feature Sets

The A20 atom-level feature vector $\mathbf{x}_v \in \mathbb{R}^{20}$ encodes:

Element identity (10-dim one-hot): $\{\mathrm{C},\mathrm{N},\mathrm{O},\mathrm{F},\mathrm{Cl},\mathrm{Br},\mathrm{I},\mathrm{S},\mathrm{P},\textit{other}\}$ ; rare or unsupported elements map to "other".
Degree (1-dim scalar): Graph-theoretic degree, clipped to $\{0,1,2,3,4,5+\}$ , then standardized.
Formal charge (1-dim scalar): Integer charge, clipped to $\{-2,-1,0,1,2\}$ .
Hybridization (4-dim one-hot): $\{\mathrm{sp}, \mathrm{sp}^2, \mathrm{sp}^3, \textit{other}\}$ .
Aromaticity (1-dim binary): 1 if atom is aromatic.
Ring membership (1-dim binary): 1 if atom is in any ring.
Total hydrogen count (1-dim scalar): Sum of explicit and implicit hydrogens, clipped and standardized.
Chirality center (1-dim binary): 1 if atom is a stereocenter.

The E17 bond-level feature vector $\mathbf{e}_{uv} \in \mathbb{R}^{17}$ encodes:

Bond order (4-dim one-hot): $\{\text{single}, \text{double}, \text{triple}, \text{aromatic}\}$ .
Conjugation (1-dim binary): RDKit conjugation flag.
Ring membership (1-dim binary): 1 if bond is in any ring.
Stereochemistry (6-dim one-hot): $\{\mathrm{NONE}, \mathrm{ANY}, \mathrm{Z}, \mathrm{E}, \mathrm{CIS}, \mathrm{TRANS}\}$ .
Ring-size indicators (5-dim multi-hot): Ring sizes $\{3,4,5,6,\geq7\}$ , multi-hot for fused ring assignments.

All features are instantiated from sanitized RDKit graphs; scalar fields are standardized by mean and variance from training data, categorical fields one-hot encoded, and binary flags as 0/1.

2. Preprocessing and Embedding Protocol

Molecules are processed as follows:

SMILES normalization and removal of explicit hydrogens (RDKit sanitization).
Extraction of all per-atom and per-bond feature values.
Clipping and standardization of scalar atom descriptors (degree, formal charge, hydrogen count) and mapping to feature bins prior to one-hot encoding.
Concatenation of categorical, scalar, and binary attributes into $\mathbf{x}_v$ and $\mathbf{e}_{uv}$ for all $v\in\mathcal V$ , $(u,v)\in\mathcal E$ .

For message-passing GNN architectures, initial embeddings are assigned $\mathbf{h}_v^{(0)} = \mathbf{x}_v$ , with edge features available to edge-MLPs.

3. Integration into Graph Neural Network Architectures

A20/E17 features are directly injected into GNN layers:

GINE: Each GINE layer executes

$\mathbf{h}_v^{(k)} = \mathrm{MLP}^{(k)}\left(\left(1+\epsilon^{(k)}\right)\mathbf{h}_v^{(k-1)} + \sum_{u \in \mathcal{N}(v)} \psi^{(k)}(\mathbf{h}_u^{(k-1)}, \mathbf{e}_{uv})\right)$

with edge-modified neighbor aggregation.

PNA: Each PNA layer computes

$\mathbf{m}_v^{(k)} = \mathrm{SCALE}_v\left(\operatorname{concat}\{\mathrm{mean}, \mathrm{max}, \mathrm{min}, \mathrm{std}\}_{u \in \mathcal{N}(v)} [\phi^{(k)}(\mathbf{h}_u^{(k-1)}, \mathbf{e}_{uv})]\right)$

followed by concatenation with center node and MLP update, enabling multi-statistic, degree-aware aggregation.

Message functions ( $\psi, \phi$ ) are configured to process both state and edge features. Edge-MLPs consume full E17 vectors.

4. Comparative Performance and Ablation

A20/E17 features deliver measurable gains:

Model	VP MSE (↓)	OP MSE (↓)
GINE + light (e4/e6)	0.255 ± 0.008	0.670 ± 0.022
GINE + A20/E17	0.223 ± 0.006	0.612 ± 0.018
PNA + light (e4/e6)	0.236 ± 0.007	0.632 ± 0.021
PNA + A20/E17	0.210 ± 0.005	0.598 ± 0.020

Relative improvements of 10–12% MSE (VP) and 5–8% MSE (OP) are achieved under scaffold splits, confirming that richer descriptors support enhanced generalization and OOD robustness (Wu et al., 23 Jan 2026).

5. Out-of-Distribution Diagnostics and OOD Generalization

OOD settings are characterized by ECFC4 Tanimoto similarity distributions (median ∼0.4, tail <0.3), scaffold-split evaluation, and similarity-binned residual analysis:

MaxSim bin	[0, 0.3)	[0.3, 0.5)	[0.5, 0.7)	[0.7, 1.0]
ST-VP (PNA)	0.324	0.241	0.194	0.162
Safe-MT (PNA)	0.305	0.232	0.190	0.165

Performance degrades smoothly with similarity; parity plots reveal tighter alignment to ground truth for PNA+A20/E17 models. Flat residuals vs. similarity indicate that models extrapolate rather than memorize.

6. Comparison with Other Molecular-Graph Feature Regimes

A20/E17 provides:

Explicit, compact, and chemically grounded descriptors exceeding the minimal four-dimensional light regimes (e.g., element, aromaticity, bond order, ring flag).
Enhanced structural and stereoelectronic information compared with common RDKit atom/bond tables (Chang, 2019).
Comprehensive coverage relative to surveyed standard feature strategies (atomic number, hybridization, bond type, stereochemistry, ring flags) (Guo et al., 2022).
Efficient representation for task-optimized GNN and multitask pipelines, avoiding high-dimensionality and redundancy while retaining interpretability.

A plausible implication is that, for applications requiring strong OOD property prediction and chemical interpretability, A20/E17 features offer a pragmatic optimum between information richness and computational tractability.

7. Significance and Practical Impact

A20/E17 molecular graph features have been established as a robust standard for molecular GNN applications targeting physical properties with substantial chemical diversity and OOD demands. Integration into advanced message-passing schemes (GINE, PNA) yields consistent improvements in molecular property regression, multitask learning, and diagnostics. These gains are confirmed by full experimental reproducibility and scaffold-split error analyses, providing both practitioners and methodologists with a comprehensive, transferable template for chemically meaningful graph representations (Wu et al., 23 Jan 2026).