Papers
Topics
Authors
Recent
Search
2000 character limit reached

A20/E17 Molecular Graph Features

Updated 30 January 2026
  • A20/E17 features are a domain-informed encoding combining 20 atom-level and 17 bond-level descriptors to capture chemical structure and stereochemistry.
  • They integrate key properties like element identity, bond order, aromaticity, and ring membership to support advanced GNN architectures such as GINE and PNA.
  • Performance gains of 10–12% VP MSE and 5–8% OP MSE under scaffold-split settings demonstrate improved out-of-distribution generalization.

A20/E17 molecular graph features define a high-fidelity, domain-informed encoding for vertices and edges in molecular graphs, optimized for graph neural network (GNN) architectures targeting robust chemical property prediction under strict out-of-distribution (OOD) regimes. The A20 set incorporates 20-dimensional atom-level descriptors reflecting element identity, connectivity, electronic structure, and stereochemistry, while the E17 set specifies 17-dimensional bond-level descriptors integrating bond order, conjugation, ring status, stereochemistry, and ring-size indicators. This feature regime has demonstrated substantial improvements in property regression and multitask settings over lighter featurizations, with superior scaffold-split generalization, especially when integrated into advanced GNN layers such as GINE and PNA (Wu et al., 23 Jan 2026).

1. Formal Specification of A20/E17 Feature Sets

The A20 atom-level feature vector xvR20\mathbf{x}_v \in \mathbb{R}^{20} encodes:

  • Element identity (10-dim one-hot): {C,N,O,F,Cl,Br,I,S,P,other}\{\mathrm{C},\mathrm{N},\mathrm{O},\mathrm{F},\mathrm{Cl},\mathrm{Br},\mathrm{I},\mathrm{S},\mathrm{P},\textit{other}\}; rare or unsupported elements map to "other".
  • Degree (1-dim scalar): Graph-theoretic degree, clipped to {0,1,2,3,4,5+}\{0,1,2,3,4,5+\}, then standardized.
  • Formal charge (1-dim scalar): Integer charge, clipped to {2,1,0,1,2}\{-2,-1,0,1,2\}.
  • Hybridization (4-dim one-hot): {sp,sp2,sp3,other}\{\mathrm{sp}, \mathrm{sp}^2, \mathrm{sp}^3, \textit{other}\}.
  • Aromaticity (1-dim binary): 1 if atom is aromatic.
  • Ring membership (1-dim binary): 1 if atom is in any ring.
  • Total hydrogen count (1-dim scalar): Sum of explicit and implicit hydrogens, clipped and standardized.
  • Chirality center (1-dim binary): 1 if atom is a stereocenter.

The E17 bond-level feature vector euvR17\mathbf{e}_{uv} \in \mathbb{R}^{17} encodes:

  • Bond order (4-dim one-hot): {single,double,triple,aromatic}\{\text{single}, \text{double}, \text{triple}, \text{aromatic}\}.
  • Conjugation (1-dim binary): RDKit conjugation flag.
  • Ring membership (1-dim binary): 1 if bond is in any ring.
  • Stereochemistry (6-dim one-hot): {NONE,ANY,Z,E,CIS,TRANS}\{\mathrm{NONE}, \mathrm{ANY}, \mathrm{Z}, \mathrm{E}, \mathrm{CIS}, \mathrm{TRANS}\}.
  • Ring-size indicators (5-dim multi-hot): Ring sizes {3,4,5,6,7}\{3,4,5,6,\geq7\}, multi-hot for fused ring assignments.

All features are instantiated from sanitized RDKit graphs; scalar fields are standardized by mean and variance from training data, categorical fields one-hot encoded, and binary flags as 0/1.

2. Preprocessing and Embedding Protocol

Molecules are processed as follows:

  • SMILES normalization and removal of explicit hydrogens (RDKit sanitization).
  • Extraction of all per-atom and per-bond feature values.
  • Clipping and standardization of scalar atom descriptors (degree, formal charge, hydrogen count) and mapping to feature bins prior to one-hot encoding.
  • Concatenation of categorical, scalar, and binary attributes into xv\mathbf{x}_v and euv\mathbf{e}_{uv} for all vVv\in\mathcal V, (u,v)E(u,v)\in\mathcal E.

For message-passing GNN architectures, initial embeddings are assigned hv(0)=xv\mathbf{h}_v^{(0)} = \mathbf{x}_v, with edge features available to edge-MLPs.

3. Integration into Graph Neural Network Architectures

A20/E17 features are directly injected into GNN layers:

  • GINE: Each GINE layer executes

hv(k)=MLP(k)((1+ϵ(k))hv(k1)+uN(v)ψ(k)(hu(k1),euv))\mathbf{h}_v^{(k)} = \mathrm{MLP}^{(k)}\left(\left(1+\epsilon^{(k)}\right)\mathbf{h}_v^{(k-1)} + \sum_{u \in \mathcal{N}(v)} \psi^{(k)}(\mathbf{h}_u^{(k-1)}, \mathbf{e}_{uv})\right)

with edge-modified neighbor aggregation.

  • PNA: Each PNA layer computes

mv(k)=SCALEv(concat{mean,max,min,std}uN(v)[ϕ(k)(hu(k1),euv)])\mathbf{m}_v^{(k)} = \mathrm{SCALE}_v\left(\operatorname{concat}\{\mathrm{mean}, \mathrm{max}, \mathrm{min}, \mathrm{std}\}_{u \in \mathcal{N}(v)} [\phi^{(k)}(\mathbf{h}_u^{(k-1)}, \mathbf{e}_{uv})]\right)

followed by concatenation with center node and MLP update, enabling multi-statistic, degree-aware aggregation.

Message functions (ψ,ϕ\psi, \phi) are configured to process both state and edge features. Edge-MLPs consume full E17 vectors.

4. Comparative Performance and Ablation

A20/E17 features deliver measurable gains:

Model VP MSE (↓) OP MSE (↓)
GINE + light (e4/e6) 0.255 ± 0.008 0.670 ± 0.022
GINE + A20/E17 0.223 ± 0.006 0.612 ± 0.018
PNA + light (e4/e6) 0.236 ± 0.007 0.632 ± 0.021
PNA + A20/E17 0.210 ± 0.005 0.598 ± 0.020

Relative improvements of 10–12% MSE (VP) and 5–8% MSE (OP) are achieved under scaffold splits, confirming that richer descriptors support enhanced generalization and OOD robustness (Wu et al., 23 Jan 2026).

5. Out-of-Distribution Diagnostics and OOD Generalization

OOD settings are characterized by ECFC4 Tanimoto similarity distributions (median ∼0.4, tail <0.3), scaffold-split evaluation, and similarity-binned residual analysis:

MaxSim bin [0, 0.3) [0.3, 0.5) [0.5, 0.7) [0.7, 1.0]
ST-VP (PNA) 0.324 0.241 0.194 0.162
Safe-MT (PNA) 0.305 0.232 0.190 0.165

Performance degrades smoothly with similarity; parity plots reveal tighter alignment to ground truth for PNA+A20/E17 models. Flat residuals vs. similarity indicate that models extrapolate rather than memorize.

6. Comparison with Other Molecular-Graph Feature Regimes

A20/E17 provides:

  • Explicit, compact, and chemically grounded descriptors exceeding the minimal four-dimensional light regimes (e.g., element, aromaticity, bond order, ring flag).
  • Enhanced structural and stereoelectronic information compared with common RDKit atom/bond tables (Chang, 2019).
  • Comprehensive coverage relative to surveyed standard feature strategies (atomic number, hybridization, bond type, stereochemistry, ring flags) (Guo et al., 2022).
  • Efficient representation for task-optimized GNN and multitask pipelines, avoiding high-dimensionality and redundancy while retaining interpretability.

A plausible implication is that, for applications requiring strong OOD property prediction and chemical interpretability, A20/E17 features offer a pragmatic optimum between information richness and computational tractability.

7. Significance and Practical Impact

A20/E17 molecular graph features have been established as a robust standard for molecular GNN applications targeting physical properties with substantial chemical diversity and OOD demands. Integration into advanced message-passing schemes (GINE, PNA) yields consistent improvements in molecular property regression, multitask learning, and diagnostics. These gains are confirmed by full experimental reproducibility and scaffold-split error analyses, providing both practitioners and methodologists with a comprehensive, transferable template for chemically meaningful graph representations (Wu et al., 23 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to A20/E17 Molecular Graph Features.