SMILES-GRPO: Group-Aware Molecular Pretraining

Updated 1 February 2026

The paper introduces a novel group-aware masking strategy that occludes entire functional groups in SMILES representations using curated SMARTS patterns, enhancing chemical coherence.
It details the implementation of a 12-layer Transformer with fixed mask-rate scheduling and optimized training to achieve superior performance on 11 molecular property benchmarks.
The paper extends its approach with Group Relative Policy Optimization (GRPO) for reinforcement learning in OCSR, yielding a 7–9.6% improvement in stereo-exact match accuracy.

SMILES-GRPO ("SMILES with Group-Aware Random Partial Occlusion") is a pre-training and model optimization paradigm developed for molecular LLMs that utilize Simplified Molecular-Input Line-Entry System (SMILES) representations. SMILES-GRPO advances molecular property prediction and cheminformatics by leveraging chemically meaningful, group-aware masking strategies during language-model pre-training and, in related contexts, by using Group Relative Policy Optimization (GRPO) for reinforcement learning-based finetuning. The approach has demonstrated superior empirical results across multiple molecular property prediction and optical chemical structure recognition (OCSR) benchmarks.

1. Group-Aware Masking in Molecular LLM Pre-training

The defining methodological innovation of SMILES-GRPO is the masking operator $M(\cdot)$ , which targets entire functional groups within SMILES sequences instead of random or arbitrary token spans. Given a SMILES sequence $S = (s_1, ..., s_L)$ , the method uses an RDKit-based substructure matcher to identify a set $\mathcal{G}(S) = \{g_1, ..., g_K\}$ of functional group subsequences (using a curated library of ~40 SMARTS patterns covering carboxyl, ester, nitro, and amine, among others). The group-aware masking operator then selects a subset $G_m \subseteq \mathcal{G}(S)$ : if $K = 0$ , atom-level random masking is used; if $1 \leq K < 10$ , one group is masked uniformly at random; if $K \geq 10$ , $\lceil 0.1 \cdot K\rceil$ groups are masked randomly. Overlapping masks are prohibited to ensure each masked region is chemically coherent.

For each selected group $g \in G_m$ , all tokens of $g$ are replaced with a special [MASK] token, producing a masked sequence $M(S) = (\hat{s}_1, ..., \hat{s}_L)$ . The resulting pre-training objective is a masked language modeling loss:

$L(\theta) = \mathbb{E}_{S \sim D} \left[ -\sum_{t=1}^L \mathbb{I}_{\hat{s}_t=[MASK]} \log p_\theta(s_t | \hat{s}_{1, ..., L}) \right]$

This encourages the model $p_\theta$ to infer chemically valid replacements for occluded functional groups, thereby compelling structure-aware representations (Peng et al., 2024).

2. Transformer Architecture and Implementation Details

SMILES-GRPO employs a 12-layer Transformer encoder congruent with MoLFormer or RoBERTa, characterized by a hidden size $D_h = 768$ , 12 attention heads per layer, rotary (MoLFormer) or absolute (RoBERTa) positional embeddings, and a SMILES-appropriate vocabulary (atom tokens, bond symbols, brackets, plus [MASK]). The approach requires no architectural modifications for group awareness—the information is introduced solely by the masking protocol.

Key optimization and training details include:

A learnable embedding for [MASK] of dimension $D_h$ .
Pre-training using AdamW (lr= $3\times10^{-5}$ ) and LambdaLR decay, batch size 1024 on 16 V100 GPUs, 50 epochs on 10M–20M SMILES, or 20 epochs on 100M.
Mask-rate scheduling is fixed by the group masking policy, yielding an average 10%–25% of tokens masked.
Fine-tuning employs FusedLAMB (lr= $3\times10^{-5}$ ), batch size 64 per task.

3. Downstream Evaluation and Benchmarking

Evaluation follows MoleculeNet’s scaffold split protocol to rigorously assess out-of-distribution generalization. Eleven benchmark tasks are used: seven classification (BBBP, BACE, ClinTox, Tox21, SIDER, HIV, MUV) using ROC-AUC, and five regression (ESOL, FreeSolv, Lipophilicity, QM7) plus QM8 (measured by RMSE and MAE, respectively). The scaffold split ensures training, validation, and test sets contain non-overlapping Bemis–Murcko scaffolds.

Empirical results show that SMILES-GRPO pre-trained on 100M molecules outperforms SMILES-only (MoLFormer, RoBERTa) and graph-based models (MolCLR, GEM, GROVER) in 9 out of 11 tasks and ranks a close second in the others (Peng et al., 2024).

Model	BBBP	BACE	ClinTox	Tox21	HIV	MUV
MolCLR-gin	0.931	0.787	0.801	0.764	0.777	0.739
GEM (3D GNN)	0.910	0.860	0.851	0.779	0.750	0.725
MoLFormer (SMILES)	0.904	0.828	0.945	0.773	0.763	0.760
SMILES-GRPO (100M)	0.924	0.798	0.961	0.790	0.781	0.799

Model	ESOL (RMSE $\downarrow$ )	FreeSolv $\downarrow$	Lipo $\downarrow$	QM7 $\downarrow$	QM8 (MAE $\downarrow$ )
MolCLR-gin	1.472	2.712	0.741	96.547	0.0205
GEM (3D GNN)	0.761	2.458	0.686	65.007	0.0179
MoLFormer (SMILES)	0.661	4.449	0.446	69.070	0.0177
SMILES-GRPO (100M)	0.390	3.049	0.398	64.167	0.0202

4. Methodological Analyses and Ablation Studies

Ablation experiments substantiate the design decisions of SMILES-GRPO:

Masking strategy: Group-aware masking achieves a 30% reduction in ESOL RMSE (0.343 vs. 0.491 for random subsequence masking at 10M molecules). Notably, group-aware pre-training with 10M samples outperforms random masking with 100M, highlighting sample efficiency.
Mask-rate sensitivity: Optimal downstream results occur when masking ~10% of groups per molecule; excessive masking (>20%) degrades accuracy.
Structural inference: UMAP projections of learned representations reflect chemical properties (e.g., molecular weight gradients), while pairwise attention patterns correlate highly (cosine similarity ≈ 0.84) with actual inter-atomic 3D distances. This suggests SMILES-GRPO enables the model to internalize significant structural information from purely 1D SMILES input.

5. GRPO in Reinforcement Learning for Optical Chemical Structure Recognition

In addition to language modeling, "Group Relative Policy Optimization" (GRPO), as deployed in the MolSight pipeline, represents another context for group-based optimization in SMILES understanding (Zhang et al., 21 Nov 2025). Here, GRPO is a group-normalized policy gradient method designed for reinforcement learning-based OCSR, directly optimizing for chemical-semantic correctness in generated SMILES, particularly stereochemistry.

The key elements are:

Markov Decision Process formalization, with reward combining Tanimoto fingerprint overlap and InChIKey-based stereochemistry match.
Group-normalized advantage computation (using $G=4$ completions per image) with a KL-divergence penalty to regularize the policy.
No explicit critic; advantage normalization suffices for baseline reduction.
When applied post-fine-tuning, GRPO yields +7 to +9.6% absolute gain in stereo-exact match accuracy on challenging OCSR benchmarks.
Empirical ablations indicate both Tanimoto and stereo reward are required for maximal gains.

6. Implications and Significance

The SMILES-GRPO approach demonstrates that chemically informed, group-aware occlusion in masking or reinforcement learning yields markedly improved structure-sensitive representations across molecular machine learning tasks. The protocol’s reliance on functional group masking, rather than architectural innovations, suggests that careful choice of pre-training objectives can significantly advance performance even without explicit graph or geometric input. The ability to recover or infer structural and physicochemical properties implies that group-aware approaches may bridge the gap between sequence-only and full-graph molecular representations.

A plausible implication is that such strategies could generalize to other domains where functional units govern structure–function relationships, provided masking can be aligned with domain-specific substructures. The integration of GRPO for OCSR demonstrates extensibility of the group normalization principle beyond standard LLM pre-training, particularly when task rewards are non-decomposable and structure-sensitive. For the most challenging tasks in molecular AI—including stereochemistry resolution and out-of-distribution property prediction—SMILES-GRPO and related group-aware objectives currently set the empirical benchmark (Peng et al., 2024, Zhang et al., 21 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Pre-trained Molecular Language Models with Random Functional Group Masking (2024)

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SMILES-GRPO.