Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Published 12 Jun 2025 in cs.AI and cs.CL | (2506.10912v2)

Abstract: Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair - generating structurally valid molecular alternatives with reduced toxicity - has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal LLMs (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess nearly 30 mainstream general-purpose MLLMs and design multiple ablation studies to analyze key factors such as evaluation criteria, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware molecule editing.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

Breaking Bad Molecules: Structure-Level Molecular Detoxification with MLLMs

The paper titled "Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?" addresses a significant challenge in drug development—molecular toxicity—and evaluates the potential of Multimodal Large Language Models (MLLMs) in performing molecular toxicity repair. Despite advancements in molecular design, the early-stage failure of drug candidates due to poor toxicity profiles remains a critical bottleneck in drug discovery. Traditional approaches for toxicity mitigation rely heavily on expert-driven structural modifications and extensive experimentation, which are both resource-intensive and time-consuming.

ToxiMol Benchmark Development

To tackle this issue, the authors introduce the ToxiMol benchmark, the first comprehensive task designed to assess the capability of general-purpose MLLMs in molecular detoxification. This benchmark emphasizes generating structurally valid molecular alternatives with reduced toxicity—a task that has not been systematically evaluated prior to this work. It incorporates a dataset covering 11 primary toxicity tasks, focusing on diverse mechanisms and granularities, along with a set of 560 representative toxic molecules.

Evaluation Framework: ToxiEval

The researchers further propose ToxiEval, an automated evaluation framework integrating multiple criteria essential for assessing repair success. These include toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity. The framework aims to deliver a high-throughput, objective assessment of the detoxification capabilities of MLLMs, reflecting real-world constraints in drug development scenarios.

Performance Evaluation of MLLMs

The paper presents a detailed evaluation of nearly 30 mainstream MLLMs on the ToxiMol benchmark, analyzing factors such as structural validity, evaluation criteria integration, candidate diversity, and failure attribution. Notably, despite the challenges faced by current MLLMs, promising capabilities are observed in areas such as toxicity understanding, semantic constraint adherence, and structure-aware molecule editing. However, repair success rates remain low, highlighting the complexity and difficulty of the task as well as the limitations of current AI models in this domain.

Implications and Future Directions

The implications of the research are profound, suggesting the utility of MLLMs as tools for enhancing drug discovery through automated detoxification processes. The study identifies potential paths for further development in AI models, advocating for more refined detoxification strategies tailored to complex toxicity endpoints.

Additionally, the evaluation framework paves the way for deeper AI integration into pharmaceutical sciences, with potential extensions into broader chemical domains beyond small molecules, including macromolecular entities like peptides and proteins.

Conclusion

Overall, the paper represents a substantial contribution to bridging the gap between AI and drug discovery. It offers a foundational step towards systematic molecular detoxification using language models, although advancements are needed for practical applicability. The research encourages future exploration into the optimization of AI-driven repair processes, iterative testing, and collaboration between computational toxicology and synthetic chemistry experts to enhance the efficiency and reliability of drug development pipelines.