One molecular generator to rule them all?
Algorithms that generate similar molecules to a given molecule can help medicinal chemists ideate novel compounds that may improve issues with the original molecule. We compared different algorithms, including an LLM, MolMiM, REINVENT4, and a classical fragment-based approach (CReM), to determine whether any is inherently better at designing similar, novel, and valid molecules. Surprisingly, most generative algorithms, except MolMiM, behave similarly well, and no clear advantage could be attributed to any one of them. Surprisingly, the LLM (Claude) did not seem to be at a disadvantage against specialist algorithms like MolMiM and REINVENT4. We summarise that generative algorithms appear to have sufficient sampling capabilities and suggest that combining generation with scoring of molecules for drug likeness and target activity may be key for widespread adoption of generative algorithms.
When optimising drug molecules for ADMET and activity, medicinal chemists face a daunting challenge: the vast chemical space of drug-like small molecules. There are ~10^33 (Polishchuk et al., 2020) possible molecular structures, and no chemist can make and test all of them. Consequently, developing sophisticated algorithms that can intelligently sample ‘relevant’ compounds, i.e. molecules that are similar (but not too similar) to molecules a chemist has already made and tested, from this astronomical number of possibilities is crucial for maximising the chances of identifying the next blockbuster drug.
Generative molecular AI is a powerful tool to sample novel and relevant molecular structures and medicinal chemists can use these technologies to generate new ideas that may solve developmental liabilities (e.g. poor solubility) or legal liabilities (e.g. patentability). However, with new methods emerging on a weekly basis, it becomes daunting to figure out which methods to use; where should medicinal chemists start, and what methods should be used to sample the large chemical space effectively to only generate relevant molecules?
In this blog post, we compared 3 AI-based molecular generators with a classic fragment-based approach to investigate whether any one approach is more likely to generate relevant molecules. We assumed that relevant molecules are (i) sufficiently different from starting molecules to not be considered trivial, (ii) not too different to be considered out of scope, and (iii) are valid molecules that do not contain any structural alerts. Where possible, we tried to use available methods ‘as they come’, i.e. by not optimizing their settings much, to simulate the experience of someone who tries these methods for the first time. Note that sampling relevant molecules is only the first step to solving liabilities. Afterwards, molecules need to be scored for other important properties such as drug-likeness and on/off-target activity either by a person or another algorithm.
Comparing different Generative AI methods
In drug discovery, chemists frequently need to ideate new molecules with desired properties, which is summarized in one of our blog posts. Here, we will specifically focus on the hit-to-lead and lead optimization stages, where the goal of the chemist is to generate new molecules inspired by the ones they made before and tested for activity and/or drug-likeness. Most of the time, projects will have tens, if not hundreds, of compounds that are closely related, making it difficult for chemists to escape innate biases and effectively generate new molecules that meaningfully differ. For this post, we curated three small datasets against common targets (Adenosine A2A, Aryl Piperazine, and SIRT2) to be a starting point for the generation. You can find references to the datasets at the bottom of the article.
We used 4 generative approaches in each of the 3 datasets:
- REINVENT4: the latest of a series of algorithms released by AstraZeneca, based on a recurrent neural network and transformer model (Loeffler et al., 2024). We used the Mol2Mol function with temperature set to 1 and the medium similarity priors to generate molecules.
- Claude 3.5 Sonnet: one of the top performing large language models (LLMs) from Anthropic, tested by providing specific prompts for the generation of molecules. Since the output of large language models can strongly depend on the prompt, we provided two different versions - a technique known as prompt engineering. One prompt only asks Claude to generate 'similar' molecules, while the other explicitly asks Claude to 'scaffold hop' (see repository cited at the end of the post).
- MolMIM: a Variational AutoEncoder-based approach and recent addition to Nvidia’s BioNemo platform (Reidenbach et al., 2022), which provides new models available to the life science industry. We used the MolMiM API with a minimum similarity of 0.7.
- CReM: a fragment-based generator with a classical medicinal chemistry-driven approach.
Do the generated molecules make any sense?
For each method, we generated ~10 molecules per input molecule by cycling through all molecules in each dataset. Note that not each method returned 10 molecules per request, potentially due to filtering invalid molecules. First, we determined whether the molecules generated by each method are realistic and do not violate any established structural alerts by calculating the frequency of rare ring systems in each dataset (Figure 1). Rare ring systems are ring systems that are found less than 100 times in ChEMBL and can be easily calculated (Useful rdkit Utils documentation).
Figure 1: Frequency of rare ring systems.
When prompted to scaffold hop, MolMiM and Claude generated a significant number of rare ring systems in relation to those in the available data. Furthermore, when visualising a sample of molecules containing rare rings, we can see that both Claude and MolMiM generate some very unusual scaffolds, with Claude appearing to have generated a molecule that superficially looks like a PROTAC-like compound (Figure 2).
Figure 2: Sample of compounds containing rare rings for Claude and MolMiM.
After filtering out all compounds with rare ring structures, we wanted to determine if any generator is more likely to violate any other structural alerts. To do so we used the Lilly Medchem Rules (Lilly MedChem rules, GitHub) and determined the pass rate for each model and dataset combination (Figure 3).
Figure 3: Pass rate of the Lilly Medchem Filters for each dataset/model combination.
CReM appeared to exhibit lower pass rates than the other models, which could be because the fragment library used to mutate and grow molecules contained a few fragments with structural alerts. However, overall pass rates appeared acceptable in comparison to the original data, so filtering the outputs from each model for rare rings and structural alerts, such as the Lilly Medchem Rules, was a feasible approach to cleaning the generated molecules.
How similar are generated molecules to the original ones?
For the following analysis, we removed all molecules that had rare ring systems and/or did not pass the Lilly Medchem Rules filter. This removed a total of 2401 molecules out of 15,285 molecules. Next, we qualitatively and quantitatively assessed any structural differences between molecules generated using different methods. Figure 4 shows a few example molecules for each method. Some molecules contained motifs that are more commonly seen as protecting groups as an intermediate en route to a final product (e.g. Boc protected amine) or groups that would hydrolyse in the human body. But overall, the generated molecules all appear similar to the original ones.
For a more quantitative analysis, we calculated the maximum Tanimoto Similarity, which compares Morgan fingerprints (vectors representing the presence or absence of molecular substructures), between each generated molecule and the most closely related original one to determine how similar generated molecules were to the original dataset. A Tanimoto Similarity Score of 0 represents no similarity between the original and generated datasets, and a score of 1 would represent complete similarity (the generated compound is present in the original dataset) (Figure 5).
MolMiM consistently generated significantly different molecules with a median Tanimoto similarity of <0.5 for all datasets, while Claude, with the default prompt, was most like the original data. Apart from those outliers, the distributions of the other generators looked roughly similar, ranging between 0.3 to 0.8, thus indicating a ‘relevant’ sampling from similar and not-so-similar molecules. Note that MolMiM was run with a minimum similarity of 0.7 and no property maximisation; hence it was surprising that similarity remained so low.
Although the generated molecules (except for those from MolMiM) appeared to follow similar distributions, they might still have been sampled from different chemical spaces. To investigate this, we performed dimensionality reduction on both the original and generated molecules to visualise the chemical space produced by each generator in a 2D plot. In these plots, the proximity of two molecules indicates their similarity in molecular fingerprints, representing the presence or absence of specific molecular substructures.
As the aryl piperazine dataset was taken from a real chemical series, we examined it in more detail. Most generative methods appeared to generate molecules in roughly similar chemical spaces (Figure 6). The only obvious outlier was MolMiM, which, as expected from the Tanimoto scoring, generated significantly different molecules (Figure 5). Claude, with explicit scaffold hopping, seemed to do the same, albeit to a lesser extent, in line with the result in Figure 5.
Does any method generate novel scaffolds with a higher rate?
Frequently, especially in early-stage projects, chemists want to generate novel ideas for molecules to make and test that exhibit different scaffolds, i.e., the generator ‘hops’ from one scaffold to the next. Are any of the generative methods more likely to come up with truly novel scaffolds?
To investigate this, we extracted each compound's Murcko scaffolds and skeleton scaffolds (a generalised form of Murcko scaffolds, with all heteroatoms substituted by carbons and all bonds by single bonds (Scaffold Analysis, OpenMolecules). We then compared the scaffolds in the original and generated datasets and calculated the fraction of novel Murcko/skeleton scaffolds for each generative method and each dataset (Figure 7). A number closer to 0 indicated all scaffolds between the original and generated datasets are shared, while a score closer to 1 indicated no shared scaffolds.
Surprisingly, all methods generated molecules with high novelty rates (>0.6). As seen, MolMiM generated almost only novel Murcko and skeleton scaffolds, while Claude, with the default prompt, generated the lowest number of novel scaffolds. All methods likely generate sufficient novel scaffolds to be interesting to a medicinal chemist. One might almost argue that the generated molecules tend to be too novel but CReM, REINVENT4, and Claude can be easily adjusted to generate more ‘boring’ molecules.
Limitations of the analysis
We want to highlight that this blog post exhibits several limitations.
First, we did not explore any goal-directed or reinforced generation of molecules. Algorithms such as REINVENT4 and MolMiM allow the use of scoring functions to direct the generation of molecules towards high activity or drug-likeness. Still, in this work, we only investigated the similarity to a set of given molecules. Goal-directed generation was extensively reviewed by Du et al., 2024. However, since building good models for target activity and ADMET remains challenging due to data volumes and noise limitations, we decided to focus on sampling relevant molecules only.
We were also explicitly concerned with the problem of iterating over molecules in hit-to-lead and lead optimisation and not with the de novo generation of molecules. Hence, our insights do not necessarily apply when trying to find molecules to bind a target with no molecules to start from. Another similar limitation is our choice to generate only from one molecule. REINVENT4 and Claude can, in theory, be biased with a series of molecules to combine knowledge from several molecules into one and more effectively traverse chemical space.
Finally, we did not spend much time optimising each method, which may explain the odd results with MolMiM. However, most methods seem to generate relevant molecules without too much optimisation.
Where to go from here?
Overall, our work suggests that molecular generators generally have no advantage over each other when the sole goal is the generation of relevant molecules, i.e., molecules that are diverse but not too different.
Apart from MolMiM, all methods generated relevant molecules at a high rate and could thus be easily integrated into drug design processes with appropriate filtering. However, a safe bet is to explore using CReM, which uses tried and tested medicinal chemistry rules to generate molecules. Claude may be an interesting option for the more adventurous as the LLM alters its behaviour dramatically with different prompts and generates molecules with slightly higher scaffold diversity. Overall, it may make sense to use Claude for ‘wild’ ideation during Hit-to-Lead and CReM for Lead Optimisation where you do not want to go too far from the well-trodden path.
We believe that this post highlights the difficulties of benchmarking generative molecular AI: Often, only a chemist can genuinely judge a molecule, and metrics are imperfect approximations. Evidence suggests that chemists show weak agreement when asked to rank chemical structures, both between and within individual chemists (Choung et al., 2023). Therefore, it is likely that the best generative method will need to be selected for medicinal chemists based on their preferences. In the end, AI doesn’t seem to be inherently better at generating relevant molecules. As generators behave similarly, building good models for target activity and ADMET properties and combining those with robust generators may be more important than the chosen generative algorithm.
Contributions
Cecilia Cabrera, Max Jakobs, and Ryan Greenhalgh for data analysis and writing. Andrea Dimitracopoulos for editing. Pat Walters, Leonard Wossnig, Nelson Lam for comments and suggestions.
Code and datasets
GitHub: https://github.com/deepmirror/small-mol-gen
- AdenosineA2A: compounds with known IC50 against adenosine A2A, curated by DeepMirror from ChEMBL and containing 63 compounds.
- Aryl Piperazine: data from Malaria Libre consortium containing annotated compounds with anti-malarial activity, containing 175 compounds. Of all datasets this is the only one from a real chemical series.
- SIRT2: compounds with known IC50 against SIRT2, curated by DeepMirror from ChEMBL and containing 70 compounds.
References
Choung O-H, Vianello R, Segler M, Stiefl N, Jiménez-Luna J. Learning chemical intuition from humans in the loop. ChemRxiv. 2023; doi:10.26434/chemrxiv-2023-knwnv-v2 This content is a preprint and has not been peer-reviewed.
Claude 3.5 Sonnet, Anthropic: https://www.anthropic.com/news/claude-3-family
CReM GitHub repository: https://github.com/DrrDom/crem
Du, Y., Jamasb, A.R., Guo, J. et al. Machine learning-aided generative molecular design. Nat Mach Intell 6, 589–604 (2024). https://doi.org/10.1038/s42256-024-00843-5
Lilly Medchem Rules: https://github.com/IanAWatson/Lilly-Medchem-Rule
Loeffler, H.H., He, J., Tibo, A. et al. Reinvent 4: Modern AI–driven generative molecule design. J Cheminform 16, 20 (2024). https://doi.org/10.1186/s13321-024-00812-5
Polishchuk, P. CReM: chemically reasonable mutations framework for structure generation. J Cheminform 12, 28 (2020). https://doi.org/10.1186/s13321-020-00431-w
Useful rdkit Utils documentation: https://useful-rdkit-utils.readthedocs.io/en/latest/
Reidenbach, D., Livne, M., Ilango, R. K., Gill, M., & Israeli, J. (2022). Improving Small Molecule Generation using Mutual Information Machine. https://arxiv.org/abs/2208.09016v2