deepmirror Engine put to the test in the Therapeutics Data Commons benchmark
Accurate prediction of molecular properties such as ADMET is key in designing better molecules. The AI Engine powering DeepMirror (DeepMirror Engine) is a unique synergy of generative and predictive algorithms that help chemists move from hit compound to clinical candidate with unmet precision and speed. But how does the predictive capability of our DeepMirror Engine compare against the many other approaches available to chemists? To offer a quantitative answer to this question, we tested the DeepMirror Engine on the popular Therapeutics Data Commons (TDC) benchmark. Our results show that, in most cases, the DeepMirror Engine outperforms easily accessible algorithms, the top algorithms on the TDC leaderboard, and alternative commercial solutions. We believe that, while often flawed, benchmarks can be informative and help build trust with our customers and the wider scientific community when used rigorously.
Background
Effective drugs need to be potent against their target of interest, exhibit good drug-like properties, and be safe. Achieving sufficient concentration in the desired region of the body to produce the therapeutic effect—and then being cleared without generating toxic metabolites or causing significant adverse effects—is crucial when progressing from a hit molecule to a clinical candidate (ref). This drug-like behaviour and safety profile are determined by ADMET properties: Absorption, Distribution, Metabolism, Excretion, and Toxicity. ADMET is an umbrella term that encompasses a set of properties describing a drug molecule's pharmacokinetics and pharmacodynamics, both contributing to the drug's efficacy and safety.
Medicinal chemists explore thousands of ideas when trying to improve ADMET on their lead candidates, meaning they must make educated guesses on which molecules are worth bringing to the lab to avoid ballooning costs for their programmes (ref). Having access to predictive tools that allow for prioritising the synthesis and testing of molecules with the highest chances of success is critical, and a foundation principle of DeepMirror (ref)
Previously (ref), we demonstrated that no single predictive approach excels across all use cases. To address this, the predictive AI engine powering DeepMirror (DeepMirror Engine) combines many different methods of representing molecular structures (featurisers) with various machine learning algorithms, and dynamically selects the best-performing combination for each specific dataset. This ensures that the most appropriate predictive model is applied to accurately predict molecular properties relevant to drug discovery.
Here, we discuss our recent progress on the development of the DeepMirror Engine in the context of the Therapeutics Data Commons (TDC) benchmark, and compare it to:
- The top of the TDC leaderboard
- A Baseline approach that puts together available open-source algorithms
- An Industry Standard approach that corresponds to a well-established commercial solution.
Methods
We put the DeepMirror Engine to the test against the top algorithms on the TDC leaderboard, accessible open-source algorithms, and an alternative commercial solution. Performance was assessed on test sets provided by the TDC for each dataset, ensuring that the data used for testing was not involved in training. The evaluation metrics were either Mean Absolute Error of the predictions (MAE), or the Spearman's Rank Correlation Coefficient (Spearman), as specified by TDC. We used a total of 9 benchmarking datasets for which prediction performance metrics are available from individuals or groups that submit their entries to the leaderboard (https://tdcommons.ai/benchmark/ADMET_group/overview/). See end of blogpost for more details on the different datasets.
The baseline open-source approaches represented simple off-the-shelf solutions and included: ECFP + Support Vector Machine, ChemProp, RDKit2D-LightGBM. To make it a fair comparison, for each ADMET dataset in the TDC we compared the DeepMirror Engine to the best performing model among the open-source algorithms (Baseline).
While we were not able to run the predictions from other commercial software, we found a useful white paper from an Industry Leader reporting the performance of their platform on the TDC, which we then used as a comparison.
If the performance metric of a lower-ranked model was within the standard deviation of the metric of a higher-ranked model, we considered their performances similar. In such cases, we assigned both models a "joint" ranking. The mean and standard deviation were calculated based on the test set performance after running predictions using five different random seeds (see tables in appendix).
For available open-source approaches and our Engine we also built 5 models with different random seeds to enable statistical comparisons. Note that this was not possible for the top leaderboard approach and the commercial solution as we did not have access to the underlying algorithms.
Results
When comparing to the TDC leaderboard, the DeepMirror Engine ranked 1st or joint 1st in 5 out of 9 datasets, and joint 3rd or better for all other datasets (Figure below).
When compared to open-source baselines, The DeepMirror Engine matched or outperformed the 3 open-source algorithms in 8 out of 9 benchmarked datasets.
Finally, the DeepMirror Engine matched or outperformed the Industry Leader in all (9 out of the 9) benchmarked datasets.
Conclusions
When compared to open source and commercial alternatives, the DeepMirror Engine shows a remarkable ability to provide top accuracy while automatically and dynamically adapting to very different ADMET property prediction tasks, offering the best overall predictive performance for this set of ADMET property predictions.
While the results of our tests show clear advantages of using the DeepMirror Engine compared with alternatives, there are two limitations in our testing that are worth highlighting, and are part of our mission to continuously improve our predictive AI framework:
- Absence of classification datasets in our TDC benchmark: In this project, we focussed on how the DeepMirror Engine performed regression tasks. Subsequent work will adapt the Engine for classification datasets available in the benchmark
- Single benchmark: While the TDC benchmark allowed us to test the DeepMirror Engine across many relevant ADMET properties, in the future we plan on evaluating its performance on more benchmarks, such as the new version of TDC (https://www.biorxiv.org/content/10.1101/2024.06.12.598655v2), and the new Polaris benchmark (https://polarishub.io).
Acknowledgements
Jacob Green and Cecilia Cabrera for data analysis, Andrea Dimitracopoulos and Max Jakobs for writing and data analysis, Leonard Wossnig for helpful comments and suggestions.
For the keen reader…
- How did you decide the data splits? Splits were provided by TDC (train and test). Only the train set was used for all model training and selection.
- Where does the Standard Deviation come from? Test set metrics were calculated by running model training 5 times, each with a different random seed, and calculating the mean and standard deviation of metrics for each test set performance for each run.
- What featuriser-algorithm combinations are included in the DeepMirror Engine? Featurisers include: ECFP, Mordred, ChemBERTa, InfoMax2D, Graph, MTL-BERT, and more. Algorithms include: LGBM, SVM, KNN, GP, MLP, GNN, and more
- How was the best model selected? The best model was selected using cross validation techniques on the train set and the results were then evaluated on the test set. Usually, the best model was consistent across random seeds, but where it wasn’t, we selected the model that was best for most random seeds and used the metrics for this chosen model from all random seeds. We decided that this approach is most representative of how our training process works in practice.
- Did you preprocess any of the datasets? Two datasets have target measurements that are exponentially distributed: Half Life and VDss. In keeping with common ML practice and other TDC published results, we trained on the base 10 logarithm of the target variables for these datasets.
- How come some Standard Deviation values are 0.000? The main source of randomness in our pipeline comes from (1) data splits and (2) hyperparameter selection. The baselines don't run (2) so it is not surprising to see these values there. The remainder of the values seem in line with other TDC results, showing robustness to different data splitting on these large datasets.
- Why did you use the bolded table method to report results? While imperfect, we would like to provide a quick way to highlight differences between different methods we compared. To overcome the limitations of comparing a single value across methods, we considered a method to be better when the second best was not within its error range, and carried out statistical testing when possible.
- Why did you not report MSE and Pearson’s Correlation Coefficient? These additional metrics where not reported as they are not present in the TDC benchmark.
- How do these results impact a real-world project? Please keep an eye out for our case studies on our blog, in which we provide details on how DeepMirror has impacted our customers when designing better molecules.
- If the DeepMirror Engine includes all the open-source baselines, why does it not outperform them in all TDC datasets? The devil is in the details of the extensive model training pipeline and parameters, and there are differences that can lead to variability in the results.
Data tables
Performance of the DeepMirror Engine against the top performing model, and overall ranking across all models on the TDC leaderboard. Better models show lower MAE and higher Spearman. Note that statistical comparison was not possible as the leaderboard only reported mean ± sd.
Performance of the DeepMirror Engine against Open-Source Software. Better models show lower MAE and higher Spearman. * denotes p<0.05 for independent t-test with Bonferroni correction for multiple comparisons.
Performance of the DeepMirror Engine against the Industry Standard. Better models show lower MAE and higher Spearman. Note that statistical comparison was not possible as the industry leader only reported mean ± sd.
About the Therapeutics Data Commons
The TDC is an open science initiative that provides a unified platform with comprehensive datasets, AI models, and benchmarks. This resource supports research across various therapeutic modalities and stages of drug discovery and development. We focussed on the ADMET properties that could be represented as regression tasks as our engine was first optimised for those tasks and thresholds that divide results in acceptable/not acceptable often vary with the stage of the drug discovery program. Regression tasks involve predicting continuous values — such as exact measurements of absorption rates, distribution volumes, metabolic half-lives, excretion rates, or toxicity levels — rather than simple categorical outcomes (e.g., toxic vs. non-toxic). This continuous data provides nuanced insights into how well the engine can estimate the specific magnitudes of these properties, which is crucial for drug discovery decisions.
These ADMET properties include:
- Absorption, measuring how a drug reaches its site of action:
- Caco2 Permeability (Caco2, https://tdcommons.ai/single_pred_tasks/ADMET/#caco-2-cell-effective-permeability-wang-et-al)
- Lipophilicity (LogD, https://tdcommons.ai/single_pred_tasks/ADMET/#lipophilicity-astrazeneca)
- Solubility (https://tdcommons.ai/single_pred_tasks/ADMET/#solubility-aqsoldb)
- Distribution, measuring how a drug is distributed across the various tissues of the body:
- Plasma Protein Binding Rate (PPBR, https://tdcommons.ai/single_pred_tasks/ADMET/#ppbr-plasma-protein-binding-rate-astrazeneca)
- Volume of Distribution at steady state (VDss, https://tdcommons.ai/single_pred_tasks/ADMET/#vdss-volumn-of-distribution-at-steady-state-lombardo-et-al)
- Excretion, measuring how a drug is removed from the body:
- Half Life (https://tdcommons.ai/single_pred_tasks/ADMET/#half-life-obach-et-al)
- Clearance (Hepatocyte and Microsome https://tdcommons.ai/single_pred_tasks/ADMET/#clearance-astrazeneca)
- Toxicity, measuring whether a drug could cause severe adverse effects:
- Lethal Dose (LD50, https://tdcommons.ai/single_pred_tasks/tox/#acute-toxicity-ld50)