Machine learning inference of natural product chemistry across biosynthetic gene cluster types

bioRxiv – March 13, 2025

Source: medRxiv/bioRxiv/arXiv

Summary

With the surge of genomic data, understanding how genes produce valuable natural compounds is more important than ever. A new machine learning tool, CHAMOIS, successfully predicts chemical properties of metabolites from biosynthetic gene clusters. It identifies links between genes and chemical traits, offering insights into unexplored biochemical functions. Notably, it can accurately locate gene clusters responsible for specific metabolites, enhancing the discovery of new natural products.

Abstract

With ever-increasing volumes of sequencing data for biosynthetic gene clusters (BGCs), computational methods to accurately predict which secondary metabolites result from these are critically lacking. Here, we present CHAMOIS, a machine learning-based tool for predicting chemical properties of secondary metabolites from protein domains annotated in the input BGCs. CHAMOIS infers 485 chemical properties from the ChemOnt ontology using logistic regression. It accurately predicts 111 such properties (AUPRC > 0.5) in cross-validation against known instances. Although CHAMOIS is not explicitly trained on biosynthetic knowledge, many of the inferred links between protein domains and metabolite properties are consistent with scientific literature, others suggest new biochemical functions of uncharacterized biosynthetic domains. Finally, CHAMOIS can pinpoint which BGC within a given genome produces a pre-specified metabolite (correct BGC in 69% of cases ranked among the top 5), which holds great potential for prioritising experimental BGC characterisation and discovery of novel biosynthetic enzymes.