Machine Learning in Biomolecular Research

Team leader: dr hab. Dominik Gront, prof. ucz.

Team leader’s e-mail address: dgront@chem.uw.edu.pl

Brief description of the research topic:

Individual biomacromolecules can consist of tens of thousands of atoms. Modeling their structure and dynamics has always been a challenge, requiring the development of new algorithms. For many years, the Laboratory of Theory of Biopolymers has been working on coarse-grained modeling methods, which, by simplifying the description of molecules, allowed for faster computations and the analysis of large systems.

In recent years, the development of machine learning methods has changed the approach to biomolecular modeling. Although ML techniques have been present in bioinformatics since the 1990s, their effectiveness has significantly increased in the recent years. Modern training algorithms and new neural network architectures now enable the prediction of protein structures and the analysis of their properties with an accuracy surpassing that of classical computational chemistry methods.

My research group currently focuses on two main projects:

1) Classification of enzymes from the P450 superfamily.

The result of several years of work is the P450 Atlas (https://p450atlas.org/), a portal collecting information about known protein sequences from this superfamily. The gathered data enabled us to develop a predictor (algorithm) that classifies a given sequence into one of more than 10,000 known families.

This tool has been made available as a web service. Its popularity is steadily growing; in April, P450Atlas classified nearly 1,000 sequences submitted by anonymous users. The model is now used to search databases of protein sequences for the identification of additional P450 enzymes. The next step will be to use this data to train a large language model. Another upcoming research task is the development of a model capable of predicting the function (substrate and product) for a given P450 amino acid sequence.

2) Machine learning methods in protein structure modeling.

For several years, we have been developing machine learning methods that support or replace classical molecular modeling tools. For example, one of the tasks involves reconstructing an all-atom representation of a protein from its coarse-grained model. This problem was addressed in two publications describing the machine learning models HECA and deepBBQ. In an ongoing project, we are working on a generative AI model that will be capable of generating random polypeptide chains with conformations similar to natural proteins.