Periodic Reporting for period 1 - LIGATE (LIgand Generator and portable drug discovery platform AT Exascale)
Okres sprawozdawczy: 2021-01-01 do 2022-06-30
The proposed LIGATE solution, in a fully integrated workflow, enables to deliver the result of a drug discovery campaign with the highest possible speed vs accuracy ratio, auto-tuning solution’s parameters to meet the time and resource constraints.
This predictability, together with the fully automation of the solution and the availability of the Exascale system, will let run a full in silico drug discovery campaign in less than one day to respond promptly for example to novel worldwide pandemic crises.
Since the evolution of HPC architectures is heading toward specialization and extreme heterogeneity, including future Exascale architectures, the LIGATE solution focuses also on code portability with the possibility to deploy the CADD platform on any available type of architecture in order not to have a legacy in the hardware.
The project will also make the platform available and open to support the discovery a novel treatment to fight virus infections and multidrug-resistant bacteria. The project will also provide the research community with the outcome of a final simulation.
The LIGATE Consortium, coordinated by Dompé farmaceutici, is composed of 11 institutions from 5 European countries.
Together with this, we analyzed how the different components of the drug discovery pipeline should interact with each other and how they exchange data and eventually demonstrated how both the LiGen and GROMACS pipelines can be used within HyperQueue.
Current LiGen-GROMACS integration allows LiGen MOL2 poses to be used to create GROMACS coordinate and topology files suitable for the free energy calculations. Work has focused on requirements for automatic topology and parameter generation, and to automatically decide simulation length to reach a target precision.
As a result, GROMACS’ Accelerated Weight Histogram method (AWH) pipeline for the evaluation of binding free energy of the system has been ported in the abovementioned Python library integrated with HyperQueue through an API, which has successfully been run on Karolina cluster.
We evaluated the benefits granted by the application of machine learning techniques at different (mainly two) points of the LIGATE drug-discovery pipeline. We focused on the analysis of the needs for the pipeline and identified in two main modules the possibility to use machine learning approaches. These two modules have been thought to support the pose filtering phase, needed to remove docking poses that are not promising, and to support the scoring of the ligand-protein interaction by calculating their binding affinity.
While only an initial analysis has been done for the machine learning based binding affinity predictor, several pose prediction techniques have been already evaluated and tested. The main step for the second half of the project will be to build a dataset to massively train next generation pose selectors on ligand poses pointwise labelled with free energy calculations automatically derived using the finalized version of the LiGen-GROMACS pipeline described above.
The data files being used are quite varied in both size and format and this requires the project to adopt a selection of management strategies. For example, for LiGen we need to be able to process a very large number of small ligand files, whereas for the free-energy simulations of GROMACS we have a smaller number of files but some of these, especially the dynamics trajectories, can be very large. As well as describing the data used, we have also performed experiments to measure data transfer between partners sites with iRODS and analyzed the various methods available for data compression. In the case of iRODS, we successfully demonstrated that data transfer between CINECA, E4 and the IT4I infrastructure was not only possible but also fast. Finally, we showed that the Burrows-Wheeler method gives the best all round performance in terms of speed and compression ratio.
Finally, we experimented with the use NVIDIA Rapids and Dask technologies for data management and analysis over multiple GPUs but despite support from NVIDIA, we couldn’t demonstrate the feasibility of the approach. We concluded that perhaps the technology is not mature enough to deal with large (i.e. Gb) data sets, but since this may change with updated software libraries we will try again later in the project. In the second half of the project, we will study further I/O solutions, including the relative merits of Object storage, relational and NoSQL databases.
Furthermore, progresses have been made in terms of system runtime optimization, improvements in data distribution and energy efficiency.
Another significant outcome is the HyperQueue HPC jobs scheduler. This software has been developed to support LIGATE use cases together with requirements provided by the pre-exascale system LUMI consortium and peta-scale system Karolina users. Thanks to this, we are now prepared to evaluate complex LIGATE workflows (e.g. CADD pipeline) execution on the abovementioned systems and on the LEONARDO pre-exascale system. These achievements will thus allow for an integrated, efficient, hardware agnostic workflow that can be easily deployed on several supercomputing centers, which will be particularly relevant in case of future health emergencies.