Skip to main content
European Commission logo
English English
CORDIS - EU research results
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary

Machine learning prediction for breast cancer therapy

Periodic Reporting for period 1 - PredAlgoBC (Machine learning prediction for breast cancer therapy)

Reporting period: 2019-10-01 to 2021-09-30

Breast cancer is a leading cause of cancer death worldwide. In Europe, it caused about 138,000 deaths in 2018. This high death rate is mainly due to metastatic cancers for which treatment is less performant than for non-metastatic cancer (27% survival rate at 5 years for metastatic cancer versus 90% for all breast cancer, respectively). Metastatic cancers are a dissemination of the initial breast tumors cells all over the body. Some patients have already metastatic cancer when they are first diagnosed, but for most of them, it is an evolution of the initial disease that escapes treatment.
In order to reduce tumor escape from treatment, it is necessary to give patients drugs that are the best adapted to their own tumor characteristics. To do that, we need to find which tumor characteristic we will have to measure before treatment that will help defining which treatment is more appropriated for each person. It is what is called personalized/precision medicine. The search for these measurable characteristics (called biomarkers) is a discipline where we use large databases built from collected patient tumor information. Most precisely, we have to search among several thousands of tumor characteristics (measured before treatment) which ones can help predicting patient treatment response. To be able to analyze these thousands of characteristics issued from thousands of patients, we need to use adapted mathematical tools that will extract pertinent information from the large amount of not-pertinent ones. These tools are machine learning (ML) algorithms.

The search of biomarker for response to treatment is a growing field where only a few biomarker "signatures" have reached the clinic. One of the reasons of this mellow success is called the curse of dimensionality: data in health are composed of a large amount of measured characteristics, but there is often not enough patient samples to represent correctly the variation in population, and ML algorithms do not perform well in this situation. The goal of this project is to use mathematical approaches combined with biological thorough analysis to reach a setting where the information given by the algorithms will be usable in the clinic.

PredAlgoBC reached its objectives since we obtained two signatures (assemblage of biomarkers) ready to use that can predict response to hormonotherapy in breast cancer.
First, we gathered data from public databases to constitute a dataset large enough to represent the entire population. We were able to collect about 12,500 tumor characteristics from about 4000 patients with breast cancer and their related follow up information.

Then, we tested mathematical tools and their tuning to analyze the dataset. Among existing ML algorithms, we used predictive algorithms. They allow to perform supervised analysis, which consists in building a mathematical model with known variables in order to predict an output. Here, variables are the gene expression levels that reflect the components (characteristics) of the tumor, and the output is response to treatment. We built hundreds of prediction models based on the 4000-patient tumor characteristics, by testing several ML algorithms, and several tuning parameters for each.
To test if the models were good performers, we split our dataset in two parts. The first part was used to train algorithms to learn to predict the outcome. At the end of the algorithm training, we tested the prediction performance of our model on the second part of the dataset (for whom we also know response to treatment). In that way, we can compare response prediction given by mathematical models versus the known treatment response, and determine if models are good performers or not.

Then, we extracted candidate biomarkers from the best prediction models. When building a model, not all the variable have the same importance to predict the outcome. For each model, they are ranked according to their importance in the prediction. The best-ranked variables are the ones that can be tested as potential biomarkers. Using the important variable ranking, we selected about 350 characteristics among the 12,500 measured that were identified as important by ML analysis. By analyzing these characteristics, we were able to identify neural development actors as components of the tumor that were linked to bad response to hormonotherapy in breast cancer, and they were never highlighted before as such. We were able to confirm the importance of these components in response to treatment in several independent clinical dataset. This indicated that our models are representative of entire population, and not only of the specific group that we studied.

We have also implemented a deep learning algorithm to create virtual patients in order to increase patient cohorts. We used for that an algorithm called GAN (generative adversarial networks). It was initially used to create images that look like photographs of human faces, even though these faces did not belong to a real person (see examples at https://thispersondoesnotexist.com/). We adapted the algorithm to create fake tumor dataset mimicking the 12500 tumor measured characteristics, and were able to create virtual tumor samples that could not be differentiate from real tumor samples.


PredAlgoBC project provides to clinicians a ready-to-use biomarker signature that help classifying patients who should receive hormonotherapy from those who should not. It also led to the identification of neuronal components as potential new targets for breast cancer therapy.
The results are currently under review for publication in a scientific journal.
The algorithm pipelines that allow implementing the same analysis than us with other datasets are publically available.
The GAN algorithm was develop by a master student in the lab who got a grant to start a PhD to deepen the development of GAN use in the cancer field.
The identification of nervous system as important tumor component for response to hormonotherapy led us to start a collaboration to investigate more deeply this implication through bench work analysis.
The purpose of our project was the implementation of mathematical pipeline to define clinic-compatible gene signature that could be used for personalized medicine. We were able to obtain a biomarker signature to help predicting response to hormonotherapy, and showed for the first time that tumor nervous system was involved in response to hormonotherapy.

The next step from now will be the implementation of this new signature for clinical use. It requires clinically usable technology for reliable and cost-accessible measurement of gene expression. We therefore have to define which is the best way to evaluate these components in the clinic using the tools that are available at our comprehensive cancer center (ICO) to operate the test routinely (with assays like PCR or immunohistochemistry). Validation for PCR and Immunohistochemistry can be performed on paraffin-embedded breast tumor blocks. These are tumors samples systematically harvested and stored at the ICO tumor bank.

Once the best clinical assay chosen, we will have to do a retrospective analysis on patients at the ICO in order to validate these new markers in clinic and confirm with clinicians whether our biomarker signature can be used routinely in the clinic to help stratifying patient for delivering of the best-adapted treatment.
PredAlgoBC pipeline for predictive biomarker discovery