Learning from massive, incompletely annotated, and structured data

Whether it is from DNA databases, online social networks or solar farms, Big Data is being used to train artificial intelligence systems to solve real-world problems. But vast datasets, or fast data streams, do not always produce information that is in a form machine learning systems can cope with. Smart software developed by the MAESTRA project aims to fix this.

Digital Economy

Artificial intelligence is hot news right now, with its stellar game playing, speech recognition and health diagnosis feats regularly hitting the headlines. But building learning systems is not as easy as some of the media coverage might suggest: the machine learning technology at the heart of AI faces computationally-difficult tasks in a great many applications. A major reason for this is that the data such systems operate on come from many disparate sources - such as video, DNA, medical images, sensors or social networks – so it cannot always be moulded into the well-structured formats that machine learning (ML) systems need if they are to be trained well enough to make useful and accurate predictions when fed new raw data. For instance, to train predictive models, software engineers often need to handle data that is unlabelled (or only partially labelled) with the values to be predicted; datasets that are massive, unwieldy, or streaming at rates too high to cope with; or data being generated concurrently by sensors in an extensive, spatially distributed network. Adding to this complexity, the data can sometimes have a combination of some, or all, of these properties, making efficient data mining extremely difficult. Time to make data make sense ‘The simultaneous presence of several of these data complexities is a hard, currently insurmountable, challenge. And it's one that severely limits the applicability of machine learning and data mining approaches,’ says Sašo Džeroski at the Jozef Stefan Institute in Ljubliana, Slovenia. So Džeroski, project coordinator of the EU-funded MAESTRA project, and colleagues in Croatia, Italy, Macedonia and Portugal have been working to clean up this messy data mining situation. After analysing the problems of mining complex data in great detail, they designed tree-based and rule-based machine learning methods and developed intelligent software that's able to take in massive sets of data, or streams of data, including incompletely labelled data and network data, and make sense of them. The majority of their developed methods can now make complex predictions, such as the values of several data variables simultaneously. And it's not just theory: to prove their software methods work, the MAESTRA team have also successfully tested them on a number of ‘showcase’ problems in a variety of fields. Success is in the genes The MAESTRA data mining methods were applied to genomic datasets containing DNA sequences from both individual organisms and diverse communities of them, such as human gut flora. The complex genomic data was so thoroughly analysed by the ML systems that they were able to successfully predict gene functions in thousands of bacterial species from data derived only from their DNA sequences. They also predicted the phenotypes of micro-organisms from their genotypes, and identified compounds that may help treat tuberculosis and salmonella, too. In the solar energy arena, the MAESTRA methods were used to help ML systems predict both the production and the consumption of energy from different kinds of sensor data in different contexts, such as the production of solar energy in photovoltaic power plants and the consumption of solar energy to heat the Mars Express orbiter. In addition, Džeroski's team predicted both equipment failures in trains and taxi demand from transport datasets. It also improved the accuracy of sentiment analysis and image annotation in social networks, too. Applications set to proliferate Many of the general purpose data mining methods developed in MAESTRA have already been open sourced but Džeroski nevertheless expects several of those to be harnessed in commercial AI projects, with organisations customising them for certain applications and adding their own user interfaces. ‘This will allow MAESTRA partners to develop secondary products in the form of tools and services that are easier to use for potential customers,’ he says. Pharmaceutical companies, Džeroski suggests, could employ customised MAESTRA tools to let AI identify new applications for older drugs, i.e. for drug repurposing. In further ongoing research, MAESTRA ideas are also being harnessed in projects using machine learning in the study of gene function and health, tumour mutation, personalised medicine, brain informatics, sustainable food production and biodiversity.