Periodic Reporting for period 3 - LIFEPLAN (A Planetary Inventory of Life – a New Synthesis Built on Big Data Combined with Novel Statistical Methods)
Période du rapport: 2022-07-01 au 2023-12-31
There are two main reasons why we understand biodiversity and its drivers so poorly. First, we lack the relevant data, since for the vast majority of species we have either no data or very sporadic data. Second, the processes underlying biodiversity dynamics are complex, and we lack the tools for converting the data that we do have into a true understanding of the processes behind them.
Through LIFEPLAN, we will overcome both hurdles. We bring together the key expertise needed to generate and interpret Big Ecological Data for a global synthesis of biotic patterning across our planet uniting community ecology, methods for automated species recognition, and Bayesian statistics for immense data.
As a basis for the whole LIFEPLAN venture, we will generate a well-standardized global dataset for a substantial proportion of all species. Such standardization is achieved through semi-automated methods, producing comparable data independent of the exact expertise of the person or team conducting the sampling. Based on a recent revolution in sampling methodology, such a sampling design is now finally achievable.
To identify the species in the samples, images and sounds that are being collected, we have developed machine learning models. The massive amounts of data we are collecting make it impossible for individual human experts to go through it all, which is why we need machine learning methods. We have also set up websites for collecting the training data that will be required for the identification task. The training data will be sound, image, and DNA barcode libraries of known species.
A major challenge is that we are discovering many new species. We have addressed this challenge by developing a classification approach that uses probabilities to represent uncertainty in classification and taxa discovery. We have also developed a new approach for predicting the number of new taxa that would be discovered if a given number of additional samples were processed - providing valuable information for the design of sampling and prediction of biodiversity. This approach also adds to the statistical literature on species sampling models, relevant to very broad applications beyond ecology.
Beyond collecting our own data and analyzing it, a major part of LIFEPLAN is developing new methods for big data statistics. We have developed multiple new modelling frameworks that can flexibly adapt to the types of structure common in spatial ecology data, as well as many other applications. We have produced multiple algorithms for more efficient computation in modelling of large spatial data – these algorithms can handle broad data types and models. We have developed two new classes of algorithms to enable much faster Bayesian statistical analyses of very long time series data, while maintaining theoretical guarantees on accuracy of the approximations employed.
A number of the statistical developments are also significantly beyond the state of the art in the field. Particularly notable are new structured frameworks for factor analysis, new paradigms for Gaussian process modelling with the input domain having unknown restrictions, and a new framework for species sampling modelling, improving upon the broad Bayesian nonparametric statistical literature on this topic. In addition, our articles on scalable algorithms for spatial and temporal data are significantly beyond the state of the art.
As both the data collection and the statistical work continue, we will next be able to apply the new methods we develop to our new data. We expect this to lead to a transformative new understanding of life on earth as we put together new models of how species are distributed across the globe.