Reducing data uncertainty
Various software applications must manage and make decisions using data with high levels of uncertainty. While certain tools can fill in the gaps to some degree, such tools are generally simplistic and limited. The EU-funded 'Heisendata - towards a next-generation uncertain-data management system' (HEISENDATA) project aimed to improve matters. The team planned to design and build new probabilistic database systems (PDBSs), supporting statistical models and probabilistic reasoning in addition to conventional database structures. The project intended to address the challenges involved in supporting such a novel union, including redesign of key system components. HEISENDATA ran for four years to February 2014. Project work covered three main branches: new probabilistic data synopses for query optimisation, new PDBS algorithms and architectures, and scalable algorithms and tools. The data synopses involved defining and creating algorithms for building histograms. For various error metrics, the new algorithms constructed optimal or near-optimal histograms and wavelet synopses. Further work introduced probabilistic histograms, which allowed a more accurate representation of the data's uncertainty characteristics. Additionally, the team addressed problems related to unstructured text containing units of structured information. The solutions extended a leading information extraction (IE) model, by developing two query approaches. The efficiency and effectiveness of the approaches were compared using real-life data sets. The result was a set of rules for choosing appropriate inference algorithms under various conditions, yielding up to 10-fold speed improvements. The project also devised a framework for scaling any generic entity resolution algorithm, and demonstrated the framework's effectiveness. Further work helped to integrate the IE pipeline with probabilistic query processing. HEISENDATA found new statistical methods for processing data with high uncertainties, and integrated the methods into conventional database structures. The work addressed a topic of interest to the academic and commercial sectors.
Keywords
Data uncertainty, data systems, data management, probabilistic database systems