Periodic Reporting for period 2 - HOBBIT (Holistic Benchmarking of Big Linked Data)
Reporting period: 2017-06-01 to 2018-11-30
The major problem addressed by HOBBIT is the lack of uniform solutions to benchmarking Big Linked Data across its lifecycle (see Figure 1). HOBBIT addresses the need for better solutions for benchmarking Big Linked Data through the following objectives:
- An open task-driven benchmarking platform to evaluate the performance of Big Linked Data processing systems. The platform is designed to be a scalable distributed solution designed to benchmark large-scale solutions. This benchmarking solution needs to be highly portable and run both on single machines and computer clusters to ensure that it supports benchmarking at any scale. Its main features must also include the generation of open, human- and machine-readable reports on the evaluation campaign results. The published data should include configuration data, experimental results, and fine-grained results for the different KPIs. In addition, the platform is to provide diagnostics mechanisms to support both developers and users in their quest for better solutions and tools.
- Benchmarks of industrial relevance in Europe. Data are one of the key assets of an increasing number of European companies. Making industrial data public is hence a difficult and partly counterproductive endeavor. To ensure that our platform still returns results of industrial results, we need to circumvent the hurdle of making real industrial data public by deploying mimicking algorithms. These will allow configuring synthetic data generators so as to compute data streams that display the same characteristics as industry data while being open and available for evaluation without restrictions.
- Reference implementations for industry-relevant key performance indicators (KPIs). Open-source implementations of widely accepted measures are to be provided to ensure that the results generated within the project can be understood and checked by any organisation.
- Data and measure collection: We gathered input on relevant datasets and quality measures from members of the European industry landscape within surveys. To this end, we (1) joined the EU project DataBench in the creation of the HOBBIT association and (2) co-organized and participated in meetups around Europe (including, e.g. EBDVF 2018). During these meetups, we presented the idea behind HOBBIT as well as engaged with the participants to gather their requirements to a Big Linked Data benchmarking platform. The main results of HOBBIT’s dissemination and engagement were (1) the creation of a HOBBIT association as Special Group 7 of Task Force 6 of the Big Data Value Association, (2) surveys to gather information from European companies and academic pertaining to their use and evaluation of Big Linked Data and corresponding platform and (3) datasets for the HOBBIT data repository. Overall, HOBBIT compiled a contact list with more than 300 members. The 25 datasets and dataset generators available through the HOBBIT CKAN repository at https://hobbit.ilabt.imec.be/ encompass industrially relevant datasets partly provided by HOBBIT.
- Benchmark creation: The measures and the datasets collected formed the basis for the 8 HOBBIT benchmarks, which were made available in 2 versions over the project. Each benchmark comprises the following three components: a deterministic data source, a number of tasks and a set of KPIs. In addition, 5 scalable mimicking algorithms (which generate data of industrial relevance) were created in the project to ensure that the benchmarks reflect realistic use cases as well as to circumvent the problem of not being given access to real datasets from industry. A number of evaluations showed that the mimicking algorithms provided by the project generate synthetic data close to real data w.r.t. features such as temporal and spatial distribution.
- The HOBBIT evaluation platform (see Figure 3) is the third core result of HOBBIT. It is built to support the benchmarking of Big Linked Data solutions at both small and large scale. The platform is developed as an open-source solution (see https://github.com/hobbit-project/platform) and support 14 challenges over the project runtime. Extensions to remote computation facilities such as AWS and an SDK complete the package. A mix of contributions from HOBBIT and from external users has now led to the platform containing 52 benchmarks and more than 300 docker images. The more than 200 users and 12,600 experiments ran over the runtime of the project suggest that the HOBBIT platform is turning into a crystallization point for benchmarking Big Linked Data.
- Evaluation campaigns: HOBBIT ran evaluation campaigns for all benchmarks within 14 challenges (including the Mighty Storage Challenge -MOCHA-, the Question Answering on Linked Data Challenge -QALD- and the Open Knowledge Extraction Challenge -OKE- at ESWC2017 as well as the DEBS grand challenges 2017 and 2018). The results show that the HOBBIT benchmarking platform scales to the requirements of large-scale benchmarking. Limitations of existing solutions at scale (e.g. completeness for storage, recall for question answering, F-measure for machine learning) could be unveiled through the scalable benchmarking provided by HOBBIT. Moreover, the lack of scalability of a large number of Linked Data solutions was made evident.
1) Benchmark for RDF data backends, including benchmarks for data ingestion, data storage and querying which all measure how fast and correctly systems deal with streams of data at industrial scales;
2) Benchmarks for knowledge extraction
3) Entity matching and linking benchmarks
The innovation behind the solutions generated by this project is underpinned by the 28 HOBBIT publications which have already been accepted at high-ranking conferences.
HOBBIT has collaborated and is collaborating with a large number of related research projects (e.g. BigDataEurope, BigDataOcean, SLIPO, SAKE, GEISER, etc.). The project aims to support the benchmarking efforts carried out in these projects through the HOBBIT platform. The expected societal impact of the project is mainly in the support of the development of more efficient Big Data Processing platforms to address societal challenges such as energy and transport.