Managing, preserving and computing with big research data

Specific challenge: Development and deployment of integrated, secure, permanent, on-demand service-driven, privacy-compliant and sustainable e-infrastructures incorporating advanced computing resources and software are essential in order to increase the capacity to manage, store and analyse extremely large, heterogeneous and complex datasets[1], including text mining of large corpora. These e-infrastructures need to provide services cutting across a wide-range of scientific communities and addressing a diversity of computational requirements, legal constraints and requirements, system and service architectures, formats, types, vocabularies and legacy practices of scientific communities that generate, analyse and use the data.

Scope: Proposals should address at least one of the first five (5) activities, or activities 6, 7 or 8 individually. Proposers are encouraged to leverage on prior work on open prototype services and to use discoverable service catalogues, common APIs, service-level agreements (SLAs) and transparent billing.

(1) Establishing a federated pan-European data e-infrastructure to provide cost-effective and interoperable solutions for data management and long term preservation. The needs for data access, storage, replication, annotation, search, compute, analysis and reuse of information across disciplines should be accommodated in different research and education contexts. All these functions should expose standard interfaces for interoperation with other data sources to aggregate them or to be aggregated, considering also ethical and regulatory requirements for sensitive data (e.g. patient data). Sustainability is of paramount importance, therefore robust business models should be proposed to encourage investment from all stakeholders. Foreseen challenges are technical, legal and organisational, including engaging e-infrastructure operators and other service providers (such as those receiving support under topics EINFRA-2-2014, EINFRA-3-2014, and EINFRA-7-2014);

(2) Services to ensure the quality and reliability of the e-infrastructure, including certification mechanisms for repositories and certification services to test and benchmark capabilities in terms of resilience and service continuity of e-infrastructures;

(3) Federating institutional and, if possible, private data management and curation tools and services used across or at some point of the full data lifecycle, including approaches for identification of open data sources and data collected with sensitive or restricted access features. Services and tools should be federated on the basis of an open architecture and should offer or coordinate support to the development of Data Management Plans, in particular for Horizon 2020 project participants;

(4) Large scale virtualisation of data/compute centre resources to achieve on-demand compute capacities, improve flexibility for data analysis and avoid unnecessary costly large data transfers.

(5) Development and adoption of a standards-based computing platform (with open software stack) that can be deployed on different hardware and e-infrastructures (such as clouds providing infrastructure-as-a-service (IaaS), HPC, grid infrastructures…) to abstract application development and execution from available (possibly remote) computing systems. This platform should be capable of federating multiple commercial and/or public cloud resources or services and deliver Platform-as-a-Service (PaaS) adapted to the scientific community with a short learning curve. Adequate coordination and interoperability with existing e-infrastructures (including GÉANT, EGI, PRACE and others) is recommended

(6) Support to the evolution of EGI (European Grid Infrastructure) towards a flexible compute/data infrastructure capable of federating and enabling the sharing of resources of any kind (public or private, grid or cloud, etc.) in order to offer computing and storage services to the whole European scientific community. The proposal will address operations for supplying services (IaaS, PaaS, SaaS) at European level, engagement of and tailoring of services to new user communities and dissemination activities.

(7) Proof of concept and prototypes of data infrastructure-enabling software (e.g. for databases and data mining) for extremely large or highly heterogeneous data sets scaling to zetabytes and trillion of objects. Clean slate approaches to data management targeting 2020+ 'data factory' requirements of research communities and large scale facilities (e.g. ESFRI projects) are encouraged.

(8) Enable the creation of a platform and infrastructure for mining text aggregated from different sources/publishers that responds to the needs of users (researchers). This includes the definition of technical requirements (e.g. on interoperability, metadata standards and aggregation of new services) as well as addressing legal and contractual issues to serve the needs of text mining communities. The project should also provide consulting and counselling services to solve problems related with the legal framework and permissions to text mine collections, and to advise researchers on the benefits and practice of text mining. The development of the proposed platform and services should be informed by the studies on policy and licencing issues associated with Text and Data Mining that will be funded from the Call for “Developing governance for the advancement of Responsible Research and Innovation” in the ""Science with and for Society"" Work Programme (topic GARRI.3.2014 - Scientific Information in the Digital Age: Text and Data Mining). Therefore, the successful proposals in these two calls are expected to engage in a mutual dialogue and establish synergies in their work.

A maximum of EUR 8 million of the total budget for this topic is foreseen for activity (6).

This topic is complementary with topic INFRADEV-4-2014/2015, as it addresses services that are potentially transversal and generic, whereas INFRADEV-4-2014/2015 addresses interoperability of services and common solutions for cluster of ESFRI and other research infrastructure initiatives in thematic areas.

Expected impact:

Increased availability of scientific data for scientific communities independently of them having already embraced or not e-science; this will be measured by cross-border data traffic over the research networks in Europe as a proxy.

Better optimisation of the use of IT equipment for research.

Avoiding lock-in to particular hardware or software platforms in the development of science.

Scientific communities embrace storage and computing infrastructures as state-of-the-art services become available and the learning curve for their use becomes less steep; this will be measured by the storage capacity available for pan-European use as well as by the number of users of EGI and other production e-infrastructures in this area.

Through the development of large pooled and interoperable text mining infrastructures, efficiencies of scale will reduce the overall costs, and more open licensing schemes will spread the use of such licenses and boost the exchange of text mining resources and practices.

Type of action: Research and innovation actions

[1] Research data include large datasets collected, developed or generated for/by research, integration of small distributed datasets, as well as data not originally collected for research, which may include environmental, social and humanities data.

Managing, preserving and computing with big research data

Diese Seite teilen

Herunterladen