Adaptive multi-tier intelligent data manager for Exascale

Periodic Reporting for period 1 - ADMIRE (Adaptive multi-tier intelligent data manager for Exascale)

Período documentado: 2021-04-01 hasta 2022-09-30

The growing need to process and access extremely large volumes of heterogeneous data sets, data-intensive applications, and the steep growth of data sets question the traditional compute-centric view on HPC. The flat storage hierarchies found in classic HPC architectures, uncoordinated file accesses, and the limited bandwidth make the centralized back-end parallel file system a serious bottleneck in traditional systems. At the same time, there is a disruptive change of the underlying storage technology with emerging multi-tier storage hierarchies based on fast non-volatile memory that can significantly lower the pressure on the back-end file system. But maximizing performance still requires careful control to avoid congestion and balancing compute and storage performance. Unfortunately, appropriate interfaces and policies for managing such an enhanced I/O stack are still lacking.

The main objective of the ADMIRE project is the creation of an active I/O stack that dynamically adjusts computation and storage requirements through intelligent global coordination, elasticity of computation and I/O, and the scheduling of storage resources along all levels of the storage hierarchy, while offering quality-of-service (QoS), energy efficiency, and resilience for accessing extremely large data sets in very heterogeneous computing and storage environments.

The specific scientific-technical objectives of ADMIRE are:

Objective 1: Enable the efficient use of new storage tiers by subjecting storage to HPC scheduling decisions and establishing a distributed control that, based on global monitoring, can dynamically adapt storage allocations to changing application demands.
Objective 2: Increase application throughput of HPC systems by leveraging the performance advantage of fast, node-local storage tiers through novel, European ad-hoc storage systems, and in-transit/in-situ processing facilities.
Objective 3: Balance computation and data transfers by providing elastic mechanisms to dynamically adjust the ratio between the allocations of compute and storage resources.
Objective 4: Reduce I/O interference via globally coordinated minimization of data transfers between storage tiers, while conveying and enforcing end-to-end QoS needs.
Objective 5: Provide tools to co-design applications and storage systems with the goal of minimizing data movement, targeting different HPC architectures.
Objective 6: Increase power-efficiency in data management operations by reducing data movement and adopting low-power storage and CPU technologies.

An integrated and operational prototype will be validated and demonstrated with several with real-world data-intensive applications from various domains, including climate/weather, life sciences, physics, remote sensing, and deep learning. The consortium comprises leading European companies, research organizations and universities, bringing together several PRACE members and Centres of Excellence for HPC applications.

In Period 1, we have focused on activities related to the successful commencement of the ADMIRE project, including setting up the required mailing lists and software repositories, establishing the project steering committee, convening regular steering committee meetings, and establishing working groups to ensure good initial technical progress on each workpackage. We have defined the project handbook, data management plan, and data dissemination and exploitation plan. We have set up the Web page (admire-eurohpc.eu) and made a strong effort in dissemination and communication of ADMIRE project. A series of webinars related to the project can be found in the web page and Youtube (ADMIRE EUROHPC).

At the scientific level, for Period 1, we have been working to implement application tailored ad-hoc storage systems used (Gekko, DataClay, EXPAND and Hercules) making use of new storage tiers, working on developing a common API for Ad-hoc storage systems. We have also designed and developed base mechanisms for I/O and computing resources malleability and made research on I/O scheduling algorithms to maximize throughput of the system and the response times of individual applications.

We have developed the monitoring layer to collect real-time information from the use cases to follow computing and I/O needs of the applications. It has been designed attempting to accommodate all constraints in a constructive trade-off to provide as much information as possible while remaining lean and re-configurable to constantly adapt our measurement. An open data set is published on ZENODO.

All the former components are orchestrated by an Intelligent Controller. The design includes control and data plane architectural blocks for orchestrating all system components, and the API of the IC with the components of the ADMIRE architectural framework and external connectors.

Finally, along Period 1 we have made the application case studies requirements collection using the monitoring facilities provided to drive the co-design activities of the ADMIRE technologies.

The ADMIRE project address a fundamental gap on current large-scale HPC computing infrastructures: the lack of mechanisms for the global coordination and scheduling of the storage I/O accesses to the emerging storage hierarchy. ADMIRE is bound to address this gap and build a framework that offers an open programming environment in which the data movement, scheduling and coordination can be holistically addressed for all levels of the storage hierarchy, and regardless of where the actual storage devices are located with respect to compute nodes., An open programmable framework will facilitate the integration of novel storage technologies or the development of novel tiering approaches. Our approach is based on scalable monitoring and control, separation of control and data, and an open control API that can be used for implementing and deploying global policies.

ADMIRE will provide contributions described below:
• Ad-hoc storage systems developed in ADMIRE providing three main distinctive contributions: malleability; end-to-end QoS guarantees; and control points inside the ad-hoc storage system enables the implementation of novel global optimizations by the intelligent controller.
• Active scheduling of I/O resources alongside compute resources, extending the notion of malleability to I/O at the global level , provided by a malleable runtime connected to the system job scheduler.
• Support for end-to-end QoS guarantees for the whole storage I/O stack, supporting in-situ/in-transit computation in the I/O stack.
• An intelligent controller collecting cross-layer information from system, applications, and users to optimize the throughput of the system and the performance on the applications.

The key contributions of ADMIRE with a significant innovation potential are:
• A Software Defined Storage solution for the whole storage hierarchy of an HPC system.
• New services for decentralized distributed control and monitoring to facilitate global system optimization using AI techniques.
• Multi-criteria job scheduling techniques covering emerging optimization dimensions such as I/O, data locality, and malleability of both computation and I/O.
• Ad-hoc file systems integrated with resource managers for improving the balance between computation, I/O, storage performance and energy consumption.

Admire framework architecture and use cases

Periodic Reporting for period 1 - ADMIRE (Adaptive multi-tier intelligent data manager for Exascale)

Compartir esta página

Descargar