Periodic Reporting for period 1 - ADMIRE (Adaptive multi-tier intelligent data manager for Exascale)
Période du rapport: 2021-04-01 au 2022-09-30
The main objective of the ADMIRE project is the creation of an active I/O stack that dynamically adjusts computation and storage requirements through intelligent global coordination, elasticity of computation and I/O, and the scheduling of storage resources along all levels of the storage hierarchy, while offering quality-of-service (QoS), energy efficiency, and resilience for accessing extremely large data sets in very heterogeneous computing and storage environments.
The specific scientific-technical objectives of ADMIRE are:
Objective 1: Enable the efficient use of new storage tiers by subjecting storage to HPC scheduling decisions and establishing a distributed control that, based on global monitoring, can dynamically adapt storage allocations to changing application demands.
Objective 2: Increase application throughput of HPC systems by leveraging the performance advantage of fast, node-local storage tiers through novel, European ad-hoc storage systems, and in-transit/in-situ processing facilities.
Objective 3: Balance computation and data transfers by providing elastic mechanisms to dynamically adjust the ratio between the allocations of compute and storage resources.
Objective 4: Reduce I/O interference via globally coordinated minimization of data transfers between storage tiers, while conveying and enforcing end-to-end QoS needs.
Objective 5: Provide tools to co-design applications and storage systems with the goal of minimizing data movement, targeting different HPC architectures.
Objective 6: Increase power-efficiency in data management operations by reducing data movement and adopting low-power storage and CPU technologies.
An integrated and operational prototype will be validated and demonstrated with several with real-world data-intensive applications from various domains, including climate/weather, life sciences, physics, remote sensing, and deep learning. The consortium comprises leading European companies, research organizations and universities, bringing together several PRACE members and Centres of Excellence for HPC applications.
At the scientific level, for Period 1, we have been working to implement application tailored ad-hoc storage systems used (Gekko, DataClay, EXPAND and Hercules) making use of new storage tiers, working on developing a common API for Ad-hoc storage systems. We have also designed and developed base mechanisms for I/O and computing resources malleability and made research on I/O scheduling algorithms to maximize throughput of the system and the response times of individual applications.
We have developed the monitoring layer to collect real-time information from the use cases to follow computing and I/O needs of the applications. It has been designed attempting to accommodate all constraints in a constructive trade-off to provide as much information as possible while remaining lean and re-configurable to constantly adapt our measurement. An open data set is published on ZENODO.
All the former components are orchestrated by an Intelligent Controller. The design includes control and data plane architectural blocks for orchestrating all system components, and the API of the IC with the components of the ADMIRE architectural framework and external connectors.
Finally, along Period 1 we have made the application case studies requirements collection using the monitoring facilities provided to drive the co-design activities of the ADMIRE technologies.
ADMIRE will provide contributions described below:
• Ad-hoc storage systems developed in ADMIRE providing three main distinctive contributions: malleability; end-to-end QoS guarantees; and control points inside the ad-hoc storage system enables the implementation of novel global optimizations by the intelligent controller.
• Active scheduling of I/O resources alongside compute resources, extending the notion of malleability to I/O at the global level , provided by a malleable runtime connected to the system job scheduler.
• Support for end-to-end QoS guarantees for the whole storage I/O stack, supporting in-situ/in-transit computation in the I/O stack.
• An intelligent controller collecting cross-layer information from system, applications, and users to optimize the throughput of the system and the performance on the applications.
The key contributions of ADMIRE with a significant innovation potential are:
• A Software Defined Storage solution for the whole storage hierarchy of an HPC system.
• New services for decentralized distributed control and monitoring to facilitate global system optimization using AI techniques.
• Multi-criteria job scheduling techniques covering emerging optimization dimensions such as I/O, data locality, and malleability of both computation and I/O.
• Ad-hoc file systems integrated with resource managers for improving the balance between computation, I/O, storage performance and energy consumption.