Periodic Reporting for period 2 - CloudLightning (Self-Organising, Self-Managing Heterogeneous Cloud)
Période du rapport: 2016-08-01 au 2018-01-31
The CloudLightning architecture is designed to be highly scalable and extensible (via a novel Plug and Play mechanism) embracing different types of heterogeneity. A unique feature of this approach is that it facilitates both the incorporation and the dynamic construction of HPC environments. In the former case, HPC machines can be added to the CloudLightning resource fabric by registering the resource manager of the HPC machine as a CloudLightning resource. In the latter case, HPC-like environments can be dynamically constructed, in response to support a particular service, from resources co-located on the same low-latency network. Thus, providing a mechanism to offer HPC-as-a-Service.
An important objective of CloudLightning was to remove the burdens of low-level service provisioning, optimisation and orchestration from the cloud consumer. A related objective was to locate decisions pertaining to resource usage with individual resource components, where optimal decisions could be made. To achieve these objectives, a system was created, composed of a hierarchy of resource managers and employing self-organisation and self-management strategies. By addressing the inefficient use of resources CloudLightning can facilitate savings to the cloud provider and the cloud consumer through reduced power consumption and improved service delivery, with hyperscale systems particularly in mind.
The main components developed to deliver the CloudLightning system include: a Gateway Service, comprised of a User Interface for Blueprint creation and Service Lifecycle Management; the Self-Organising and Self-managing system (SOSM), comprised of a novel, sophisticated, routing network designed to allow service requests to autonomously navigate towards the most appropriate (set of) resource(s) to provision that request; a Plug and Play Mechanism to allow for the dynamic registration (and subsequent deregistration, if required) of resources and associated telemetry endpoints; a Universal Telemetry Interface, allowing different telemetry systems, that may be associated with the various resources in the CloudLightning fabric, to be queried in a uniform manner.
CloudLightning was realised and exercised end to end on a testbed of heterogeneous resources comprised of CPUs, Graphics Processing Units, Many Integrated Cores (MIC) and Data Flow Engines (DFEs). The system was evaluated using three primary use cases: Oil and Gas, Genomics, and Ray Tracing. These HPC-like use cases were containerised, converted to cloud applications and traces from their execution were gathered and used as input to the large-scale simulation of the SOSM system. The large-scale simulation activity examined scalability, power consumption, computational efficiency and resource utilisation and used this information to compare the efficiency of the SOSM system with traditional cloud resource allocation schemes. An analysis of the results show that the SOSM system compares very favourably with traditional methods and thus is a viable approach for future cloud resource management, particularly with respect to the added complexity associated with the management of the emerging heterogeneous cloud.
The results of the CloudLightning project have been disseminated through the publication of an open access book: 'Heterogeneity, High Performance Computing, Self-Organization and the Cloud'; the publication of 47 peer-reviewed scientific publications; the presentation of the work at 33 conferences, in addition to one organised by the CloudLightning consortium; the participation in 15 workshops, in addition to 4 workshops organised by the consortium; the development and delivery of a MOOC 'High Performance Computing in the Cloud' in collaboration with FutureLearn; the organisation of 4 industry briefings to engage with industry stakeholders; and the publications of 90 articles in non-scientific and non-peer-reviewed publications.
The novel contributions of the project include the design and implementation of a self-organising, self-managing heterogeneous, service-oriented cloud architecture, and its constituent components, based on a blueprint-as-a-service deployment model and supporting separation of concerns. The CloudLightning use cases were made cloud-friendly through containerisation, and made available through an HPC-as-a-Service deployment model. Finally, a bespoke simulation framework, simulating dynamic resource allocation schemes in complex hyper-scale heterogeneous cloud infrastructures was constructed.
The impacts of the CloudLightning project include: simplifying the operational overhead of deploying services and reducing the complexity in deploying HPC workloads on traditional cloud resources and when using heterogeneous resources; the potential to increase energy efficiency resulting in an attractive cost structure for service providers who can also make use of the directed evolution present in the CloudLightning system to evolve a cloud configuration that is appropriate to their business needs.