MANGO: exploring Manycore Architectures for Next-GeneratiOn HPC systems

Periodic Reporting for period 2 - MANGO (MANGO: exploring Manycore Architectures for Next-GeneratiOn HPC systems)

Okres sprawozdawczy: 2017-04-01 do 2019-03-31

HPC plays a key role in today’s IT-based society. Platform customization is a fundamental enabler for improved energy efficiency in HPC. Customization often relies on heterogeneous compute technologies, where different types of devices (CPUs, GPUs, custom accelerators, FPGAs) are mixed together and used only for computations where they are most efficient. As they have different power/performance trade-offs, a significant effort is needed to match resources to applications. MANGO explores new architectures for future HPC systems taking heterogeneity as the central point. MANGO tries to answer the question: How can we combine heterogeneous components and how can we program/manage them for the best achievement of computational efficiency?

In addition, MANGO targets two emerging HPC requirements: guaranteeing predictability to new applications (as HPC merges with BigData) and providing capacity computing (running as many applications as possible). MANGO explores the so-called 3P space domain: Performance, Power consumption and Predictability.

MANGO built a prototype, which enables rapid architecture exploration. The prototype, while guaranteeing flexibility, poses per se a big challenge.

Short-term goals:

- A flexible prototype for architecture exploration
- New heterogeneous architectures
- Real-time support in the PPP design space
- Unified and simple access via a smart interconnect
- Programming models and compilers to new architectures
- Resource manager
- Monitoring tools
- Cooling techniques
- Impact on a set of real applications (video transcoding, medical imaging, security and surveillance)

During the first half of the project the Consortium focused on the design and integration of components, with the definition of the specifications in Applications, Hardware, and Software (D1.1 D1.2 D1.3).The Phase 1 platform was delivered. Partly overlapped with the definition of the system, the Consortium made progress towards implementation of basic software and hardware components that build up the complete system: manycore accelerators, interconnection network, compiler support and resource manager support. An initial integration process was conducted during in order to get a first coupling of all system components in the single node context.

During the second half of the project, the Consortium added more functionality and support: interrupts, efficient memory transfer, and system compatibility with larger multi-FPGA settings. For cooling, we have designed and implemented a prototype of a micro-scale two-phase thermosyphon cooling device. Global and local resource manager (LRM), together with system monitoring tools were developed and adapted each other. The prototype was finally built and configured thanks to an incremental integration effort leveraging all the anticipation work carried out at the beginning of the second period. During the last third of the project all the integration and validation process was conducted and took most of the efforts. Indeed, complexity of the integration and validation processes proved as a real challenge for the Consortium. At the end, a final integrated and operative solution, including all specific components, was achieved. The system is now able to run multiple concurrent applications, which can trigger kernels on the heterogeneous components via the resource manager. The Global Resource Manager (GRM) acts a single entry-point to applications, allocating workload to the MANGO nodes depending on temperature, power and availability. GRM delegates control to the LRM in a coordinated way.

PEAK Manycore architecture with coherent interconnect (UPV). PEAK is being targeted as a baseline for new European project initiatives in the domain of security and space applications. Also, cloud-computing related projects are being targeted where manycore architectures running on edge devices are possible (FPGAs).

Multi-FGPA communication NoC (UPV) with support for concurrent communication between accelerators and memories. A TLB-like approach has been deployed to help also virtualize the memory segments located in different DDR memory modules along the HN cluster.

HN API and library (UPV) with transparent and efficient access to units and memories.

Open-source GPU-like hardware core (CERICT). It supporting three levels of parallelism: vector, hardware threads, and manycore parallelism. The core comes with an LLVM-based toolchain including a custom backend that was developed from scratch. It is mainly meant as an open platform enabling broad-spectrum architectural exploration, but it can also be used in production environments as a fully configurable heterogeneous accelerator. Among other results, the customization capabilities of the GPU-like core have been exercised with the specific application domain of cryptographic computing by extending it with a custom functional unit along with a related set of instructions supporting particularly demanding cryptographic primitives.

Programming model and resource management (POLIMI). It allows the programmer to express QoS and provide kernel implementations for different accelerators. In this way, we enable the management of resources in a system-wide way that is aware of the co-existance of multiple applications, while avoiding the burden on the programmer that is inherent in programming models such as OpenCL.

Micro-scale two-phase thermosyphon prototype (EPFL). This new cooling device is able to tackle heat fluxes and dynamic transients of heat peaks that are not manageable with air-based cooling and can be integrated in real data centers. Extensive tests were performed and the development of advanced machine learning based thermal management strategies in this project have proven its usability not only for FPGAs, but also for High-Performance Computing (HPC) systems based on x86 processors.

Heterogeneous accelerator (UNIZG). Results show that DCT HW accelerator, even running at 40MHz mode, outperforms Intel (running at 3.3 GHz) AVX optimized implementation for block sizes of 16x16 or larger when it comes to processing time.

THALES. The MANGO architecture enabled to identify the key points to be addressed whenever there is a need to parallelize in a capacity computing aspects these kinds of algorithms like LDPC.

PHILIPS. Valuable insights achieved in the scalability of the volume rendering algorithm. The PHILIPS HealthSuite Digital Platform (HSDP) applied in the project where we aim to provide volume rendered images to patients and care providers, who require access to volume rendering at greater scale and more and more expect personal access to their images everywhere and anytime.

EATON. New emerging trend in the Datacenter market with the direct low-voltage DC distribution to the rack.

PROD. With the flexible and adjustable HN nodes and the communication possibilities to GN nodes via physical interfaces (GBit Ethernet, USB, PCI Express) including IPs and APIs, improvements in a multiplicity of societal areas can be achieved (e.g. education, health care, aged care services) with new products.

MANGO results are used in new projects: IDEAS (FETOPEN) to create a genuine innovation with socio-economic impacts. RECIPE (H2020) to enable the HPC system to improve its reliability via predictive reliability techniques. DeepHealth (ICT) to deploy a European Distributed Deep Learning library and to use it on heterogeneous hardware, where MANGO is one central pillar.

Basic FPGA motherboard used in MANGO platform

Periodic Reporting for period 2 - MANGO (MANGO: exploring Manycore Architectures for Next-GeneratiOn HPC systems)

Udostępnij tę stronę

Pobierz