Periodic Reporting for period 2 - The European PILOT (Pilot using Independent Local & Open Technologies)
Reporting period: 2022-10-01 to 2024-05-31
The EUPILOT project aims to build an end-to-end demonstrator of accelerators that could be used in a pre-exascale system. The project will produce three chip tapeouts. The first will be a test-chip to validate the use of the 12nm technology node. The second and third will contain a vector accelerator (VEC) with up to 16 cores and a machine learning and stencil accelerator (MLS) with up to eight cores, respectively.
Chips will be mounted along LPDDR memory into modules and these will be installed into accelerator boards making systems, and, paired with host servers, deployed into liquid immersion tanks. These tanks support ultra-efficient power densities and are a trending technology for the future of HPC.
EUPILOT contributes to a sustainable exascale HPC ecosystem in Europe, helping lay the groundwork for long-term technical independence by delivering an end-to-end proof of concept. The know-how and the boost in industrial competitiveness and closer cooperation will all help the goal of establishing European digital autonomy.
Regarding numerical libraries, the team has been working on the implementation of BLIS and FFTW kernels for both VEC and MLS, to support the HPC applications. To support the AI frameworks and models, the team has successfully implemented a version of oneDNN targeting the VEC accelerator. For MLS, converting ONNX models to DaCe and an initial version of the MLS backend has been released. The team has been developing a memory management solution and a version of TensorFlow that dynamically links with oneDNN, to feed Arax. Arax has been ported to a RISC-V QEMU environment. The team has provided a oneDNN library optimised for VEC to assist the integration to TensorFlow. Work has been started on integrating Tarantella with DaCe/TensorFlow.
Co-design work has been performed to start developing tests for verifying OpenMPI's data transfer engine's (DTE) functionality. The team has worked towards the final goal to port and optimise the OpenMP runtime for VEC, with a focus on locality awareness and better energy efficiency. Similarly for MPI, a new component to OpenMPI has been developed to optimize collective operations making use of the hardware extensions in VEC. Effort has been devoted to develop an optimized version of the TAMPI (Task-Aware MPI) library that manages all concurrent MPI requests internally as well as a porting of the DLB (Dynamic Load Balancing) library to RISC-V.
Work was done on node- and cluster-level resource management, based on the integration of three components: SLURM, Konro and DROM. The malleability work is completed with the deployment of DMRLib on RISC-V. The team has worked on porting recent Linux kernel and root file system with the appropriate customizations (device drivers), an environment to ease image file generation and deployment as well as a new Fast Context Switch module to better support OpenMP free-agent threads and DLB.
In terms of tools, the team has been working on integrating the Fortran front-end of LLVM with the EPI compiler to pave the way to vectorisation and optimized code generation for both VEC and MLS. An initial release of the memory interference analysis engine, supporting the analysis of scalar and vector memory instructions, has been implemented in LLVM as a RISC-V back-end pass.
The hardware team focused on two main areas in parallel.
The tapeout of the so-called test-chip done and the chip will arrive in June 24. The characterization and debugging of all critical structures to be used in future chips will take around three months.
Work was performed for the implementation of parts of the uncore, with the C2C controller, LPDDR controller, and the CXL controller (with their corresponding PHYs) in addition to the power management controller, PLLs, etc. Most of this work was included in the test-chip.
In parallel, work was done to reach freeze milestones in VEC and MLS designs.
In terms of the memory hierarchy, design work was performed for cache improvements and feature upgrades in the intra-chip coherency mechanisms. In the RISC-V/VEC cores, performance increases can be expected from a 4x increase in handling outstanding misses. Work was also performed to extend the AMBA5 CHI. The first interface specification for the I/O coherent data-transfer engine (DTE) was created, along with the DMA engine.
Work was performed to improve the VEC core from a 2-way in-order design to a 3-way out-of-order core. The interface between the core and the VPU (OVI) was improved to version 2, with changes in the core and the VPU. There are also improvements in the NoC of virtual channels, enabling inter-chip routing.
On the MLS side, improvements have been performed in the integration of the SPU to the snitch integer core, memory-mapping of the SPU and further integration improvements.
Verification efforts have been devoted to transiting from version 0.7 to version 1.0 of the V extension. Effort continued in the multi-FPGA environments with C2C protocol extensions.
In the systems area, work for the development of the testboard that will host the test-chip was complete, and the system specifications and definition of the requirements has contiued.
Finally, planning for the deployment and operation of liquid immersion cooling tanks continued.
EUPILOT will provide implementations of relevant AI/ML frameworks, leveraging a pool of European technologies (e.g. DaCe and Tarantella), that will enable a broad class of applications.
The improvements on the prior VPU will be substantial, in addition to compatibility with RISC-VV v1.0. The VEC core will improve and both accelerator chips will bring out performance/energy efficiency gains as well as increased memory capacity and potentially higher memory throughput. The NoC will be improved to match the increase in memory bandwidth. The CXL controller brings improvements in latency and throughput.
The availability of computing platforms like EUPILOT, bringing HPC, AI/ML accelerated workloads to future exascale systems has the potential to impact the world in remarkably positive endeavors like faster drug discovery, protein folding and in general shorter times to scientific discoveries. These have the potential to save billions and many years of development across many fields. These efforts aim for large-scale societal improvements by enabling users to model what was before difficult to physically perform, to ameliorate the impact of detrimental or catastrophic events upon society at large.