Periodic Reporting for period 2 - SPARCITY (SparCity: An Optimization and Co-design Framework for Sparse Computation)
Periodo di rendicontazione: 2022-10-01 al 2024-03-31
- Develop a comprehensive characterization mechanism for sparse computations
- Create advanced node-level optimizations for modern parallel architectures.
- Devise topology-aware partitioning algorithms and communication optimizations for system-level parallelism.
- Create digital SuperTwins of supercomputers to evaluate what-if hardware scenarios.
- Demonstrate the SparCity framework's effectiveness and usability on challenging real-life applications.
Overall, SparCity significantly contributes to Europe's strengths in HPC applications, low-energy processing technologies, and advanced software and services development.
**WP2:** The focus was on expanding tensor and matrix generators in Task 2.2 and using ML models for matrix reorderings. The source-to-source compiler and automatic graph kernel fusion in Task 2.3 were completed. Extensive experiments on data and computation reordering in Task 2.5 led to a Supercomputing 2023 publication, with reordering algorithms made available through SparseBase, offering performance recommendations for CPUs and GPUs.
**WP3:** System-level optimizations included a new interface specification for collective communication and scalable techniques for partitioning sparse matrices, tensors, and graphs in Task 3.2 applied to deep learning workloads on GPUs and IPUs. A software infrastructure for communication offloading in multithreaded code was implemented in Task 3.3. Hierarchical partitioning and graph embedding for sparse matrix-vector multiplication on IPUs were explored in Task 3.4.
**WP4:** SuperTwin-related tasks were completed, enabling microbenchmarks, probing, and visualization of monitored applications for detecting performance anomalies. SuperTwin, an open-source framework for digital twins of HPC systems, uses a Knowledge Base for performance metric monitoring, real-time visualization, linked-data connections, and advanced analysis. The Abstraction Layer automates low-level profiling, including live cache-aware roofline modeling, validated on various multicore architectures.
**WP5:** Task 5.1 optimized cardiac electrophysiology simulators. Task 5.2 developed new datasets and algorithms for detecting digital wildfires on social networks, including an incremental clustering strategy. Task 5.3 extended epistasis detection implementations to new hardware like Google TPUs, Nvidia tensor cores, and Graphcore IPUs, leveraging transformer neural networks for better performance. Task 5.4 ported GNNs for multi-pedestrian tracking to Graphcore IPUs, achieving superior performance over GPUs. Task 5.5 created SparseUtils.com offering prepartitioned matrices and matrix operation tools. Two new applications, sparsified large language models, and sequence alignment were added during the second period.
The SparCity project has significantly advanced the state-of-the-art in characterizing application behavior for sparse computations. The project achieved comprehensive feature extraction for common sparse data structures, such as matrices, graphs, and tensors, which facilitated the development of machine learning systems to predict the efficacy of reordering algorithms for SpMV. This predictive capability represents a substantial optimization in sparse computations. The feature extraction tool is publicly available at https://github.com/sparcityeu/feaTen. Enhanced performance and energy-efficiency modeling were also achieved, including the development of the Mansard Roofline Model (MaRM) and enhancements to the Cache-Aware Roofline Model (CARM). These models are designed for sparse kernels and input matrices, providing a framework for optimizing performance across various hardware architectures, as detailed in several peer-reviewed publications. Data locality tools were expanded to support a broader set of architectures, including AMD machines, enhancing comprehensive communication modeling capabilities.
**Node-Level Optimizations for Sparse Computation**
Key contributions include the first use of order-dependent features for predicting reordering effectiveness in specific matrices, optimizing sparse matrix operations by tailoring the reordering process. An extensive performance study involving 490 large matrices, six reordering algorithms, and eight modern multicore architectures was published at Supercomputing 2023. Other notable developments include genTen, a smart sparse tensor generator available at https://github.com/sparcityeu/genTe) and SparseBase, a preprocessing library for sparse computation available https://github.com/sparcityeu/sparsebase. Additionally, a framework for fusing multiple graph algorithms significantly improved performance over executing algorithms individually.
**System-Level Optimizations for Sparse Computation**
The project achieved substantial system-level optimizations, notably through innovative partitioning algorithms, dynamic topology information systems, and advanced communication offloading infrastructure. A dynamic topology information system, available at https://github.com/sparcityeu/yloc integrates dynamic data and makes it available at runtime. A communication offloading software infrastructure was developed to integrate communication operations efficiently into multithreaded code, available at https://github.com/sparcityeu/mmcso. Additionally, fast GPU-based hypergraph partitioning schemes were implemented, available at https://github.com/sparcityeu/FastPartitioner.
**Digital SuperTwin and SuperViz**
Significant progress was made in developing SuperTwin, an open-source framework for generating digital twins of HPC systems. SuperTwin enables fine- or coarse-grain profiling and visualization with minimal code overhead and is available at https://github.com/sparcityeu/Digital-SuperTwin. SparseViz, a no-code software for sparse-data ordering and visualization, allows users to evaluate kernel performance with different orderings, available at https://github.com/sparcityeu/sparseviz.
**Demonstration with Real-Life Applications**
Progress was demonstrated across six real-life applications, including unprecedented simulations with the cardiac simulator, advanced graph processing algorithms, and improvements in epistasis detection. The use of Graphcore IPUs across multiple applications delivered better performance than state-of-the-art CPU/GPU implementations. Open collections of sparse problem instances are maintained to facilitate research and reproducibility, available at https://datasets.simula.no/sparcity/ and http://sparseutils.com/.