ORCID

Abstract

For the past several decades, optimizing compilers have been aprimary area of focus in both industry and academia. This continued research interest is a testament to the complexity of thistask, primarily stemming from the vast number of parameters thatmust be explored to attain near-optimal results. One of the keycompiler optimizations is "Register Blocking (RB)" also known as"Register-level Tiling" or "unroll-and-jam". RB can strongly reducethe number of executed Load/Store (L/S) instructions, and as aconsequence the number of data accesses in memory hierarchy,but due to its inherent complexities, fine-tuning is essential for itseffective implementation. To address this problem, in this work anew methodology is proposed for RB. The RB factors, the loopsto apply RB, the number of allocated variables/registers per arrayreference, and the loops’ ordering are generated by an analyticalmodel, leveraging the target hardware (HW) architecture details andloop kernel characteristics. The proposed methodology has beenevaluated on both embedded and general-purpose CPUs acrossseven well-known loop kernels, achieving high speedups and L/Sinstruction gains over GCC compiler, handwritten optimized codes,and the popular Pluto tool.

DOI

10.1145/3649153.3649194

Publication Date

2024-07-02

Keywords

Compiler Optimization, Register Blocking, Register Tiling, Unroll-and-Jam, High Performance Computing, Data Reuse, CPUs, Compiler Optimizations

First Page

71

Last Page

79

Share

COinS