ORCID

Abstract

Register Blocking (RB), also known as ‘Register-level Tiling’ or ‘unroll-and-jam,’ is a key compiler optimizationfor developing efficient micro-kernels. However, applying RB effectively is a complex task due to severalchallenges. First, the exploration space of possible RB configurations is vast. Second, RB and loop permutationare interdependent; therefore, addressing both optimizations simultaneously further inflates the explorationspace. Third, the effectiveness of RB is highly dependent on the target hardware platform and the specific loopkernel being optimized. As a result, an extensive and time-consuming fine-tuning process is necessary forachieving an efficient implementation.To address these challenges, a source-to-source analytical modelling approach is proposed. The RB factors,the loops to apply RB, the number of allocated variables/registers per array reference, and the loops’ orderingare generated by an analytical model, leveraging the target hardware architecture details and loop kernelcharacteristics. The proposed methodology has been evaluated on both embedded and general-purpose CPUs,using seven well-known loop kernels and three machine learning applications. The results show significantspeedups over the GCC compiler, the Pluto tool, and related work.

Publication Date

2025-09-13

Publication Title

ACM Transactions on Embedded Computing Systems

Volume

24

Issue

5

ISSN

1539-9087

Acceptance Date

2025-01-01

Deposit Date

2025-12-10

Funding

This work is part of R-PODID project, supported by the Chips Joint Undertaking and its members, including the top-up funding by National Authorities of Italy, Turkey, Portugal, The Netherlands, Czech Republic, Latvia, Greece, and Romania under grant agreement No 101112338

Keywords

CPUs, Compiler optimizations, data reuse, high performance computing, register blocking, register tiling, unroll-and-jam

First Page

1

Last Page

24

Share

COinS