Design and Implementation of 2D Convolution on x86/x64 Processors

In this paper, a new method for accelerating the 2D direct Convolution operation on x86/x64 processors is presented. It includes efficient vectorization by using SIMD intrinsics, bit-twiddling optimizations, the optimization of the division operation, multi-threading using OpenMP, register blocking and the shortest possible bit-width value of the intermediate results. The proposed method, which is provided as open-source, is general and can be applied to other processor families too, e.g., Arm. The proposed method has been evaluated on two different multi-core Intel CPUs, by using twenty different image sizes, 8-bit integer computations and the most commonly used kernel sizes (3x3, 5x5, 7x7, 9x9). It achieves from 2.8 to 40 speedup over the Intel IPP library (OpenCV GaussianBlur and Filter2D routines), from 105 to 400 speedup over the gemm-based convolution method (by using Intel MKL int8 matrix multiplication routine), and from 8.5 to 618 speedup over the vslsConvExec Intel MKL direct convolution routine. The proposed method is superior as it achieves far fewer arithmetical and load/store instructions.


INTRODUCTION
T HE 2-D Convolution is a widely used operation in image and video processing for filtering, smoothing, sharpening, edge detection and differentiation. Furthermore, it is the most computationally intensive building block in Convolutional Neural Networks (CNNs).
Speeding up the convolution operation is a challenging and non-trivial task. This is due to two main reasons. First, the optimization process must take into account multiple parameters: input image dimensions and image bit-depth, kernel size, kernel values, and target hardware architecture. Second, modern compilers fail to provide optimized code, therefore different manually vectorized (e.g., using SSE/AVX intrinsics) and optimized routines are needed, and most importantly these routines must be customized to the above mentioned parameters.
Three main strategies are typically used to speedup the convolution operation: a) optimizing the direct convolution method [1], e.g., Intel IPP and Intel MKL libraries provide such routines, b) using the highly optimized Matrix-Matrix Multiplication (MMM) routines of Intel MKL or BLAS optimized libraries [2] [3], c) using either the Fast Fourier Transform (FFT) [4] or the Winograd algorithm [5] [6]. According to [7] an efficient direct convolution implementation achieves higher performance than the MMM-based and FFT based convolution methods on CNNs, as the non-direct methods introduce computational and memory overheads. The proposed method uses the direct method.
The aim of this work is to speedup the 2D convolution in the context of image and video processing algorithms. The extension of the proposed method in CNNs is left for future work.
To speedup the convolution operation, CPU vendors provide optimized libraries such as Intel IPP (Integrated Performance Primitives) and Intel MKL that are commer-Manuscript received Oct 19, 2021; revised XXX cial proprietary libraries working only on Intel processors. OpenCV (Open Source Computer Vision), which is a very popular library of programming functions for real time computer vision applications, uses Intel IPP (specified in the installation phase) to accelerate its routines on Intel CPUs. Its optimized routines use multiple threads and AVX/SSE intrinsics (IPP uses the direct method). Intel MKL provides a set of routines to efficiently execute the direct convolution operation for single and double precision real and complex data. Furthermore, Intel MKL provides highly optimized int8/int16/FP32 MMM routines that are widely used to compute the convolution layer of CNNs.
In this paper, the design and implementation of the 2D convolution operation on x86/x64 processors is delivered, for different kernel sizes, kernel values, vectorization technologies, number of physical CPU cores, image bitdepths and image sizes as well as for separable kernels. The proposed method achieves far fewer arithmetical and Load/Store (L/S) vector instructions than the state of the art libraries/methods as it includes efficient vectorization by using SIMD intrinsics, bit-twiddling optimizations, efficient vector division, register blocking and the shortest possible bit-width value for the intermediate results. In particular, our method offers high SIMD utilization. Fig. 1 shows the instruction gains over the Intel libraries for an 1024x1024 image. The results are extracted by using the Valgrind tool [8]. The MMM-based method gives the highest number of instructions mainly because of the extra computation/memory overhead of the im2col operation (it is explained in Section 2). The two direct methods achieve fewer instructions as there are no overheads.
The proposed method is evaluated over a) the Intel IPP / OpenCV and in particular GaussianBlur and Filter2D routines [9], b) the MMM-based convolution method, by using Intel MKL 'cblas gemm s8u8s32()' int8 MMM routine [10], c) 'vslsConvExec' Intel's MKL direct convolution routine (it does not support integer data, thus floating point (FP) input data are used) [11], on two multi-core Intel CPUs, by using a wide range of 8-bit greyscale images and the most commonly used kernel sizes (3x3, 5x5, 7x7, 9x9). Our method achieves high performance and energy consumption gains in all cases.
The main contributions of this paper are: a) a new method for computing the 2D convolution operation on modern CPUs, achieving far fewer arithmetical and L/S instructions compared to the state of the art software commercial libraries, b) a research work providing the theoretical background and source code 1 , to efficiently design and implement the 2D convolution operation on different CPUs and for different kernels and images, c) an experimental procedure showcasing that the proposed method achieves high performance gains on two different CPUs.
The remainder of this paper is organized as follows. In Section 2, the related work is reviewed. The proposed method is presented in Section 3, while the experimental results are discussed in Section 4. Finally, Section 5 is dedicated to discussion while Section 6 is dedicated to conclusions and future work.

RELATED WORK
The main strategies that have been followed to improve the performance of the convolution operation on CPUs can be classified into three main groups. The proposed method belongs to the first group. The first group of works uses the direct algorithm. The main optimizations used are vectorization, improving the memory layout to enable vectorization, parallelization (by using OpenMP) and register blocking. Although parallelization can be easily applied by using OpenMP pragmas, the efficient application of vectorization is far more challenging and necessitates the manual development of low-level software; furthermore, if applied efficiently, it gives by far the highest performance gain.
There are three main vectorization strategies for the convolution operation [12] [13] [14] [7]. The first strategy (aka coefficient propagation) broadcasts each kernel's scalar coefficient to a separate vector variable [12] [13] [14] (each vector variable has multiple copies of just one scalar coefficient). In the second strategy, all the scalar coefficients which are located in a kernel's row are copied into a vector 1. https://github.com/kelefouras/2D Convolution variable, multiple times [12] [13]. According to [12], the first method above is faster than the second (when using 16/32-bit instructions), as it achieves fewer arithmetical instructions. The third strategy, which is used in CNNs only, vectorizes the depth or filter loop [1] [15] [7], and data are processed in a way similar to Method2; note that in CNNs the convolution layer includes three more loops (both the input feature map and the weights are tensors) and thus a comparison with our method is hard to be realized. Most of the works use 32-bit FP instructions, some use 16bit integer instructions [12] [13] [14] [16], but no related work uses 8-bit instructions on x86/x64 processors. 8-bit implementations are far more challenging as first, there is no x86/x64 8-bit vector multiplication instruction, second, there is no vector instruction for integer division, and third, packing/unpacking becomes complicated and extra instructions are required. Thus, none of the current methods can be directly extended to 8-bit (serious modifications are required), e.g., the first vectorization strategy cannot be efficiently implemented in 8-bit because of the first reason above.
To the best of our knowledge, the proposed method achieves fewer L/S and arithmetical instructions compared to all the currently published methods for the following reasons. First, it leverages the existing available 8-bit integer instructions. Second, the proposed method efficiently packs and stores the output pixels in memory; in Subsection 4.1.3, we show that this step boosts performance by about x2.5. Third, it uses optimizations that the existing works do not; the number of instructions is further reduced by optimizing the division operation and the bit-width of the intermediate results (in Subsection 4.1.3, we show that these optimizations boost performance by about x1.6, each).
In [17], a parallel implementation of Harris corner detection algorithm is proposed for NUMA architectures. In [18], the Harris operator is optimized using a number of optimizations such as vectorization, data interleaving and parallelization, on both x86/x64 and Arm processors. In [19], different ways of vectorizing the 3D convolution are shown. In [20], a white paper for a Gaussian Blur implementation on Intel processors is proposed, using FP computations. In [21], a performance comparison is applied between x86 SSE and Arm NEON intrinsics for four image processing routines, including Gaussian Blur. In [15], authors propose a method to accelerate CNNs by using Arm NEON intrinsics and 16-bit computations. In [22], the 2D convolution is optimized by using AVX512 and OpenMP. In [23], authors study the vectorization process in CNNs, using Matlab code. Last, in [24], an implementation for canny edge detection algorithm is delivered. The proposed method achieves fewer L/S and arithmetical vector instructions than [17][18][19][20][21][22][23][24] for the three reasons explained above.
Intel IPP and Intel MKL provide high performance direct methods for computing the convolution operation; these optimized routines have been implemented by using AVX and multiple threads. A comparison with these methods is provided in Section 4.
Last, there are methods that apply auto-vectorization by using image processing domain-specific languages (DSL) such as Halide [26] [27].
The second group of works implement the convolution operation by using the highly optimized Matrix-Matrix Multiplication (MMM) routines of Intel MKL and BLAS optimized libraries [2] [3]. In these methods, convolution is implemented on top of MMM, leveraging the aforementioned highly optimized libraries. This approach is mainly used in CNNs and has been employed in mainstream deep learning frameworks [3].
MMM-based algorithms rely on 'im2col' or 'im2row' memory transformations [2] to convert the Convolution problem into an MMM problem, introducing a non-trivial overhead in memory storage and bandwidth, which is proportional to the kernel size. In the 2D convolution case, the 'im2col' / 'im2row' operation copies the N ×M input image to a new 2D array of size (N × M ) × (k × k), where k × k is the kernel size. This is a slow and memory-bound operation. Furthermore, the MMM operation is applied on the new tall and skinny matrix whose dimensions are dissimilar from the matrices arising from traditional high performance computing (HPC) applications and according to [7] the MMM routines do not achieve their peak performance in this case. Last, the memory size needed is k × k larger compared to the direct method.
An indirection buffer is introduced in [2] to avoid reshuffling the data. In [3], im2col conversion and MMM packing operations are merged to one, reducing the memory footprint. In [28], an efficient CPU implementation for convolution-pooling in CNNs is presented, by using convolution interchange and vectorization.
To better utilize the CPU's vector extensions in CNNs, quantization is used. In [29], authors optimize CNNs by using extremely low-bit quantization on ARM CPU (from 2 up to 8-bit) for the MMM and Winograd methods. Intel oneAPI Deep Neural Network Library (oneDNN) [30] supports 8-bit quantization and three algorithms (direct, MMM-based and Winograd-based).
Regarding the third group, either the Fast Fourier Transform (FFT) [4] or the Winograd algorithm [5] [6] are used, to reduce the number of FP computations required. The Winograd-based methods are mainly used for small kernel sizes (e.g., 3x3), while the FFT-based methods are mainly used for larger kernel sizes. However, the reduced number of operations does not always align with performance as several challenges make it hard to fully utilize the hardware resources on modern CPUs. The Winograd's based convolution algorithms cannot efficiently utilize the memory hierarchy and the wide vector registers that modern CPUs support [6]. FFT-based approaches suffer from significant performance overheads too [7]. Note that both algorithms suffer from a lack of precision.
In [31], a distributed implementation for the IBM Cell Broadband Engine processor is proposed. [5] extends and optimizes the Winograd-class of convolutional algorithms to the N-dimensional case of CNN on x86/x64 CPUs. In [6], another Winograd implementation is presented for manycore CPUs. In [32], a new class of Winograd algorithms for integer arithmetic is presented.

PROPOSED METHOD
The inputs to our method are the following: kernel size, kernel values, vectorization technology (e.g., AVX,SSE), number of physical CPU cores, image bit-depth and image dimensions; for different parameters, different code implementations are proposed. The proposed method is broken down into seven Subsections. For ease of presentation, we assume AVX technology (256-bit operations), 8-bit input/output images and a 2D kernel of size 3 × 3.

Bit-width selection of the Intermediate Results (IRs)
In 2D convolution, a k × k kernel 'slides' over the input image, performing k × k multiplication, (k × k − 1) addition and 1 division (integer coefficients are assumed here) operations and the result is stored into a single output pixel. A naive implementation is shown on the left of Fig. 2 (for ease of presentation, the border pixels are not computed here).
Most of the image formats use 8-bit pixels and therefore to achieve maximum performance 8-bit processing is needed so as to fully utilize the wide vector instructions. This means that the 'input', 'output' and 'coef' arrays in Fig. 2, should be 8-bit. However, 'tmp' variable is 32-bits.
Given that the range of the output pixel value is always within [0, 255] (uses 8-bits), the range of the 'tmp/divisor' value in Fig. 2 Conclusion: In most cases, 16-bit IRs bit-width is adequate for kernels of size 3x3 and 5x5; for larger kernel sizes, 32-bit width is normally needed, depending on the coefficients' values. Note that for separable kernels, e.g., Gaussian Blur, 16-bit width is adequate for larger kernel sizes too, as fewer operations are executed in this case; two 1D kernels are used instead of one 2D (one for the X and another for the Y dimension), and thus the number of operations being performed is lower. In Section 4, we show that 16-bit width provides a speedup from x1.05 to x2.0 over 32-bit width (about x1.5 on average).
Hardware Limitations: Accuracy loss is likely when using 8-bit inputs, regardless of the IRs bit width; the reason follows. In x86/x64 processors there is just one 8bit vector multiplication instruction available (maddubs); maddubs, includes both multiplication and addition of the IRs (Fig. 5). Its first operand is of type unsigned 8bit integer (used for storing the pixels), while its second operand is of type signed 8-bit integer (used for storing the coefficients); the output contains signed 16-bit integers and thus (−32768 ≤ i0 × C00 + i1 × C01 ≤ 32767). When both maddub's operand values are near to their maximum values, an overflow occurs, which leads to accuracy loss. Given that the maximum value of the input (e.g., i0, i1) is 255, an overflow will never occur if the sum of each coefficient pair (e.g., C00+C01) is smaller or equal to 128.

Vectorization
The optimization of the vectorization process is the key to achieve high performance in convolution operation. The proposed vectorization method minimizes the number of arithmetical instructions as well as the number of store and division operations. It also reduces the number of load operations; to minimize the number of load operations, the optimization in Subsection 3.3. must be also applied. Fig. 3 shows a high level illustration of the proposed method. For a kernel of size k × k, k vector load operations are needed to load the first working set of the input image (r0,r1,r2 in Fig. 3). Each vector contains simd.length/pixel.size pixel elements, e.g., for AVX technology (256-bit) and 8-bit pixels, each vector contains 32 pixel elements. The first output result is calculated by processing the first k elements (0:k-1 elements) of the k previously loaded vector variables, e.g., 0:2 elements in Fig. 3. The second output result is calculated by processing the elements 1:k, the third 2:k+1, etc. Loading k ×(simd.length/pixel.size) pixels and calculating just one result is very inefficient as first, the hardware vectorization unit is underutilized, and second, more vector instructions are required. The two main objectives of the proposed method are first, to utilize the vector instructions as much as possible (and as a consequence the number of vector instructions is reduced), second, to select the appropriate vector instructions (note that modern CPUs support a rich instruction set with diverse latency and throughput values).
The proposed vectorization algorithm for kernels of size 3x3 is shown in Algorithm 1 (for ease of presentation, the calculation of the border pixels is not shown here). To ease the presentation of Algorithm 1, Fig. 6 is also provided. Firstly, the following coefficients (V C00, V C10, V C20) are multiplied by the input vectors, by using maddubs instruction (lines 18-20 in Algorithm 1). Then, the results are vertically and horizontally added (lines [23][24][25][26][27][28]. The Horizontal addition is needed so as to add all the pairs in the 'out' variable in Fig.5; it is implemented by using 'shift' and 'add' instructions as the easy 'hadd' instruction to use is an expensive operation. Note that for larger kernel sizes, more than one horizontal addition is needed. The reason that 12, instead of 3 (one for each kernel's row), vector coefficients are used, follows. By using multiple vector coefficient variables (Fig. 4), less load (or no shift) vector operations are required. Instead of loading the input image again and again (e.g., loading elements 0:31, then 1:32, then 2:33 etc) and multiplying the input vectors by a single vector coefficient (e.g.,V C00), the input is loaded once and it is multiplied by the four pre-computed shifted coefficients (Fig. 4). It is preferable to use multiple coefficient variables rather than loading the input image multiple times because first, the coefficients can be loaded just once and remain into hardware registers and second, less memory address calculations are required. Furthermore, instead of loading the input to a vector and then shifting it multiple times (AVX shift instructions are expensive), the pre-computed shifted coefficients are used instead (Fig. 4). Note that shift is an expensive operation in AVX because the existing 256bit shift instructions are implemented as two 128-bit shift instructions and therefore the 16th vector position (there are 32 in total) is always filled with zero (two separate 128bit shift operations occur); therefore additional instructions are needed to fill that empty position with the appropriate value. However, using multiple vector coefficients is not efficient for large kernel sizes, as their number becomes too high; therefore a slightly different algorithm (Algorithm 2) is proposed in this case (explained later).
Note that after line 28, either the odd or the even elements of 'm0' must be discarded (depending on the X value), as these elements do not contain any useful data; this is realized by using the 'and' operations in lines 33,36,39,43; two different masks are needed, i.e., 'mask1' and 'mask2', because in X = 0 and X = 1, the results are located in the even positions of 'm0', while in X = 2 and X = 3 the results are located in the odd positions of 'm0'. Note that the least significant elements are the rightmost.
As far as the number of division operations is concerned, instead of applying division just after the horizontal addition step in line 28 (which is the easiest thing to do), the results are efficiently packed into out even and out odd vector variables, and division is applied outside X loop (lines 49-50). This way, the number of division operations is reduced from four to two (the IRs are first packed and then divided).
The IRs are packed in an elegant and efficient way, and therefore a significant amount of arithmetical vector instructions is saved. In Subsection 4.1.3, we show that this step has a high positive impact on performance. By separating the even and odd results of X loop into separate vectors (out even and out odd), makes the process of extracting and blending the output 8-bit values into one vector variable (lines 53,54) efficient, saving lots of arithmetical vector instructions. Last, the X for loop is fully unrolled to avoid using the if-conditions in lines 31-46.
For kernels of size 7 × 7 or larger, a slightly modified algorithm is used instead (Algorithm 2), because in this case Algorithm 1 gives a very high number of vector coefficients, e.g., 56 vector coefficients are required for the general case of 7×7. When the number of vector variables is higher than the number of the CPU's hardware registers, then the variables cannot remain into the registers and they are loaded/stored many times from/to memory (a.k.a. register spills), degrading performance. Algorithm 2 uses just k coefficients (when the kernel is of size k × k -one vector coefficient is used for each kernel's row), addressing this problem. Algorithm 2 loads the input more times (lines 12-14) than Algorithm 1 (more load instructions), but it computes more output pixels per iteration (32 instead of 30 in Algorithm 1). Algorithm 2 is always faster than Algorithm 1 for kernels of size 7x7 and larger, as in this case Algorithm 1 gives a very high number of vector coefficients. On the contrary, Algorithm 1 is normally faster for kernels of size 5x5 and smaller.
32-bit IRs: The bit-width of the IRs has a strong impact on vectorization. The most efficient way to make the 16-bit IRs 32-bit, is by using madd epi16 instruction. In Algorithm 3, we show the implementation of Algorithm 2 for 32-bit IRs. First, maddubs instruction multiplies 8-bit numbers and outputs signed 16-bit integers (lines 16-18 in Algorithm 3). Then, madd epi16 instruction multiplies the 16-bit output of maddubs by the value of '1' and adds the adjacent results (lines 21-23), generating 32-bit integers. The results are then vertically added and afterwards directly divided as 'm0' contains only useful data now (eight 32-bit values). In the 32-bit case the number of division operations doubles. The implementation of Algorithm 1 using 32-bit IRs is similar.
The aforementioned algorithms can be easily extended to larger kernel sizes, 16-bit computations and different vectorization technologies, e.g., AVX-512, SSE, Arm SVE, Arm NEON (see Section 5).

Register Blocking
Convolution is a data intensive loop kernel and therefore the number of load/store (L/S) operations performed is critical. So far, the number of load operations is reduced by packing as many scalar coefficients as possible into the vector coefficients and by using multiple coefficient vector variables (Algorithm 1). The number of store operations is minimized by efficiently extracting and blending the IRs. In Algorithm 1, 30 output pixels are calculated and stored by using just 3 load operations and a single store instruction.
To further reduce the number of load operations of Algorithm 1, register blocking optimization is applied. Register blocking consists of loop unroll and scalar replacement optimizations. For ease of presentation, we show the register blocking optimization using scalar code (Fig. 2), but its application on Algorithm 1 is trivial. In Fig. 2, the application of register blocking is shown with a factor of 2. Firstly, loop unrolling is applied to 'row' loop, exposing common array references in the loop body. Then, the common array references are replaced by variables (scalar replacement). On the left of Fig. 2, every time the row iterator is incremented, two of the three input image rows are loaded again, which is not efficient; on the right, the number of load operations is reduced by about 1.5×.
Register blocking optimization is applied to Algorithm 1 only; it cannot be applied to Algorithm 2, as different pixel values are loaded in each X iteration. We found experimentally that register blocking can boost performance by Algorithm 2 Proposed vectorization algorithm for kernel sizes 7 × 7 and larger . For presentation purposes the implementation of the 3 × 3 kernel is shown. All the x86/x64 intrinsics can be found in this link [33]. 1: unsigned short int a=65535 //in binary a = (11111111 11111111) 2 2: m256i r0,r1,r2,m0,m1,m2,out even, out odd 3: m256i mask1 = set epi16 (0, a, 0, a, 0, a, 0, a, 0, a, 0, a, 0, a, 0, a) //odd positions are zero 4: m256i V C00, V C10, V C20 //just three vector coefficients are required 5: 6: for (row = 1; row < N − 1; row + +) do 7: for (col = 32; col <= M − 34; col+ = 32) do 8: 9: //This loop must be fully unrolled to avoid the if-conditions below 10: for (X = 0; X < 4; X + +) do 11: //Load 32 8-bit pixels 12: Algorithm 3 Implementation of Algorithm 2 for 32-bit IRs 1: m256i r0,r1,r2,m0,m1,m2, ones = set1 epi16(1) 2: m256i V C00, V C10, V C20 //vector coefficients 3: m256i out [4] //For ease of presentation an array is shown here, but four variables must be used instead 4: 5: for (row = 1; row < N − 1; row + +) do 6: for ( up to 1.2×. The register blocking factor needs to be found experimentally and it depends on the kernel size, image size and hardware platform. The kernel size affects the blocking factor as the smaller the kernel size is, the fewer the vector coefficients are; if the number of the vector variables is larger than the number of the available hardware vector registers (in x64 there are normally 16 registers), then the reused variables cannot remain into the registers and they are spilled to L1 data cache (aka register spilling). However, it is performance efficient to apply register blocking even when register spills occur as the number of saved load operations is high, normally outweighing the cost of register spilling (this is hardware dependent). The image size also affects the blocking factor; when the width of the input image is very large (e.g., M=5184), then L1 data cache almost fills up (with the input/output image rows) and thus register spilling causes useful data to be evicted from the cache, degrading performance. Note that more image rows are processed at a time when register blocking is applied.

Optimizing the division operation
Division is a long-latency operation and its optimization highly impacts the overall performance. Its optimization is broken down into two steps.
Firstly, the division operation is transformed into multiplication and shift operations. The (x/d) operation is transformed into (x * ceil(f )) >> (w + b) [34], where w is the word size in bits (e.g., w = 16, if 16-bit division is applied), b = f loor(log 2 (d)) and f = 2 (w+b) /d. The unsigned scalar division with fixed divisor is shown in Algorithm 4. Note that the divisor is always a constant value; therefore the computationally expensive procedure that calculates the ceil(f ) and b values, is applied just once for each input image.
The second step includes the optimization of the multiplication operation in (x * ceil(f )) >> (w + b). Let us assume that 16-bit width is used for the IRs. In this case, the aforementioned multiplication is a 16-bit multiplication and its result needs 32-bits to be stored. Therefore, two vector multiplication instructions are needed ( 'mullo' and 'mulhi'). 'Mullo' is needed to calculate the low 16-bits, while 'mulhi' is needed to calculate the high 16-bits; then, extra instructions are also needed to pack the results and process the 32-bit results. This procedure, which is computationally expensive, is optimized by using the bit twiddling optimizations shown in Algorithm 5; this way, only one multiplication operation is performed and no packing of the results is needed [34]. All the x86/x64 intrinsics can be found in [33]. The 32-bit division is shown in Algorithm 6.

Separable kernels
A separable k × k 2D kernel can be broken down into two 1D kernels, a kernel of size k × 1 and another of size 1 × k; this can reduce the number of arithmetical instructions, especially for medium/large kernel sizes (Gaussian Blur is separable). The most efficient way of implementing the separable kernels is by merging the two 1D convolution operations. First, the k × 1 kernel is applied on the first row and its results are stored into a new array ( 'temp' in Algorithm 7) which is then processed by the 1 × k (Algorithm 7). The procedure is repeated row by row.
The optimization process of the 1 × k kernel is a simplified version of Subsection 3.2. Regarding the k × 1 kernel, there is no 8-bit multiplication vector instruction and thus the multiplication of the even and odd elements is performed separately (lines 24-26 and 34-36).
It has been found experimentally that separable kernels perform better for kernels of size 7 × 7 or larger (as the gain in arithmetical instructions is high), worse for kernels of size 3 × 3, and both methods perform almost equally for kernels of size 5 × 5.

Image Boundaries
In the aforementioned algorithms, the calculation of the border pixels (first and last columns/rows) is not included, to ease presentation. The first columns are processed by shifting the vector coefficients by one position and filling the gap (first position element) with a special value depending on the border type policy needed, e.g., zero. Regarding the last column pixels, they are processed separately to avoid the cost of array padding. A separate routine is used for the loop reminder that stores the output pixel elements either in shorter vectors or one by one. Note that to achieve best performance, different loop reminder routines are needed for different M values, e.g., if just one column needs to be computed, an optimized scalar version would be faster. Last, the computation of the first and last row output pixels can be simplified as normally the border pixel policy is filling with zeros.

PERFORMANCE EVALUATION
The experimental results are performed on two host PCs; an Intel quad-core i7-4790 CPU at 3.60GHz with DDR3 1600MHz (PC1) and an Intel quad-core i5-7500 CPU at 3.40GHz with DDR4 2666MHz (PC2), both running Ubuntu 20.04. Both processors support AVX2 instructions. PC2 achieves higher memory bandwidth than PC1 and this is why all the implementations run faster on PC2, apart from the largest three image sizes where PC1 performs better in some cases; this is because PC1 has a larger L3 cache, i.e. 8MB compared to 6MB of PC2 (this is explained below).
The proposed method is compared to the latest version of OpenCV (version 4.5.1) with Intel IPP, AVX and multithreading enabled as well as to the latest version of Intel MKL library (version 2021.3). Note that the OpenCV routines being evaluated use the Intel IPP library. All the codes are compiled by using gcc 9.3.0 and '-O3' option.
The performance metrics used are speedup and Calculated Pixels per Second (CPS); we have used CPS instead of execution time, as CPS better shows how well the hardware is utilized. The CPS formula is given by (M × N × times.run)/measured.ex.time, where M is the image width, N is the image height and times.run is the number of times each code version has run. To get accurate execution time values, each version runs multiple times (times.run) so as its execution time is about one minute.
The register blocking factors used are 2 and 3, for the 5x5 and 3x3 case, respectively. Register blocking is not efficient for the largest image size (5184x3456) for the reason explained in the end of Subsection 3.3. Note that register blocking is not applicable to Algorithm 2 (7x7 and 9x9 case).

Image Size (M,N) Computed Pixels per second (CPS)
Gaussian Blur 3x3

Evaluation over Intel IPP / OpenCV
The proposed method is evaluated using two different convolution loop kernels for image smoothing. The first one is the Gaussian Blur (OpenCV GaussianBlur routine [9]); the Gaussian Blur is separable and its coefficient values are symmetrical, e.g., in a 3 × 3 kernel the coefficients of the 1st and 3rd row are identical. The second loop kernel is nonsymmetrical and non-separable (OpenCV filter2D routine

Speedup over Intel IPP
Gaussian Blur 9x9 Fig. 10: Evaluation over Intel IPP / OpenCV on 9 × 9 Gaussian Blur [9]) and all its coefficients are different. Four different kernel sizes have been used for each case, i.e., 3x3, 5x5, 7x7 and 9x9, and a wide range of greyscale 8-bit image sizes. Algorithm 1 is used for the 3x3 and 5x5 cases, while Algorithm 2 for the 7x7 and 9x9. Fig. 7-10 show the evaluation of the proposed method for the Gaussian Blur. To improve readability, the results on 'PC1' and 'PC2' are shown using the same colours; 'PC2' uses a continuous line while 'PC1' a discontinuous line. The 'Unoptimized (O3)' line refers to a naive single threaded implementation. Although OpenMP gives average speedup of 3.4× here, we did not use OpenMP in the unoptimized version as we would like to evaluate the code that a nonexpert would use. One implementation is used for all the kernel sizes. '16/32' refers to the bit-width used for the IRs, 'A/B/C' refers to the division case (see Algorithm 5/6) while 'sep' refers to separable kernels (in this case Algorithm 7 has been used). The Gaussian Blur is separable and thus Algorithm 7 can be used to reduce the number of arithmetical instructions. Algorithm 7 performs better than Algorithm 1 for the 7x7 and 9x9 kernels, but Algorithm 1 performs better for the 3x3 case as in this case there is no gain in arithmetical instructions. For the 5x5 case, both algorithms provide roughly the same performance. Regarding the 3x3 case (Fig. 7), all the implementations achieve their highest CPS values, as the 3x3 case includes fewer calculations than the 5x5,7x7 and 9x9 cases. The unoptimized single-threaded code ('O3' option is used) achieves about 69 and 59 Mega CPS (MCPS) and is about 233 and 272 times slower than the proposed method, on 'PC1' and 'PC2', respectively. IPP achieves from 2.22 to 3.52 Giga CPS (GCPS) and from 1.88 to 3.86 GCPS, on 'PC1' and 'PC2', respectively, delivering high speedup values over the unoptimized method (Fig. 7). Note that IPP uses separable kernels for the GaussianBlur routine.

Gaussian Blur
Gcc cannot auto-vectorize the code because of the ifcondition in the loop body; an if-condition is required to process the border pixels differently. Note that if we process the border pixels separately (in another loop kernel), then no if-condition is needed and gcc is capable of auto-vectorizing the code; still, the speedup achieved is quite low (around 1.8×). The 'Ofast' option does not provide any further performance gain here. The divisor value of the unoptimized code is always a power of two and therefore gcc replaces the division operation with a shift operation.
The performance of the proposed method is shown for different divisor values (Fig. 7). As it was explained in Subsection 3.4, there are three different division cases (Algorithm 5/Algorithm 6). If the divisor is a power of 2 (case A), then the division is realized by using just a single shift instruction, while in case B and case C, 7/11 and 5/9 instructions are executed, for the 16/32 case, respectively. This is why case A is faster than case C and case C is faster than case B, at all times.
Regarding the three largest input sizes, there is a performance drop for both IPP and the proposed method, because the input/output images are larger than L3 cache. Performance is degraded more on 'PC2' than on 'PC1' because 'PC2' has a smaller L3 size (6MB compared to 8MB). The image size 2400x2400 is the most extreme case where PC1 is 1.8× faster than PC2; in this case, the input image fits into the PC1's L3 cache, but it cannot remain into the PC2's L3 cache. The drop is much higher for the 3x3 case compared to the other kernel sizes because the arithmetic intensity (the ratio of arithmetical instructions and bytes loaded/stored) of the 3x3 case is lower than the others, which means that the 3x3 case is more memory-bound; for large input sizes, the overall performance is bounded by the DDR bandwidth. The proposed method achieves high speedup values in all cases.
Both the proposed method and IPP, cannot achieve their best performance for very small images, as the scalability is lower. The number of threads used is four at all cases. Note that performance is degraded by using either more or fewer threads, even in the smallest studied image sizes. Fig. 8 shows the evaluation of the proposed method for the kernel size of 5x5. We have not included a figure showing the CPS because of the limited page size. As it was expected, all methods are slower in the 5x5 case (compared to the 3x3 case), as the overall number of instructions is higher. IPP achieves from 1.34 to 1.59 and from 1.79 to 2.76, Giga CPS (GCPS), on PC1 and PC2, respectively. The proposed method achieves from 5.61 to 7.74 and from 5.62 to 7.97 GCPS, on PC1 and PC2, respectively. The proposed method achieves high speedup values in all cases.
In the 5x5 case (Fig. 8), some of the proposed implementations include a separable kernel (Algorithm 7) and some not (Algorithm 1). However, both methods provide roughly the same performance here. It is important to note that the separable case (Algorithm 7) executes more division operations and therefore its performance is degraded more in the 'B/C' division cases.
Regarding the 7x7 case (Fig. 9), our method achieves more CPS than the 5x5 case. It achieves from 6.0 to 9.1 and from 6.0 to 9.2 GCPS, on PC1 and PC2, respectively. Although the 5x5 case theoretically requires fewer arithmetical instructions than the 7x7 case, in practice it executes more, as the process of the last horizontal addition (line 26 in Algorithm 2) becomes complicated. In the 3x3 and 7x7 cases, sets of 4 and 8 scalar coefficients (including the zeros) are used to fill the vector coefficients (Fig.3), and thus each set of scalar coefficients is located either on the lower or on the upper half of the AVX registers. However, in the 5x5 and 9x9 case a set always uses both AVX parts, and this makes the horizontal-add operation more complicated, requiring extra vector instructions. This is because in AVX, the existing 256bit shift instructions are implemented as two 128-bit shift instructions and therefore the 16th vector position is always filled with zero; therefore additional instructions are needed to fill that empty position with the appropriate value. This is the reason that the 5x5 and 9x9 cases achieve lower speedup values over IPP, compared to the 3x3 and 7x7 cases.
For the 9x9 case, the proposed method achieves from 2.72 to 3.25 and from 2.72 to 3.38 GCPS, on PC1 and PC2, respectively. Fig. 11-12 show the evaluation of the proposed method for the non-symmetrical convolution kernel (Filter2D). The kernels are not separable in this case and therefore 32-bit IRs are needed for the 7x7 and 9x9 case. We have not included a figure showing the CPS because of the limited page size; this is why we also show the speedup values for the 5x5,7x7 and 9x9 cases in a single figure.

Filter2D
We were surprised when we found out that Filter2D routine is several times slower than the GaussianBlur routine, e.g., about x10 and x15 times slower for the 3x3 and 5x5 case, respectively. The main reasons follow. Firstly, Filter2D routine is single-threaded; we are not aware of the reason that Filter2D routine is not parallelized. Secondly, the Gaus-sianBlur routine uses separable kernels.
The proposed method achieves speedup from 16× to 40× over Filter2D routine. The performance of the proposed method is lower for small M values and large images, for the reason explained in Subsection 4.1.1.
The proposed method achieves fewer CPS for the nonsymmetrical Filter2D kernel compared to the Gaussian Blur since a) all the scalar coefficients are different now and as a consequence the number of vector coefficients required by the proposed method is higher, b) Algorithm 7 is not applicable, c) 32-bit IRs are needed for the 7x7 and 9x9 case. Regarding the 3x3 case, the number of vector coefficients has been increased from 8 to 12 and this why performance is degraded by about 1.76× and 1.4×, on PC1 and PC2, respectively. Note that the arithmetic intensity is even lower for non-symmetrical kernels. In the 5x5 case, the number of coefficients has been increased from 18 to 30; by using more coefficients, a larger number of register spills occurs, and for this reason the proposed method achieves better performance in the Gaussian Blur case (about 1.46× on PC1 and 1.25× on PC2). Regarding the 7x7 and 9x9 cases, Algorithm 3 is used instead of Algorithm 7; our method performs about 3× times faster on the GaussianBlur case because of the (b)-(c) reasons above.

Performance Breakdown
In this Subsection, we provide a performance analysis of different optimizations of the proposed method on PC1. First, a performance evaluation of different IRs bit-width values is performed (Fig. 13); the first bar always refers to the 'A' case, the second to the 'B', while the third to the 'C' case. As it was expected, the implementation using 32bit IRs is always slower than the one using 16-bit, as extra arithmetical instructions are being executed. Performance degradation is lower when the divisor is a power of 2 (case A), as in this case both 16-bit and 32-bit division is realized via a single shift instruction. However, when the actual division is computed (case B/C), an extra overhead occurs for the 32-bit case. This is because the 32-bit division (Algorithm 6) uses more arithmetical instructions compared to the 16-bit division (shown in Algorithm 5). For large input images and 3x3 kernel sizes, the performance drop is insignificant, as the code becomes memory-bound and the execution time highly depends on the time needed to load/store the data.  Second, a performance evaluation of the optimizations shown in Subsection 3.4, Subsection 3.2, and Subsection 3.1 is made, for kernels of size 3x3 (Fig. 14).
To evaluate the performance gain of Subsection 3.4, we have replaced the optimized division operation with a routine that uses the ' mm256 div ps()' instruction. We did that because this is the only division instruction supported in x86/x64. ' mm256 div ps()' divides FP values and thus, the 16-bit integer intermediate results are converted to FP to perform the division and then they are converted to integer values again. The division operation described in Subsection 3.4 gives a speedup of about x1.6. For large input images the code becomes memory-bound and this is why there is no performance gain; however, this is not true for larger kernel sizes where performance gain occurs even for large images.
Next, we have evaluated the performance of the last step of Algorithm 1 (lines 31-57) (see 'L/S not optimized' line in Fig. 14). When the vertical and horizontal additions are completed (line 28), the results (m0), need to be first divided and then stored to memory into positions (col, col+4, col+8, col+12, col+16, col+20, col+24, col+28); the latter includes multiple store instructions as the elements are not stored into consecutive memory locations (this is further explained in Subsection 3.2). To optimize this process, in Algorithm 1, more IRs are calculated and then packed in an elegant and efficient way (lines 31-46); this allows for a single store operation as well as to less division operations.
To evaluate the impact of lines 31-57 in Algorithm 1, instead of executing the code in lines 31-57, we divide m0 just after line 28 and then its useful data (e.g., data in positions (col, col+4, col+8, col+12, col+16, col+20, col+24, col+28)) are extracted and stored into memory. This results into multiple extract/store operations which degrade performance by a factor of about x2.5. Note that dividing m0 just after line 28, also results into extra division operations, but this does not really affect the results here as we have chosen Case A only, where division is realized by using a single shift instruction.

Evaluation over Intel MKL library
The proposed method is evaluated over Intel MKL library on PC2, by using two different convolution methods, an Intel MKL MMM based method and an Intel MKL direct method (Fig. 15, Fig. 16). For a fair comparison, nonsymmetrical kernels are used here. Note that in the case   where symmetrical kernels are used, the proposed method is significantly faster, especially for the 7x7 and 9x9 cases. Firstly, the proposed method is evaluated over the MMM-based convolution method. This process consists of two steps. First, the input array (of size N × M ) is copied into another bigger 2d array of size ((N ×M )×(k×k)), a.k.a. im2col operation (k × k is the size of the kernel). Second, the 'cblas gemm s8u8s32()' Intel MKL routine [10] multiplies the new array (2d array of size (N × M ) × (k × k)) by the kernel (1d array of size (k × k)). 'cblas gemm s8u8s32()' routine takes as input 8-bit integer values and returns 32-bit integer values; this is a highly optimized routine which includes AVX intrinsics and multiple threads. The first routine (im2col) has been manually developed and is not optimized. Although, the convolution operation cannot be realized without the 'im2col' step, we show the performance of the MMM-based convolution with and without 'im2col' step in order to show that the proposed method achieves high speedup values even without running the 'im2col' step. The MMM-based method does not perform division and therefore just one speedup line is shown for each case.
The proposed method provides high speedup values over the MMM-based convolution method in all cases (Fig. 15); it achieves speedup values from 105× to 400×, from 261× to 372×, from 298× to 375× and from 160× to 181×, for kernels of sizes 3x3, 5x5, 7x7 and 9x9, respectively. The proposed method achieves a lower speedup value for the 9x9 case, because a) the process of the horizontal additions (Algorithm 2) requires more arithmetical instructions in this case, b) fewer pixels are calculated per iteration. Note that the performance bottleneck of the MMM-based convolution is the 'im2col' routine, which is not optimized. Although, the performance of the 'im2col' routine can be improved (e.g., by parallelizing it), the proposed method provides superior performance even when this routine is not used at all (Fig. 15). The proposed method achieves speedup values from 9.7× to 35×, from 10.0× to 14.8×, from 6.2× to 14.7× and from 2.1× to 6.3×, over 'cblas gemm s8u8s32()' routine when the 'im2col' routine is not used, for kernels of sizes 3x3, 5x5, 7x7 and 9x9, respectively (Fig. 15).
The proposed method achieves high performance gains for three main reasons. First, it uses the direct method and thus there are no computational/memory overheads ( 'im2col'). Second, 'cblas gemm s8u8s32()' routine cannot achieve its peak performance for tall and skinny matrices [7]. Third and most importantly, the proposed method achieves far fewer L/S and arithmetical instructions than 'cblas gemm s8u8s32()' routine (and less memory); this is because a) 'cblas gemm s8u8s32()' routine operates on an array which is k × k times larger than the input image, b) the proposed method better utilizes the AVX instructions and computes multiple output pixels in each iteration.
The second Intel MKL routine that we evaluated our work is 'vslsConvExec()' [11] (Fig. 16). Intel MKL provides a set of routines to perform linear convolution for single and double precision real and complex data. Given that there is no convolution routine for integer numbers, we have used the 'vslsConvExec()' routine, which uses numbers of type 'float' (2d convolution is applied). 'vslsConvExec()' is a highly optimized routine which supports AVX intrinsics and multiple threads. The experimental results show that the 'vslsConvExec()' direct convolution method achieves high performance gains over the MMM-based, in all cases ( Fig. 15/Fig. 16). It is important to note that the execution time of the 'vslsConvExec()' routine is not affected by the kernel size but it is stronly affected by the input size; it achieves from 23 MCPS to 131 MCPS. The proposed method achieves speedup values from 98× to 618×, from 50× to 291×, from 23× to 146× and from 8.5× to 47×, for kernels of sizes 3x3, 5x5, 7x7 and 9x9, respectively.
The proposed method achieves high performance gains for the following reasons: a) it uses int8 instead of FP input data; int8 can give a performance gain up to 4×, compared to FP, b) it achieves fewer L/S and arithmetical instructions as it better utilizes the AVX instructions and computes multiple output pixels in each iteration.

Energy Consumption
The proposed method is also evaluated over Intel IPP and Intel MKL in terms of power and energy consump-tion (Fig. 17), on PC2 (a power meter is used). The average power/energy consumption and execution time reduction values are calculated by using the 20 studied image sizes and the following formula (value1-value2)/value2 x 100%. The UnOpt.(O3), Filter2D and im2col implementations consume on average 32, 35 and 40.5 Watts, respectively. gemm s8u8s32 and vslsConvExec consume on average 56.6 and 57.2 Watts, respectively. The latter two implementations run on all the four cores and this is why power consumption is higher. The Gaussian.Blur routine (symmetrical case), which is the second fastest routine after the proposed, consumes more power (on average 73.8 Watts) as it better utilizes the processing capacity of the processor. The proposed method consumes on average 74.8 and 70 Watts, for the non-symmetrical and symmetrical case, respectively. For symmetrical kernels, the proposed method consumes less power than the non-symmetrical, as there are fewer load instructions and fewer memory accesses in memory hierarchy. The proposed method achieves high energy gains in all cases.

Different kernel sizes and kernel values:
The proposed vectorization method is more efficient for even kernel sizes, as no zeros are inserted in this case and thus more pixels are calculated by using the same number of instructions.
In the general case, the kernel values cannot fit into 8bit variables for kernels of size 11 × 11 or larger, and in this case 16-bit computations are needed. This is not true for the separable kernels (e.g., Gaussian Blur) as their kernel values are smaller and thus 8-bit width is adequate for large kernel sizes too. Extending Algorithm 1 and Algorithm 2 by using 16-bit computations is straightforward. Although we have evaluated the proposed method by using 8-bit kernel values only, we expect the 16-bit case to be up to two times slower, and therefore significant performance gains are expected in the 16-bit case too. Note that the Intel IPP and the MMM-based methods are expected to run slower in this case too, e.g., 'cblas gemm s16s16s32()' is needed instead of 'cblas gemm s8u8s32()'.
The kernel size and values also affect the bit-width of the intermediate results; 16-bit width is adequate for the 3 × 3 and 5 × 5 case, while for larger kernel sizes 32-bit width is normally needed (see Subsection 3.1). For separable kernels, 16-bit width is adequate for larger kernel sizes too.
The proposed method provides impressive speedup values in both cases.
Furthermore, for kernel sizes larger or equal to 17x17/9x9 when 8-bit/16-bit inputs are used, the vector coefficients in Fig. 4 can hold just one set of scalar coefficients. In this case just one output pixel is computed in each iteration and thus Algorithm 1 and 2 are simplified.

CNN Convolution Layers:
The proposed method can be extended to compute the CNN convolution layer. However, CNNs introduce extra parameters that need to be considered in the optimization process. First, CNNs include different stride and dilation values. The application of Algorithm 1/Algorithm 2 is straightforward when stride=1 and dilation=1, but for different stride and dilation values, the proposed algorithms need to be appropriately extended. Furthermore, the number of instructions required to calculate an output value is way higher in CNNs, and this affects the vectorization process. This also affects the bitwidth selection process in Subsection 3.1; we expect that 16-bit IRs would not be adequate in most cases, unless the quantization level is high. The CNN layer introduces three additional loops, where vectorization and register blocking can be applied; in CNNS, the depth or filter loop is normally vectorized [1] [15] [7]. Another difference relates to the size of the kernel. In CNNs, there are many kernels and not one, and each kernel requires more memory. Given that multiple kernels exist, the following question arises, should we traverse each kernel on the input array separately, or traverse multiple kernels together? Different Processors: The proposed algorithms can be easily extended to different vector lengths (e.g., AVX-512) and vectorization technologies (e.g., Arm NEON, Arm SVE); however, different manually vectorized routines are needed for each case. The proposed method is also expected to scale well to processors with many CPU cores, as the studied algorithm is data parallel.

CONCLUSIONS AND FUTURE WORK
In this paper, a fast method for computing the direct 2D convolution operation on x86/x64 processors is delivered. We provide important insight on how to design and implement the 2D convolution for different kernel sizes, kernel values, vectorization technologies, number of physical CPU cores, image bit-depths and image sizes. The optimized routines, which are provided as open-source, include efficient vectorization using SIMD intrinsics, bit-twiddling optimizations, the optimization of the division operation, multi-threading using OpenMP and others.
The proposed method has been evaluated by using twenty different size images, both 16-bit and 32-bit intermediate results, four different kernel sizes (3x3,5x5,7x7,9x9) and two different processors, and it provides high speedup values over Intel IPP and Intel MKL in all cases. To the best of our knowledge, the proposed method is superior as it achieves far fewer arithmetical and L/S instructions.
As far as our future work is concerned, the first step includes the evaluation of the proposed method to larger kernel sizes. In the longer term, we are planning to apply and evaluate the proposed method to the convolution layer of CNNs. Although the proposed method can be extended to multiple channels and different dilation/stride values (with a reasonable effort), we expect that several changes might be required in order to achieve peak performance.

Vasilios Kelefouras
Dr. Vasilios Kelefouras is an Assistant Professor in computer science at University of Plymouth. His main research interests are in the areas of code optimization, High Performance Computing and optimizing compilers. He has strong R&D experience in optimizing software applications, in terms of execution time, energy consumption and memory size, in a wide range of different hardware platforms. He has published more than 35 research papers.

Georgios Keramidas
Dr. Georgios Keramidas is an Assistant Professor at the School of Informatics of the Aristotle University of Thessaloniki, Greece. His main research interests are in the areas of low-power processor/memory design,multicore systems, VLIW/multi-threaded architectures, graphic processors, power modelling methodologies, FPGA prototyping, and compiler optimizations techniques. He has published more than 70 papers, two books, and he also holds 11 US patents.