Design and Implementation of 2D Convolution on x86/x64 Processors
dc.contributor.author | Kelefouras, Vasileios | |
dc.contributor.author | Keramidas, G | |
dc.date.accessioned | 2022-05-01T14:19:16Z | |
dc.date.available | 2022-05-01T14:19:16Z | |
dc.date.issued | 2022-04-29 | |
dc.identifier.issn | 1045-9219 | |
dc.identifier.issn | 1558-2183 | |
dc.identifier.uri | http://hdl.handle.net/10026.1/19129 | |
dc.description.abstract |
In this paper, a new method for accelerating the 2D direct Convolution operation on x86/x64 processors is presented. It includes efficient vectorization by using SIMD intrinsics, bit-twiddling optimizations, the optimization of the division operation, multi-threading using OpenMP, register blocking and the shortest possible bit-width value of the intermediate results. The proposed method, which is provided as open-source, is general and can be applied to other processor families too, e.g., Arm. The proposed method has been evaluated on two different multi-core Intel CPUs, by using twenty different image sizes, 8-bit integer computations and the most commonly used kernel sizes (3x3, 5x5, 7x7, 9x9). It achieves from 2.8×2.8× to 40×40× speedup over the Intel IPP library (OpenCV GaussianBlur and Filter2D routines), from 105 ×105× to 400 ×400× speedup over the gemm-based convolution method (by using Intel MKL int8 matrix multiplication routine), and from 8.5×8.5× to 618×618× speedup over the vslsConvExec Intel MKL direct convolution routine. The proposed method is superior as it achieves far fewer arithmetical and load/store instructions. | |
dc.format.extent | 3800-3815 | |
dc.language.iso | en | |
dc.publisher | Institute of Electrical and Electronics Engineers (IEEE) | |
dc.subject | Convolution | |
dc.subject | gaussian blur | |
dc.subject | code optimization | |
dc.subject | vectorization | |
dc.subject | AVX | |
dc.subject | OpenMP | |
dc.subject | OpenCV | |
dc.subject | Intel MKL | |
dc.subject | Intel IPP | |
dc.subject | high performance computing (HPC) | |
dc.subject | image processing | |
dc.title | Design and Implementation of 2D Convolution on x86/x64 Processors | |
dc.type | journal-article | |
dc.type | Journal Article | |
plymouth.author-url | https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000831139000004&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=11bb513d99f797142bcfeffcc58ea008 | |
plymouth.issue | 12 | |
plymouth.volume | 33 | |
plymouth.publication-status | Published | |
plymouth.journal | IEEE Transactions on Parallel and Distributed Systems | |
dc.identifier.doi | 10.1109/tpds.2022.3171471 | |
plymouth.organisational-group | /Plymouth | |
plymouth.organisational-group | /Plymouth/Faculty of Science and Engineering | |
plymouth.organisational-group | /Plymouth/Faculty of Science and Engineering/School of Engineering, Computing and Mathematics | |
plymouth.organisational-group | /Plymouth/REF 2021 Researchers by UoA | |
plymouth.organisational-group | /Plymouth/REF 2021 Researchers by UoA/UoA11 Computer Science and Informatics | |
plymouth.organisational-group | /Plymouth/Users by role | |
plymouth.organisational-group | /Plymouth/Users by role/Academics | |
dcterms.dateAccepted | 2022-04-27 | |
dc.rights.embargodate | 2022-5-5 | |
dc.identifier.eissn | 1558-2183 | |
dc.rights.embargoperiod | Not known | |
rioxxterms.versionofrecord | 10.1109/tpds.2022.3171471 | |
rioxxterms.licenseref.uri | http://www.rioxx.net/licenses/all-rights-reserved | |
rioxxterms.type | Journal Article/Review |