ORCID

Abstract

Deep neural networks (DNNs) have become indispensable in many real-life applications like natural languageprocessing, and autonomous systems. However, deploying DNNs on resource-constrained devices, e.g., inRISC-V platforms, remains challenging due to the high computational and memory demands of fully connected(FC) layers, which dominate resource consumption. Low-rank factorization (LRF) offers an effective approachto compressing FC layers, but the vast design space of LRF solutions involves complex tradeoffs amongFLOPs, memory size, inference time, and accuracy, making the LRF process complex and time-consuming. Thisarticle introduces an end-to-end LRF design space exploration methodology and a specialized design tool foroptimizing FC layers on RISC-V processors. Using Tensor Train Decomposition (TTD) offered by TensorFlowT3F library, the proposed work prunes the LRF design space by excluding first, inefficient decomposition shapesand second, solutions with poor inference performance on RISC-V architectures. Compiler optimizations arethen applied to enhance custom T3F layer performance, minimizing inference time and boosting computationalefficiency. On average, our TT-decomposed layers run 3× faster than IREE and 8× faster than Pluto on thesame compressed model. This work provides an efficient solution for deploying DNNs on edge and embeddeddevices powered by RISC-V architectures.

Publication Date

2025-10-24

Publication Title

ACM Transactions on Embedded Computing Systems

Volume

24

Issue

6

ISSN

1539-9087

Acceptance Date

2025-08-31

Deposit Date

2025-12-09

First Page

1

Last Page

34

Share

COinS