This paper presents a fast and robust approach to evaluate the singular values of small (e.g., (Formula presented.), (Formula presented.)) matrices on single- and multi-Graphics Processing Unit (GPU) systems, enabling the modulation of the accuracy–speed trade-off. Targeting applications that require only computations of the SVs in electromagnetics (e.g., Multiple Input Multiple Output—MIMO link capacity optimization) and emerging deep-learning kernels, our method contrasts with existing GPU singular value decomposition (SVD) routines by computing singular values only, thereby reducing overhead compared to full-SVD libraries such as cuSOLVER’s gesvd and MKL’s desvg. The method uses four steps: interlaced storage of the matrices in GPU global memory, bidiagonalization via Householder transformations, symmetric tridiagonalization, and root finding by bisection using Sturm sequences. We implemented the algorithm in CUDA and evaluated it on different single- and multi-GPU systems. The approach is particularly suited for the analysis and design of multiple-input/multiple-output (MIMO) communication links, where thousands of tiny SVDs must be computed rapidly. As an example of the satisfactory performance of our approach, the speed-up reached for large matrix batches against cuSOLVER’s gesvd has been around 20 for (Formula presented.) matrices. Furthermore, near-linear scaling across multi-GPUs systems has been reached, while maintaining root mean square errors below (Formula presented.) in single precision and below (Formula presented.) in double precision. Tightening the tolerance from (Formula presented.) to (Formula presented.) increased the total runtime by only about 10%.
Calculating the Singular Values of Many Small Matrices on GPUs / Capozzoli, A.; Curcio, C.; Di Donna, S.; Liseno, A.. - In: ELECTRONICS. - ISSN 2079-9292. - 14:16(2025). [10.3390/electronics14163217]
Calculating the Singular Values of Many Small Matrices on GPUs
Capozzoli A.;Curcio C.;Di Donna S.;Liseno A.
2025
Abstract
This paper presents a fast and robust approach to evaluate the singular values of small (e.g., (Formula presented.), (Formula presented.)) matrices on single- and multi-Graphics Processing Unit (GPU) systems, enabling the modulation of the accuracy–speed trade-off. Targeting applications that require only computations of the SVs in electromagnetics (e.g., Multiple Input Multiple Output—MIMO link capacity optimization) and emerging deep-learning kernels, our method contrasts with existing GPU singular value decomposition (SVD) routines by computing singular values only, thereby reducing overhead compared to full-SVD libraries such as cuSOLVER’s gesvd and MKL’s desvg. The method uses four steps: interlaced storage of the matrices in GPU global memory, bidiagonalization via Householder transformations, symmetric tridiagonalization, and root finding by bisection using Sturm sequences. We implemented the algorithm in CUDA and evaluated it on different single- and multi-GPU systems. The approach is particularly suited for the analysis and design of multiple-input/multiple-output (MIMO) communication links, where thousands of tiny SVDs must be computed rapidly. As an example of the satisfactory performance of our approach, the speed-up reached for large matrix batches against cuSOLVER’s gesvd has been around 20 for (Formula presented.) matrices. Furthermore, near-linear scaling across multi-GPUs systems has been reached, while maintaining root mean square errors below (Formula presented.) in single precision and below (Formula presented.) in double precision. Tightening the tolerance from (Formula presented.) to (Formula presented.) increased the total runtime by only about 10%.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


