Extending sparse patterns to improve inverse preconditioning on GPU architectures

التفاصيل البيبلوغرافية
العنوان:	Extending sparse patterns to improve inverse preconditioning on GPU architectures
المؤلفون:	Laut Turón, Sergi, Borrell Pol, Ricard, Casas, Marc
المساهمون:	Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Barcelona Supercomputing Center
بيانات النشر:	Association for Computing Machinery (ACM)
سنة النشر:	2024
المجموعة:	Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
مصطلحات موضوعية:	Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, Sparse linear algebra, GPUs, Memory coalescing, Spatial locality
الوصف:	Graphic Processing Units (GPUs) have become a key component of high-end computing infrastructures due to their massively parallel architecture, which delivers large floating-point operations per cycle rates. Many scientific workloads benefit from GPUs and, in particular, numerical methods solving linear systems of equations $Ax=b$ typically run on GPUs. Among them, the Conjugate Gradient (CG) method, which targets linear systems with Symmetric and Positive Definite (SPD) matrices, runs on GPUs using its preconditioned form. However, state-of-the-art preconditioning techniques like the Factorized Sparse Approximate Inverse (FSAI) preconditioner ignore the benefits of data coalescence and locality on GPU architectures and leave substantial performance on the table. These approaches are exclusively based on numerical criteria. This paper proposes the GPU-aware Factorized Sparse Approximate Inverse (GFSAI) preconditioner. GFSAI generates sparse patterns that enhance the numerical benefits of FSAI and improve data locality and coalescence on GPU architectures. We evaluate GFSAI considering NVIDIA V100 and AMD MI50 GPUs and a set of 47 sparse matrices. GFSAI improves the state-of-the-art by reducing the average CG iteration count by 27.48\% and 31.25\% on NVIDIA and AMD, respectively, which leads to average decreases in execution time of 23.83\% and 26.07\% in these two GPU architectures. ; Marc Casas has been partially supported by Grant RYC-2017- 23269, which is funded by MCIN/AEI/10.13039/501100011033 and ESF Investing in your future. This research was supported by grant PID2019-107255GB-C21 funded by MCIN/AEI/10.13039/501100011033. The authors thank the Departament de Recerca i Universitats de la Generalitat de Catalunya for supporting the Research Group "Performance understanding, analysis, and simulation/emulation of novel architectures" (Code: 2021 SGR 00865) ; Peer Reviewed ; Postprint (author's final draft)
نوع الوثيقة:	conference object
وصف الملف:	14 p.; application/pdf
اللغة:	English
Relation:	info:eu-repo/grantAgreement/AEI//RYC-201723269; info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-107255GB-C21/ES/BSC - COMPUTACION DE ALTAS PRESTACIONES VIII/; http://hdl.handle.net/2117/419540
DOI:	10.1145/3625549.3658683
الاتاحة:	http://hdl.handle.net/2117/419540 https://doi.org/10.1145/3625549.3658683
Rights:	Open Access
رقم الانضمام:	edsbas.EB1CA023
قاعدة البيانات:	BASE

View record in BASE

الوصف
DOI:	10.1145/3625549.3658683