Electronic Resource

Predictive reliability and fault management in exascale systems: State of the art and perspectives

التفاصيل البيبلوغرافية
العنوان: Predictive reliability and fault management in exascale systems: State of the art and perspectives
المؤلفون: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. VIRTUOS - Virtualisation and Operating Systems, Canal Corretger, Ramon, Hernández Luz, Carles, Tornero Gavilá, Rafael, Cilardo, Alessandro, Massari, Giuseppe, Reghenzani, Federico, Fornaciari, William, Zapater Sancho, Marina, Atienza, David, Oleksiak, Ariel, Wojciech Piatek, Poznan, Abella Ferrer, Jaume
بيانات النشر: 2020-09
نوع الوثيقة: Electronic Resource
مستخلص: Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.
This work has received funding from the European Union’s Horizon 2020 (H2020) research and innovation program under the FET-HPC Grant Agreement No. 801137 (RECIPE). Jaume Abella was also partially supported by the Ministry of Economy and Competitiveness of Spain under Contract No. TIN2015-65316-P and under Ramon y Cajal Postdoctoral Fellowship No. RYC-2013-14717, as well as by the HiPEAC Network of Excellence. Ramon Canal is partially supported by the Generalitat de Catalunya under Contract No. 2017SGR0962.
Peer Reviewed
Postprint (author's final draft)
مصطلحات الفهرس: Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, High performance computing, Fault location (Engineering), Semiconductors, HPC, Supercomputing, Exascale, Reliability, Prediction, Survey, Faults, Failures, Superordinadors, Article
URL: http://hdl.handle.net/2117/330352
https://dl.acm.org/doi/abs/10.1145/3403956
https://dl.acm.org/doi/abs/10.1145/3403956
info:eu-repo/grantAgreement/EC/H2020/801137/EU/REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems/RECIPE
info:eu-repo/grantAgreement/MINECO//RYC-2013-14717/ES/RYC-2013-14717
الاتاحة: Open access content. Open access content
Open Access
ملاحظة: application/pdf
English
Other Numbers: HGF oai:upcommons.upc.edu:2117/330352
Canal, R. [et al.]. Predictive reliability and fault management in exascale systems: State of the art and perspectives. "ACM computing surveys", Setembre 2020, vol. 53, núm. 5, p. 95:1-95:32.
0360-0300
10.1145/3403956
1224047827
المصدر المساهم: UNIV POLITECNICA DE CATALUNYA
From OAIster®, provided by the OCLC Cooperative.
رقم الانضمام: edsoai.on1224047827
قاعدة البيانات: OAIster