Перейти к основному содержанию
AkademIndex

Продукты

Для разработчиков

AkademBaseОткрытый API экосистемы
Статья

Legio: fault resiliency for embarrassingly parallel MPI applications

Roberto RoccoDipartimento di Elettronica, Infomazione e Bioingegneria, Politecnico di Milano, Milan, ItalyDavide GadioliDipartimento di Elettronica, Infomazione e Bioingegneria, Politecnico di Milano, Milan, ItalyGianluca PalermoDipartimento di Elettronica, Infomazione e Bioingegneria, Politecnico di Milano, Milan, Italy
2021en
ABI

Аннотация

Abstract Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their high frequency. Natively, MPI cannot handle faults and it stops the execution prematurely when it finds one. With the introduction of ULFM, it is possible to continue the execution, but it requires complex integration with the application. In this paper we propose Legio, a framework that introduces fault resiliency in embarrassingly parallel MPI applications. Legio exposes its features to the application transparently, removing any integration difficulty. After a fault, the execution continues only with the non-failed processes. We also propose a hierarchical alternative, which features lower repair costs on large communicators. We evaluated our solutions on the Marconi100 cluster at CINECA with benchmarks and real-world applications, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI.

Перевод пока недоступен

Идентификаторы

Цитирования и источники

Цитирований: 2Использованных источников: 0