Mostrar el registro sencillo del ítem

dc.contributor.author
Montezanti, Diego Miguel  
dc.contributor.author
Rucci, Enzo  
dc.contributor.author
de Giusti, Armando Eduardo  
dc.contributor.author
Naiouf, Ricardo Marcelo  
dc.contributor.author
Rexachs, Dolores  
dc.contributor.author
Luque, Emilio  
dc.date.available
2023-11-22T19:08:42Z  
dc.date.issued
2020-12  
dc.identifier.citation
Montezanti, Diego Miguel; Rucci, Enzo; de Giusti, Armando Eduardo; Naiouf, Ricardo Marcelo; Rexachs, Dolores; et al.; Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing; Elsevier Science; Future Generation Computer Systems; 113; 12-2020; 240-254  
dc.identifier.issn
0167-739X  
dc.identifier.uri
http://hdl.handle.net/11336/218511  
dc.description.abstract
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments.  
dc.format
application/pdf  
dc.language.iso
eng  
dc.publisher
Elsevier Science  
dc.rights
info:eu-repo/semantics/restrictedAccess  
dc.rights.uri
https://creativecommons.org/licenses/by-nc-nd/2.5/ar/  
dc.rights.uri
https://creativecommons.org/licenses/by/2.5/ar/  
dc.subject
AUTOMATIC RECOVERY  
dc.subject
SOFT ERROR DETECTION  
dc.subject
SYSTEM-LEVEL CHECKPOINT  
dc.subject
USER-LEVEL CHECKPOINT  
dc.subject.classification
Otras Ciencias de la Computación e Información  
dc.subject.classification
Ciencias de la Computación e Información  
dc.subject.classification
CIENCIAS NATURALES Y EXACTAS  
dc.title
Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing  
dc.type
info:eu-repo/semantics/article  
dc.type
info:ar-repo/semantics/artículo  
dc.type
info:eu-repo/semantics/publishedVersion  
dc.date.updated
2023-11-21T11:52:27Z  
dc.journal.volume
113  
dc.journal.pagination
240-254  
dc.journal.pais
Países Bajos  
dc.journal.ciudad
Amsterdam  
dc.description.fil
Fil: Montezanti, Diego Miguel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina  
dc.description.fil
Fil: Rucci, Enzo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina  
dc.description.fil
Fil: de Giusti, Armando Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina  
dc.description.fil
Fil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina  
dc.description.fil
Fil: Rexachs, Dolores. Universidad Autonoma de Barcelona. Dto Arquitectura Computadoras y Sist/operativos; España  
dc.description.fil
Fil: Luque, Emilio. Universidad Autonoma de Barcelona. Dto Arquitectura Computadoras y Sist/operativos; España  
dc.journal.title
Future Generation Computer Systems  
dc.relation.alternativeid
info:eu-repo/semantics/altIdentifier/url/https://www.sciencedirect.com/science/article/pii/S0167739X19308404  
dc.relation.alternativeid
info:eu-repo/semantics/altIdentifier/doi/http://dx.doi.org/10.1016/j.future.2020.07.003