The development of scientic software, reliable and ecient, in distributed computing environments, requires the identication and the analysis of issues related to the design and the deployment of algorithms for high-performance computing architectures and their integration in distributed contexts. In these environments, indeed, resources eciency and availability can change unexpectedly because of overloading or failure i.e. of both computing nodes and interconnection network. The scenario described above, requires the design of mechanisms enabling the software to survive" to such unexpected events by ensuring, at the same time, an eective use of the computing resources. Although many researchers are working on these problems for years, fault tolerance, for some classes of applications is an open matter still today. Here we focus on the design and the deployment of a checkpointing/migration system to enable fault tolerance in parallel applications running in distributed environments. In particular we describe details about HADAB, a new hybrid checkpoint- ing strategy, and its deployment in a meaningful case study: the PETSc Conjugate Gradient algortithm implementation. The related testing phase has been performed on the University of Naples distributed infrastructure (S.Co.P.E. infrastructure).

HADAB: enabling fault tolerance in parallel applications in distributed environments / Boccia, V.; Carracciuolo, L.; Laccetti, Giuliano; Lapegna, Marco; Mele, Valeria. - 7203:(2012), pp. 700-709. (Intervento presentato al convegno International Conference on Parallel Processing and Applied Mathematics 2011 tenutosi a Torun (Polonia) nel 11-14 / 9 / 2011) [10.1007/978-3-642-31464-3_71].

HADAB: enabling fault tolerance in parallel applications in distributed environments

V. Boccia
;
L. Carracciuolo;LACCETTI, GIULIANO;LAPEGNA, MARCO;MELE, VALERIA
2012

Abstract

The development of scientic software, reliable and ecient, in distributed computing environments, requires the identication and the analysis of issues related to the design and the deployment of algorithms for high-performance computing architectures and their integration in distributed contexts. In these environments, indeed, resources eciency and availability can change unexpectedly because of overloading or failure i.e. of both computing nodes and interconnection network. The scenario described above, requires the design of mechanisms enabling the software to survive" to such unexpected events by ensuring, at the same time, an eective use of the computing resources. Although many researchers are working on these problems for years, fault tolerance, for some classes of applications is an open matter still today. Here we focus on the design and the deployment of a checkpointing/migration system to enable fault tolerance in parallel applications running in distributed environments. In particular we describe details about HADAB, a new hybrid checkpoint- ing strategy, and its deployment in a meaningful case study: the PETSc Conjugate Gradient algortithm implementation. The related testing phase has been performed on the University of Naples distributed infrastructure (S.Co.P.E. infrastructure).
2012
9783642314636
HADAB: enabling fault tolerance in parallel applications in distributed environments / Boccia, V.; Carracciuolo, L.; Laccetti, Giuliano; Lapegna, Marco; Mele, Valeria. - 7203:(2012), pp. 700-709. (Intervento presentato al convegno International Conference on Parallel Processing and Applied Mathematics 2011 tenutosi a Torun (Polonia) nel 11-14 / 9 / 2011) [10.1007/978-3-642-31464-3_71].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/400458
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 20
  • ???jsp.display-item.citation.isi??? 13
social impact