Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance.

OS-Level Hang Detection in Complex Software Systems / Bovenzi, Antonio; Cinque, Marcello; Cotroneo, Domenico; Natella, Roberto; G., Carrozza. - In: INTERNATIONAL JOURNAL OF CRITICAL COMPUTER-BASED SYSTEMS. - ISSN 1757-8779. - 2:3/4(2011), pp. 352-377. [10.1504/IJCCBS.2011.042333]

OS-Level Hang Detection in Complex Software Systems

BOVENZI, ANTONIO;CINQUE, MARCELLO;COTRONEO, DOMENICO;NATELLA, ROBERTO;
2011

Abstract

Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance.
2011
OS-Level Hang Detection in Complex Software Systems / Bovenzi, Antonio; Cinque, Marcello; Cotroneo, Domenico; Natella, Roberto; G., Carrozza. - In: INTERNATIONAL JOURNAL OF CRITICAL COMPUTER-BASED SYSTEMS. - ISSN 1757-8779. - 2:3/4(2011), pp. 352-377. [10.1504/IJCCBS.2011.042333]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/411896
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact