Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by any explicit failure notification, hindering the timely detection and recovery. In this work, we propose an approach to run-time failure detection tailored for monitoring multi-tenant and concurrent cloud computing systems. The approach uses a non-intrusive form of event tracing, without manual changes to the system’s internals to propagate session identifiers (IDs), and builds a set of lightweight monitoring rules from fault-free executions. We evaluated the effectiveness of the approach in detecting failures in the context of the OpenStack cloud computing platform, a complex and “off-the-shelf” distributed system, by executing a campaign of fault injection experiments in a multi-tenant scenario. Our experiments show that the approach detects the failure with an F1 score (0.85) and accuracy (0.77) higher than the ones provided by the OpenStack failure logging mechanisms (0.53 and 0.50) and two non–session-aware run-time verification approaches (both lower than 0.15). Moreover, the approach significantly decreases the average time to detect failures at run-time (∼114 seconds) compared to the OpenStack logging mechanisms.

Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform / Cotroneo, Domenico; DE SIMONE, Luigi; Liguori, Pietro; Natella, Roberto. - In: THE JOURNAL OF SYSTEMS AND SOFTWARE. - ISSN 0164-1212. - (2023). [10.1016/j.jss.2023.111611]

Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform

Domenico Cotroneo
Co-primo
;
Luigi De Simone
Co-primo
;
Pietro Liguori
Co-primo
;
Roberto Natella
Co-primo
2023

Abstract

Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by any explicit failure notification, hindering the timely detection and recovery. In this work, we propose an approach to run-time failure detection tailored for monitoring multi-tenant and concurrent cloud computing systems. The approach uses a non-intrusive form of event tracing, without manual changes to the system’s internals to propagate session identifiers (IDs), and builds a set of lightweight monitoring rules from fault-free executions. We evaluated the effectiveness of the approach in detecting failures in the context of the OpenStack cloud computing platform, a complex and “off-the-shelf” distributed system, by executing a campaign of fault injection experiments in a multi-tenant scenario. Our experiments show that the approach detects the failure with an F1 score (0.85) and accuracy (0.77) higher than the ones provided by the OpenStack failure logging mechanisms (0.53 and 0.50) and two non–session-aware run-time verification approaches (both lower than 0.15). Moreover, the approach significantly decreases the average time to detect failures at run-time (∼114 seconds) compared to the OpenStack logging mechanisms.
2023
Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform / Cotroneo, Domenico; DE SIMONE, Luigi; Liguori, Pietro; Natella, Roberto. - In: THE JOURNAL OF SYSTEMS AND SOFTWARE. - ISSN 0164-1212. - (2023). [10.1016/j.jss.2023.111611]
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0164121223000067-main.pdf

solo utenti autorizzati

Tipologia: Versione Editoriale (PDF)
Licenza: Copyright dell'editore
Dimensione 990.36 kB
Formato Adobe PDF
990.36 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/905937
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 1
social impact