In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.

Mutiny! How Does Kubernetes Fail, and What Can We Do about It? / Barletta, M.; Cinque, M.; Di Martino, C.; Kalbarczyk, Z. T.; Iyer, R. K.. - (2024), pp. 1-14. ( 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024 aus 2024) [10.1109/DSN58291.2024.00016].

Mutiny! How Does Kubernetes Fail, and What Can We Do about It?

Barletta M.;Cinque M.;
2024

Abstract

In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.
2024
Mutiny! How Does Kubernetes Fail, and What Can We Do about It? / Barletta, M.; Cinque, M.; Di Martino, C.; Kalbarczyk, Z. T.; Iyer, R. K.. - (2024), pp. 1-14. ( 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024 aus 2024) [10.1109/DSN58291.2024.00016].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/990426
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 2
social impact