Mutiny! How Does Kubernetes Fail, and What Can We Do about It?

Barletta, M.; Cinque, M.; Di Martino, C.; Kalbarczyk, Z. T.; Iyer, R. K.

doi:10.1109/DSN58291.2024.00016

In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.

Mutiny! How Does Kubernetes Fail, and What Can We Do about It? / Barletta, M.; Cinque, M.; Di Martino, C.; Kalbarczyk, Z. T.; Iyer, R. K.. - (2024), pp. 1-14. ( 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2024 aus 2024) [10.1109/DSN58291.2024.00016].