Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, there is growing evidence in the literature that ML approaches are not always used appropriately, leading to incorrect and sometimes overly optimistic results. One reason for this inappropriate use of ML may be the increasing availability of machine learning tools, leading to what we call the “push the button” approach. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. In particular, this paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Indeed, crucial steps in ML pipeline can be inadvertently overlooked, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML approach workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning framework, and compares standard inductive ML with transductive ML paradigms. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications considering tasks and generalization goals.

Don’t push the button! Exploring data leakage risks in machine learning and transfer learning / Apicella, Andrea; Isgrò, Francesco; Prevete, Roberto. - In: ARTIFICIAL INTELLIGENCE REVIEW. - ISSN 0269-2821. - 58:11(2025). [10.1007/s10462-025-11326-3]

Don’t push the button! Exploring data leakage risks in machine learning and transfer learning

Apicella, Andrea
;
Isgrò, Francesco;Prevete, Roberto
2025

Abstract

Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, there is growing evidence in the literature that ML approaches are not always used appropriately, leading to incorrect and sometimes overly optimistic results. One reason for this inappropriate use of ML may be the increasing availability of machine learning tools, leading to what we call the “push the button” approach. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. In particular, this paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Indeed, crucial steps in ML pipeline can be inadvertently overlooked, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML approach workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning framework, and compares standard inductive ML with transductive ML paradigms. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications considering tasks and generalization goals.
2025
Don’t push the button! Exploring data leakage risks in machine learning and transfer learning / Apicella, Andrea; Isgrò, Francesco; Prevete, Roberto. - In: ARTIFICIAL INTELLIGENCE REVIEW. - ISSN 0269-2821. - 58:11(2025). [10.1007/s10462-025-11326-3]
File in questo prodotto:
File Dimensione Formato  
s10462-025-11326-3.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Non specificato
Dimensione 7.9 MB
Formato Adobe PDF
7.9 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/1014914
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 24
  • ???jsp.display-item.citation.isi??? 14
social impact