Standard multivariate techniques like Principal Component Analysis (PCA) are based on the eigendecomposition of a matrix and therefore require complete data sets. Recent comparative reviews of PCA algorithms for missing data showed the regularised iterative PCA algorithm (RPCA) to be effective. This paper presents two chunk-wise implementations of RPCA suitable for the imputation of “tall” data sets, that is, data sets with many observations. A “chunk” is a subset of the whole set of available observations. In particular, one implementation is suitable for distributed computation as it imputes each chunk independently. The other implementation, instead, is suitable for incremental computation, where the imputation of each new chunk is based on all the chunks analysed that far. The proposed procedures were compared to batch RPCA considering different data sets and missing data mechanisms. Experimental results showed that the distributed approach had similar performance to batch RPCA for data with entries missing completely at random. The incremental approach showed appreciable performance when the data is missing not completely at random, and the first analysed chunks contain sufficient information on the data structure.

Chunk-wise regularised PCA-based imputation of missing data / IODICE D'ENZA, Alfonso; Markos, Angelos; Palumbo, Francesco. - In: STATISTICAL METHODS & APPLICATIONS. - ISSN 1618-2510. - 31:2(2021), pp. 365-386. [10.1007/s10260-021-00575-5]

Chunk-wise regularised PCA-based imputation of missing data

Alfonso Iodice D'Enza
;
Francesco Palumbo
2021

Abstract

Standard multivariate techniques like Principal Component Analysis (PCA) are based on the eigendecomposition of a matrix and therefore require complete data sets. Recent comparative reviews of PCA algorithms for missing data showed the regularised iterative PCA algorithm (RPCA) to be effective. This paper presents two chunk-wise implementations of RPCA suitable for the imputation of “tall” data sets, that is, data sets with many observations. A “chunk” is a subset of the whole set of available observations. In particular, one implementation is suitable for distributed computation as it imputes each chunk independently. The other implementation, instead, is suitable for incremental computation, where the imputation of each new chunk is based on all the chunks analysed that far. The proposed procedures were compared to batch RPCA considering different data sets and missing data mechanisms. Experimental results showed that the distributed approach had similar performance to batch RPCA for data with entries missing completely at random. The incremental approach showed appreciable performance when the data is missing not completely at random, and the first analysed chunks contain sufficient information on the data structure.
2021
Chunk-wise regularised PCA-based imputation of missing data / IODICE D'ENZA, Alfonso; Markos, Angelos; Palumbo, Francesco. - In: STATISTICAL METHODS & APPLICATIONS. - ISSN 1618-2510. - 31:2(2021), pp. 365-386. [10.1007/s10260-021-00575-5]
File in questo prodotto:
File Dimensione Formato  
03_CW_RPCA.pdf

solo utenti autorizzati

Tipologia: Documento in Post-print
Licenza: Non specificato
Dimensione 2.62 MB
Formato Adobe PDF
2.62 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/874425
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 3
social impact