Please use this identifier to cite or link to this item: http://hdl.handle.net/10773/32545
Title: An empirical comparison of two approaches for CDPCA in high-dimensional data
Author: Freitas, Adelaide
Macedo, Eloísa
Vichi, Maurizio
Keywords: Principal component analysis
Clustering of objects
Partitioning of attributes
Semidefnite programming
Issue Date: Sep-2021
Publisher: Springer
Abstract: Modifed principal component analysis techniques, specially those yielding sparse solutions, are attractive due to its usefulness for interpretation purposes, in particular, in high-dimensional data sets. Clustering and disjoint principal component analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix. In particular, CDPCA seeks to describe the data in terms of disjoint (and possibly sparse) components and has, simultaneously, the particularity of identifying clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically compare the performance of two diferent heuristic iterative procedures, namely ALS and two step-SDP algorithms proposed in the specialized literature to perform CDPCA. To avoid possible efect of diferent variance values among the original variables, all the data was standardized. Although both procedures perform well, numerical tests highlight two main features that distinguish their performance, in particular related to the two-step-SDP algorithm: it provides faster results than ALS and, since it employs a clustering procedure (k-means) on the variables, outperforms ALS algo rithm in recovering the true variable partitioning unveiled by the generated data sets. Overall, both procedures produce satisfactory results in terms of solution precision, where ALS performs better, and in recovering the true object clusters, in which two-step-SDP outperforms ALS approach for data sets with lower sample size and more structure complexity (i.e., error level in the CDPCA model). The proportion of explained variance by the components estimated by both algorithms is affected by the data structure complexity (higher error level, the lower variance) and presents similar values for the two algorithms, except for data sets with two object clusters where the two-step-SDP approach yields higher variance. Moreover, experimental tests suggest that the two-step-SDP approach, in general, presents more ability to recover the true number of object clusters, while the ALS algorithm is better in terms of quality of object clustering with more homogeneous, compact and well separated clusters in the reduced space of the CDPCA components.
Peer review: yes
URI: http://hdl.handle.net/10773/32545
DOI: 10.1007/s10260-020-00546-2
ISSN: 1618-2510
Appears in Collections:TEMA - Artigos
CIDMA - Artigos
DMat - Artigos
PSG - Artigos

Files in This Item:
File Description SizeFormat 
2021FreitasEtAl_An empirical comparison of two approaches for CDPCA in high-dimensional data.pdf1.6 MBAdobe PDFrestrictedAccess


FacebookTwitterLinkedIn
Formato BibTex MendeleyEndnote Degois 

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.