Please use this identifier to cite or link to this item: http://hdl.handle.net/10773/35321
Title: Outlier detection: a procedure to capture atypical groups of observations
Author: Tavares, Ana Helena
Vera Afreixo
Brito, Paula
Keywords: Outlyingness
Clustering
Distributional data
Functional data
Issue Date: 2022
Publisher: CLAD, FEUP
Abstract: In this work, we introduce the concept of atypical group of observations and propose a procedure for its identification. By atypical group, we mean a cluster of observations whose ‘mean’ pattern stands out from the majority of the ‘mean’ patterns of the remaining clusters. Challenges that arise in atypical group detection are firstly to identify a meaningful segmentation of the data, and secondly to flag the atypical segments. Our work focus on data whose elements are discrete distributions. If heterogeneous datasets, where distinct patterns coexist, can validly be clustered, then the class prototypes provide a simplified description of data. Thus, the key idea of our proposal is to combine a clustering method with a functional outlyingness criterion to capture atypical class prototypes. To identify a segmentation of the distributional data we iteratively combine two steps. The first creates a hierarchy of clusters, while the second flags atypical curves within each cluster, based on a measure of functional outlyingness which accounts for the shape of the distributions [1]. Segments with atypical curves, are forwarded for (sub)clustering, and the procedure is repeated until no outlying curves are identified in clusters. Once the final partition is obtained, each cluster is represented by a class prototype, whose outlyingness is evaluated according to the same functional approach. Clusters with an atypical class prototype are pointed as atypical. We apply our procedure to investigate clusters of genomic words in human DNA by studying their inter-word lag distributions. These experiments demonstrate the potential of the new method for identifying clusters of words with outlying patterns.
Peer review: yes
URI: http://hdl.handle.net/10773/35321
Appears in Collections:CIDMA - Comunicações
ESTGA - Comunicações
PSG - Comunicações

Files in This Item:
File Description SizeFormat 
IFCS2022_AnaTavares.pdf101.17 kBAdobe PDFrestrictedAccess


FacebookTwitterLinkedIn
Formato BibTex MendeleyEndnote Degois 

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.