Please use this identifier to cite or link to this item: http://hdl.handle.net/10773/30312
Full metadata record
DC FieldValueLanguage
dc.contributor.authorTavares, A. H.pt_PT
dc.contributor.authorAfreixo, V.pt_PT
dc.contributor.authorBrito, P.pt_PT
dc.date.accessioned2021-01-13T17:50:07Z-
dc.date.available2021-01-13T17:50:07Z-
dc.date.issued2017-
dc.identifier.isbn978-972-99289-3-2-
dc.identifier.urihttp://hdl.handle.net/10773/30312-
dc.description.abstractFunctional data appear in several domains of science, for example, in biomedical, meteorologic or engineering studies. A functional observation can exhibit an atypical behaviour during a short or a large part of the domain and this may be due to magnitude or to shape features. Over the last ten years many outlier detection methods have been proposed. In this work we use the functional data framework to investigate the existence of DNA words with outlying distance distribution, which may be related with biological motifs. A DNA word is a sequence defined in the genome alphabet {ACGT}. Distances between successive occurrences of the same word allow defining the inter-word distance distribution, interpretable as a discrete function. Each word length š‘˜ is associated with a functional dataset formed by 4 š‘˜ distance distributions. As the word length increases, greater is the diversity of observed patterns in the functional dataset and larger is the number of distributions displaying strong peaks of frequency. We propose a two-step procedure to detect words with an outlying pattern of distances: first, the functions are clustered according to their global trend; then, an outlier detection method is applied within each cluster. Each distribution trend is obtained by data smoothing, which avoids some distributionsā€™ peaks, and similarities between smoothed data are explored through hierarchical complete linkage clustering. The dissimilarity between functions is evaluated using the Euclidean distance or the Generalized Minimum distance [1], which considers the dependence between domain points. The resulting dendograms are then cut leading to a partition of the distance distributions. For the second step we use the Directional Outlyingness measure which assigns a robust measure of outlyingness to each domain point and is the building block of a graphical tool for visualization of the centrality of the curves [2]. We focus on the human genome and words of length š‘˜ ā‰¤ 7. Results are compared with those obtained by applying only the second step of the procedure [3].pt_PT
dc.language.isoengpt_PT
dc.publisherIST Presspt_PT
dc.relationinfo:eu-repo/grantAgreement/FCT/PD/PD%2FBD%2F105729%2F2014/PTpt_PT
dc.rightsopenAccesspt_PT
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/pt_PT
dc.subjectDistance distributionpt_PT
dc.subjectDNA wordpt_PT
dc.subjectDirectional outlyingnesspt_PT
dc.titleClustering DNA words through distance distributionspt_PT
dc.typeconferenceObjectpt_PT
dc.description.versionpublishedpt_PT
dc.peerreviewedyespt_PT
ua.event.date12-14 julho, 2017pt_PT
degois.publication.firstPage31pt_PT
degois.publication.lastPage32pt_PT
degois.publication.titleData Science, Statistics & Visualisation 2017: book of abstractspt_PT
dc.relation.publisherversionhttps://dssv2017.github.io/book.pdfpt_PT
Appears in Collections:CIDMA - ComunicaƧƵes
IBIMED - ComunicaƧƵes
PSG - ComunicaƧƵes

Files in This Item:
File Description SizeFormat 
DSSV2017_tavares.pdf161.18 kBAdobe PDFView/Open


FacebookTwitterLinkedIn
Formato BibTex MendeleyEndnote Degois 

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.