Binary auto-regressive geometric modelling in a DNA context

Gouveia, Sónia; Scotto, Manuel G.; Weiss, Christian H.; Ferreira, Paulo Jorge S. G.

doi:10.1111/rssc.12172

Please use this identifier to cite or link to this item: http://hdl.handle.net/10773/17427

Full metadata record

DC Field	Value	Language
dc.contributor.author	Gouveia, Sónia	pt
dc.contributor.author	Scotto, Manuel G.	pt
dc.contributor.author	Weiss, Christian H.	pt
dc.contributor.author	Ferreira, Paulo Jorge S. G.	pt
dc.date.accessioned	2017-05-15T13:00:24Z	-
dc.date.issued	2017-02	-
dc.identifier.issn	1467-9876	pt
dc.identifier.uri	http://hdl.handle.net/10773/17427	-
dc.description.abstract	Symbolic or categorical sequences occur in any contexts and can be characterized, for example, by integer-valued intersymbol distances or binary-valued indicator sequences. The analysis of these numerical sequences often sheds light on the properties of the original symbolic sequences. This work introduces new statistical tools for exploring auto-correlation structure in the indicator sequences, for the specific case of deoxyribonucleic acid (DNA) sequences. It is known that the probability distribution of internucleotide distances of DNA sequences deviates significantly from the distribution obtained by assuming independent random placement (i.e. the geometric distribution) and that the deviations can be used either to discriminate between species or to build phylogenetic trees. To investigate the extent to which auto-correlation structure explains these deviations, the 0–1 indicator sequence of each nucleotide (A, C, G and T) is endowed with a binary auto-regressive (AR) model of optimum order. The corresponding binary AR geometric distribution is derived analytically and compared with the observed internucleotide distance distribution by appropriate goodness-of-fit testing. Results in 34 mitochondrial DNA sequences show that the hypothesis of equal observed/expected frequencies is seldom rejected when a binary AR model is considered instead of independence (76/136 versus 125/136 rejections at the 1% level), in spite of chi-square testing tending to reject for large samples, regardless of how close observed/expected values are. Furthermore, binary AR structure also leads to a median discrepancy reduction of 90% for G, 80% for C, 60% for T and 30% for nucleotide A. Therefore, these models are useful to describe the dependences within a given nucleotide and encourage the development of a model-based framework to compact internucleotide distance information and to understand DNA differences among species further.	pt
dc.language.iso	eng	pt
dc.publisher	Wiley	pt
dc.relation	FCT-UID/CEC/00127/2013	pt
dc.relation	FCT-UID/MAT/04106/2013	pt
dc.relation	FCT - SFRH/BPD/87037/2012	pt
dc.rights	restrictedAccess	por
dc.subject	Binary auto-regressive models	pt
dc.subject	χ2-testing	pt
dc.subject	DNA sequence analysis	pt
dc.subject	Geometric distribution	pt
dc.subject	Internucleotide distances	pt
dc.title	Binary auto-regressive geometric modelling in a DNA context	pt
dc.type	article	pt
dc.peerreviewed	yes	pt
ua.distribution	international	pt
degois.publication.firstPage	253	pt
degois.publication.issue	2	pt
degois.publication.lastPage	271	pt
degois.publication.title	Journal of the Royal Statistical Society: Series C	pt
degois.publication.volume	66	pt
dc.date.embargo	10000-01-01	-
dc.identifier.doi	10.1111/rssc.12172	pt
Appears in Collections:	CIDMA - Artigos IEETA - Artigos PSG - Artigos

Files in This Item:

File	Description	Size	Format
2017 Gouveiaetal.pdf		778.66 kB	Adobe PDF

Show simple item record