An integrative approach for codon repeats evolutionary analyses

: The relationship between genome characteristics and several human diseases has been a central research goal in genomics. Many studies have shown that specific gene patterns, such as amino acid repetitions, are associated with human diseases. However, several open questions still remain, such as, how these tandem repeats appeared in the evolutionary path or how they have evolved in orthologous genes of related organisms. In this paper, we present a computational solution that facilitates comparative studies of orthologous genes from various organisms. The application uses various web services to gather gene sequence information, local algorithms for tandem repeats identification and similarity measures for gene clustering.


Introduction
The analysis of protein primary structures, as well as their evolution over time, has been a highly studied area from the point of view of the evolutionary chain. Many studies (Ali et al., 1998;George et al., 2006;Jones and Pevzner, 2006;Fu and Jiang, 2008) showed the relationship between some human genes and various illnesses, such as cancer (Pearson, 2007), neurodegenerative disorders, and others (Bowen et al., 2000;Brameier and Wiuf, 2007). Many other studies focus on certain parts of the genome that have been important for the survival of the human species (Pestova et al., 1994;Ferro et al., 2002;Freed et al., 2005;Herishanu et al., 2009). These refer specifically to the repetition of certain codons and/or amino acids and have allowed to predict possible diseases and identify useful treatments, namely patient oriented medication (Hsueh, 2006;Bogaerts et al., 2008;Mena et al., 2008;Tarini et al., 2009). Pearson and Cleary (2005) identified a set of genes with repetitions that are related to several diseases.
One way of determining how these repeated regions evolved is to track the orthologous genes from related species so that they can be aligned with the sequence from the human gene. Two types of homologous genes can be distinguished: orthologues, which are defined as genes in different species that have evolved from a common ancestor, and paralogues, which originated from duplication in the same species.
Under the context of determining the extent to which repetitions are present in orthologous genes in several organisms, a set of biological questions arises, such as how did these amino acids sequences evolved over time? Are they under some set of negative pressure? Could this phenomenon have influenced speciation and the evolution of organisms?
The OMIM database (OMIM, 2009) can be used to identify the association between diseases and genes (Hamosh et al., 2005). Additionally, orthologous sequences can be retrieved from KEGG (Ogata et al., 1999;KEGG, 2010). However, querying and relating data from a set of orthologous genes (especially those with repetitions) are extremely time-consuming tasks, exacerbated by the fact that there is no integrative tool that allows performing these comparisons in an automated manner for a large number of organisms. As such, developing a bioinformatics application to perform these tasks in an autonomous and integrated way is important to allow rapid data analysis.
In Lousado et al. (2009) we presented an algorithm that allows the identification of codon repetition regions in genome sequences. In the present paper we have extended this work by developing an integrative software application that facilitates the comparison of disease-related genes from humans with the respective orthologous gene sequences from other organisms, so as to perform evolutionary studies. In order to illustrate this idea, we focused on determining whether existing repetitions that cause diseases in humans have propagated from less evolved organisms, and how that propagation occurred, that is, with a decrease or an increase in the number of repetitions along the evolutionary chain.
Special emphasis was given in this work to facilitating the retrieve of orthologous genes. For example, they are downloaded in a versatile format so that the data may be saved locally as text files for later use off-line. The goal was to achieve automatic analysis of amino acid repetitions in the various orthologous genes. The multi-window functionality introduced in the application is also important, since the user does not lose data from query to query, being able to open up to ten completely independent windows for each of the options. Moreover, even if windows are closed, they remain easily accessible from a drop-down menu.

Integration workflow
For this approach we have used web services that are already available in several biological databases, such as KEGG (2010). This web-based technology facilitates data gathering, through a standardised programmatic interface. A software application can retrieve information on demand, as needed, avoiding downloading extensive data, typically through file transfer, and allowing just to extract a small part of the available data.
In order to carry out this work we have developed a standalone application, following a specific workflow (Figure 1). It starts with genes that have been previously identified as those implicated in diseases and iteratively constructs a relationship between them and their orthologous genes from various organisms, allowing to study codon/amino acid repetitions. Following the work presented by Jones and Pevzner (2006), we assume a default value of 10 as the minimum representative number of consecutive codon and/or amino acid that appear repeated.

Figure 1 Data integration workflow
The amino acid and codon data are extracted from KEGG reference database. From this database, the application identifies the genes that have at least 10 consecutive repeated codons -the predefined size threshold.
Once the genes and respective repetitive sequences are identified, a new phase is initiated to determine if the genes are associated to diseases. We use then the OMIM database to isolate gene-disease associations (Hamosh et al., 2005).
Using this information -genes associated to diseases and with repetitive sequencesthe genes with repetitions responsible for diseases are isolated from the remaining genes in which the repetitions are not known to be related to diseases. The purpose of this separation is to create a control group of genes to validate the study. At the end, the results obtained from the test group can then be compared with the results from the control group.
For each gene in the set, the application looks for orthologous genes from a group of previously selected organisms. From that point, the process begins comparing the orthologous gene set creating a database of human genes and the respective orthologues found.

Implementation
The application was developed using the .NET platform integrated with Office Web Components. The use of KEGG web services, including integration with the .NET platform, required some modifications in the default parameters to avoid timeouts and transfer breaks. The user interface is depicted in Figure 2. The framework is made up of two main modules: • "Orthologous data retrieval" • "Orthologous advanced search".
Additionally, the framework also incorporates a web explorer, pointing to KEGG web page by default. The user may create up to ten instances of each module (corresponding to ten separated frames), where each instance is able to access independent data. In order to manage these work frames, common visualisation features are available (tile, cascading, hide, copy, …) and all visualisation preferences can be kept along the several opened frames.

Data retrieval
Starting by a gene ID, a KEGG orthology identifier (ko) or a pathway, (e.g., hsa:367, ko:K08557 or path:hsa05215), the software allows to return the respective orthologous. The information collected is then displayed in the respective frame. The found orthologous genes are displayed on the left-hand list as well as the information on diseases related to that gene and pathways, if that is the case. The user may then access the orthologue by simply selecting the respective gene from the list on the left.
The data being viewed in the amino acid and nucleotide frames are held in memory by default. The user may save the data by simply selecting the respective option "Add sequences and store into memory" (Figure 3). By accessing the list of diseases, the window shows all available information including bibliographic references for each disease (Figure 3).

Figure 3
Two independent instances of KEGG orthologous data retrieval interface (see online version for colours)

Advanced search
The module for orthologous advanced search is essentially an application of batch processing, that is, once the user creates the list of genes to be analysed and the list of organisms to be compared, the system will submit data to the KEGG database, automatically and iteratively. It extracts the information and saves the respective file in the folder that has been previously selected for that purpose.
As the data are being processed, a list of orthologous genes from the selected organisms is created. The gene order in the final list depends on the degree of similarity to the original sequence.
The tool also incorporates other functionalities, including the search for non-exact repetitions, i.e., some errors are allowed in the detection of tandem repeats. For the whole process, beside the web services connections, one can also resort on local data to conduct the analysis.
After processing, several spreadsheets are created with the results of the analysis on each of the orthologous genes of the set (Figure 4). These spreadsheets can alternatively be processed locally as a single file in XLS format.

Web explorer
The application includes a web browser that gives complementary information about the genes under study and that allows filtering and parsing to retrieve more information from the gathered pages. We aim also to endow this module with annotation features so that more relevant concepts can be highlighted, facilitating the reading of the presented information.

Results
To test the application, we have conducted an analysis in accordance with the previously presented workflow (Figure 1). For this, we compared human genes under the referred conditions with the respective orthologous genes from several organisms. Obtaining the results was almost immediate for the local source (offline), and it takes only a few seconds or minutes, depending on bandwidth and the amount of data, when we use the web server directly as a source of data (online). Since it integrates scattered data, whether via the Web (several sources) or by post-processing (offline files), the developed tool becomes a crucial ally for researchers, mainly in situations where massive data extraction is needed. Table 1 presents a list of 15 human genes identified in the literature (Panzer et al., 1995;Hamosh et al., 2005;Pearson and Cleary, 2005) that have either codon or amino acid repetitions that are directly related with human diseases. This was our set of genes under study. Table 2 shows a list of 16 human genes with repeats, but with no described direct association with diseases. According to the workflow presented above, this list represents the control set for the presented study.  In the next phase, we have selected 20 organisms randomly distributed along the evolutionary chain (Table 3). Then, using two separate instances of the "Advanced search" module (study and control), we have retrieved from the KEGG database all orthologous genes from the selected organisms. In order to extract the genes that are orthologous of those identified in Table 1 from the genomes of the organisms shown in Table 3, the application required approximately 15 min. This delay is due to the use web services and as such is always dependent on the amount of processing, data transfer and network efficiency. In this operation, 29 data files were created. Of these, 14 contained the orthologous genes, one file for each set of genes and their nucleotide sequences. Other 14 files contained identical data, but with their amino acid sequences. A log file was also created describing information about the success or failure in obtaining orthologues. Figures 5 and 6 present two examples of post processing of the data using previous results. Figure 5 refers to genes with repeats that are responsible for diseases, whereas Figure 6, refers to the control gene set.
By comparing the results, we note that the Gln string (23 repeats) of the human gene 367, which is responsible for the emergence of prostate cancer, is also detectable in organisms rather distant on the evolutionary chain, namely in fungi. However, in intermediate species the string is not present and it appears again only in higher organisms (mammals). The same happens with the human gene 3064, responsible for Huntington's disease (Herishanu et al., 2009). The Gln string from gene 1822, that is responsible for dentatorubral-pallidoluysian atrophy (DRPLA), a severe neurodegenerative disease (Pearson, 2007), turns out to be only present in Drosophila melanogaster, as well as in most higher organisms.

Figure 5
The graph represents genes whose repetitions were found to be associated with human diseases. It presents a comparison between the repetition length from three human genes and their retrieved orthologous genes As for the control group, shown in Figure 6, we can observe, for instance, that the gene KCNMA1, responsible for human neurodegenerative diseases such as epilepsy and paroxysmal dyskinesis (Du et al., 2005), presents a great number of repetitions of Ser residues in Schizosaccharomyces pombe and Kluyveromyces lactis, two fungi, presenting no significant repetitions in intermediate species.
The repetition pattern appears again in higher organisms. Interestingly, repetitions are not present in the organism Pan troglodytes, which is phylogenetically much close to the humans. Furthermore, repetitions of the human gene SRRM2, which is related with Parkinson's disease (Shehadeh et al., 2010), are present in Homo sapiens, Mus musculus, Monodelphis domestica and Ornithorhynchus anatinus, but do not appear in Pan troglodytes. From the results, we can observe that the genes in Figure 5 have a similar behaviour in Pan troglodytes and Homo sapiens, which is not the case in the control group, with exception of the MLLT3 gene ( Figure 6). It is perhaps interesting to note that for the genes in the test group the largest number of repeats occurs with Gln residues while in the control group, the largest number of repetitions occurs with Ser residues, suggesting a more deleterious effect occurring from Gln repetitions than from Ser ones.
With these results we cannot argue a clear distinction between the two groups and conclude that the repetitions responsible for diseases in humans have a different evolution from the repetitions that are not responsible for diseases. However, we cannot also assure that the disease associated to a particular gene in the control set is not caused by codon repetitions -simply there is yet no scientific evidence about that relation. Anyway, the objective of this study was not to explore the biological meaning of the sample, but only to show the potential of the application regarding the integration of information for studying codon/amino acid tandem repeats in orthologous genes.

Figure 6
The graph represents genes whose repetitions could not be linked with diseases.
It provides a comparison between the repetition length from three human genes and their retrieved orthologous genes (see online version for colours)

Conclusion
Codon or amino acid repeats have been associated with specific human diseases, and may play a variety of regulatory and evolutionary roles. In this paper we presented a computation application that simplifies the study of genes with this type of pattern, along the evolutionary chain. To do so, the software extracts orthologous genes from public resources and performs a comparative analysis that shows how repeats have evolved over time within the species under study.
Using this methodology, shown some differences that occurred during the evolution process between orthologous genes of a group of genes. This evolution was not uniform and shows some differences between genes from the test and control groups. Future work should deal with extensive exploitation of these two groups, extending the study to other genes and other organisms, so that the full set of capabilities of the application can be explored.