Computational and Statistical Methodologies for ORFeome Primary Structure Analysis

Codon usage and context are biased in open reading frames (ORFs) of most genomes. Codon usage is largely influenced by biased genome G+C pressure, in particular in prokaryotes, but the general rules that govern the evolution of codon context remain largely elusive. In order to shed new light into this question we have developed computational, statistical and graphical tools for analysis of codon context on an ORFeome wide scale. In here, we describe these methodologies in detail and show how they can be used for analysis of ORFs of any genome sequenced.


Introduction
Genome sequencing is opening unprecedented ways for understanding the primary structure of open reading frames on a global scale and the evolutionary forces that shape them (ORFeome analysis). Codon usage has been intensively studied in many organisms and one already has a relatively good understanding of the structural and functional constraints that shape its evolution. Conversely, other important features, such as codon context (two neighbor codons), tandem codon repeats and amino acid composition have not been so well studied and we are still far from understanding their importance for gene stability, mRNA decoding efficiency and accuracy (1)(2)(3)(4)(5). Codon context is rather interesting because it is biased and has an important impact on tRNA decoding accuracy but the rules that define good and bad context of neighbor codons are not yet understood. Additionally, it is not yet clear whether codon context is used to regulate speed of mRNA translation, whether it influences ribosome drop out during elongation and how genes with bad codon context are translated under physiological stress. Considering that mRNA decoding accuracy is critical to ensure correct flow of genetic information from DNA to protein, understanding those rules is likely to provide new insight on the constraints imposed by the mRNA translation machinery on gene evolution. More importantly, codon context rules would allow one to redesign the open reading frames of genes for optimal expression in heterologous hosts (6,7). This is of practical relevance since previous studies carried out in our laboratory have shown that codon context is species specific and consequently heterologous genes do not have the most appropriate context for translation by the host translational machinery.
Traditional methods used for codon usage and context analysis do not provide userfriendly tools to carry out detailed gene primary structure analysis on a genomic scale.
Codon usage tables, using absolute metric, are available in public databases for any sequenced gene or genome (http://www.kazusa.or.jp/codon/) and free-ware software for multivariate analysis (correspondence analysis) of codon and amino acid usage is also readily available (http://bioweb.pasteur.fr/seqanal/interfaces/codonw.html). However sophisticated statistical and data visualization tools are clearly lacking. In order to study context bias in complete ORFeomes, we have constructed a bioinformation system herein named ANACONDA, which imports FASTA files, and performs a series of analyses that permit elucidating how codons are associated in consecutive pairs, either in coding sequences or in non-coding regions. This methodology allows to differentiate general biases imposed by general rules of genome evolution, which are related to DNA replication biases (8)(9)(10)(11), from biases imposed by the mRNA translational machinery (1,2,(12)(13)(14)(15)(16)(17)(18). In here, we describe the architecture of ANACONDA and how it can be used to analyse gene primary structure on an ORFeome scale. 3

Materials
ANACONDA is a software package specially developed for the study of genes' primary

Methods
In this section, we describe the main tools of ANACONDA which are divided into four main parts, namely: i) uploading and validation of DNA sequence data into the local database; ii) building ORFeome maps for two-codon context bias; iii) visualization and analysis of the two-codon context biases in individual sequences and iv) comparison of the codon biases across multiple ORFeomes. Also, and taking advantage of the fact that ANACONDA interprets DNA sequences as sequences of trinucleotides (codons), several tools regarding codon usage analysis have been implemented, as explained below.

Statistics methods
ANACONDA uses contingency tables as the basic statistical methodology and identifies preferred and rejected codon pairs of an ORFeome through the analysis of adjusted residuals values of the contingency tables. The following list highlights the main statistical procedures performed by the software.
1. ANACONDA uploads ORFeome sequences from any genome and reads them in the 5´to 3´direction fixing each codon (ribosomal P-site) and memorizing its neighbour codons (E-site codon and A-site codon).

Data processing (Quantification).
The imported sequences are then processed according to the statistic methodology that reveals the irregularities in the codon context along the genome. In this phase, sequence processing can be avoided if the aim is to apply data from a previous statistical analysis to a current analysis. Also, sequences with particular characteristics, or groups of genes can be excluded (at the beginning or at the end) from quantification. The length of the codon context can also be modified, i.e., instead of analyzing codon-pairs, triplets of codons or long range context effects can be studied.

Evaluation of the sequences quality.
Once the raw data is processed, ANACONDA generates a report showing rejected ORFs and a small description of the rejection. Valid ORFs, using particular set of filters, are shown on a specific menu "Valid Tab" on the left panel of the screen (Fig 1). ORFs excluded from analysis appear in the "Rejected Tab" of the same panel. This allows simple visual inspection of all sequences present in the original FASTA files.

Working with genomic maps of two-codon context
1. Creating an ORFeome context map. After processing of valid sequences, an entry with the species' name, as given by the user, will appear on the left panel of the main window of ANACONDA (Fig 1).

Data from individual contexts.
In order to facilitate interpretation and analysis of genomic maps, the two-codon contexts can be selected with the cursor and individual information from them will be displayed in the status bar of the software's window.
These include: i) number of genes used to calculate the bias; ii) full name of both axes of the map; iii) residual value for that context; iv) occurrence for that codon pair in the genome under analysis. 6. Exporting data. The numerical data that give origin to a map can be exported as an Excel worksheet. This will include raw data and residuals data of all map layouts, i.e. 64 x 61 codons; 21 x 21 amino acids; etc, through the option File->Save Matrix.

Working with individual ORF sequences
1. Mapping ORFs. In order to detect the impact of codon context bias (as well as the presence of rare codons) on coding sequences, ANACONDA has additional tools for sequence mapping. These can be activated by selecting individual ORFs on the hierarchical left panel of the software's main window (Fig 2).  iii) looking for ORFs rich in bad contexts or rare codons; iv) finding ORFs whose G + C % is included in a chosen interval. This filter tool is very useful for studying the distribution of these variables along an entire ORFeome. It also helps finding specific sequences or ORFs with extreme values for a particular variable (See Note 12).

Image and data exporting.
As with genomic maps, any part of the gene view layout can be selected and copied into another application. Also, numerical data associated with filtered ORFs can be exported as Excel worksheet by clicking on the ORF set at the Tab Filtered window with the right mouse button.

Comparing maps.
ORFeome maps for two-codon-context bias can be compared in pairs using the Processing->Compare Genomes option. This tool will produce a Differential Display Map (DDM) that results from subtracting both maps cellby-cell. DDMs can also be manipulated by the user as described for normal ORFeome maps.

Clustering.
Alternatively, all opened maps can be compared in one single display to allow detecting overall patterns of two-codon context. This can be achieved with the option Processing->Compare all genomes. When this option is selected, ANACONDA will transform the 64 x 61 maps of each opened ORFeome into one single column of 3904 lines, one for each possible codon pair. In a second step, all columns are aligned set side-by-side to allow immediate comparison of patterns. As with all 64 x 61 maps, it is possible to rearrange this large-scale comparative map through cluster analysis of both axes to highlight major common patterns (Fig. 3).

Exporting data.
Similarly to the 64 x 61 maps, the adjusted residuals of large-scale comparison maps can be exported as CVS files for further mathematical analysis.  8. A sequence that has been opened with no quantification can be analysed with residual data extracted from other Orfeomes. For this, the user must select the sequence using the hierarchical left panel and click on its name with the right mouse button. Then the option "re-direct" must be selected, as well as the genome whose residual data is to be used. The sequence will then appear at the gene view layout, coloured as if it belonged to the host genome.

Notes
9. The header of the gene view layout includes: 1) the ORF name; 2) the total number of codons of that ORF; 3) the number of codons whose frequency is below the chosen threshold for rare codons; 4) the percentage of rare codons in the ORF; 5) the type of map and how data was quantified to reach the residuals used; 6) the count and the percentage of two-codon contexts whose calculated residues belong to each colour of the scale shown in the layout. Additionally, ANACONDA allows counting the total number of particular codons, as specified in the gene view options. 10. Alternatively, the same information can be obtained using the "i" button of the 13. Workspaces can be named by the user and saved at any location in the file system.
14. Some windows allow selecting the ORFeome to be analyzed, through a scroll-down menu located in a field called "genome". Usually, the default ORFeome is the first one that was opened, and attention must be taken to change this selection in order to analyze the intended ORFeome. ORFeomes are normalized to a given size and aligned using the same context order.

23
View publication stats View publication stats