Distributional properties of inversions and segmentation algorithms for RNA sequences

Sameera Dhananjaya Viswakula, University of Texas at El Paso

Abstract

Ribonucleic acid (RNA) is a long single stranded molecule made up of four types of nucleotide bases: Adenine (A), Cytosine(C), Guanine (G) and Uracil (U). It folds back on itself and forms C-G and A-U complementary base pairs. The set of such hydrogen-bonded pairs in an RNA molecule is called its secondary structure. Knowing the secondary structure of RNA is useful for understanding its biological function. Prediction of RNA secondary structure from the nucleotide sequence has been an important bioinformatics problem for over two decades. The work in this thesis is motivated by the need to improve the secondary structure prediction accuracy and efficiency for long RNA molecules. It involves investigating the distribution of inversions in random nucleotide sequences. An inversion is a string of nucleotide bases in an RNA sequence followed closely by its inverted complementary sequence downstream. It is the essential element to build any secondary structure. In this study, I focused on a random variable representing the number of inversions in an independent and identically distributed (i.i.d.) letter sequence, sampled from the nucleotide alphabet {A, C, G, U} with base composition {pA, pC, pG, pU}. I derived a recursive expression for calculating its mean, obtained simulated values for its variance, and demonstrated that this random variable can be reasonably approximated by a Poisson random variable in a range of inversion parameter values. Predicting RNA secondary structure is a complicated process. It requires so much computer time and memory that often makes it impractical to perform any detailed predictions for sequences only several hundred bases long. Yet, there exist RNA molecules (e.g., in some viral genomes) of biological interest that contain over a thousand bases. In order to overcome the limitations in computing resources, Taufer et al. (2008) developed an approach by the grid computing technology. The idea of this approach is to segment a long RNA sequences into smaller chunks and send them to different computers on the grid for individual predictions. Then, the individual predictions are assembled to give the prediction for the original molecule. I developed two algorithms for segmenting a long RNA sequence into small chunks. Both algorithms attempt to identify areas in the sequence with high concentration of inversions and preserve these areas within a single chunk in order to reduce the information loss. The effect of these two segmentation algorithms on secondary structure prediction accuracy is tested on a set of data from the Rfam database of RNA sequences with known secondary structures.

Subject Area

Statistics|Bioinformatics

Recommended Citation

Viswakula, Sameera Dhananjaya, "Distributional properties of inversions and segmentation algorithms for RNA sequences" (2011). ETD Collection for University of Texas, El Paso. AAI1498327.
https://scholarworks.utep.edu/dissertations/AAI1498327

Share

COinS