Assessing accuracies and improving efficiency for segmentation-based RNA secondary structure prediction methods
RNA secondary structure prediction has become an important area of interest in biology and medicine because it helps in understanding the mechanisms of many biological processes such as gene regulation and viral replication, and in designing RNA-based therapies to treat various diseases such as cancers and AIDS. Different thermodynamics-based computational algorithms for RNA structure prediction exist, and have been used to help understand the disease mechanisms and design treatments. However, most of these computational tools that can predict complex pseudoknot structures have a sequence length limitation of few hundred nucleotide bases due to their high demands of computer resources. Yet, many RNA molecules, such as those making up viral genomes, are thousands of bases long. To overcome the sequence length limitation, a segmentation approach called chunk concatenation method was previously proposed to cut a long RNA sequence into shorter chunks at strategic positions that conserve inversion patterns in the nucleotide sequence, predict each single chunk independently by existing software like pknotsRG or RNAstructure, and then combine the results to build the final prediction of the entire RNA. In the present study, we investigated whether the prediction accuracy over 136 sequences with known structures obtained from a collection of Rfam non-coding RNA families sequences could be improved by capturing possible structures formed between two neighboring chunks that would be missed by the usage of the chunk concatenation method. We compared the overall prediction accuracies of regular, chunk concatenation, and two-chunk elimination prediction methods to determine whether they are statistically different in prediction accuracy or not. In addition, a high-throughput distributed batch computing system called HTCondor has been used to reduce the waiting time for the RNA secondary structure prediction of 14 Nodavirus sequences in comparison to their sequential prediction, overcoming the need of high demands of computer resources when processing long RNA sequences. ^
Cardenas James, Gerardo Alberto, "Assessing accuracies and improving efficiency for segmentation-based RNA secondary structure prediction methods" (2016). ETD Collection for University of Texas, El Paso. AAI10250763.