A Computational Method for Predicting Functional Effects of Cancer-Related Genetic Sequence Variants

Bofei Wang, University of Texas at El Paso


Rapid advances in next generation sequencing (NGS) technologies provide many opportunities to identify associations between genetic sequence variants (GSV) and diseases, which may lead to better clinical diagnosis and treatments. OncoMiner is a bioinformatics pipeline developed at UTEP (OncoMiner.utep.edu) for mining NGS data. It can identify exonic sequence variants, link them with associated literatures, visualize genomic locations and compare their occurrence frequencies among different groups. However, the current version of OncoMiner is limited to accepting only a specific input le format provided by the Otogenetics NGS Lab Services. The main objectives of my current work are (1) to develop a Python script for preprocessing the more widely used variant calling format (VCF) NGS les and convert them to the OncoMiner input (OMI) format, and (2) to evaluate the performance of the script. Most of the required data elds in the OMI le can be extracted directly from the VCF le. The genomic region type, however, needs to be determined by comparison with a reference genome. Since I will be working on human cancer data, the reference genome used for this work is the human genome assembly hg38 obtained from UCSC Genome Browser. To improve efficiency, the script splits the VCF le and reference genome by chromosomes into smaller les for parallel processing. Our script has been tested on 148 VCF les, containing data from prostate cancer patients, downloaded from The Cancer Genome Atlas (TCGA). Parallelization of the script obtained average speedups of 1.50, 2.28, 3.14, 3.84, 4.00 using 2, 4, 8, 16, 24 cores respectively. To test the programs capability of handling big datasets, 35 larger les with sizes ranging from 193.8 MB to 3.7 GB are used. These les contain data from leukemia patients, cell lines, and normal individuals collected at local hospitals and UTEP. Both the number of variants and the number of samples in the VCF le were found significantly correlated with runtime. A multiple linear regression indicated that 83% variation in the runtime can be explained by its relationship with the numbers of variants and samples. We plan to incorporate this preprocessing script into OncoMiner pipeline and use it for downstream analyses of a collection of 500 prostate cancer VCF les from TCGA, and the local leukemia dataset to identify GSVs associated with the diseases and prioritize risky variants based on their predicted functional effects.

Subject Area

Bioinformatics|Public health|Pathology|Health sciences|Oncology|Genetics|Epidemiology|Physiology

Recommended Citation

Wang, Bofei, "A Computational Method for Predicting Functional Effects of Cancer-Related Genetic Sequence Variants" (2019). ETD Collection for University of Texas, El Paso. AAI13886012.