Document Type

Thesis

Degree

Master of Science

Major

Computer Science

Date of Defense

4-5-2021

Graduate Advisor

Badri Adhikari

Committee

Sharlee Climer

Uday Chakraborty

Abstract

Protein structure prediction and its associated key sub-problems such as distance map prediction are of significance importance in biology and bioinformatics. The inter-residue distance prediction problem, or distance prediction in short, is to predict the physical distance between amino acids in a three-dimensional (3D) space, given a protein's one-dimensional sequence information. While there exist many methods to predict distance maps, there are currently no methods that can take those predicted distance maps and build 3D models from them in an ab initio way, i.e., without using any other information. This works aims to fill this gap by: a) developing a method that accepts predicted distance maps (2D information) as input and builds 3D models, and, b) investigating the prospect and limitations of distance-guided 3D modeling (reconstruction). DISTFOLD is a Perl and Python based script that wraps around a well-established 3D modeling tool known as CNS-Suite. To test our DISTFOLD implementation, we first benchmarked it on a small subset of the SCOPe dataset representative of the entire protein data bank. In addition to developing DISTFOLD, we also investigated (a) how various distance thresholds for selecting distance restraints impact the reconstruction accuracy, (b) how secondary structure information influences the reconstruction accuracy, and (c) how the reconstruction accuracy changes when predicted distances are used instead of true. Using two representative sets consisting of 1583 proteins and 259 proteins, we show that our method, DISTFOLD, is capable of building accurate models in an array of settings. Our results also show that the value of threshold chosen to filter-out/keep distances can drastically affect reconstruction accuracy. We also show that including secondary structures, when available, can benefit reconstruction in the absence of local distance information. When predicted distances are used instead of true, we found out that the reconstruction accuracy drops significantly and that distances predicted at thresholds higher than 11 or 12 angstroms are not significantly useful for reconstruction. DISTFOLD is publicly available at https://github.com/ba-lab/distfold/.

Share

COinS