PROTEIN STRUCTURE PREDICTION
Bioinformatic Approach

edited by
IGOR F. TSIGELNY


Contents
Preface     xv
List of Contributors     xxi
Part I. CONCEPTS OF PROTEIN STRUCRURE PREDICTION     1
A. Prediction Methods and Systems     3
1. Computational Studies of Protein Structure and Function Using Threading Program PROSPECT     5
Dong Xu and Ying Xu
   1.1. Introduction     5
   1.2. Method of PROSPECT     10
          1.2.1. Threading Templates     11
          1.2.2. Energy Function     12
          1.2.3. Threading Algorithm     13
          1.2.4. Confidence Assessment of Threading Results     15
   1.3. Protocols of Using PROSPECT     17
          1.3.1. Pre-Processing before Running PROSPECT     18
          1.3.2. Running PROSPECT     20
          1.3.3. Human Evaluation     21
          1.3.4. Manual Refinement     25
          1.3.5. Structure-Based Functional Inference     26
   1.4. Performance of PROSPECT     29
          1.4.1. Testing of PROSPECT Using Known Structures in PDB     29
          1.4.2. Blind Test in CASP     30
   1.5. Application of PROSPECT in Protein Studies     34
          1.5.1. Human Vitronectin     34
          1.5.2. Human DNA-Activated Protein Kinase     35
          1.5.3. Yeast PTR3 Protein     36
   1.6. Summary     37
2. Bayesian Approach to Protein Fold Recognition: Building Protein Structural Models from Bits and Pieces     43
Jadwiga Bienkowska, Hongxian He, Robert G. Rogers Jr., and Lihua Yu
   2.1. Introduction     45
   2.2. Fundamentals of DSMs and HMMs     49
          2.2.1. Representation of Protein Structure by a DSM     50
          2.2.2. Mathematical Representation of a DSM     51
          2.2.3. Measures of Compatibility of a Protein Sequence with a DSM     52
   2.3. Automated Generation of Protein Structural Templates     53
         2.3.1. Criteria for Selecting Structural Information     54
         2.3.2. Candidate Structural Quantities     55
         2.3.3. Classification of Structural States     57
   2.4. Automated Design of a Structural DSM from a Structural Template     60
          2.4.1. Design Principles     60
          2.4.2. Secondary Structure Submodels     62
          2.4.3. Construction of DSM from the Structural Template     65
          2.4.4. Using Structural Alignments and Multiple Structural Templates in Building DSM     65
   2.5. Automatic Pattern Embedding in a DSM     66
          2.5.1. Automated Pattern Generation and Selection     67
          2.5.2. Look-Ahead     69
   2.6. A Bayesian Approach to Fold Recognition     70
          2.6.1. The Filtering Algorithm     70
          2.6.2. Prior Model Probabilities     72
   2.7. Results     75
          2.7.1. Comparing the Bayesian Approach and Total Alignment Probability with Other Methods     75
          2.7.2. Results of Automatic Pattern Embedding     77
          2.7.3. Comparison of Different Assignments of Prior Probabilities     79
   2.8. Strategies for Defeating the Combinatorial Explosion     80
3. Three-Dimensional Structure Prediction Using Simplified Structure Models and Bayesian Block Fragments     85
Jun Zhu and Roland Lüthy
   3.1. Introduction     87
   3.2. Methods     89
          3.2.1. Simplified Backbone Angle Representation of 3D Structures     89
          3.2.2. Block Selection     90
          3.2.3. Energy Functions     94
          3.2.4. Energy Minimization     101
          3.2.5. Using Information from Bayesian Blocks     102
          3.2.6. Enforcing Secondary Structures     103
   3.3. Examples     103
4. Protein Structure Prediction Using Hidden Markov Model Structural Libraries      109
Igor Tsigelny, Yuriy Sharikov, and Lynn F. Ten Eyck
   4.1. Introduction     111
   4.2. Structural Hidden Markov Model Libraries     112
   4.3. Decision Tree     114
          4.3.1. Search for the Best HMM     114
          4.3.2. Searching within the Structural Alignment     117
   4.4. Program Testing     120
   4.5. Prediction of Unsolved Structures     121
5. The Role of Sequence Information in Protein Structure Prediction     125
Damien Devos, Florencio Pazos, Osvaldo Olmea, David de Juan, Osvaldo Graña, Jose M. Fernández, 
and Alfonso Valencia
 
   5.1. Introduction     127
          5.1.1. Information Contained in Multiple Sequence Alignments in Protein Families     127
   5.2. Automated Generation of Protein Structural Templates     128
   5.3. Distribution of Informative Positions in Protein Structures     130
   5.4. Informative Positions in Protein Structure Models     132
   5.5. A Threading Server That Filters Models with Multiple Sequence Alignments Information     128
   5.6. A First Field Evaluation of the Server, the CAFASP Results     130
   5.7. A CAFASP Example of the Use of Sequence Information     132
   5.8. Training Neural Networks for the Discrimination of Wrong Threading Models Using Sequence     130
   5.9. Conclusions     132
6. Protein Fold Recognition and Comparative Modeling Using HOMSTRAD, JOY, and FUGUE     143
Ricardo Núñez Miguel, Jiye Shi, and Kenji Mizuguchi
   6.1. Introduction     145
   6.2. Overview     149
   6.3. Identification of Homologues     150
   6.4. Generating Sequence-Structure Alignment     152
   6.5. Example     153
          6.5.1. Searching for Homologues     153
          6.5.2. Alignment     157
          6.5.3. Modeling     161
          6.5.4. Heteroatoms     161
          6.5.5. Refinements     162
          6.5.6. Model Validation     162
          6.5.7. Model     163
   6.6. Conclusion     165
7. Fully Automated Protein Tertiary Structure Prediction Using Fourier Transform Spectral Methods     171
Carlos Adriel Del Carpio Muñoz and Atsushi Yoshimori
   7.1. Sequence Alignment and Protein Structure Modeling     173
   7.2. Protein Function and Structure Elucidation by Spectral Analysis     176
   7.3. Spectral Analysis and Folding Pattern Recognition     179
          7.3.1. Spectral Representation of Protein Primary Structures     180
          7.3.2. Spectral Alignment and Protein Structure Similarity     184
          7.3.3. Automatic Protein Folding Pattern Recognition     186
   7.4. Automatic Classification of Protein Foldings     188
          7.4.1. Dominant Physicochemical Parameters     188
          7.4.2. Classification of Protein Folding by Spectral Analysis     191
   7.5. Protein Folding Pattern Recognition by Spectral Analysis     195
8. From the Building Blocks Folding Model to Protein Structure Prediction     201
Nurit Haspel, Chung-Jung Tsai, Haim Wolfson, and Ruth Nussinov
   8.1. Introduction     203
   8.2. Protein Folding: A Process of Intra-Molecular Building Block Recognition     205
   8.3. Experimental and Theoretical Support for the Building Block Concept     206
   8.4. The Building Block Cutting Algorithm     209
   8.5. The Scoring Function     210
   8.6. The Cutting Procedure     211
   8.7. Critical Building Blocks     213
   8.8. From the Building Block Folding Model to Structure Prediction: The Scheme     214
   8.9. Conclusions     220
9. Protein Threading Statistics: An Attempt to Assess the Significance of a Fold Assignment to a Sequence 227
Antoine Marin, Joël Pothier, Karel Zimmermann, and Jean-François Gibrat
   9.1. Introduction     229
   9.2. Method     232
          9.2.1. Library of “Cores”     232
          9.2.2. Development of a Score Function     233
          9.2.3. Combinatorial Optimization Algorithm     239
          9.2.4. Empirical Distribution of Scores     241
          9.2.5. Development of a Benchmark Database     244
   9.3. Results     247
   9.4. Discussion     254
          9.4.1. Use of Filters     254
          9.4.2. Difficulty of the Benchmark     255
          9.4.3. Statistical Criterion     256
          9.4.4. Present Limits of the Method     258
   9.5. Conclusion     259
10. Protein Structure Prediction by Threading: Force Field Philosophy, Approaches to Alignment     263
Thomas Huber and Andrew E. Torda
      10.1. Introduction     265
               10.3.1. Common Methodology     267
      10.2. Force Field Based Scoring     269
      10.3. Parameterizing Force Fields     271
               10.3.1. Physically-Based Potential Energies     271
               10.3.2. Potentials of Mean Force     272
               10.3.3. Optimized Force Fields     273
      10.4. Alignment Philosophy     278
               10.4.1. Common Alignment and Score Methods     278
               10.4.2. Sausage Alignments     279
      10.5. Beyond Pairwise Terms     280
      10.6. Template Libraries    285
      10.7. Further Outlook and Speculation      289
11. Predicting Protein Structure Using SAM, UCSC’s Hidden Markov Model Tools      297
Kevin Karplus
      11.1. A Naive View of Protein Structure Prediction     299
      11.2. Fold Recognition     301
      11.3. Hidden Markov Models     302
              11.3.1. Multitrack Hidden Markov Models     305
              11.3.2. Statistical Significance for Hidden Markov Models     307
      11.4. Using SAM-T2K for Superfamily Modeling     308
      11.5. Improved Verification of Homology     312
      11.6. Family-Level Multiple Alignments     314
      11.7. Modeling Non-Contiguous Domains     315
      11.8. Building an HMM from a Structural Alignment     316
      11.9. Improving Existing Multiple Alignments     319
      11.10. Creating a Multiple Alignment from Unaligned Sequences     319
      11.11. Conclusions     320
12. Local Genome Organization, Gene Expression, and Structural Genomics: Evolution at Work     325
Wayne Volkmuth and Nickolai Alexandrov
      12.1. Introduction     327
      12.2. Methods     329
               12.2.1. Genomes     329
               12.2.2. Microarray Expression Data     329
               12.2.3. Fold Assignment     331
               12.2.4. Non-Redundant Set of Proteins     333
               12.2.5. Fold Enrichment Along the Genome     333
               12.2.6. Fold Enrichment for Genes with Similar Patterns of Expression     333
      12.3. Results     333
               12.3.1. Fold Enrichment Along the Genome     333
               12.3.2. Fold Enrichment for Genes with Similar Patterns of Expression     333
      12.3. Summary and Conclusions     334
13. Protein Structure Prediction on the Basis of Combinatorial Peptide Library Screening     341
Igor Tsigelny, Yuriy Sharikov, Vladimir Kotlovyi, Michael Kelner, and Lynn F. Ten Eyck
      13.1. Concept of the Comprehensive System     343
      13.2. HMM-ELONGATOR     345
               13.2.1. Problem Description     345
               13.2.2. Elongation Strategies     346
B. Consensus Structure Prediction     353
14. A User’s Guide to Fold Recognition     355
Naomi Siew and Daniel Fischer
      14.1. Introduction     357
      14.2. Examples of Using Fold Recognition for Biological Research     358
               14.2.1. Plant Resistance Gene Products     359
               14.2.2. Acetohydroxyacid Synthase     360
               14.2.3. Endothelial Cell Protein C/Activated Protein C Receptor     361
      14.3. How to Fold Recognize?     363
               14.3.1. Searching for Homologues of Known Structure     364
               14.3.2. Running Your Favorite Fold Recognition Method     365
               14.3.3. Running Other Methods     368
               14.3.4. Why Run More Than One Method?     369
               14.3.5. 3D-Shotgun Meta-Predictor     370
      14.4. Summary     370
15. Structure Prediction Meta Server     377
Leszek Rychlewski
      15.1. Introduction     379
      15.2. The Meta Server     381
               15.2.1. User Input and Job Status Display     382
               15.2.2. Job Deposition and Administration     382
               15.2.3. Request Submission Queuing     384
               15.2.4. Blast-Filter     385
               15.2.5. Local and Remote Prediction Services     385
               15.2.6. Raw Output Converters     387
               15.2.7. Visualization and Linking     389
               15.2.8. Interfaces     389
      15.3. Discussion     390
Part II. METHODS OF STRUCTURE AND SEQUENCE ALIGNMENT     395
16. Improved Fold Recognition by Using the PCONS Consensus Approach     397

Huisheng Fang, Björn Wallin, Jesper Lundström, Christer von Wowern, and Arne Elofsson
      16.1. Introduction     399
      16.2. Why are Manual Predictions Better?     401
               16.2.1. Biological Knowledge     401
               16.2.2. Structural Analysis     401
               16.2.3. Consensus Analysis     402
      16.3. Consensus Predictions in CASP4     403
      16.4. Pcons     405
               16.4.1. Collection of Publicly Available Models     406
               16.4.2. Structural Comparison     406
               16.4.3. Prediction of Quality of the Models     407
      16.5. Performance of Pcons     408
               16.5.1. Performance in LiveBench-2     409
               16.5.2. Why Does Pcons Perform Better?     411
      16.6. Pcons-II     412
               16.6.1. Improvements Using More Servers     412
               16.6.2. Speed-Up of Structural Comparisons     412
               16.6.3. Using Better Statistics     413
               16.6.4. Improvements Using Linear Regression     413
      16.7. Summary     414
17. New Insights into Protein Fold Space and Sequence-Structure Relationships     417
Ilya N. Shindyalov and Philip E. Bourne
      17.1. Introduction     419
      17.2. Overview of CE Sequence-Structure Space     420
      17.3. Scop vs. CE Fold Space Comparison     421
      17.4. Analysis of Structure Redundancy     422
               17.4.1. Size of NR Set as a Function of Criteria Used     423
               17.4.2. Characterization of Chains Excluded from the Set     423
               17.4.3. Characterization of Similarity Between Chains in the Set     424
               17.4.4. Complementary Sequence and Structure NR Sets     428
               17.4.5. Combined NR Set     428
18. A Flexible Method for Structural Alignment: Applications to Structure Prediction Assessments     431
Vladimir Kotlovyi, Igor Tsigelny, and Lynn Ten Eyck
      18.1. Introduction     433
      18.2. Theoretical Background     435
      18.3. Algorithms and Their Implementation     438
      18.4. Representation of Data in XML Forms     440
      18.5. Timing     442
      18.6. Web-Servers     443
      18.7. Illustrative Examples     447
19. Comparative Analysis of Protein Structure: New Concepts and Approaches for Multiple 
      Structure Alignment     449

Chittibabu Guda, Eric D. Scheeff, Philip E. Bourne, and Ilya N. Shindyalov
      19.1. Introduction     451
      19.2. Algorithm for Aligning Multiple Protein Structures Using Monte Carlo Optimization     452
               19.2.1. Scoring Function     452
      19.3. Approaches for Optimization of Multiple Structure Alignment     453
               19.3.1. Effect of Weights Based on Number of Residues on Alignment Length and Alignment Distance     453
      19.4. Analysis of Specific Protein Families     455
               19.4.1. Analysis of an Alignment of Protein Kinases     455
               19.4.2. Analysis of an Alignment of Aspartic Proteinases     458
      19.5. Summary     459
20. Comparative Analysis of Protein Structure: Automated vs. Manual Alignment of the Protein 
      Kinase Family     463

Eric D. Scheeff, Philip E. Bourne, and Ilya N. Shindyalov
      20.1. Introduction     465
      20.2. The Challenge of Automated Protein Structure Alignment     466
      20.3. A Case Study: Alignment of the Eukaryotic Protein Kinases and Their Relatives     467
      20.4. An Example of an Automated Alignment: The Combinatorial Extension Algorithm     468
      20.5. Parameters for the Determination of an “Optimal” Structure Alignment     470
      20.6. Comparison of CE Alignments with Manual Alignments     471
      20.7. Conclusion     475
Index     479

Preface

Prediction of protein structure is very important today. Whereas more than 17,000 protein structures are stored in PDB, more than 110,000 proteins are stored only in SWISSPROT. The ratio of solved crystal structures to a number of discovered proteins to about 0.15, and I do not see any improvement of this value in the future. At the same time development of genomics has brought an overwhelming amount of DNA sequencing information, which can be and already is used for constructing the hypothetical proteins.
This situation shows the great importance of protein structure prediction. The field is growing very rapidly. A simple analysis of publications shows that the number of articles having the words ‘protein structure prediction’ has almost doubled since 1995.
     So many really great ideas are used as a basis for the current prediction systems. Some of them will evolve into the next generation of the prediction software but some of them, even very promising, will be lost and rediscovered in the future. Here we tried to include the variety of methods representing the most interesting concepts of current protein structure prediction.
This compendium of ideas makes this book an invaluable source for scientists developing prediction methods. In many cases authors describe successful prediction methods and programs that make this book an invaluable source of information for numerous users of prediction software.
     The first chapter describes the protein structure prediction program PROSPECT that produces a globally optimal threading alignment for a typical threading energy function, and allows users to easily incorporate experimental data as constraints into the threading process. PROSPECT also provides a confidence assessment of a threading result based on a neural network. The second chapter presents a protein fold-recognition method that selects the best fold model for a given protein sequence from a library of structural hidden Markov models (HMMs). The HMMs are built from protein structures following their modular decomposition into the secondary structure elements and representing those elements by a pre-designed set of submodels. The third chapter describes a method to fold proteins into simplified three-dimensional structures constructed from small fragments cut out of a representative set of known three-dimensional structures. The three-dimensional protein structures and fragments are represented in a simplified form as a sequence of angle pairs, one angle pair per residue. Chapter 4 describes the application of HMM constructed on the basis of structural alignments for protein structure prediction. An example system HMM-SPECTR is given with the description of different types of HMMs based on structural alignments. Chapter 5 reviews the different methods of extraction of information from multiple sequence alignments and illustrates how to use them as a primary source of information. The chapter describes the application of rarely used features such as sequence conservation, variations between sub-families, correlation between the patterns of mutation of pairs of positions, and the distribution of apolar residues for structure prediction. Chapter 6 illustrates how knowledge of protein three-dimensional structure can be used to identify homologues of known structure, generate sequence-structure alignments and assist model building. It describes the programs: HOMSTRAD, a database of structure-based alignments for protein families of known structure, JOY, a program to annotate local environments in structure-based alignments and FUGUE, a program to perform sequence-structure homology recognition. Chapter 7 proposes a different concept of sequence homology. This concept is derived from a periodicity analysis of the physicochemical properties of the residues constituting proteins primary structures. The analysis is performed using a front-end processing technique in automatic speech recognition by means of which the cepstrum (measure of the periodic wiggliness of a frequency response) is computed that leads to a spectral envelope that depicts the subtle periodicity in physicochemical characteristics of the sequence.
     Chapter 8 describes the building block protein folding model. Via a building block assigning algorithm, sequence comparisons and weighting scheme, building blocks are assigned to a target protein sequence. The problem of the ‘building block’ is very important for both protein folding modeling and protein structure prediction. Authors of several chapters in this book propose different ‘building blocks’ for discretization of the prediction process. In most cases they do not discuss the physical properties of these blocks, paying attention only to the information coding. The approach of the authors of the chapter 8 clearly defining the physical and informational properties of the building blocks looks very promising.
     Chapter 9 describes a new fold recognition method called FROST (Fold Recognition Oriented Search Tool). It includes 1D and 3D comparison and a database of representative three-dimensional structures. The chapter uses information theory concepts for embedding of a number of sequence and structure parameters in one scoring function. This approach makes this chapter very elegant and useful for the developers of protein structure prediction systems. Chapter 10 continues the discussion of how to combine different levels of resolution and representation of a protein and the rationalizations of score functions for protein structure prediction. The statistical mechanical parameters are used together with purely empirical and even ad hoc parameters.
     Chapter 11 describes one of the most effective HMM system for biological applications—SAM. The chapter shows an approach to fold recognition that relies on HMMs for both selecting the template and for aligning the target to the template. The technique has been used successfully in three of the Critical Assessment of Structure Prediction (CASP) experiments.
     Chapter 12 discusses the important link between genomic information and protein structure. The chapter describes the clues that could be used to help infer the evolutionary relationship via structural similarity and improve the ability to predict the biochemical function. The first such clue is a positional conservation along the genome, i.e., nearby genes tend to be structurally related more often than expected by chance alone. The second such clue is present in expression data: genes that are correlated in expression are more apt to share a common fold than two randomly chosen genes.
     Chapter 13 proposes a comprehensive system for computer based drug design. The chapter describes the program HMM-ELONGATOR, which predicts putative protein targets based on a set of peptides shown to bind a drug molecule from combinatorial libraries.
     Chapter 14 on the basis of three examples shows how the use of fold recognition helped biologists in planning and devising experiments and in generating verifiable hypotheses. This chapter describes the meta-predictor approach for protein structure prediction. Chapter 15 describes in details the Structure Prediction Meta Server that collects prediction models from many high quality services and translates them into standard formats enabling convenient analysis of the results. The Meta Server offers an infrastructure for the creation of automated jury algorithms, which analyze the set of results for the user and calculate the reliability score for a consensus prediction. Chapter 16 describes a new method for fold recognition, Pcons that utilizes the “consensus analysis.” This chapter shows the advantages of Pcons based on the large scale benchmarking.
     Chapter 17 starts the part of the book devoted to the concepts of structural alignment. It is obvious that proper structural alignment of proteins is the cornerstone of the majority of prediction methods. This chapter introduces several new views of protein fold space which will help to further understand protein evolution and interpret structural similarities.  Differences between the manual (SCOP) and automated (CE) approaches to the structural classification problem are described. Chapter 18 discusses the design principles of a structure alignment system that can be used for structure prediction assessments. This system is based on a hierarchical representation of a protein shape. Such a representation makes the system suitable for effective alignments of structures with low similarity. Chapter 19 describes the Monte-Carlo approach to the construction of multiple structural alignments. Chapter 20 describes the specific example, where an alignment of eukaryotic protein kinases generated using the combinatorial extension algorithm (CE) is compared with a manually derived alignment. Implications for CE are discussed, as well as implications for automated structural alignment in general.
     Overwhelmed by current errands, proposals, and papers, we mostly do not think in global terms of our place in building of knowledge, building of science. Nevertheless it is going on and in one way or another we build the structure of scientific knowledge. If the scientific articles are the ‘bricks’ in this building, books are the cornerstones.
     I would like to thank all the authors for devoting their time to the writing of the chapters. I hope this book will be useful to professionals and students in the field.



Igor F. Tsigelny
La Jolla