PROTEIN STRUCTURE PREDICTION Bioinformatic Approach edited by IGOR F. TSIGELNY
Contents Preface xv List of Contributors xxi Part I. CONCEPTS OF PROTEIN STRUCRURE PREDICTION 1 A. Prediction Methods and Systems 3 1. Computational Studies of Protein Structure and Function Using Threading Program PROSPECT 5 Dong Xu and Ying Xu 1.1. Introduction 5 1.2. Method of PROSPECT 10 1.2.1. Threading Templates 11 1.2.2. Energy Function 12 1.2.3. Threading Algorithm 13 1.2.4. Confidence Assessment of Threading Results 15 1.3. Protocols of Using PROSPECT 17 1.3.1. Pre-Processing before Running PROSPECT 18 1.3.2. Running PROSPECT 20 1.3.3. Human Evaluation 21 1.3.4. Manual Refinement 25 1.3.5. Structure-Based Functional Inference 26 1.4. Performance of PROSPECT 29 1.4.1. Testing of PROSPECT Using Known Structures in PDB 29 1.4.2. Blind Test in CASP 30 1.5. Application of PROSPECT in Protein Studies 34 1.5.1. Human Vitronectin 34 1.5.2. Human DNA-Activated Protein Kinase 35 1.5.3. Yeast PTR3 Protein 36 1.6. Summary 37 2. Bayesian Approach to Protein Fold Recognition: Building Protein Structural Models from Bits and Pieces 43 Jadwiga Bienkowska, Hongxian He, Robert G. Rogers Jr., and Lihua Yu 2.1. Introduction 45 2.2. Fundamentals of DSMs and HMMs 49 2.2.1. Representation of Protein Structure by a DSM 50 2.2.2. Mathematical Representation of a DSM 51 2.2.3. Measures of Compatibility of a Protein Sequence with a DSM 52 2.3. Automated Generation of Protein Structural Templates 53 2.3.1. Criteria for Selecting Structural Information 54 2.3.2. Candidate Structural Quantities 55 2.3.3. Classification of Structural States 57 2.4. Automated Design of a Structural DSM from a Structural Template 60 2.4.1. Design Principles 60 2.4.2. Secondary Structure Submodels 62 2.4.3. Construction of DSM from the Structural Template 65 2.4.4. Using Structural Alignments and Multiple Structural Templates in Building DSM 65 2.5. Automatic Pattern Embedding in a DSM 66 2.5.1. Automated Pattern Generation and Selection 67 2.5.2. Look-Ahead 69 2.6. A Bayesian Approach to Fold Recognition 70 2.6.1. The Filtering Algorithm 70 2.6.2. Prior Model Probabilities 72 2.7. Results 75 2.7.1. Comparing the Bayesian Approach and Total Alignment Probability with Other Methods 75 2.7.2. Results of Automatic Pattern Embedding 77 2.7.3. Comparison of Different Assignments of Prior Probabilities 79 2.8. Strategies for Defeating the Combinatorial Explosion 80 3. Three-Dimensional Structure Prediction Using Simplified Structure Models and Bayesian Block Fragments 85 Jun Zhu and Roland Lüthy 3.1. Introduction 87 3.2. Methods 89 3.2.1. Simplified Backbone Angle Representation of 3D Structures 89 3.2.2. Block Selection 90 3.2.3. Energy Functions 94 3.2.4. Energy Minimization 101 3.2.5. Using Information from Bayesian Blocks 102 3.2.6. Enforcing Secondary Structures 103 3.3. Examples 103 4. Protein Structure Prediction Using Hidden Markov Model Structural Libraries 109 Igor Tsigelny, Yuriy Sharikov, and Lynn F. Ten Eyck 4.1. Introduction 111 4.2. Structural Hidden Markov Model Libraries 112 4.3. Decision Tree 114 4.3.1. Search for the Best HMM 114 4.3.2. Searching within the Structural Alignment 117 4.4. Program Testing 120 4.5. Prediction of Unsolved Structures 121 5. The Role of Sequence Information in Protein Structure Prediction 125 Damien Devos, Florencio Pazos, Osvaldo Olmea, David de Juan, Osvaldo Graña, Jose M. Fernández, and Alfonso Valencia 5.1. Introduction 127 5.1.1. Information Contained in Multiple Sequence Alignments in Protein Families 127 5.2. Automated Generation of Protein Structural Templates 128 5.3. Distribution of Informative Positions in Protein Structures 130 5.4. Informative Positions in Protein Structure Models 132 5.5. A Threading Server That Filters Models with Multiple Sequence Alignments Information 128 5.6. A First Field Evaluation of the Server, the CAFASP Results 130 5.7. A CAFASP Example of the Use of Sequence Information 132 5.8. Training Neural Networks for the Discrimination of Wrong Threading Models Using Sequence 130 5.9. Conclusions 132 6. Protein Fold Recognition and Comparative Modeling Using HOMSTRAD, JOY, and FUGUE 143 Ricardo Núñez Miguel, Jiye Shi, and Kenji Mizuguchi 6.1. Introduction 145 6.2. Overview 149 6.3. Identification of Homologues 150 6.4. Generating Sequence-Structure Alignment 152 6.5. Example 153 6.5.1. Searching for Homologues 153 6.5.2. Alignment 157 6.5.3. Modeling 161 6.5.4. Heteroatoms 161 6.5.5. Refinements 162 6.5.6. Model Validation 162 6.5.7. Model 163 6.6. Conclusion 165 7. Fully Automated Protein Tertiary Structure Prediction Using Fourier Transform Spectral Methods 171 Carlos Adriel Del Carpio Muñoz and Atsushi Yoshimori 7.1. Sequence Alignment and Protein Structure Modeling 173 7.2. Protein Function and Structure Elucidation by Spectral Analysis 176 7.3. Spectral Analysis and Folding Pattern Recognition 179 7.3.1. Spectral Representation of Protein Primary Structures 180 7.3.2. Spectral Alignment and Protein Structure Similarity 184 7.3.3. Automatic Protein Folding Pattern Recognition 186 7.4. Automatic Classification of Protein Foldings 188 7.4.1. Dominant Physicochemical Parameters 188 7.4.2. Classification of Protein Folding by Spectral Analysis 191 7.5. Protein Folding Pattern Recognition by Spectral Analysis 195 8. From the Building Blocks Folding Model to Protein Structure Prediction 201 Nurit Haspel, Chung-Jung Tsai, Haim Wolfson, and Ruth Nussinov 8.1. Introduction 203 8.2. Protein Folding: A Process of Intra-Molecular Building Block Recognition 205 8.3. Experimental and Theoretical Support for the Building Block Concept 206 8.4. The Building Block Cutting Algorithm 209 8.5. The Scoring Function 210 8.6. The Cutting Procedure 211 8.7. Critical Building Blocks 213 8.8. From the Building Block Folding Model to Structure Prediction: The Scheme 214 8.9. Conclusions 220 9. Protein Threading Statistics: An Attempt to Assess the Significance of a Fold Assignment to a Sequence 227 Antoine Marin, Joël Pothier, Karel Zimmermann, and Jean-François Gibrat 9.1. Introduction 229 9.2. Method 232 9.2.1. Library of “Cores” 232 9.2.2. Development of a Score Function 233 9.2.3. Combinatorial Optimization Algorithm 239 9.2.4. Empirical Distribution of Scores 241 9.2.5. Development of a Benchmark Database 244 9.3. Results 247 9.4. Discussion 254 9.4.1. Use of Filters 254 9.4.2. Difficulty of the Benchmark 255 9.4.3. Statistical Criterion 256 9.4.4. Present Limits of the Method 258 9.5. Conclusion 259 10. Protein Structure Prediction by Threading: Force Field Philosophy, Approaches to Alignment 263 Thomas Huber and Andrew E. Torda 10.1. Introduction 265 10.3.1. Common Methodology 267 10.2. Force Field Based Scoring 269 10.3. Parameterizing Force Fields 271 10.3.1. Physically-Based Potential Energies 271 10.3.2. Potentials of Mean Force 272 10.3.3. Optimized Force Fields 273 10.4. Alignment Philosophy 278 10.4.1. Common Alignment and Score Methods 278 10.4.2. Sausage Alignments 279 10.5. Beyond Pairwise Terms 280 10.6. Template Libraries 285 10.7. Further Outlook and Speculation 289 11. Predicting Protein Structure Using SAM, UCSC’s Hidden Markov Model Tools 297 Kevin Karplus 11.1. A Naive View of Protein Structure Prediction 299 11.2. Fold Recognition 301 11.3. Hidden Markov Models 302 11.3.1. Multitrack Hidden Markov Models 305 11.3.2. Statistical Significance for Hidden Markov Models 307 11.4. Using SAM-T2K for Superfamily Modeling 308 11.5. Improved Verification of Homology 312 11.6. Family-Level Multiple Alignments 314 11.7. Modeling Non-Contiguous Domains 315 11.8. Building an HMM from a Structural Alignment 316 11.9. Improving Existing Multiple Alignments 319 11.10. Creating a Multiple Alignment from Unaligned Sequences 319 11.11. Conclusions 320 12. Local Genome Organization, Gene Expression, and Structural Genomics: Evolution at Work 325 Wayne Volkmuth and Nickolai Alexandrov 12.1. Introduction 327 12.2. Methods 329 12.2.1. Genomes 329 12.2.2. Microarray Expression Data 329 12.2.3. Fold Assignment 331 12.2.4. Non-Redundant Set of Proteins 333 12.2.5. Fold Enrichment Along the Genome 333 12.2.6. Fold Enrichment for Genes with Similar Patterns of Expression 333 12.3. Results 333 12.3.1. Fold Enrichment Along the Genome 333 12.3.2. Fold Enrichment for Genes with Similar Patterns of Expression 333 12.3. Summary and Conclusions 334 13. Protein Structure Prediction on the Basis of Combinatorial Peptide Library Screening 341 Igor Tsigelny, Yuriy Sharikov, Vladimir Kotlovyi, Michael Kelner, and Lynn F. Ten Eyck 13.1. Concept of the Comprehensive System 343 13.2. HMM-ELONGATOR 345 13.2.1. Problem Description 345 13.2.2. Elongation Strategies 346 B. Consensus Structure Prediction 353 14. A User’s Guide to Fold Recognition 355 Naomi Siew and Daniel Fischer 14.1. Introduction 357 14.2. Examples of Using Fold Recognition for Biological Research 358 14.2.1. Plant Resistance Gene Products 359 14.2.2. Acetohydroxyacid Synthase 360 14.2.3. Endothelial Cell Protein C/Activated Protein C Receptor 361 14.3. How to Fold Recognize? 363 14.3.1. Searching for Homologues of Known Structure 364 14.3.2. Running Your Favorite Fold Recognition Method 365 14.3.3. Running Other Methods 368 14.3.4. Why Run More Than One Method? 369 14.3.5. 3D-Shotgun Meta-Predictor 370 14.4. Summary 370 15. Structure Prediction Meta Server 377 Leszek Rychlewski 15.1. Introduction 379 15.2. The Meta Server 381 15.2.1. User Input and Job Status Display 382 15.2.2. Job Deposition and Administration 382 15.2.3. Request Submission Queuing 384 15.2.4. Blast-Filter 385 15.2.5. Local and Remote Prediction Services 385 15.2.6. Raw Output Converters 387 15.2.7. Visualization and Linking 389 15.2.8. Interfaces 389 15.3. Discussion 390 Part II. METHODS OF STRUCTURE AND SEQUENCE ALIGNMENT 395 16. Improved Fold Recognition by Using the PCONS Consensus Approach 397 Huisheng Fang, Björn Wallin, Jesper Lundström, Christer von Wowern, and Arne Elofsson 16.1. Introduction 399 16.2. Why are Manual Predictions Better? 401 16.2.1. Biological Knowledge 401 16.2.2. Structural Analysis 401 16.2.3. Consensus Analysis 402 16.3. Consensus Predictions in CASP4 403 16.4. Pcons 405 16.4.1. Collection of Publicly Available Models 406 16.4.2. Structural Comparison 406 16.4.3. Prediction of Quality of the Models 407 16.5. Performance of Pcons 408 16.5.1. Performance in LiveBench-2 409 16.5.2. Why Does Pcons Perform Better? 411 16.6. Pcons-II 412 16.6.1. Improvements Using More Servers 412 16.6.2. Speed-Up of Structural Comparisons 412 16.6.3. Using Better Statistics 413 16.6.4. Improvements Using Linear Regression 413 16.7. Summary 414 17. New Insights into Protein Fold Space and Sequence-Structure Relationships 417 Ilya N. Shindyalov and Philip E. Bourne 17.1. Introduction 419 17.2. Overview of CE Sequence-Structure Space 420 17.3. Scop vs. CE Fold Space Comparison 421 17.4. Analysis of Structure Redundancy 422 17.4.1. Size of NR Set as a Function of Criteria Used 423 17.4.2. Characterization of Chains Excluded from the Set 423 17.4.3. Characterization of Similarity Between Chains in the Set 424 17.4.4. Complementary Sequence and Structure NR Sets 428 17.4.5. Combined NR Set 428 18. A Flexible Method for Structural Alignment: Applications to Structure Prediction Assessments 431 Vladimir Kotlovyi, Igor Tsigelny, and Lynn Ten Eyck 18.1. Introduction 433 18.2. Theoretical Background 435 18.3. Algorithms and Their Implementation 438 18.4. Representation of Data in XML Forms 440 18.5. Timing 442 18.6. Web-Servers 443 18.7. Illustrative Examples 447 19. Comparative Analysis of Protein Structure: New Concepts and Approaches for Multiple Structure Alignment 449 Chittibabu Guda, Eric D. Scheeff, Philip E. Bourne, and Ilya N. Shindyalov 19.1. Introduction 451 19.2. Algorithm for Aligning Multiple Protein Structures Using Monte Carlo Optimization 452 19.2.1. Scoring Function 452 19.3. Approaches for Optimization of Multiple Structure Alignment 453 19.3.1. Effect of Weights Based on Number of Residues on Alignment Length and Alignment Distance 453 19.4. Analysis of Specific Protein Families 455 19.4.1. Analysis of an Alignment of Protein Kinases 455 19.4.2. Analysis of an Alignment of Aspartic Proteinases 458 19.5. Summary 459 20. Comparative Analysis of Protein Structure: Automated vs. Manual Alignment of the Protein Kinase Family 463 Eric D. Scheeff, Philip E. Bourne, and Ilya N. Shindyalov 20.1. Introduction 465 20.2. The Challenge of Automated Protein Structure Alignment 466 20.3. A Case Study: Alignment of the Eukaryotic Protein Kinases and Their Relatives 467 20.4. An Example of an Automated Alignment: The Combinatorial Extension Algorithm 468 20.5. Parameters for the Determination of an “Optimal” Structure Alignment 470 20.6. Comparison of CE Alignments with Manual Alignments 471 20.7. Conclusion 475 Index 479
Preface
Prediction of protein structure is very important today. Whereas more than 17,000 protein structures are stored in PDB, more than 110,000 proteins are stored only in SWISSPROT. The ratio of solved crystal structures to a number of discovered proteins to about 0.15, and I do not see any improvement of this value in the future. At the same time development of genomics has brought an overwhelming amount of DNA sequencing information, which can be and already is used for constructing the hypothetical proteins. This situation shows the great importance of protein structure prediction. The field is growing very rapidly. A simple analysis of publications shows that the number of articles having the words ‘protein structure prediction’ has almost doubled since 1995. So many really great ideas are used as a basis for the current prediction systems. Some of them will evolve into the next generation of the prediction software but some of them, even very promising, will be lost and rediscovered in the future. Here we tried to include the variety of methods representing the most interesting concepts of current protein structure prediction. This compendium of ideas makes this book an invaluable source for scientists developing prediction methods. In many cases authors describe successful prediction methods and programs that make this book an invaluable source of information for numerous users of prediction software. The first chapter describes the protein structure prediction program PROSPECT that produces a globally optimal threading alignment for a typical threading energy function, and allows users to easily incorporate experimental data as constraints into the threading process. PROSPECT also provides a confidence assessment of a threading result based on a neural network. The second chapter presents a protein fold-recognition method that selects the best fold model for a given protein sequence from a library of structural hidden Markov models (HMMs). The HMMs are built from protein structures following their modular decomposition into the secondary structure elements and representing those elements by a pre-designed set of submodels. The third chapter describes a method to fold proteins into simplified three-dimensional structures constructed from small fragments cut out of a representative set of known three-dimensional structures. The three-dimensional protein structures and fragments are represented in a simplified form as a sequence of angle pairs, one angle pair per residue. Chapter 4 describes the application of HMM constructed on the basis of structural alignments for protein structure prediction. An example system HMM-SPECTR is given with the description of different types of HMMs based on structural alignments. Chapter 5 reviews the different methods of extraction of information from multiple sequence alignments and illustrates how to use them as a primary source of information. The chapter describes the application of rarely used features such as sequence conservation, variations between sub-families, correlation between the patterns of mutation of pairs of positions, and the distribution of apolar residues for structure prediction. Chapter 6 illustrates how knowledge of protein three-dimensional structure can be used to identify homologues of known structure, generate sequence-structure alignments and assist model building. It describes the programs: HOMSTRAD, a database of structure-based alignments for protein families of known structure, JOY, a program to annotate local environments in structure-based alignments and FUGUE, a program to perform sequence-structure homology recognition. Chapter 7 proposes a different concept of sequence homology. This concept is derived from a periodicity analysis of the physicochemical properties of the residues constituting proteins primary structures. The analysis is performed using a front-end processing technique in automatic speech recognition by means of which the cepstrum (measure of the periodic wiggliness of a frequency response) is computed that leads to a spectral envelope that depicts the subtle periodicity in physicochemical characteristics of the sequence. Chapter 8 describes the building block protein folding model. Via a building block assigning algorithm, sequence comparisons and weighting scheme, building blocks are assigned to a target protein sequence. The problem of the ‘building block’ is very important for both protein folding modeling and protein structure prediction. Authors of several chapters in this book propose different ‘building blocks’ for discretization of the prediction process. In most cases they do not discuss the physical properties of these blocks, paying attention only to the information coding. The approach of the authors of the chapter 8 clearly defining the physical and informational properties of the building blocks looks very promising. Chapter 9 describes a new fold recognition method called FROST (Fold Recognition Oriented Search Tool). It includes 1D and 3D comparison and a database of representative three-dimensional structures. The chapter uses information theory concepts for embedding of a number of sequence and structure parameters in one scoring function. This approach makes this chapter very elegant and useful for the developers of protein structure prediction systems. Chapter 10 continues the discussion of how to combine different levels of resolution and representation of a protein and the rationalizations of score functions for protein structure prediction. The statistical mechanical parameters are used together with purely empirical and even ad hoc parameters. Chapter 11 describes one of the most effective HMM system for biological applications—SAM. The chapter shows an approach to fold recognition that relies on HMMs for both selecting the template and for aligning the target to the template. The technique has been used successfully in three of the Critical Assessment of Structure Prediction (CASP) experiments. Chapter 12 discusses the important link between genomic information and protein structure. The chapter describes the clues that could be used to help infer the evolutionary relationship via structural similarity and improve the ability to predict the biochemical function. The first such clue is a positional conservation along the genome, i.e., nearby genes tend to be structurally related more often than expected by chance alone. The second such clue is present in expression data: genes that are correlated in expression are more apt to share a common fold than two randomly chosen genes. Chapter 13 proposes a comprehensive system for computer based drug design. The chapter describes the program HMM-ELONGATOR, which predicts putative protein targets based on a set of peptides shown to bind a drug molecule from combinatorial libraries. Chapter 14 on the basis of three examples shows how the use of fold recognition helped biologists in planning and devising experiments and in generating verifiable hypotheses. This chapter describes the meta-predictor approach for protein structure prediction. Chapter 15 describes in details the Structure Prediction Meta Server that collects prediction models from many high quality services and translates them into standard formats enabling convenient analysis of the results. The Meta Server offers an infrastructure for the creation of automated jury algorithms, which analyze the set of results for the user and calculate the reliability score for a consensus prediction. Chapter 16 describes a new method for fold recognition, Pcons that utilizes the “consensus analysis.” This chapter shows the advantages of Pcons based on the large scale benchmarking. Chapter 17 starts the part of the book devoted to the concepts of structural alignment. It is obvious that proper structural alignment of proteins is the cornerstone of the majority of prediction methods. This chapter introduces several new views of protein fold space which will help to further understand protein evolution and interpret structural similarities. Differences between the manual (SCOP) and automated (CE) approaches to the structural classification problem are described. Chapter 18 discusses the design principles of a structure alignment system that can be used for structure prediction assessments. This system is based on a hierarchical representation of a protein shape. Such a representation makes the system suitable for effective alignments of structures with low similarity. Chapter 19 describes the Monte-Carlo approach to the construction of multiple structural alignments. Chapter 20 describes the specific example, where an alignment of eukaryotic protein kinases generated using the combinatorial extension algorithm (CE) is compared with a manually derived alignment. Implications for CE are discussed, as well as implications for automated structural alignment in general. Overwhelmed by current errands, proposals, and papers, we mostly do not think in global terms of our place in building of knowledge, building of science. Nevertheless it is going on and in one way or another we build the structure of scientific knowledge. If the scientific articles are the ‘bricks’ in this building, books are the cornerstones. I would like to thank all the authors for devoting their time to the writing of the chapters. I hope this book will be useful to professionals and students in the field.
Russel F. Doolittle –
Professor of the University of California, San Diego, Member of National Academy of Sciences
Moore’s Law, which predicts the rapidity with which the speed and capacity of computer chips are increasing, is a well-known concept. Less well appreciated is how fast new generations of software follow in its wake to harness all the expanded power. A good example occurs in the field of protein structure prediction, where new approaches to fold recognition, structural alignment and threading continue to appear at a rate that leaves the individual investigator at a loss of which way to turn to solve any particular problem.
Now there is a very timely book on the subject, edited by Igor Tsigelny, that serves as an excellent guide to the very newest approaches. A wide variety of programs and strategies is discussed, including new applications of hidden Markov models and novel Bayesian approaches for building up models from block fragments. In one way or other, the theme that holds throughout has to do with determining three-dimensional structures for the plethora of newly determined amino acid sequences.
The 20 chapters are from groups around the world. They are well chosen and present sufficient perspective that the interested reader can seek out an appropriate path for his own needs. Having said that, I must add that these are very meaty and detailed renderings.
The reader who immerses himself heavily in them is bound to emerge an experienced modeler.
The advent of structural genomics—aka proteomics—demands resources like this volume.