Structural Genomics
1. Structural genomics: an emerging scientific field.
Structural genomics is an emerging area of biological research aimed at solving the complete representative set of protein structures through the application of high-throughput structure determination techniques. Particular areas of work include exhaustive structural analysis of all proteins from a number of model organisms with completely sequenced genomes and the class-based approach focusing on protein groups of special medical or biotechnological interest (Terwillinger et al., 1998; Frishman & Mewes, 1999). A number of large-scale projects of this type have been launched in the USA, Japan, and Germany.
2. Structural genomics and bioinformatics.
Computational molecular biology created the rationale for structural genomics by deriving the general principles of protein structure organization and by providing a tentative upper boundary for the total number of existing protein folds, efficient ways of their prediction and classification. Comparative protein sequence and structure analysis is a major cost-saving factor in high-throughput structure determination leading to optimal, most economic selection of targets for X-ray crystallography or NMR studies. Bioinformatics also plays a crucial role in assessment and classification of the new structural data obtained. Bioinformatics research, in its turn, directly benefits from the flood of data generated by structural genomics projects, resulting in improved algorithms, software, and databanks.
3. Structural genomics at MIPS.
3.1 The PEDANT genome database as a structural genomics resource.
PEDANT is software system for high-throughput analysis of large biological sequence sets (Riley et al., 2005). The principal features of PEDANT are: (i) completely automatic processing of data using a wide range of bioinformatics methods, (ii) manual refinement of annotation, (iii) automatic and manual assignment of gene products to a number of functional and structural categories, (iv) extensive hyperlinked protein reports, and (v) advanced DNA and protein viewers. The system is easily extensible and allows to include custom methods, databases, and categories with minimal or no programming effort. PEDANT is actively used as a collaborative environment to support several on-going genome sequencing projects. The main purpose of the PEDANT genome database is to quickly disseminate well-organized information on completely sequenced and unfinished genomes. It currently includes >300 genomic sequences and in many cases serves as the only source of exhaustive information on a given genome. In particular, the availability of structural predictions for over 500000 genomic proteins makes PEDANT one of the most extensive structural genomics resource available in the public domain.
3.2 Comparative analysis of protein folds in complete genomes.
One of our mainstream activities is the assessment of the phylogenetic distribution of protein architectures using secondary structure prediction, threading, and similarity-based fold recognition methods. Very early on structural invormation became an important part of genome annotation at MIPS. Secondary structure predictions and assignments were provided for the entire yeast genome released in 1997 (Mewes et al., 1997). Analysis of these predictions and comparison with other genomes allowed us to establish previously unknown tendencies in protein fold occurrence in complete genomes (Frishman & Mewes, 1997). In this article we have also proposed to conduct an exhaustive structure determination for all soluble proteins from a small microbial organism. More recently, sensitive sequence searches led to identification of at least one known structural domain in up to 50% of genomic proteins (Frishman et al., 2000). These data, combined with complete functional characterization of proteins available in PEDANT provide a solid foundation for further research in taking structural consensus in different organism groups. One specific result, obtained from the comparison of the A.thaliana genome with three other model organisms is that there is a significant difference in structural properties of proteins in uni- and multicellular organisms (Mayer et al., 1999). Another major distinction is between prokaryotic and eukaryotic globular proteins. The latter contain much more longer, multi-domain proteins and are thus expected to have different patterns of structure conservation, packing, and interaction with the solvent. Finally, even among prokaryotic species, significant variations in structural features, such as amino acid composition, secondary structure element length, size of internal cavities, and hydrogen bonding can be expected, especially in organisms adapting to extreme living conditions. Comparative analysis of protein folds encoded in complete genomes also allows one to gain insights into the mechanisms of individual cellular processes, such as chaperone-mediated protein folding. We have recently conducted fold prediction analysis of a set of experimentally derived GroEL substrates from E.coli and compared it with the predicted structural properties of the complete E.coli proteome (Houry et al., 1999; Kerner et al., 2005). Structural investigations on the evolution of disease-related proteins have also been conducted (Wong et al., 2005).
3.3 Knowledge-based target selection by STRUDEL.
The key to quick progress in structure determination projects lies in the most optimal strategy in target selection. Current approaches typically rely on all-against-all sequence comparison with subsequent clusterization and domain boundary analysis. However, such methods do not explicitly incorporate the main requirement to efficient target selection, namely the need to choose the minimal number of soluble proteins covering the maximal number of currently unknown folds. We have developed a flexible computational procedure called STRUDEL (Structure Determination Logic) that directly incorporates the whole body of annotation available in the PEDANT genome database into the sequence clustering and selection process in order to identify proteins that are likely to possess currently unknown structural domains (Frishman, 2001). Filtering out gene products based on predicted structural features, such as known 3D-structures, regions of low complexity, and transmembrane regions, allows to reduce the complexity of neighbor relationships between sequences and all but eliminates the need for further partitioning of single-linkage clusters into disjoint protein groups corresponding to homologous families. The algorithm was tested in a large-scale computation experiment in which exemplary target selection for 32 prokaryotic genomes was conducted. An important prerequisite for the success of this work is to keep the data collection up-to-date and monitor the progress of other high-throughput projects. For this reason, the system must be designed as a dynamic resource with the ability to re-assess the selection of targets quickly when new data become available. Further potential improvements include the development of better methods to distinguish between soluble and membrane-bound proteins and to predict the protein propensities to build crystals.
3.4 Structural genomics of membrane bound proteins.
Membrane proteins play a central role in many important cellular processes. In a project funded by BMBF and conducted in collaboration with MPI für Biochemie, Forschungsinstitut für Molekulare Pharmakologie, MPI für Biophysik, and MPI für Infektionsbiologie we aim at increasing the number of known 3D structures of integral membrane proteins through a combination of innovative bioinformatics, biotechnological, crystallographic and NMR techniques. Carefully selected proteins from S.typhimurium and H.pylori will be subjected to structure determination. In parallel two specific membrane protein classes - fumarat reductases and multiple drug resistance transporters - from a wide spectrum of prokaryotic organisms will be systematically investigated. It is anticipated that the experimental complications in structure determination of membrane proteins will be compensated through intelligent application of structural genomics approaches. In particular, it will be crucial to distinguish genuine membrane proteins from small soluble proteins that are part of membrane-bound multiprotein complexes. Furthermore, current expression strategies do not allow to address multiprotein complexes residing in the membrane in intact form. An important task for bioinformatics will be to predict the probability of each candidate protein to be expressed and crystallized and to determine which method of structure determination - X-ray crystallography, NMR spectroscopy, or electron microscopy - is the most appropriate.
3.5 Large-scale structure determination of M.tuberculosis proteins and their ligand complexes.
This project is also funded by BMBF and is conducted in collaboration with EMBL-Hamburg, MPI für Infektionsbiologie and several industrial partners. The goal of the project is to determine up to 50 structures of selected M.tuberculosis proteins and up to 100 complexes with low molecular weight ligands. For 20 of these targets directed development of active agents will be undertaken. Among the potantial candidates for structure determination are gene products essential for the viability of this organism, factors determining the resistance to antibiotica and other therapeutic agents and factors determining the pathogenicity of M.tuberculosis. Identification of such targets will be made based on both bioinformatics analysis and experimental proteomics approaches. The bioinformatics infrastructure will be represented by the PEDANT software and database of systematically annotated genomes as well as the integrated system for target tracking developed in collaboration with Biomax Informatics AG.
3.6 Structural genomics of the plant Arabidopsis thaliana.
In a further structural genomics effort, initiated in collaboration with several partners from USA, Japan, and Israel (Principal Investigator - Prof. J.Markley, University of Wisconsin), we are providing bioinformatics support for high-throughput structure determination of A.thaliana proteins. The A.thaliana genome database developed at the Institute for Bioinformatics already contains a wealth of manually verified data on functional and structural properties of gene products from this organism. Since the genome displays a very high degree of duplication, efficient clustering of amino acid sequences and protein family analysis will be especially challenging. The main goal of this project is structure characterization of A.thaliana genes relevant for human disease.
4. Publications.
- Frishman D and Mewes HW (1997) Protein structural classes in five complete genomes. Nature Struct. Biol. 4, 626-628.
- Mewes HW, Albermann K, Baehr, M, Frishman D, Gleissner A, Hani J, Heumann K, Kleine, K, Maierl A, Oliver SG, Pfeiffer F, Zollner A. (1997) The yeast genome directory. Nature, 387, 7-65.
- Terwilliger TC, Waldo G, Peat TS, Newman JM, Chu K, Berendzen J. (1998) Class-directed structure determination: foundation for a protein structure initiative. Protein Sci. 7, 1851-6.
- Mayer K, Schueller C, Wambutt R, ..., Frishman D, ... (1999). Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402, 769-777.
- Houry WA, Frishman D, Eckerskorn C, Lottspeich F, Hartl FU (1999) Identification of in vivo substrates of the chaperonin GroEL. Nature 402, 147-154.
- Frishman D, Mewes HW (1999) Genome Based Structural Biology. Progr. Biophys. Mol. Biol. 72, 1-17.
- Frishman D, Albermann K, Hani J, Heumann K, Metanomski A, Zollner A, Mewes HW (2001). Functional and structural genomics using PEDANT. Bioinformatics, 17, 44-57.
- Frishman D (2002). Knowledge-based selection of targets for structural genomics. Protein Engineering 15, 169-183.
- Riley ML, Schmidt T, Wagner C, Mewes HW, Frishman D. (2005) The PEDANT genome database in 2005. Nucleic Acids Res. 33(Web Server issue):D308-D310, 2005.
- Wong P, Fritz A, Frishman D. (2005) Designability, Aggregation Propensity and Duplication of Disease-related Proteins. Protein Eng Des Sel 18(10):503-508.
- Kerner MJ, Naylor DJ, Ishihama Y, Maier T, Chang HC, Stines AP, Georgopoulos C, Frishman D, Hayer-Hartl M, Mann M, Hartl FU. (2005) Proteome-wide analysis of chaperonin-dependent protein folding in Escherichia coli. Cell 122(2):209-20.
