The GenRE Project - Development of a Genome Research Environment
1 Overview
GenRE will be implemented as a workhorse for the annotation of genome information as done by MIPS (e.g. BFAM, Plants, Fungi). All genomes being annotated by the MIPS group will move to GenRE and allow for comprehensive annotation of complex genomic features.
1.1 Introduction
Systematic sequencing of genomes has resulted in substantial amounts of sequence data encoding for the structures of all biological macromolecules in a species. Traditionally, this information has been translated into sequence data representing genetic elements, i.e. the functional segments of the DNA such as coding open reading frames or exons, various kinds of RNAs, and last, but not least, promoter elements responsible for the condition dependent, controlled expression.
The complex annotation associates biological information to sequence data. This substantial information can be inferred by the application of computational methods that is systematically applied like in the PEDANT sequence analysis package, but other, more specific information needs careful manual curation. Computational methods for example can transfer information from known sequence properties such as membrane spanning segments, secondary structures, folds, PFAM-domains, motifs and the like, whereas annotating the most important information such as sub-cellular location, protein/protein interaction, co-regulation, membership in pathways or cellular networks requires expert skills.
Analysis of high-throughput experimental data needs advanced data structures that must be able to allow computational methods to access information beyond the individual genetic element. On the one hand, basic information as compiled by the generic public data collections must be included; on the other hand, dynamic, inferred data such as homology, fold, and pro-tein/protein data have to be integrated. The conceptual challenge meets the technical one. For instance, one could ask that a system should be able to cope with the following examples of complex queries:
- visualize all known genetic elements and their individual evidence of a small genome based only on the primary information from the sequence contig and the database information
- map all assigned functional classifications to a set of expression analysis data and correlate them to all putative protein/protein interactions
- compare two genomes dynamically for functional neighbourhood relations
1.2 Status and Current Problems
The annotation management systems currently employed bear a number of conceptional shortcomings. Based on individual entities represented by database entries, no relations in terms of functional interactions are annotated. Databases rely on classical flat file structures being integrated by various passive data integration engines (e.g. SRS, BioRS) separating data structures from applications. However, database integration is limited to allow for complex queries for the comprehensive and comparative analysis of genome data, but it is not suitable to allow for the representation of biological knowledge in terms of data structure and functionality.
1.3 Providing Solutions for Complex Data Representation
Biological knowledge is based on experimental evidence and its molecular interpretation. Functional information is, as long as the molecules involved can be identified, associated to the primary sequence and corresponding 3-dimensional structure. The correlation between the primary sequence and its functional properties is hardly represented by straightforward relations; in most cases it cannot be inferred from the sequence properties. Thus, the information deduced from experimental data is descriptive and needs to be structured to allow for computational operations. For instance, the different types of protein/protein interactions can be represented as coloured graphs, where the colours of the edges may represent the type of interaction as well as the type of experimental evidence.
One of the main challenges for the predictive bioinformatics methods is to establish methods for the reliable transfer of information from one object to another. For instance, reconstruction of known pathways from one organism to another is one of the primary tasks in the comparative analysis of genomes. Partial information on a common pathway may come from both genomes compared; one might like to know the pathway information in A, its overlaps to pathways in B and the possible extension in both organisms if one can identify the involved orthologs. To solve this question, the retrieval of structured information is required, not only for the organisms compared, but also in any other organism available. Such an organisation is a prerequisite for any exhaustive search to estimate the correlation function for the quality of the functional prediction (in this case the metabolic context). It is necessary for the computation to weight the available information according to the direct (experimental) evidence or the indirect evidence by comparative analysis.
Computation on genome data is required for any kind of combinatorial approaches that try to combine different types of information such as functional classification to protein/protein interaction or expression analysis data. The GenRE system will allow not only for the representation of known and deduced data, but also to include precomputed predictive information while applying a variety of computational methods such as sequence comparison, structure prediction, co-regulation, and membership in pathways or regulatory networks. The basic principle on the "atomic" level has been already implemented in the PEDANT genome analysis system (Frishman et al., 2001). However, to extend PEDANT to cope with higher levels of information such as interactions, the current data model has to be extended.
2 Technical Detail
2.1 Status
At present the annotation (associated information to the sequence data) of genomes is highly heterogeneous. Various annotation systems differ in coverage and quality of the information included. Only few are regularly updated. Data models and functional properties of the databases differ widely. Thus, the comparative analysis of genome information is handicapped by the heterogeneity of genome annotation.
Problems emerging from Client/Server side as well as from data management. Examples are:
- For users and application developers comprehensive interfaces to all genomes are missing
- Different data management approaches like flat-files, relational database management systems (MySQL, Postgres, ORACLE, Sybase, DB2, ...)
- Insufficient and outdated data models
Especially the combination of the latter two is a serious problem for an appropriate data management. While the syntax of the database content can be controlled with reasonable effort by data typing, the semantics is much more difficult to be defined. Nevertheless dynamic treatment of semantic is essential for scientific information to handle dynamic ontologies representing the current knowledge. Ontologies try not only to build defined vocabularies, but also to assign bidirectional relations (e.g. "is part of" vs. "belongs to"). Seminal data management systems should be able to handle the semantics of their information content.
From this requirement two related problems arise. The first one is the conceptual representation of the data. Conventional relational models require already at design time a defined ontology representing the semantics. The second one is the dynamic development of biological knowledge where the meaning of the data may change. Conventional DBMSs with their rigid treatment of data models have serious difficulties to modify the schema due to their strong correlation of data content and the underlying model. For example, the content of any table is always correlated with its name. Changing the name requires changes in the corresponding code causing a strong dependency between the data model and the software.
2.2 System Architecture and Design
A suitable system able to overcome the problems mentioned above requires at least three major components:
- A Semantic Database Management System
- A suitable middleware architecture for the integration of existing data sources
- A reliable and modular system for network (Web) based access
2.2.1 Semantic Database Management
The representation of complex data using object-oriented technologies has clear advantages compared to the dissection of data into relational tables. OO-techniques allow a more flexible management of data due to their ability to represent inheritance, aggregation and association. A protein for example inherits general information from the biomolecule object definition and aggregates with other proteins to a protein-protein complex. Nevertheless, similar to the conventional approaches OO-databases also require a data model at design time.
To avoid problems caused by the dynamic reassignment of data to their corresponding semantics, we decided to apply a modified approach. The core concepts of our Semantic Database Management System are:
- Semantic is represented in form of knowledge objects.
- Knowledge objects are associated by dynamic references with attributes containing the real data.
- Attributes may contain simple data types like Strings, numbers, integers… but also more complex data like database queries to external systems. These methods contain complex algorithms necessary for semantic checking. This is possible because they are derived from an abstract base class.
With this approach a logical separation of data and semantics is introduced. On the one hand there is a descriptive rule manager dealing with inheritance, association, and aggregation and on the other hand the storage context manager makes sure that data is organized in the current context of the ontological definition.
2.2.2 Middleware and Webservices
To cope with the given complexity, the system requires a suitable multi-tier layout integrating a strong middleware architecture for the annotation database system as well as all other components like web-services and external access. The layered approach as a basic design architecture ensures functional independence. The system should be able to handle data in a uniform way. Although not completely adoptable, enterprise application integration (EAI) technologies are an excellent starting point for the GenRE development.
For this reason our design strongly uses J2EE and CORBA as the middleware and XML as markup language for data encapsulation and transport. The J2EE system runs within an application server (JBoss) delivering and driving the Web server as well as database management systems. J2EE provides ready to use components like naming services (JNDI), security (JAAS), controler for database access (JDBC) etc. Web services are delivered with a special Model-View-Controler architecture based on a reusable JAVA framework for independent web design.
Description of the Tiers
- Client Tier:
This is most commonly a web browser - The Web Tier:
is responsible for accepting user requests and sending pages back to the users. - The Middle Tier:
handles the workflow related issues within the system - The EJB Tier:
In this tier Enterprise Java Beans are handling the business logic of the attached systems. The Beans are responsible for manipulating and retrieving data according to the logic of the user requests. For instance, to perform the semantic checking they read out an algorithm from a GenRE attribute and start the execution on the related data. The Beans are responsible for the integration of our former databases like MatDB, CYGD, etc. Due to the fact that EJBs are standardized, it is possible to use Beans attached to differented data sources in the same way and integrate (compare) data from various systems. - The Enterprise Information System (EIS) Tier:
contains all data systems like GenRE, MatDB, but also CORBA interfaces for external access of services like BioRS, etc.
3 General
3.1 Project related information
3.1.1 Project members
-
Dr. Volker Stümpflen (project manager)
-
Matthias Oesterheld
