The european molecular biology linked original resources (TEMBLOR)

ProViz is a protein-protein interaction graphs visualisation tool. It uses the Tulip library to display graphs. Features: Load and save graphs in Tulip and PSI-mi formats Efficient and fast navigation through the graphs Uses Ontologies described in XML format Clustering and exploration tools Multiple views of the same graph, each view may be filtered Possibility to annotate each node and each edge with a comment Public: Biologist who wishes to explore interaction graphs between biomolecules. These graphs can be produced by database search, conversion of PSI files, computation (predictive method). ProViz provides facilities to navigate through large graphs and biologically relevant exploration functions. ProViz is developed for the IntAct project by the LaBRI, Bordeaux, France. Contact: David Sherman. Web: http://cbi.labri.fr/eng/proviz.html

Partner 22 (CCLRC) has responsibility for data harvesting from X-ray structure solution, principally via programs in the CCP4 suite under WP9.9, "X-ray structure harvesting development". In broad terms, we have done this by improving the data management within CCP4, increasing the amount of data and metadata exported by CCP4 programs, and by providing utilities to ease the deposition process for crystallographers. In parallel with this, there are major changes being made to the CCP4 suite, in particular to its architecture and to the software libraries used by the applications. The groundwork for these changes has been laid by release 5.0 of the CCP4 suite in March 2004. Release 5.1, expected in early 2005, will begin the process of providing new applications and new automation schemes, with older programs being phased out (including some of those that provide harvesting information). From the point of view of Data Harvesting, it is necessary to transfer existing processes to the new components of the suite. The harvesting mechanism must also be tied in closely with the developing data model, which will be used for automation within CCP4. This work is clearly open-ended. However, it would be useful to begin the process under the auspices of Temblor by including the necessary changes in CCP4 5.1. As stated above, this implies a timescale of early 2005 for completion of this work. Participant 22 has made further additions to the data harvesting system and the Data Harvesting Management Tool contains some validation of data harvesting files. Version 4 of the AutoDep software allows local validation of deposition data. This will be distributed by CCP4. Data Harvesting is now fully supported by the CCP4 GUI, with data harvesting files being organised by ccp4i project. Tools have been added for the import and analysis of protein sequence information. Version 5.0 of the CCP4 software suite was released in May 2004, and included a number of developments from Temblor. There has been a major effort to improve and rationalise harvesting information outputted from the CCP4 refinement program REFMAC.

A network of concurring genes and proteins extends through the scientific literature touching on phenotypes, pathologies and gene function. iHOP provides this network as a natural way of accessing millions of PubMed abstracts. By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource, bringing all advantages of the Internet to scientific literature research.

As of December 2005 ArrayExpress contained gene expression and other micro-array data from almost 35,000 hybridisations, comprising over 1200 studies, covering 70 different species (see figure below). This exceeds the amount of data predicted in DESPRAD proposal more than twice. Most of the data are related to peer-reviewed publications. The available studies cover a wide variety of experiment types, such as gene expression related to compound treatments, disease states, organism part comparisons, or developmental studies. For instance, the experiment with accession number E-TOXM-16, investigates whether genotoxic carcinogens at doses known to induce liver tumours in rat bioassay deregulate a common set of genes in a short-term in vivo study. Raw and normalised data are provided. The experiment uses 137 hybridisations on 126 different samples on Affymetrix array U34A. It combines experimental factors compound, dose and time. The experiment E-UMCU-12 studies nine-day glucose starvation stationary phase culture in yeast Sacharomices serevisiae exit and entry from quiescence. It provides time series data for 34 time points, and provides raw, normalised, and normalised smoothened data. Among other gene expression datasets in the database are human and mouse tissue expression data (e.g., E-AFMX-4, E-AFMX-5), and Arabidopsis thaliana development and differentiation expression data (e.g., E-AFMX-8). Slightly over 20% of the gene expression experiments provide time course data. Roughly a third of the experiments have been performed on Affymetrix platform.

Temblor partners have produced protein-protein interaction data sets based on novel experimental methods for complex and phosphorylation analysis, respectively. These were integrated into the IntAct database to practically test the defined standards and the tools developed. The generated datasets have been published as Lasserre et al (Pmid 16858726) and Gruhler et al (Pmid 16088002).

FEMME database v1.0 stores the topological and geometrical features of medium resolution data solved by 3D-EM or high-medium resolution data simulated from atomic co-ordinates regardless of the resolution achieved. http://www.biocomp.cnb.uam.es/FEMME/ http://www.biocomp.cnb.uam.es/FEMME/SEARCH/search_form.html

MIAMExpress is a tool for submitting data to the ArrayExpress database. The tool takes you through a series of forms, where you will be asked to provide information about your experiment. Full documentation is available from http://www.ebi.ac.uk/miamexpress/help/. The ArrayExpress curation tool consists of two components - Submission Tracker and MAGE object Editor. The Submission Tracker is a java based tracking system to automate the loading and processing of MAGE-ML format ftp submissions to ArrayExpress. The tracker detects new submissions, validates them and auto loads them where they are valid. The system provides error reports to the ArrayExpress curators and external submitters and allows submissions to be re-queued for loading once corrected. MAGE object Editor is a part of the CurationTool and provides an interface to make changes to the submissions deposited to ArrayExpress database. It retrieves data from the ArrayExpress database as MAGE object associated with Experiment, Array, and Protocol and displayed through some templates where the data can be modified and submitted back to ArrayExpress.

The redesigned website for the EBI-MSD which incorporates all the work undertaken by virtue of Temblor, the new search systems linked to the relational database. We continue to maintain, develop and build this site to complete all the deliverables for education, tutorials and documentation. A high level of documentation was completed and made available with on-line tutorials and supporting documentation.

MIAME describes the data and metadata that authors must provide to support conclusions drawn from a micro-array experiment in order to interpret it unambiguously and potentially to reproduce it. The MIAME document has two quite distinct aspects that serve two quite distinct purposes. The first is the vocabulary to describe the logical structure that is common to most micro-array experiments and the second is a simple checklist that can be consulted by those involved in publication of micro-array-based data. This structure provides a simple framework for describing a micro-array experiment, which might easily be adapted for micro-array data management software development. The final version of the Minimum Information About a Micro-array Experiment (MIAME) was released in January 2005. The full description is available from http://www.mged.org/Workgroups/MIAME/miame_checklist.html. During the run of TEMBLOR project, largely due to TEMBLOR dissemination activities MIAME has been accepted by most of the major scientific journals as a requirement for publishing microarray data. MAGE consists of three parts - object model MAGE-OM, data exchange format MAGE-ML and toolkit MAGEstk. MAGE-ML has been automatically derived from Microarray Gene Expression Object Model (MAGE-OM), which is developed and described using the Unified Modelling Language (UML) a standard language for describing object models. Models described using UML have advantages over pure XML technologies (DTDs or XML Schemas) in many respects, especially for didactic purposes.

Within WorkPackage 9.6 the work on access to reaction mechanisms (De23) and interface to metabolic pathways (De24) was done with Partner 23. Professor Gasteiger's group have taken the famous Biochemical Pathways wall chart from Roche Applied Science and converted this into a reaction database prior to TEMBLOR and a web based retrieval system called C@ROL was interfaced to this BioPath database. TEMBLOR contributed to enhancing the wide variety of search methods for chemical structures, enzymes, and reactions that can allow one to explore the endogenous metabolism of different species. Major features of this database are that each molecule is represented by lists of all atoms and bonds (as connection tables), and in the reactions the reaction centres, the atoms and bonds directly involved in the bond rearrangement process, are marked. The information in the database has been enriched by a set of diverse 3D structure conformations generated by the programs CORINA and ROTATE. The database is accessible on the Internet at: http://www2.chemie.uni-erlangen.de/services/biopath/index.html and http://www.mol-net.de/databases/biopath.html (see also http://www.mol-net.de/software/carol/index.html). The information in this database can be used to explore enzyme inhibitors as transition state mimics. Furthermore, it was shown how the classification of biochemical reactions based on physicochemical effects at the reaction site, corresponds with the classification of enzymes by the EC code. (see Reitz,M., Sacher,O., Tarkhov,A., Trumbach,D. & Gasteiger,J. Enabling the exploration of biochemical pathways. Org. Biomol. Chem. 2, 3226-3237 (2004). The current system covers prokaryotes, plants, yeasts, animals, over all pathways, with 2175 Biochemical Transformations.

AUTODEP is a deposition interface and server and is a metadata driven system, which prepares entries for loading into the EBI-MSD database. The system is flexible, extendable and easy to maintain and manage. A XML based dictionaries drive the deposition interface, which stores data in xml format. Having the data in XML format is extremely beneficial, as this is easy to parse and can be transformed into any other required format like PDB.

The GOA project provides high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Since joining the GO Consortium (GOC) in 2001, GOA has been primarily responsible for the integration and release of GO annotation to the human proteome. Starting in 2006, as part of a new GOC reference genome-working group, GOA is committed to the comprehensive annotation of a set of disease-related gene products in human, mouse and rat species. It is hoped that by generating a reliable set of GO annotations to these genomes, the GOC will empower comparative methods used in first pass annotation of other proteomes. Because of the multi-species nature of the UniProtKB, GOA also assists in the curation of another 100,000 species. This involves electronic annotation and the integration of high-quality GO annotation from many model organism and specialist groups. This effort ensures that the GOA datasets remain a key reference and a comprehensive source of annotations for UniProtKB.

IntAct provides an open source database and toolkit for the storage, presentation and analysis of protein interactions. The web interface provides both textual and graphical representations of protein interactions, and allows exploring interaction networks in the context of the GO annotations of the interacting proteins. A web service allows direct computational access to retrieve interaction networks in XML format. As of October 2005, IntAct contains approximately 65.000 binary interactions imported from the literature and curated in collaboration with the Swiss-Prot team, making intensive use of controlled vocabularies to ensure data consistency. All IntAct software, data and controlled vocabularies are available at http://www.ebi.ac.uk/intact.

The IMEx consortium is a group of major public interaction data providers sharing curation effort and exchanging completed records on molecular interaction data, similar to successful global collaborations for protein and DNA sequences and for macromolecular structures. For details please see http://imex.sf.net IMEx defines three types of membership: - Archival: IMEx partner commits to producing relevant numbers of records curated to IMEx standard, and commits to importing all IMEx records provided by IMEx partners, to provide a full dataset of globally available IMEx data. - Topical: IMEx partner commits to producing relevant numbers of records curated to IMEx standard, but does not commit to importing all available IMEx records. This is suitable e.g. for a model organism database only interested in relevant IMEx records. - Observer: Prospective future IMEx consortium member. IMEx partners: DIP (http://dip.doe-mbi.ucla.edu) (Archival) IntAct (http://www.ebi.ac.uk/intact) (Archival) MINT (http://mint.bio.uniroma2.it/mint) (Topical) MPact (http://mips.gsf.de/genre/proj/mpact) (Topical) BioGRID (http://www.thebiogrid.org/) (Observer) BIND (http://www.blueprint.org) (Currently inactive) The IMEx consortium is open to the participation of additional partners.

In the framework of the Integr8 project, we have developed a database of homologous genes dedicated to the comparative analysis of completely sequenced genomes: HOGENOM. The HOGENOM database is directly based on the sequence data from UniProt, and notably on the Genome Review section, that provides a coherent and non-redundant collection of proteins from complete genomes. Homologous protein genes are classified into families on the basis of BLAST similarity searches between protein sequences and, for each family, a multiple alignment and a phylogenetic tree are computed.

ArrayExpress consists of the database itself, data loader and data access interface. ArrayExpress runs on Oracle RDBMS. However, we use very few Oracle special features, therefore porting to other DBMS platforms is possible, and only DDL scripts would have to be adapted to a different syntax. MAGEloader uses an Oracle sequence for generating unique object identifiers; therefore some methods (localized inside a single class) would need to be changed to generate identifiers in some other way, where underlying RDBMS does not provide sequences. We have been contacted by groups who intend to port ArrayExpress to other RDBMSs and we know of several such efforts. The database E/R model was auto-generated from a modified MAGE-OM by our own tool. The database contains more than 200 tables, derived from around 150 classes in the MAGE-OM. The mapping used is relatively straightforward: classes are mapped to tables one-to-one, each object can be distributed across several tables according to the inheritance hierarchy, 1-to-1 and 1-to-many associations are mapped to foreign keys, while many-to-many associations are mapped to link tables. Some local modifications of the object model were done to improve performance of common queries. Database has been described in publication U. Sarkans, H. Parkinson, G. Garcia Lara, A. Oezcimen, A. Sharma, N. Abeygunawardena, S. Contrino, E. Holloway, P. Rocca-Serra, G. Mukherjee, M. Shojatalab, M. Kapushesky, S. Sansone, A. Farne, T. Rayner, and A. Brazma. The ArrayExpress gene expression database: a software engineering and implementation perspective. Bioinformatics, 2005, Vol. 21 No. 8: 1495-1501. The database and all detailed documentation is available from http://www.ebi.ac.uk/arrayexpress

New algorithms for predicting gene function, for identifying periodic genes in micro-array data, for identifying co-expressed blocks of genes and samples, new data normalisation algorithms, and a novel algorithm for cluster comparison have been developed and published in high impact journals. In particular new methods have been developed for predicting gene functions from micro-array data, for identifying periodic genes in micro-array data, K-medoids clustering algorithm and for comparing results of different clusterings. Jointly with a wide range of collaborators, a numerous existing algorithms have been modified and implemented in Expression Profiler. These include the "signature" algorithm (developed in Weizman institute), "in between group analysis" algorithm (developed in Cork University), and various normalisation algorithms. All are available via Expression Profiler interface (see http://www.ebi.ac.uk/expressionprofiler/). Hierarchical and non-hierarchical clustering are among of the most widely used gene expression data analysis methods. Different clustering methods and different parameters often produce different results. Understanding how different clustering results relate to each other is important if we are to understand the biological relevance of different clusters and the underlying data. The clustering results can sometimes be rather different, in which case the problem often goes beyond one-to one relationship between clusters on in each clustering. We developed a clustering comparison method and its implementation that finds the correspondence between groups of clusters in two different clustering results. The number of clusters in the results to be compared may be very different, and we allow the comparison of either two non-hierarchical clusterings (such as K-means), or a flat and a hierarchical clustering. Using simulated data we show that our method can be used to approximate the "true" clusters, while using real gene expression data we show that we can restore biologically meaningful clusters. The method is available online as a part of tool-set Expression Prifiler http://www.ebi.ac.uk/expressionprifoier/. The software is open source

Molecular interaction data is a key resource in modern biomedical research, and molecular interaction datasets are currently generated on a large scale, providing from one to tens of thousands of interactions per experiment. These interaction data sets are provided in many different forms; from simple pairs of protein names to detailed textual descriptions and XML formats, and are collected in different databases, each with their own database schema. In 2004, the HUPO Proteomics Standards Initiative developed and published the PSI MI XML format for molecular interactions as a community format for the exchange of protein interaction data. This format has been jointly developed by major producers and providers of protein interaction data, among them BIND, DIP, IntAct, MINT, and MIPS. The PSI MI 1.0 format is now widely implemented and supported by tool and data providers. The PSI MI format was explicitly intended to develop in an incremental fashion. Version 1.0 focussed exclusively on protein interactions, and provided only very limited support for quantitative parameters, in particular kinetics. Based on experience with level 1.0, and requests from both databases and data providers, the HUPO PSI work group for molecular interactions has evolved the PSI MI format to version 2.5, available from http://psidev.sf.net

The Genome Reviews database provides an up-to-date, standardised and comprehensively annotated view of the genomic sequence of organisms with completely deciphered genomes. Currently, Genome Reviews contains the genomes of archaea, bacteria, baker's yeast and Arabidopsis thaliana. Genome Reviews is available as a MySQL relational database, or a flat file format derived from that in the EMBL Nucleotide Sequence Database.

The Integr8 web portal provides easy access to integrated information about deciphered genomes and their corresponding proteomes. Available data includes DNA sequences (from databases including the EMBL nucleotide sequence database, Genome Reviews, and Ensembl); protein sequences (from databases including the UniProt Knowledgebase and IPI); statistical genome and proteome analysis (performed using InterPro, CluSTr, and GOA); and information about orthology, paralogy, and synteny.

Risultati finali

Condividi questa pagina

Scarica