Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees.
Nucleic Acids Res. 2016 Apr 19. pii: gkw290.
Interactive Tree Of Life (http://itol.embl.de) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. It is freely available and open to everyone. The current version was completely redesigned and rewritten, utilizing current web technologies for speedy and streamlined processing. Numerous new features were introduced and several new data types are now supported. Trees with up to 100,000 leaves can now be efficiently displayed. Full interactive control over precise positioning of various annotation features and an unlimited number of datasets allow the easy creation of complex tree visualizations. iTOL 3 is the first tool which supports direct visualization of the recently proposed phylogenetic placements format. Finally, iTOL's account system has been redesigned to simplify the management of trees in user-defined workspaces and projects, as it is heavily used and currently handles already more than 500,000 trees from more than 10,000 individual users.
The SIDER database of drugs and side effects.
Nucleic Acids Res. 2016 Jan 4;44(D1):D1075-9. doi: 10.1093/nar/gkv1075. Epub 2015
Unwanted side effects of drugs are a burden on patients and a severe impediment in the development of new drugs. At the same time, adverse drug reactions (ADRs) recorded during clinical trials are an important source of human phenotypic data. It is therefore essential to combine data on drugs, targets and side effects into a more complete picture of the therapeutic mechanism of actions of drugs and the ways in which they cause adverse reactions. To this end, we have created the SIDER ('Side Effect Resource', http://sideeffects.embl.de) database of drugs and ADRs. The current release, SIDER 4, contains data on 1430 drugs, 5880 ADRs and 140 064 drug-ADR pairs, which is an increase of 40% compared to the previous version. For more fine-grained analyses, we extracted the frequency with which side effects occur from the package inserts. This information is available for 39% of drug-ADR pairs, 19% of which can be compared to the frequency under placebo treatment. SIDER furthermore contains a data set of drug indications, extracted from the package inserts using Natural Language Processing. These drug indications are used to reduce the rate of false positives by identifying medical terms that do not correspond to ADRs.
The InterPro protein families database: the classification resource after 15 years.
Nucleic Acids Res. 2015 Jan; 43(Database issue):D213-21. doi: 10.1093/nar/gku1243.
The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36,766 member database signatures integrated into 26,238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
PTMcode v2: a resource for functional associations of post-translational modifications within and between proteins.
Nucleic Acids Res. 2015 Jan; 43(Database issue):D494-502. doi:
The post-translational regulation of proteins is mainly driven by two molecular events, their modification by several types of moieties and their interaction with other proteins. These two processes are interdependent and together are responsible for the function of the protein in a particular cell state. Several databases focus on the prediction and compilation of protein-protein interactions (PPIs) and no less on the collection and analysis of protein post-translational modifications (PTMs), however, there are no resources that concentrate on describing the regulatory role of PTMs in PPIs. We developed several methods based on residue co-evolution and proximity to predict the functional associations of pairs of PTMs that we apply to modifications in the same protein and between two interacting proteins. In order to make data available for understudied organisms, PTMcode v2 (http://ptmcode.embl.de) includes a new strategy to propagate PTMs from validated modified sites through orthologous proteins. The second release of PTMcode covers 19 eukaryotic species from which we collected more than 300,000 experimentally verified PTMs (>1,300,000 propagated) of 69 types extracting the post-translational regulation of >100,000 proteins and >100,000 interactions. In total, we report 8 million associations of PTMs regulating single proteins and over 9.4 million interplays tuning PPIs.
SMART: recent updates, new developments and status in 2015.
Nucleic Acids Res. 2015 Jan; 43(Database issue):D257-60. doi: 10.1093/nar/gku949.
SMART (Simple Modular Architecture Research Tool) is a web resource (http://smart.embl.de/) providing simple identification and extensive annotation of protein domains and the exploration of protein domain architectures. In the current version, SMART contains manually curated models for more than 1200 protein domains, with approximately 200 new models since our last update article. The underlying protein databases were synchronized with UniProt, Ensembl and STRING, bringing the total number of annotated domains and other protein features above 100 million. SMART's 'Genomic' mode, which annotates proteins from completely sequenced genomes was greatly expanded and now includes 2031 species, compared to 1133 in the previous release. SMART analysis results pages have been completely redesigned and include links to several new information sources. A new, vector-based display engine has been developed for protein schematics in SMART, which can also be exported as high-resolution bitmap images for easy inclusion into other documents. Taxonomic tree displays in SMART have been significantly improved, and can be easily navigated using the integrated search engine.
PTMcode: a database of known and predicted functional associations between post-translational modifications in proteins.
Nucleic Acids Res. 2013 Jan; 41(Database issue):D306-11. doi: 10.1093/nar/gks1230.
Post-translational modifications (PTMs) are involved in the regulation and structural stabilization of eukaryotic proteins. The combination of individual PTM states is a key to modulate cellular functions as became evident in a few well-studied proteins. This combinatorial setting, dubbed the PTM code, has been proposed to be extended to whole proteomes in eukaryotes. Although we are still far from deciphering such a complex language, thousands of protein PTM sites are being mapped by high-throughput technologies, thus providing sufficient data for comparative analysis. PTMcode (http://ptmcode.embl.de) aims to compile known and predicted PTM associations to provide a framework that would enable hypothesis-driven experimental or computational analysis of various scales. In its first release, PTMcode provides PTM functional associations of 13 different PTM types within proteins in 8 eukaryotes. They are based on five evidence channels: a literature survey, residue co-evolution, structural proximity, PTMs at the same residue and location within PTM highly enriched protein regions (hotspots). PTMcode is presented as a protein-based searchable database with an interactive web interface providing the context of the co-regulation of nearly 75 000 residues in >10 000 proteins.
eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges.
Nucleic Acids Res. 2012 Jan; 40(Database issue):D284-9. doi: 10.1093/nar/gkr1060.
Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database (http://eggnog.embl.de) contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721,801 orthologous groups, encompassing a total of 4,396,591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101,208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450,904 orthologous groups (62.5%).
InterPro in 2011: new developments in the family and domain prediction database.
Nucleic Acids Res. 2012 Jan; 40(Database issue):D306-12. doi: 10.1093/nar/gkr948.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
SMART 7: recent updates to the protein domain annotation resource.
Nucleic Acids Res. 2012 Jan; 40(Database issue):D302-5. doi: 10.1093/nar/gkr931.
SMART (Simple Modular Architecture Research Tool) is an online resource (http://smart.embl.de/) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 7 contains manually curated models for 1009 protein domains, 200 more than in the previous version. The current release introduces several novel features and a streamlined user interface resulting in a faster and more comfortable workflow. The underlying protein databases were greatly expanded, resulting in a 2-fold increase in number of annotated domains and features. The database of completely sequenced genomes now includes 1133 species, compared to 630 in the previous release. Domain architecture analysis results can now be exported and visualized through the iTOL phylogenetic tree viewer. 'metaSMART' was introduced as a novel subresource dedicated to the exploration and analysis of domain architectures in various metagenomics data sets. An advanced full text search engine was implemented, covering the complete annotations for SMART and Pfam domains, as well as the complete set of protein descriptions, allowing users to quickly find relevant information.
iPath2.0: interactive pathway explorer.
Nucleic Acids Res. 2011 Jul;39(Web Server issue):W412-5. doi: 10.1093/nar/gkr313.
iPath2.0 is a web-based tool (http://pathways.embl.de) for the visualization and analysis of cellular pathways. Its primary map summarizes the metabolism in biological systems as annotated to date. Nodes in the map correspond to various chemical compounds and edges represent series of enzymatic reactions. In two other maps, iPath2.0 provides an overview of secondary metabolite biosynthesis and a hand-picked selection of important regulatory pathways and other functional modules, allowing a more general overview of protein functions in a genome or metagenome. iPath2.0's main interface is an interactive Flash-based viewer, which allows users to easily navigate and explore the complex pathway maps. In addition to the default pre-computed overview maps, iPath offers several data mapping tools. Users can upload various types of data and completely customize all nodes and edges of iPath2.0's maps. These customized maps give users an intuitive overview of their own data, guiding the analysis of various genomics and metagenomics projects.
Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy.
Nucleic Acids Res. 2011 Jul;39(Web Server issue):W475-8. doi: 10.1093/nar/gkr201.
Interactive Tree Of Life (http://itol.embl.de) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. It is freely available and open to everyone. In addition to classical tree viewer functions, iTOL offers many novel ways of annotating trees with various additional data. Current version introduces numerous new features and greatly expands the number of supported data set types. Trees can be interactively manipulated and edited. A free personal account system is available, providing management and sharing of trees in user defined workspaces and projects. Export to various bitmap and vector graphics formats is supported. Batch access interface is available for programmatic access or inclusion of interactive trees into other web services.
Toward molecular trait-based ecology through integration of biogeochemical, geographical and metagenomic data.
Mol Syst Biol. 2011 Mar 15; 7:473. doi: 10.1038/msb.2011.6.
Using metagenomic 'parts lists' to infer global patterns on microbial ecology remains a significant challenge. To deduce important ecological indicators such as environmental adaptation, molecular trait dispersal, diversity variation and primary production from the gene pool of an ecosystem, we integrated 25 ocean metagenomes with geographical, meteorological and geophysicochemical data. We find that climatic factors (temperature, sunlight) are the major determinants of the biomolecular repertoire of each sample and the main limiting factor on functional trait dispersal (absence of biogeographic provincialism). Molecular functional richness and diversity show a distinct latitudinal gradient peaking at 20 degrees N and correlate with primary production. The latter can also be predicted from the molecular functional composition of an environmental sample. Together, our results show that the functional community composition derived from metagenomes is an important quantitative readout for molecular trait-based biogeography and ecology.
Visualization of multiple alignments, phylogenies and gene family evolution.
Nat Methods. 2010 Mar;7(3 Suppl):S16-25. doi: 10.1038/nmeth.1434.
Software for visualizing sequence alignments and trees are essential tools for life scientists. In this review, we describe the major features and capabilities of a selection of stand-alone and web-based applications useful when investigating the function and evolution of a gene family. These range from simple viewers, to systems that provide sophisticated editing and analysis functions. We conclude with a discussion of the challenges that these tools now face due to the flood of next generation sequence data and the increasingly complex network of bioinformatics information sources.
A side effect resource to capture phenotypic effects of drugs.
Mol Syst Biol. 2010; 6:343. doi: 10.1038/msb.2009.98. Epub 2010 Jan 19.
The molecular understanding of phenotypes caused by drugs in humans is essential for elucidating mechanisms of action and for developing personalized medicines. Side effects of drugs (also known as adverse drug reactions) are an important source of human phenotypic information, but so far research on this topic has been hampered by insufficient accessibility of data. Consequently, we have developed a public, computer-readable side effect resource (SIDER) that connects 888 drugs to 1450 side effect terms. It contains information on frequency in patients for one-third of the drug-side effect pairs. For 199 drugs, the side effect frequency of placebo administration could also be extracted. We illustrate the potential of SIDER with a number of analyses. The resource is freely available for academic research at http://sideeffects.embl.de.
Impact of genome reduction on bacterial metabolism and its regulation.
Science. 2009 Nov 27; 326(5957):1263-8. doi: 10.1126/science.1177263.
To understand basic principles of bacterial metabolism organization and regulation, but also the impact of genome size, we systematically studied one of the smallest bacteria, Mycoplasma pneumoniae. A manually curated metabolic network of 189 reactions catalyzed by 129 enzymes allowed the design of a defined, minimal medium with 19 essential nutrients. More than 1300 growth curves were recorded in the presence of various nutrient concentrations. Measurements of biomass indicators, metabolites, and 13C-glucose experiments provided information on directionality, fluxes, and energetics; integration with transcription profiling enabled the global analysis of metabolic regulation. Compared with more complex bacteria, the M. pneumoniae metabolic network has a more linear topology and contains a higher fraction of multifunctional enzymes; general features such as metabolite concentrations, cellular energetics, adaptability, and global gene expression responses are similar, however.
eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations.
Nucleic Acids Res. 2010 Jan; 38(Database issue):D190-5. doi: 10.1093/nar/gkp951.
The identification of orthologous relationships forms the basis for most comparative genomics studies. Here, we present the second version of the eggNOG database, which contains orthologous groups (OGs) constructed through identification of reciprocal best BLAST matches and triangular linkage clustering. We applied this procedure to 630 complete genomes (529 bacteria, 46 archaea and 55 eukaryotes), which is a 2-fold increase relative to the previous version. The pipeline yielded 224,847 OGs, including 9724 extended versions of the original COG and KOG. We computed OGs for different levels of the tree of life; in addition to the species groups included in our first release (i.e. fungi, metazoa, insects, vertebrates and mammals), we have now constructed OGs for archaea, fishes, rodents and primates. We automatically annotate the non-supervised orthologous groups (NOGs) with functional descriptions, protein domains, and functional categories as defined initially for the COG/KOG database. In-depth analysis is facilitated by precomputed high-quality multiple sequence alignments and maximum-likelihood trees for each of the available OGs. Altogether, eggNOG covers 2,242 035 proteins (built from 2,590,259 proteins) and provides a broad functional description for at least 1,966,709 (88%) of them. Users can access the complete set of orthologous groups via a web interface at: http://eggnog.embl.de.
Quantifying environmental adaptation of metabolic pathways in metagenomics.
Proc Natl Acad Sci U S A. 2009 Feb 3; 106(5):1374-9. doi: 10.1073/pnas.0808022106.
Recently, approaches have been developed to sample the genetic content of heterogeneous environments (metagenomics). However, by what means these sequences link distinct environmental conditions with specific biological processes is not well understood. Thus, a major challenge is how the usage of particular pathways and subnetworks reflects the adaptation of microbial communities across environments and habitats-i.e., how network dynamics relates to environmental features. Previous research has treated environments as discrete, somewhat simplified classes (e.g., terrestrial vs. marine), and searched for obvious metabolic differences among them (i.e., treating the analysis as a typical classification problem). However, environmental differences result from combinations of many factors, which often vary only slightly. Therefore, we introduce an approach that employs correlation and regression to relate multiple, continuously varying factors defining an environment to the extent of particular microbial pathways present in a geographic site. Moreover, rather than looking only at individual correlations (one-to-one), we adapted canonical correlation analysis and related techniques to define an ensemble of weighted pathways that maximally covaries with a combination of environmental variables (many-to-many), which we term a metabolic footprint. Applied to available aquatic datasets, we identified footprints predictive of their environment that can potentially be used as biosensors. For example, we show a strong multivariate correlation between the energy-conversion strategies of a community and multiple environmental gradients (e.g., temperature). Moreover, we identified covariation in amino acid transport and cofactor synthesis, suggesting that limiting amounts of cofactor can (partially) explain increased import of amino acids in nutrient-limited conditions.
SMART 6: recent updates and new developments.
Nucleic Acids Res. 2009 Jan; 37(Database issue):D229-32. doi: 10.1093/nar/gkn808.
Simple modular architecture research tool (SMART) is an online tool (http://smart.embl.de/) for the identification and annotation of protein domains. It provides a user-friendly platform for the exploration and comparative study of domain architectures in both proteins and genes. The current release of SMART contains manually curated models for 784 protein domains. Recent developments were focused on further data integration and improving user friendliness. The underlying protein database based on completely sequenced genomes was greatly expanded and now includes 630 species, compared to 191 in the previous release. As an initial step towards integrating information on biological pathways into SMART, our domain annotations were extended with data on metabolic pathways and links to several pathways resources. The interaction network view was completely redesigned and is now available for more than 2 million proteins. In addition to the standard web access to the database, users can now query SMART using distributed annotation system (DAS) or through a simple object access protocol (SOAP) based web service.
InterPro: the integrative protein signature database.
Nucleic Acids Res. 2009 Jan; 37(Database issue):D211-5. doi: 10.1093/nar/gkn785.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or 'signatures' representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total approximately 58,000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein-protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
metaTIGER: a metabolic evolution resource.
Nucleic Acids Res. 2009 Jan; 37(Database issue):D531-8. doi: 10.1093/nar/gkn826.
Metabolic networks are a subject that has received much attention, but existing web resources do not include extensive phylogenetic information. Phylogenomic approaches (phylogenetics on a genomic scale) have been shown to be effective in the study of evolution and processes like horizontal gene transfer (HGT). To address the lack of phylogenomic information relating to eukaryotic metabolism, metaTIGER (www.bioinformatics.leeds.ac.uk/metatiger) has been created, using genomic information from 121 eukaryotes and 404 prokaryotes and sensitive sequence search techniques to predict the presence of metabolic enzymes. These enzyme sequences were used to create a comprehensive database of 2257 maximum-likelihood phylogenetic trees, some containing over 500 organisms. The trees can be viewed using iTOL, an advanced interactive tree viewer, enabling straightforward interpretation of large trees. Complex high-throughput tree analysis is also available through user-defined queries, allowing the rapid identification of trees of interest, e.g. containing putative HGT events. metaTIGER also provides novel and easy-to-use facilities for viewing and comparing the metabolic networks in different organisms via highlighted pathway images and tables. metaTIGER is demonstrated through evolutionary analysis of Plasmodium, including identification of genes horizontally transferred from chlamydia.
Discovering functional novelty in metagenomes: examples from light-mediated processes.
J Bacteriol. 2009 Jan; 191(1):32-41. doi: 10.1128/JB.01084-08. Epub 2008 Oct 10.
The emerging coverage of diverse habitats by metagenomic shotgun data opens new avenues of discovering functional novelty using computational tools. Here, we apply three different concepts for predicting novel functions within light-mediated microbial pathways in five diverse environments. Using phylogenetic approaches, we discovered two novel deep-branching subfamilies of photolyases (involved in light-mediated repair) distributed abundantly in high-UV environments. Using neighborhood approaches, we were able to assign seven novel functional partners in luciferase synthesis, nitrogen metabolism, and quorum sensing to BLUF domain-containing proteins (involved in light sensing). Finally, by domain analysis, for RcaE proteins (involved in chromatic adaptation), we predict 16 novel domain architectures that indicate novel functionalities in habitats with little or no light. Quantification of protein abundance in the various environments supports our findings that bacteria utilize light for sensing, repair, and adaptation far more widely than previously thought. While the discoveries illustrate the opportunities in function discovery, we also discuss the immense conceptual and practical challenges that come along with this new type of data.
Evolution of the phospho-tyrosine signaling machinery in premetazoan lineages.
Proc Natl Acad Sci U S A. 2008 Jul 15; 105(28):9680-4. doi:
Multicellular animals use a three-part molecular toolkit to mediate phospho-tyrosine signaling: Tyrosine kinases (TyrK), protein tyrosine phosphatases (PTP), and Src Homology 2 (SH2) domains function, respectively, as "writers," "erasers," and "readers" of phospho-tyrosine modifications. How did this system of three components evolve, given their interdependent function? Here, we examine the usage of these components in 41 eukaryotic genomes, including the newly sequenced genome of the choanoflagellate, Monosiga brevicollis, the closest known unicellular relative to metazoans. This analysis indicates that SH2 and PTP domains likely evolved earliest-a handful of these domains are found in premetazoan eukaryotes lacking tyrosine kinases, most likely to deal with limited tyrosine phosphorylation cross-catalyzed by promiscuous Ser/Thr kinases. Modern TyrK proteins, however, are only observed in two lineages, metazoans and choanoflagellates. These two lineages show a dramatic coexpansion of all three domain families. Concurrent expansion of the three domain families is consistent with a stepwise evolutionary model in which preexisting SH2 and PTP domains were of limited utility until the appearance of the TyrK domain in the last common ancestor of metazoans and choanoflagellates. The emergence of the full three-component signaling system, with its dramatically increased encoding potential, may have contributed to the advent of metazoan multicellularity.
iPath: interactive exploration of biochemical pathways and networks.
Trends Biochem Sci. 2008 Mar; 33(3):101-3. doi: 10.1016/j.tibs.2008.01.001. Epub
iPath is an open-access online tool (http://pathways.embl.de) for visualizing and analyzing metabolic pathways. An interactive viewer provides straightforward navigation through various pathways and enables easy access to the underlying chemicals and enzymes. Customized pathway maps can be generated and annotated using various external data. For example, by merging human genome data with two important gut commensals, iPath can pinpoint the complementarity of the host-symbiont metabolic capacities.
The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans.
Nature. 2008 Feb 14; 451(7180):783-8. doi: 10.1038/nature06617.
Choanoflagellates are the closest known relatives of metazoans. To discover potential molecular mechanisms underlying the evolution of metazoan multicellularity, we sequenced and analysed the genome of the unicellular choanoflagellate Monosiga brevicollis. The genome contains approximately 9,200 intron-rich genes, including a number that encode cell adhesion and signalling protein domains that are otherwise restricted to metazoans. Here we show that the physical linkages among protein domains often differ between M. brevicollis and metazoans, suggesting that abundant domain shuffling followed the separation of the choanoflagellate and metazoan lineages. The completion of the M. brevicollis genome allows us to reconstruct with increasing resolution the genomic changes that accompanied the origin of metazoans.
4DXpress: a database for cross-species expression pattern comparisons.
Nucleic Acids Res. 2008 Jan; 36(Database issue):D847-53. Epub 2007 Oct 4.
In the major animal model species like mouse, fish or fly, detailed spatial information on gene expression over time can be acquired through whole mount in situ hybridization experiments. In these species, expression patterns of many genes have been studied and data has been integrated into dedicated model organism databases like ZFIN for zebrafish, MEPD for medaka, BDGP for Drosophila or GXD for mouse. However, a central repository that allows users to query and compare gene expression patterns across different species has not yet been established. Therefore, we have integrated expression patterns for zebrafish, Drosophila, medaka and mouse into a central public repository called 4DXpress (expression database in four dimensions). Users can query anatomy ontology-based expression annotations across species and quickly jump from one gene to the orthologues in other species. Genes are linked to public microarray data in ArrayExpress. We have mapped developmental stages between the species to be able to compare developmental time phases. We store the largest collection of gene expression patterns available to date in an individual resource, reflecting 16 505 annotated genes. 4DXpress will be an invaluable tool for developmental as well as for computational biologists interested in gene regulation and evolution. 4DXpress is available at http://ani.embl.de/4DXpress.
Quantitative assessment of protein function prediction from metagenomics shotgun sequences.
Proc Natl Acad Sci U S A. 2007 Aug 28; 104(35):13913-8. Epub 2007 Aug 23.
To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.
New developments in the InterPro database.
Nucleic Acids Res. 2007 Jan; 35(Database issue):D224-8.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (http://www.ebi.ac.uk/interpro), and for download by anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro). The InterProScan search tool is now also available via a web service at http://www.ebi.ac.uk/Tools/webservices/WSInterProScan.html.
Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation.
Bioinformatics. 2007 Jan 1; 23(1):127-8. Epub 2006 Oct 18.
Interactive Tree Of Life (iTOL) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. Trees can be interactively pruned and re-rooted. Various types of data such as genome sizes or protein domain repertoires can be mapped onto the tree. Export to several bitmap and vector graphics formats is supported. AVAILABILITY: iTOL is available at http://itol.embl.de
Insights into social insects from the genome of the honeybee Apis mellifera.
Nature. 2006 Oct 26; 443(7114):931-49.
Here we report the genome sequence of the honeybee Apis mellifera, a key model for social behaviour and essential to global ecology through pollination. Compared with other sequenced insect genomes, the A. mellifera genome has high A+T and CpG contents, lacks major transposon families, evolves more slowly, and is more similar to vertebrates for circadian rhythm, RNA interference and DNA methylation genes, among others. Furthermore, A. mellifera has fewer genes for innate immunity, detoxification enzymes, cuticle-forming proteins and gustatory receptors, more genes for odorant receptors, and novel genes for nectar and pollen utilization, consistent with its ecology and social organization. Compared to Drosophila, genes in early developmental pathways differ in Apis, whereas similarities exist for functions that differ markedly, such as sex determination, brain function and behaviour. Population genetics suggests a novel African origin for the species A. mellifera and insights into whether Africanized bees spread throughout the New World via hybridization or displacement.
SMART 5: domains in the context of genomes and networks.
Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D257-60.
The Simple Modular Architecture Research Tool (SMART) is an online resource (http://smart.embl.de/) used for protein domain identification and the analysis of protein domain architectures. Many new features were implemented to make SMART more accessible to scientists from different fields. The new 'Genomic' mode in SMART makes it easy to analyze domain architectures in completely sequenced genomes. Domain annotation has been updated with a detailed taxonomic breakdown and a prediction of the catalytic activity for 50 SMART domains is now available, based on the presence of essential amino acids. Furthermore, intrinsically disordered protein regions can be identified and displayed. The network context is now displayed in the results page for more than 350 000 proteins, enabling easy analyses of domain interactions.
Nonsense-mediated mRNA decay factors act in concert to regulate common mRNA targets.
RNA. 2005 Oct; 11(10):1530-44.
Nonsense-mediated mRNA decay (NMD) is a surveillance pathway that degrades mRNAs containing nonsense codons, and regulates the expression of naturally occurring transcripts. While NMD is not essential in yeast or nematodes, UPF1, a key NMD effector, is essential in mice. Here we show that NMD components are required for cell proliferation in Drosophila. This raises the question of whether NMD effectors diverged functionally during evolution. To address this question, we examined expression profiles in Drosophila cells depleted of all known metazoan NMD components. We show that UPF1, UPF2, UPF3, SMG1, SMG5, and SMG6 regulate in concert the expression of a cohort of genes with functions in a wide range of cellular activities, including cell cycle progression. Only a few transcripts were regulated exclusively by individual factors, suggesting that these proteins act mainly in the NMD pathway and their role in mRNA decay has not diverged substantially. Finally, the vast majority of NMD targets in Drosophila are not orthologs of targets previously identified in yeast or human cells. Thus phenotypic differences observed across species following inhibition of NMD can be largely attributed to changes in the repertoire of regulated genes.
Consistency of genome-based methods in measuring Metazoan evolution.
FEBS Lett. 2005 Jun 13; 579(15):3355-61. Epub 2005 Apr 18.
Seven distinct genome-wide divergence measures were applied pairwise to the nine sequenced animal genomes of human, mouse, rat, chicken, pufferfish, fruit fly, mosquito, and two nematode worms (Caenorhabditis briggsae and Caenorhabditis elegans). Qualitatively, all of these divergence measures are found to correlate with the estimated time since speciation; however, marked deviations are observed in a few lineages. The distinct genome divergence measures also correlate well among themselves, indicating that most of the processes shaping genomes are dominated by neutral events. The deviations from the clock-like scenario in some lineages are observed consistently by several measures, implicitly confirming their reliability.
Computational analysis of Modular Protein Architectures.
2005; Chapter 21; Book chapter in: Modular Protein Domains
InterPro, progress and status in 2005.
Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D201-5.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created to integrate the major protein signature databases. Currently, it includes PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY. Signatures are manually integrated into InterPro entries that are curated to provide biological and functional information. Annotation is provided in an abstract, Gene Ontology mapping and links to specialized databases. New features of InterPro include extended protein match views, taxonomic range information and protein 3D structure data. One of the new match views is the InterPro Domain Architecture view, which shows the domain composition of protein matches. Two new entry types were introduced to better describe InterPro entries: these are active site and binding site. PIRSF and the structure-based SUPERFAMILY are the latest member databases to join InterPro, and CATH and PANTHER are soon to be integrated. InterPro release 8.0 contains 11 007 entries, representing 2573 domains, 8166 families, 201 repeats, 26 active sites, 21 binding sites and 20 post-translational modification sites. InterPro covers over 78% of all proteins in the Swiss-Prot and TrEMBL components of UniProt. The database is available for text- and sequence-based searches via a webserver (http://www.ebi.ac.uk/interpro), and for download by anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
Gene expression profiling of the rat superior olivary complex using serial analysis of gene expression.
Eur J Neurosci. 2004 Dec; 20(12):3244-58.
The superior olivary complex (SOC) is an auditory brainstem region that represents a favourable system to study rapid neurotransmission and the maturation of neuronal circuits. Here we performed serial analysis of gene expression (SAGE) on the SOC in 60-day-old Sprague-Dawley rats to identify genes specifically important for its function and to create a transcriptome reference for the subsequent identification of age-related or disease-related changes. Sequencing of 31 035 tags identified 10 473 different transcripts. Fifty-seven per cent of the unique tags with a count greater than four were statistically more highly represented in the SOC than in the hippocampus. Among them were genes encoding proteins involved in energy supply, the glutamate/glutamine shuttle, and myelination. Approximately 80 plasma membrane transporters, receptors, channels, and vesicular transporters were identified, and 25% of them displayed a significantly higher expression level in the SOC than in the hippocampus. Some of the plasma membrane proteins were not previously characterized in the SOC, e.g. the purinergic receptor subunit P2X(6) and the metabotropic GABA receptor Gpr51. Differential gene expression between SOC and hippocampus was confirmed using RNA in situ hybridization or immunohistochemistry. The extensive gene inventory presented here will alleviate the dissection of the molecular mechanisms underlying specific SOC functions and the comparison with other SAGE libraries from brain will ease the identification of promoters to generate region-specific transgenic animals. The analysis will be part of the publicly available database ID-GRAB.
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.
Nature. 2004 Dec 9; 432(7018):695-716.
We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.
Fast identification of folded human protein domains expressed in E. coli suitable for structural analysis.
BMC Struct Biol. 2004 Mar 8; 4:4.
BACKGROUND: High-throughput protein structure analysis of individual protein domains requires analysis of large numbers of expression clones to identify suitable constructs for structure determination. For this purpose, methods need to be implemented for fast and reliable screening of the expressed proteins as early as possible in the overall process from cloning to structure determination. RESULTS: 88 different E. coli expression constructs for 17 human protein domains were analysed using high-throughput cloning, purification and folding analysis to obtain candidates suitable for structural analysis. After 96 deep-well microplate expression and automated protein purification, protein domains were directly analysed using 1D 1H-NMR spectroscopy. In addition, analytical hydrophobic interaction chromatography (HIC) was used to detect natively folded protein. With these two analytical methods, six constructs (representing two domains) were quickly identified as being well folded and suitable for structural analysis. CONCLUSION: The described approach facilitates high-throughput structural analysis. Clones expressing natively folded proteins suitable for NMR structure determination were quickly identified upon small scale expression screening using 1D 1H-NMR and/or analytical HIC. This procedure is especially effective as a fast and inexpensive screen for the 'low hanging fruits' in structural genomics.
SMART 4.0: towards genomic data integration.
Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D142-4.
SMART (Simple Modular Architecture Research Tool) is a web tool (http://smart.embl.de/) for the identification and annotation of protein domains, and provides a platform for the comparative study of complex domain architectures in genes and proteins. The January 2004 release of SMART contains 685 protein domains. New developments in SMART are centred on the integration of data from completed metazoan genomes. SMART now uses predicted proteins from complete genomes in its source sequence databases, and integrates these with predictions of orthology. New visualization tools have been developed to allow analysis of gene intron-exon structure within the context of protein domain structure, and to align these displays to provide schematic comparisons of orthologous genes, or multiple transcripts from the same gene. Other improvements include the ability to query SMART by Gene Ontology terms, improved structure database searching and batch retrieval of multiple entries.
Alternative splicing and evolution.
Bioessays. 2003 Nov; 25(11):1031-4.
Alternative splicing is a critical post-transcriptional event leading to an increase in the transcriptome diversity. Recent bioinformatics studies revealed a high frequency of alternative splicing. Although the extent of AS conservation among mammals is still being discussed, it has been argued that major forms of alternatively spliced transcripts are much better conserved than minor forms. It suggests that alternative splicing plays a major role in genome evolution allowing new exons to evolve with less constraint.
ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins.
Nucleic Acids Res. 2003 Jul 1; 31(13):3625-30.
Multidomain proteins predominate in eukaryotic proteomes. Individual functions assigned to different sequence segments combine to create a complex function for the whole protein. While on-line resources are available for revealing globular domains in sequences, there has hitherto been no comprehensive collection of small functional sites/motifs comparable to the globular domain resources, yet these are as important for the function of multidomain proteins. Short linear peptide motifs are used for cell compartment targeting, protein-protein interaction, regulation by phosphorylation, acetylation, glycosylation and a host of other post-translational modifications. ELM, the Eukaryotic Linear Motif server at http://elm.eu.org/, is a new bioinformatics resource for investigating candidate short non-globular functional motifs in eukaryotic proteins, aiming to fill the void in bioinformatics tools. Sequence comparisons with short motifs are difficult to evaluate because the usual significance assessments are inappropriate. Therefore the server is implemented with several logical filters to eliminate false positives. Current filters are for cell compartment, globular domain clash and taxonomic range. In favourable cases, the filters can reduce the number of retained matches by an order of magnitude or more.
The InterPro Database, 2003 brings increased coverage and new features.
Nucleic Acids Res. 2003 Jan 1; 31(1):315-8.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created in 1999 as a means of amalgamating the major protein signature databases into one comprehensive resource. PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been manually integrated and curated and are available in InterPro for text- and sequence-based searching. The results are provided in a single format that rationalises the results that would be obtained by searching the member databases individually. The latest release of InterPro contains 5629 entries describing 4280 families, 1239 domains, 95 repeats and 15 post-translational modifications. Currently, the combined signatures in InterPro cover more than 74% of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15% since the inception of InterPro. New features of the database include improved searching capabilities and enhanced graphical user interfaces for visualisation of the data. The database is available via a webserver (http://www.ebi.ac.uk/interpro) and anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
Initial sequencing and comparative analysis of the mouse genome.
Nature. 2002 Dec 5; 420(6915):520-62.
The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.
The genome sequence of the malaria mosquito Anopheles gambiae.
Science. 2002 Oct 4; 298(5591):129-49.
Anopheles gambiae is the principal vector of malaria, a disease that afflicts more than 500 million people and causes more than 1 million deaths each year. Tenfold shotgun sequence coverage was obtained from the PEST strain of A. gambiae and assembled into scaffolds that span 278 million base pairs. A total of 91% of the genome was organized in 303 scaffolds; the largest scaffold was 23.1 million base pairs. There was substantial genetic variation within this strain, and the apparent existence of two haplotypes of approximately equal frequency ("dual haplotypes") in a substantial fraction of the genome likely reflects the outbred nature of the PEST strain. The sequence produced a conservative inference of more than 400,000 single-nucleotide polymorphisms that showed a markedly bimodal density distribution. Analysis of the genome sequence revealed strong evidence for about 14,000 protein-encoding transcripts. Prominent expansions in specific families of proteins likely involved in cell adhesion and immunity were noted. An expressed sequence tag analysis of genes regulated by blood feeding provided insights into the physiological adaptations of a hematophagous insect.
Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster.
Science. 2002 Oct 4; 298(5591):149-59.
Comparison of the genomes and proteomes of the two diptera Anopheles gambiae and Drosophila melanogaster, which diverged about 250 million years ago, reveals considerable similarities. However, numerous differences are also observed; some of these must reflect the selection and subsequent adaptation associated with different ecologies and life strategies. Almost half of the genes in both genomes are interpreted as orthologs and show an average sequence identity of about 56%, which is slightly lower than that observed between the orthologs of the pufferfish and human (diverged about 450 million years ago). This indicates that these two insects diverged considerably faster than vertebrates. Aligned sequences reveal that orthologous genes have retained only half of their intron/exon structure, indicating that intron gains or losses have occurred at a rate of about one per gene per 125 million years. Chromosomal arms exhibit significant remnants of homology between the two species, although only 34% of the genes colocalize in small "microsyntenic" clusters, and major interarm transfers as well as intra-arm shuffling of gene order are detected.
Immunity-related genes and gene families in Anopheles gambiae.
Science. 2002 Oct 4; 298(5591):159-65.
We have identified 242 Anopheles gambiae genes from 18 gene families implicated in innate immunity and have detected marked diversification relative to Drosophila melanogaster. Immune-related gene families involved in recognition, signal modulation, and effector systems show a marked deficit of orthologs and excessive gene expansions, possibly reflecting selection pressures from different pathogens encountered in these insects' very different life-styles. In contrast, the multifunctional Toll signal transduction pathway is substantially conserved, presumably because of counterselection for developmental stability. Representative expression profiles confirm that sequence diversification is accompanied by specific responses to different immune challenges. Alternative RNA splicing may also contribute to expansion of the immune repertoire.
InterPro: an integrated documentation resource for protein families, domains and functional sites.
Brief Bioinform. 2002 Sep; 3(3):225-35.
The exponential increase in the submission of nucleotide sequences to the nucleotide sequence database by genome sequencing centres has resulted in a need for rapid, automatic methods for classification of the resulting protein sequences. There are several signature and sequence cluster-based methods for protein classification, each resource having distinct areas of optimum application owing to the differences in the underlying analysis methods. In recognition of this, InterPro was developed as an integrated documentation resource for protein families, domains and functional sites, to rationalise the complementary efforts of the individual protein signature database projects. The member databases - PRINTS, PROSITE, Pfam, ProDom, SMART and TIGRFAMs - form the InterPro core. Related signatures from each member database are unified into single InterPro entries. Each InterPro entry includes a unique accession number, functional descriptions and literature references, and links are made back to the relevant member database(s). Release 4.0 of InterPro (November 2001) contains 4,691 entries, representing 3,532 families, 1,068 domains, 74 repeats and 15 sites of post-translational modification (PTMs) encoded by different regular expressions, profiles, fingerprints and hidden Markov models (HMMs). Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (2,141,621 InterPro hits from 586,124 SWISS-PROT and TrEMBL protein sequences). The database is freely accessible for text- and sequence-based searches.
Common exon duplication in animals and its role in alternative splicing.
Hum Mol Genet. 2002 Jun 15; 11(13):1561-7.
When searching the genomes of human, fly and worm for cases of exon duplication, we found that about 10% of all genes contain tandemly duplicated exons. In the course of the analyses, 2438 unannotated exons were identified that are not currently included in genome databases and that are likely to be functional. The vast majority of them are likely to be involved in mutually exclusive alternative splicing events. The common nature of recent exon duplication indicates that it might have a significant role in the fast evolution of eukaryotic genes. It also provides a general mechanism for the regulation of protein function.
Protein domain analysis in the era of complete genomes.
FEBS Lett. 2002 Feb 20; 513(1):129-34.
Domains present one of the most useful levels at which to understand protein function, and domain family-based analysis has had a profound impact on the study of individual proteins. Protein domain discovery has been progressing steadily over the past 30 years. What are the realistically achievable goals of sequence-based domain analysis, and how far off are they for the sequences encoded in eukaryotic genomes? Here we address some of the issues involved in better coverage of sequence-based domain annotation, and the integration of these results within the wider context of genomes, structures and function.
Genome and protein evolution in eukaryotes.
Curr Opin Chem Biol. 2002 Feb; 6(1):39-45.
The past year has seen the completion of the genome sequence of the flowering plant Arabidopsis thaliana and the initial sequence reports of the human genome. The availability of completely sequenced eukaryotic genomes from disparate phylogenetic lineages has opened the door to comparative analyses and a better understanding of the evolutionary processes shaping genomes. Complex many-to-many relationships between genes from different species appear to be the norm, suggesting that transfer of detailed functional annotation will not be straightforward. In addition to expansion and contraction of gene families, new genes evolve from recombination of pre-existing domains, although some domain families do appear to have evolved recently and to be specific to restricted phylogenetic lineages. The overall picture is of a huge diversity of gene content within eukaryotic genomes, reflecting different functional demands in different species.
Recent improvements to the SMART domain-based sequence annotation resource.
Nucleic Acids Res. 2002 Jan 1; 30(1):242-4.
SMART (Simple Modular Architecture Research Tool, http://smart.embl-heidelberg.de) is a web-based resource used for the annotation of protein domains and the analysis of domain architectures, with particular emphasis on mobile eukaryotic domains. Extensive annotation for each domain family is available, providing information relating to function, subcellular localization, phyletic distribution and tertiary structure. The January 2002 release has added more than 200 hand-curated domain models. This brings the total to over 600 domain families that are widely represented among nuclear, signalling and extracellular proteins. Annotation now includes links to the Online Mendelian Inheritance in Man (OMIM) database in cases where a human disease is associated with one or more mutations in a particular domain. We have implemented new analysis methods and updated others. New advanced queries provide direct access to the SMART relational database using SQL. This database now contains information on intrinsic sequence features such as transmembrane regions, coiled-coils, signal peptides and internal repeats. SMART output can now be easily included in users' documents. A SMART mirror has been created at http://smart.ox.ac.uk.
* equal contribution