| 1. |
Visualization of multiple alignments,
phylogenies and gene family evolution.
Nat Methods. 2010 Mar;7(3
Suppl):S16-25.
Software for visualizing
sequence alignments and trees are essential tools for life
scientists. In this review, we describe the major features and
capabilities of a selection of stand-alone and web-based
applications useful when investigating the function and evolution
of a gene family. These range from simple viewers, to systems that
provide sophisticated editing and analysis functions. We conclude
with a discussion of the challenges that these tools now face due
to the flood of next generation sequence data and the increasingly
complex network of bioinformatics information sources.
|
| 2. |
A side effect resource to capture phenotypic
effects of drugs.
Mol Syst Biol. 2010; 6:343.
Epub 2010 Jan 19.
The molecular understanding
of phenotypes caused by drugs in humans is essential for
elucidating mechanisms of action and for developing personalized
medicines. Side effects of drugs (also known as adverse drug
reactions) are an important source of human phenotypic information,
but so far research on this topic has been hampered by insufficient
accessibility of data. Consequently, we have developed a public,
computer-readable side effect resource (SIDER) that connects 888
drugs to 1450 side effect terms. It contains information on
frequency in patients for one-third of the drug-side effect pairs.
For 199 drugs, the side effect frequency of placebo administration
could also be extracted. We illustrate the potential of SIDER with
a number of analyses. The resource is freely available for academic
research at http://sideeffects.embl.de.
|
| 3. |
Impact of genome reduction on bacterial
metabolism and its regulation.
Science. 2009 Nov 27;
326(5957):1263-8.
To understand basic
principles of bacterial metabolism organization and regulation, but
also the impact of genome size, we systematically studied one of
the smallest bacteria, Mycoplasma pneumoniae. A manually curated
metabolic network of 189 reactions catalyzed by 129 enzymes allowed
the design of a defined, minimal medium with 19 essential
nutrients. More than 1300 growth curves were recorded in the
presence of various nutrient concentrations. Measurements of
biomass indicators, metabolites, and 13C-glucose experiments
provided information on directionality, fluxes, and energetics;
integration with transcription profiling enabled the global
analysis of metabolic regulation. Compared with more complex
bacteria, the M. pneumoniae metabolic network has a more linear
topology and contains a higher fraction of multifunctional enzymes;
general features such as metabolite concentrations, cellular
energetics, adaptability, and global gene expression responses are
similar, however.
|
| 4. |
eggNOG v2.0: extending the evolutionary
genealogy of genes with enhanced non-supervised orthologous groups,
species and functional annotations.
Nucleic Acids Res. 2010 Jan;
38(Database issue):D190-5. Epub 2009 Nov 9.
The identification of
orthologous relationships forms the basis for most comparative
genomics studies. Here, we present the second version of the eggNOG
database, which contains orthologous groups (OGs) constructed
through identification of reciprocal best BLAST matches and
triangular linkage clustering. We applied this procedure to 630
complete genomes (529 bacteria, 46 archaea and 55 eukaryotes),
which is a 2-fold increase relative to the previous version. The
pipeline yielded 224,847 OGs, including 9724 extended versions of
the original COG and KOG. We computed OGs for different levels of
the tree of life; in addition to the species groups included in our
first release (i.e. fungi, metazoa, insects, vertebrates and
mammals), we have now constructed OGs for archaea, fishes, rodents
and primates. We automatically annotate the non-supervised
orthologous groups (NOGs) with functional descriptions, protein
domains, and functional categories as defined initially for the
COG/KOG database. In-depth analysis is facilitated by precomputed
high-quality multiple sequence alignments and maximum-likelihood
trees for each of the available OGs. Altogether, eggNOG covers
2,242 035 proteins (built from 2,590,259 proteins) and provides a
broad functional description for at least 1,966,709 (88%) of them.
Users can access the complete set of orthologous groups via a web
interface at: http://eggnog.embl.de.
|
| 5. |
Quantifying environmental adaptation of
metabolic pathways in metagenomics.
Proc Natl Acad Sci U S A. 2009 Feb 3;
106(5):1374-9. Epub 2009 Jan 22.
Recently, approaches have
been developed to sample the genetic content of heterogeneous
environments (metagenomics). However, by what means these sequences
link distinct environmental conditions with specific biological
processes is not well understood. Thus, a major challenge is how
the usage of particular pathways and subnetworks reflects the
adaptation of microbial communities across environments and
habitats-i.e., how network dynamics relates to environmental
features. Previous research has treated environments as discrete,
somewhat simplified classes (e.g., terrestrial vs. marine), and
searched for obvious metabolic differences among them (i.e.,
treating the analysis as a typical classification problem).
However, environmental differences result from combinations of many
factors, which often vary only slightly. Therefore, we introduce an
approach that employs correlation and regression to relate
multiple, continuously varying factors defining an environment to
the extent of particular microbial pathways present in a geographic
site. Moreover, rather than looking only at individual correlations
(one-to-one), we adapted canonical correlation analysis and related
techniques to define an ensemble of weighted pathways that
maximally covaries with a combination of environmental variables
(many-to-many), which we term a metabolic footprint. Applied to
available aquatic datasets, we identified footprints predictive of
their environment that can potentially be used as biosensors. For
example, we show a strong multivariate correlation between the
energy-conversion strategies of a community and multiple
environmental gradients (e.g., temperature). Moreover, we
identified covariation in amino acid transport and cofactor
synthesis, suggesting that limiting amounts of cofactor can
(partially) explain increased import of amino acids in
nutrient-limited conditions.
|
| 6. |
SMART 6: recent updates and new
developments.
Nucleic Acids Res. 2009 Jan;
37(Database issue):D229-32. Epub 2008 Oct 31.
Simple modular architecture
research tool (SMART) is an online tool (http://smart.embl.de/) for the
identification and annotation of protein domains. It provides a
user-friendly platform for the exploration and comparative study of
domain architectures in both proteins and genes. The current
release of SMART contains manually curated models for 784 protein
domains. Recent developments were focused on further data
integration and improving user friendliness. The underlying protein
database based on completely sequenced genomes was greatly expanded
and now includes 630 species, compared to 191 in the previous
release. As an initial step towards integrating information on
biological pathways into SMART, our domain annotations were
extended with data on metabolic pathways and links to several
pathways resources. The interaction network view was completely
redesigned and is now available for more than 2 million proteins.
In addition to the standard web access to the database, users can
now query SMART using distributed annotation system (DAS) or
through a simple object access protocol (SOAP) based web
service.
|
| 7. |
InterPro: the integrative protein signature
database.
Nucleic Acids Res. 2009 Jan;
37(Database issue):D211-5. Epub 2008 Oct 21.
The InterPro database
(http://www.ebi.ac.uk/interpro/)
integrates together predictive models or 'signatures' representing
protein domains, families and functional sites from multiple,
diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS,
ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is
performed manually and approximately half of the total
approximately 58,000 signatures available in the source databases
belong to an InterPro entry. Recently, we have started to also
display the remaining un-integrated signatures via our web
interface. Other developments include the provision of
non-signature data, such as structural data, in new XML files on
our FTP site, as well as the inclusion of matchless UniProtKB
proteins in the existing match XML files. The web interface has
been extended and now links out to the ADAN predicted
protein-protein interaction database and the SPICE and Dasty
viewers. The latest public release (v18.0) covers 79.8% of
UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may
be accessed either via the web address above, via web services, by
downloading files by anonymous FTP or by using the InterProScan
search software (http://www.ebi.ac.uk/Tools/InterProScan/).
|
| 8. |
metaTIGER: a metabolic evolution resource.
Nucleic Acids Res. 2009 Jan;
37(Database issue):D531-8. Epub 2008 Oct 25.
Metabolic networks are a
subject that has received much attention, but existing web
resources do not include extensive phylogenetic information.
Phylogenomic approaches (phylogenetics on a genomic scale) have
been shown to be effective in the study of evolution and processes
like horizontal gene transfer (HGT). To address the lack of
phylogenomic information relating to eukaryotic metabolism,
metaTIGER (www.bioinformatics.leeds.ac.uk/metatiger) has been
created, using genomic information from 121 eukaryotes and 404
prokaryotes and sensitive sequence search techniques to predict the
presence of metabolic enzymes. These enzyme sequences were used to
create a comprehensive database of 2257 maximum-likelihood
phylogenetic trees, some containing over 500 organisms. The trees
can be viewed using iTOL, an advanced interactive tree viewer,
enabling straightforward interpretation of large trees. Complex
high-throughput tree analysis is also available through
user-defined queries, allowing the rapid identification of trees of
interest, e.g. containing putative HGT events. metaTIGER also
provides novel and easy-to-use facilities for viewing and comparing
the metabolic networks in different organisms via highlighted
pathway images and tables. metaTIGER is demonstrated through
evolutionary analysis of Plasmodium, including identification of
genes horizontally transferred from chlamydia.
|
| 9. |
Discovering functional novelty in metagenomes:
examples from light-mediated processes.
J Bacteriol. 2009 Jan;
191(1):32-41. Epub 2008 Oct 10.
The emerging coverage of
diverse habitats by metagenomic shotgun data opens new avenues of
discovering functional novelty using computational tools. Here, we
apply three different concepts for predicting novel functions
within light-mediated microbial pathways in five diverse
environments. Using phylogenetic approaches, we discovered two
novel deep-branching subfamilies of photolyases (involved in
light-mediated repair) distributed abundantly in high-UV
environments. Using neighborhood approaches, we were able to assign
seven novel functional partners in luciferase synthesis, nitrogen
metabolism, and quorum sensing to BLUF domain-containing proteins
(involved in light sensing). Finally, by domain analysis, for RcaE
proteins (involved in chromatic adaptation), we predict 16 novel
domain architectures that indicate novel functionalities in
habitats with little or no light. Quantification of protein
abundance in the various environments supports our findings that
bacteria utilize light for sensing, repair, and adaptation far more
widely than previously thought. While the discoveries illustrate
the opportunities in function discovery, we also discuss the
immense conceptual and practical challenges that come along with
this new type of data.
|
| 10. |
Evolution of the phospho-tyrosine signaling
machinery in premetazoan lineages.
Proc Natl Acad Sci U S A. 2008 Jul 15;
105(28):9680-4. Epub 2008 Jul 3.
Multicellular animals use a
three-part molecular toolkit to mediate phospho-tyrosine signaling:
Tyrosine kinases (TyrK), protein tyrosine phosphatases (PTP), and
Src Homology 2 (SH2) domains function, respectively, as "writers,"
"erasers," and "readers" of phospho-tyrosine modifications. How did
this system of three components evolve, given their interdependent
function? Here, we examine the usage of these components in 41
eukaryotic genomes, including the newly sequenced genome of the
choanoflagellate, Monosiga brevicollis, the closest known
unicellular relative to metazoans. This analysis indicates that SH2
and PTP domains likely evolved earliest-a handful of these domains
are found in premetazoan eukaryotes lacking tyrosine kinases, most
likely to deal with limited tyrosine phosphorylation
cross-catalyzed by promiscuous Ser/Thr kinases. Modern TyrK
proteins, however, are only observed in two lineages, metazoans and
choanoflagellates. These two lineages show a dramatic coexpansion
of all three domain families. Concurrent expansion of the three
domain families is consistent with a stepwise evolutionary model in
which preexisting SH2 and PTP domains were of limited utility until
the appearance of the TyrK domain in the last common ancestor of
metazoans and choanoflagellates. The emergence of the full
three-component signaling system, with its dramatically increased
encoding potential, may have contributed to the advent of metazoan
multicellularity.
|
| 11. |
iPath: interactive exploration of biochemical
pathways and networks.
Trends Biochem Sci. 2008 Mar;
33(3):101-3. Epub 2008 Feb 13.
iPath is an open-access
online tool (http://pathways.embl.de) for
visualizing and analyzing metabolic pathways. An interactive viewer
provides straightforward navigation through various pathways and
enables easy access to the underlying chemicals and enzymes.
Customized pathway maps can be generated and annotated using
various external data. For example, by merging human genome data
with two important gut commensals, iPath can pinpoint the
complementarity of the host-symbiont metabolic capacities.
|
| 12. |
The genome of the choanoflagellate Monosiga
brevicollis and the origin of metazoans.
Nature. 2008 Feb 14;
451(7180):783-8.
Choanoflagellates are the
closest known relatives of metazoans. To discover potential
molecular mechanisms underlying the evolution of metazoan
multicellularity, we sequenced and analysed the genome of the
unicellular choanoflagellate Monosiga brevicollis. The genome
contains approximately 9,200 intron-rich genes, including a number
that encode cell adhesion and signalling protein domains that are
otherwise restricted to metazoans. Here we show that the physical
linkages among protein domains often differ between M. brevicollis
and metazoans, suggesting that abundant domain shuffling followed
the separation of the choanoflagellate and metazoan lineages. The
completion of the M. brevicollis genome allows us to reconstruct
with increasing resolution the genomic changes that accompanied the
origin of metazoans.
|
| 13. |
4DXpress: a database for cross-species
expression pattern comparisons.
Nucleic Acids Res. 2008 Jan;
36(Database issue):D847-53. Epub 2007 Oct 4.
In the major animal model
species like mouse, fish or fly, detailed spatial information on
gene expression over time can be acquired through whole mount in
situ hybridization experiments. In these species, expression
patterns of many genes have been studied and data has been
integrated into dedicated model organism databases like ZFIN for
zebrafish, MEPD for medaka, BDGP for Drosophila or GXD for mouse.
However, a central repository that allows users to query and
compare gene expression patterns across different species has not
yet been established. Therefore, we have integrated expression
patterns for zebrafish, Drosophila, medaka and mouse into a central
public repository called 4DXpress (expression database in four
dimensions). Users can query anatomy ontology-based expression
annotations across species and quickly jump from one gene to the
orthologues in other species. Genes are linked to public microarray
data in ArrayExpress. We have mapped developmental stages between
the species to be able to compare developmental time phases. We
store the largest collection of gene expression patterns available
to date in an individual resource, reflecting 16 505 annotated
genes. 4DXpress will be an invaluable tool for developmental as
well as for computational biologists interested in gene regulation
and evolution. 4DXpress is available at
http://ani.embl.de/4DXpress.
|
| 14. |
Quantitative assessment of protein function
prediction from metagenomics shotgun sequences.
Proc Natl Acad Sci U S A. 2007 Aug 28;
104(35):13913-8. Epub 2007 Aug 23.
To assess the potential of
protein function prediction in environmental genomics data, we
analyzed shotgun sequences from four diverse and complex habitats.
Using homology searches as well as customized gene neighborhood
methods that incorporate intergenic and evolutionary distances, we
inferred specific functions for 76% of the 1.4 million predicted
ORFs in these samples (83% when nonspecific functions are
considered). Surprisingly, these fractions are only slightly
smaller than the corresponding ones in completely sequenced genomes
(83% and 86%, respectively, by using the same methodology) and
considerably higher than previously thought. For as many as 75,448
ORFs (5% of the total), only neighborhood methods can assign
functions, illustrated here by a previously undescribed gene
associated with the well characterized heme biosynthesis operon and
a potential transcription factor that might regulate a coupling
between fatty acid biosynthesis and degradation. Our results
further suggest that, although functions can be inferred for most
proteins on earth, many functions remain to be discovered in
numerous small, rare protein families.
|
| 15. |
New developments in the InterPro database.
Nucleic Acids Res. 2007 Jan;
35(Database issue):D224-8.
InterPro is an integrated
resource for protein families, domains and functional sites, which
integrates the following protein signature databases: PROSITE,
PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D
and PANTHER. The latter two new member databases have been
integrated since the last publication in this journal. There have
been several new developments in InterPro, including an additional
reading field, new database links, extensions to the web interface
and additional match XML files. InterPro has always provided
matches to UniProtKB proteins on the website and in the match XML
file on the FTP site. Additional matches to proteins in UniParc
(UniProt archive) are now available for download in the new match
XML files only. The latest InterPro release (13.0) contains more
than 13 000 entries, covering over 78% of all proteins in
UniProtKB. The database is available for text- and sequence-based
searches via a webserver (http://www.ebi.ac.uk/interpro),
and for download by anonymous FTP
(ftp://ftp.ebi.ac.uk/pub/databases/interpro). The InterProScan
search tool is now also available via a web service at
http://www.ebi.ac.uk/Tools/webservices/WSInterProScan.html.
|
| 16. |
Interactive Tree Of Life (iTOL): an online tool
for phylogenetic tree display and annotation.
Bioinformatics. 2007 Jan 1;
23(1):127-8. Epub 2006 Oct 18.
Interactive Tree Of Life
(iTOL) is a web-based tool for the display, manipulation and
annotation of phylogenetic trees. Trees can be interactively pruned
and re-rooted. Various types of data such as genome sizes or
protein domain repertoires can be mapped onto the tree. Export to
several bitmap and vector graphics formats is supported.
AVAILABILITY: iTOL is available at http://itol.embl.de
|
| 17. |
Insights into social insects from the genome of
the honeybee Apis mellifera.
Nature. 2006 Oct 26;
443(7114):931-49.
Here we report the genome
sequence of the honeybee Apis mellifera, a key model for social
behaviour and essential to global ecology through pollination.
Compared with other sequenced insect genomes, the A. mellifera
genome has high A+T and CpG contents, lacks major transposon
families, evolves more slowly, and is more similar to vertebrates
for circadian rhythm, RNA interference and DNA methylation genes,
among others. Furthermore, A. mellifera has fewer genes for innate
immunity, detoxification enzymes, cuticle-forming proteins and
gustatory receptors, more genes for odorant receptors, and novel
genes for nectar and pollen utilization, consistent with its
ecology and social organization. Compared to Drosophila, genes in
early developmental pathways differ in Apis, whereas similarities
exist for functions that differ markedly, such as sex
determination, brain function and behaviour. Population genetics
suggests a novel African origin for the species A. mellifera and
insights into whether Africanized bees spread throughout the New
World via hybridization or displacement.
|
| 18. |
SMART 5: domains in the context of genomes and
networks.
Nucleic Acids Res. 2006 Jan 1;
34(Database issue):D257-60.
The Simple Modular
Architecture Research Tool (SMART) is an online resource (http://smart.embl.de/) used for protein
domain identification and the analysis of protein domain
architectures. Many new features were implemented to make SMART
more accessible to scientists from different fields. The new
'Genomic' mode in SMART makes it easy to analyze domain
architectures in completely sequenced genomes. Domain annotation
has been updated with a detailed taxonomic breakdown and a
prediction of the catalytic activity for 50 SMART domains is now
available, based on the presence of essential amino acids.
Furthermore, intrinsically disordered protein regions can be
identified and displayed. The network context is now displayed in
the results page for more than 350 000 proteins, enabling easy
analyses of domain interactions.
|
| 19. |
Nonsense-mediated mRNA decay factors act in
concert to regulate common mRNA targets.
RNA. 2005 Oct;
11(10):1530-44.
Nonsense-mediated mRNA decay
(NMD) is a surveillance pathway that degrades mRNAs containing
nonsense codons, and regulates the expression of naturally
occurring transcripts. While NMD is not essential in yeast or
nematodes, UPF1, a key NMD effector, is essential in mice. Here we
show that NMD components are required for cell proliferation in
Drosophila. This raises the question of whether NMD effectors
diverged functionally during evolution. To address this question,
we examined expression profiles in Drosophila cells depleted of all
known metazoan NMD components. We show that UPF1, UPF2, UPF3, SMG1,
SMG5, and SMG6 regulate in concert the expression of a cohort of
genes with functions in a wide range of cellular activities,
including cell cycle progression. Only a few transcripts were
regulated exclusively by individual factors, suggesting that these
proteins act mainly in the NMD pathway and their role in mRNA decay
has not diverged substantially. Finally, the vast majority of NMD
targets in Drosophila are not orthologs of targets previously
identified in yeast or human cells. Thus phenotypic differences
observed across species following inhibition of NMD can be largely
attributed to changes in the repertoire of regulated genes.
|
| 20. |
Consistency of genome-based methods in measuring
Metazoan evolution.
FEBS Lett. 2005 Jun 13;
579(15):3355-61. Epub 2005 Apr 18.
Seven distinct genome-wide
divergence measures were applied pairwise to the nine sequenced
animal genomes of human, mouse, rat, chicken, pufferfish, fruit
fly, mosquito, and two nematode worms (Caenorhabditis briggsae and
Caenorhabditis elegans). Qualitatively, all of these divergence
measures are found to correlate with the estimated time since
speciation; however, marked deviations are observed in a few
lineages. The distinct genome divergence measures also correlate
well among themselves, indicating that most of the processes
shaping genomes are dominated by neutral events. The deviations
from the clock-like scenario in some lineages are observed
consistently by several measures, implicitly confirming their
reliability.
|
| 21. |
Computational analysis of Modular Protein
Architectures.
2005; Chapter 21; Book chapter in: Modular
Protein Domains
|
| 22. |
InterPro, progress and status in 2005.
Nucleic Acids Res. 2005 Jan 1;
33(Database issue):D201-5.
InterPro, an integrated
documentation resource of protein families, domains and functional
sites, was created to integrate the major protein signature
databases. Currently, it includes PROSITE, Pfam, PRINTS, ProDom,
SMART, TIGRFAMs, PIRSF and SUPERFAMILY. Signatures are manually
integrated into InterPro entries that are curated to provide
biological and functional information. Annotation is provided in an
abstract, Gene Ontology mapping and links to specialized databases.
New features of InterPro include extended protein match views,
taxonomic range information and protein 3D structure data. One of
the new match views is the InterPro Domain Architecture view, which
shows the domain composition of protein matches. Two new entry
types were introduced to better describe InterPro entries: these
are active site and binding site. PIRSF and the structure-based
SUPERFAMILY are the latest member databases to join InterPro, and
CATH and PANTHER are soon to be integrated. InterPro release 8.0
contains 11 007 entries, representing 2573 domains, 8166 families,
201 repeats, 26 active sites, 21 binding sites and 20
post-translational modification sites. InterPro covers over 78% of
all proteins in the Swiss-Prot and TrEMBL components of UniProt.
The database is available for text- and sequence-based searches via
a webserver (http://www.ebi.ac.uk/interpro),
and for download by anonymous FTP
(ftp://ftp.ebi.ac.uk/pub/databases/interpro).
|
| 23. |
Gene expression profiling of the rat superior
olivary complex using serial analysis of gene expression.
Eur J Neurosci. 2004 Dec;
20(12):3244-58.
The superior olivary complex
(SOC) is an auditory brainstem region that represents a favourable
system to study rapid neurotransmission and the maturation of
neuronal circuits. Here we performed serial analysis of gene
expression (SAGE) on the SOC in 60-day-old Sprague-Dawley rats to
identify genes specifically important for its function and to
create a transcriptome reference for the subsequent identification
of age-related or disease-related changes. Sequencing of 31 035
tags identified 10 473 different transcripts. Fifty-seven per cent
of the unique tags with a count greater than four were
statistically more highly represented in the SOC than in the
hippocampus. Among them were genes encoding proteins involved in
energy supply, the glutamate/glutamine shuttle, and myelination.
Approximately 80 plasma membrane transporters, receptors, channels,
and vesicular transporters were identified, and 25% of them
displayed a significantly higher expression level in the SOC than
in the hippocampus. Some of the plasma membrane proteins were not
previously characterized in the SOC, e.g. the purinergic receptor
subunit P2X(6) and the metabotropic GABA receptor Gpr51.
Differential gene expression between SOC and hippocampus was
confirmed using RNA in situ hybridization or immunohistochemistry.
The extensive gene inventory presented here will alleviate the
dissection of the molecular mechanisms underlying specific SOC
functions and the comparison with other SAGE libraries from brain
will ease the identification of promoters to generate
region-specific transgenic animals. The analysis will be part of
the publicly available database ID-GRAB.
|
| 24. |
Sequence and comparative analysis of the chicken
genome provide unique perspectives on vertebrate evolution.
Nature. 2004 Dec 9;
432(7018):695-716.
We present here a draft
genome sequence of the red jungle fowl, Gallus gallus. Because the
chicken is a modern descendant of the dinosaurs and the first
non-mammalian amniote to have its genome sequenced, the draft
sequence of its genome--composed of approximately one billion base
pairs of sequence and an estimated 20,000-23,000 genes--provides a
new perspective on vertebrate genome evolution, while also
improving the annotation of mammalian genomes. For example, the
evolutionary distance between chicken and human provides high
specificity in detecting functional elements, both non-coding and
coding. Notably, many conserved non-coding sequences are far from
genes and cannot be assigned to defined functional classes. In
coding regions the evolutionary dynamics of protein domains and
orthologous groups illustrate processes that distinguish the
lineages leading to birds and mammals. The distinctive properties
of avian microchromosomes, together with the inferred patterns of
conserved synteny, provide additional insights into vertebrate
chromosome architecture.
|
| 25. |
Fast identification of folded human protein
domains expressed in E. coli suitable for structural
analysis.
BMC Struct Biol. 2004 Mar 8;
4:4.
BACKGROUND: High-throughput
protein structure analysis of individual protein domains requires
analysis of large numbers of expression clones to identify suitable
constructs for structure determination. For this purpose, methods
need to be implemented for fast and reliable screening of the
expressed proteins as early as possible in the overall process from
cloning to structure determination. RESULTS: 88 different E. coli
expression constructs for 17 human protein domains were analysed
using high-throughput cloning, purification and folding analysis to
obtain candidates suitable for structural analysis. After 96
deep-well microplate expression and automated protein purification,
protein domains were directly analysed using 1D 1H-NMR
spectroscopy. In addition, analytical hydrophobic interaction
chromatography (HIC) was used to detect natively folded protein.
With these two analytical methods, six constructs (representing two
domains) were quickly identified as being well folded and suitable
for structural analysis. CONCLUSION: The described approach
facilitates high-throughput structural analysis. Clones expressing
natively folded proteins suitable for NMR structure determination
were quickly identified upon small scale expression screening using
1D 1H-NMR and/or analytical HIC. This procedure is especially
effective as a fast and inexpensive screen for the 'low hanging
fruits' in structural genomics.
|
| 26. |
SMART 4.0: towards genomic data
integration.
Nucleic Acids Res. 2004 Jan 1;
32(Database issue):D142-4.
SMART (Simple Modular
Architecture Research Tool) is a web tool (http://smart.embl.de/) for the
identification and annotation of protein domains, and provides a
platform for the comparative study of complex domain architectures
in genes and proteins. The January 2004 release of SMART contains
685 protein domains. New developments in SMART are centred on the
integration of data from completed metazoan genomes. SMART now uses
predicted proteins from complete genomes in its source sequence
databases, and integrates these with predictions of orthology. New
visualization tools have been developed to allow analysis of gene
intron-exon structure within the context of protein domain
structure, and to align these displays to provide schematic
comparisons of orthologous genes, or multiple transcripts from the
same gene. Other improvements include the ability to query SMART by
Gene Ontology terms, improved structure database searching and
batch retrieval of multiple entries.
|
| 27. |
Alternative splicing and evolution.
Bioessays. 2003 Nov;
25(11):1031-4.
Alternative splicing is a
critical post-transcriptional event leading to an increase in the
transcriptome diversity. Recent bioinformatics studies revealed a
high frequency of alternative splicing. Although the extent of AS
conservation among mammals is still being discussed, it has been
argued that major forms of alternatively spliced transcripts are
much better conserved than minor forms. It suggests that
alternative splicing plays a major role in genome evolution
allowing new exons to evolve with less constraint.
|
| 28. |
ELM server: A new resource for investigating
short functional sites in modular eukaryotic proteins.
Nucleic Acids Res. 2003 Jul 1;
31(13):3625-30.
Multidomain proteins
predominate in eukaryotic proteomes. Individual functions assigned
to different sequence segments combine to create a complex function
for the whole protein. While on-line resources are available for
revealing globular domains in sequences, there has hitherto been no
comprehensive collection of small functional sites/motifs
comparable to the globular domain resources, yet these are as
important for the function of multidomain proteins. Short linear
peptide motifs are used for cell compartment targeting,
protein-protein interaction, regulation by phosphorylation,
acetylation, glycosylation and a host of other post-translational
modifications. ELM, the Eukaryotic Linear Motif server at http://elm.eu.org/, is a new
bioinformatics resource for investigating candidate short
non-globular functional motifs in eukaryotic proteins, aiming to
fill the void in bioinformatics tools. Sequence comparisons with
short motifs are difficult to evaluate because the usual
significance assessments are inappropriate. Therefore the server is
implemented with several logical filters to eliminate false
positives. Current filters are for cell compartment, globular
domain clash and taxonomic range. In favourable cases, the filters
can reduce the number of retained matches by an order of magnitude
or more.
|
| 29. |
The InterPro Database, 2003 brings increased
coverage and new features.
Nucleic Acids Res. 2003 Jan 1;
31(1):315-8.
InterPro, an integrated
documentation resource of protein families, domains and functional
sites, was created in 1999 as a means of amalgamating the major
protein signature databases into one comprehensive resource.
PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been
manually integrated and curated and are available in InterPro for
text- and sequence-based searching. The results are provided in a
single format that rationalises the results that would be obtained
by searching the member databases individually. The latest release
of InterPro contains 5629 entries describing 4280 families, 1239
domains, 95 repeats and 15 post-translational modifications.
Currently, the combined signatures in InterPro cover more than 74%
of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15%
since the inception of InterPro. New features of the database
include improved searching capabilities and enhanced graphical user
interfaces for visualisation of the data. The database is available
via a webserver (http://www.ebi.ac.uk/interpro)
and anonymous FTP
(ftp://ftp.ebi.ac.uk/pub/databases/interpro).
|
| 30. |
Initial sequencing and comparative analysis of
the mouse genome.
Nature. 2002 Dec 5;
420(6915):520-62.
The sequence of the mouse
genome is a key informational tool for understanding the contents
of the human genome and a key experimental tool for biomedical
research. Here, we report the results of an international
collaboration to produce a high-quality draft sequence of the mouse
genome. We also present an initial comparative analysis of the
mouse and human genomes, describing some of the insights that can
be gleaned from the two sequences. We discuss topics including the
analysis of the evolutionary forces shaping the size, structure and
sequence of the genomes; the conservation of large-scale synteny
across most of the genomes; the much lower extent of sequence
orthology covering less than half of the genomes; the proportions
of the genomes under selection; the number of protein-coding genes;
the expansion of gene families related to reproduction and
immunity; the evolution of proteins; and the identification of
intraspecies polymorphism.
|
| 31. |
The genome sequence of the malaria mosquito
Anopheles gambiae.
Science. 2002 Oct 4;
298(5591):129-49.
Anopheles gambiae is the
principal vector of malaria, a disease that afflicts more than 500
million people and causes more than 1 million deaths each year.
Tenfold shotgun sequence coverage was obtained from the PEST strain
of A. gambiae and assembled into scaffolds that span 278 million
base pairs. A total of 91% of the genome was organized in 303
scaffolds; the largest scaffold was 23.1 million base pairs. There
was substantial genetic variation within this strain, and the
apparent existence of two haplotypes of approximately equal
frequency ("dual haplotypes") in a substantial fraction of the
genome likely reflects the outbred nature of the PEST strain. The
sequence produced a conservative inference of more than 400,000
single-nucleotide polymorphisms that showed a markedly bimodal
density distribution. Analysis of the genome sequence revealed
strong evidence for about 14,000 protein-encoding transcripts.
Prominent expansions in specific families of proteins likely
involved in cell adhesion and immunity were noted. An expressed
sequence tag analysis of genes regulated by blood feeding provided
insights into the physiological adaptations of a hematophagous
insect.
|
| 32. |
Comparative genome and proteome analysis of
Anopheles gambiae and Drosophila melanogaster.
Science. 2002 Oct 4;
298(5591):149-59.
Comparison of the genomes
and proteomes of the two diptera Anopheles gambiae and Drosophila
melanogaster, which diverged about 250 million years ago, reveals
considerable similarities. However, numerous differences are also
observed; some of these must reflect the selection and subsequent
adaptation associated with different ecologies and life strategies.
Almost half of the genes in both genomes are interpreted as
orthologs and show an average sequence identity of about 56%, which
is slightly lower than that observed between the orthologs of the
pufferfish and human (diverged about 450 million years ago). This
indicates that these two insects diverged considerably faster than
vertebrates. Aligned sequences reveal that orthologous genes have
retained only half of their intron/exon structure, indicating that
intron gains or losses have occurred at a rate of about one per
gene per 125 million years. Chromosomal arms exhibit significant
remnants of homology between the two species, although only 34% of
the genes colocalize in small "microsyntenic" clusters, and major
interarm transfers as well as intra-arm shuffling of gene order are
detected.
|
| 33. |
Immunity-related genes and gene families in
Anopheles gambiae.
Science. 2002 Oct 4;
298(5591):159-65.
We have identified 242
Anopheles gambiae genes from 18 gene families implicated in innate
immunity and have detected marked diversification relative to
Drosophila melanogaster. Immune-related gene families involved in
recognition, signal modulation, and effector systems show a marked
deficit of orthologs and excessive gene expansions, possibly
reflecting selection pressures from different pathogens encountered
in these insects' very different life-styles. In contrast, the
multifunctional Toll signal transduction pathway is substantially
conserved, presumably because of counterselection for developmental
stability. Representative expression profiles confirm that sequence
diversification is accompanied by specific responses to different
immune challenges. Alternative RNA splicing may also contribute to
expansion of the immune repertoire.
|
| 34. |
InterPro: an integrated documentation resource
for protein families, domains and functional sites.
Brief Bioinform. 2002 Sep;
3(3):225-35.
The exponential increase in
the submission of nucleotide sequences to the nucleotide sequence
database by genome sequencing centres has resulted in a need for
rapid, automatic methods for classification of the resulting
protein sequences. There are several signature and sequence
cluster-based methods for protein classification, each resource
having distinct areas of optimum application owing to the
differences in the underlying analysis methods. In recognition of
this, InterPro was developed as an integrated documentation
resource for protein families, domains and functional sites, to
rationalise the complementary efforts of the individual protein
signature database projects. The member databases - PRINTS,
PROSITE, Pfam, ProDom, SMART and TIGRFAMs - form the InterPro core.
Related signatures from each member database are unified into
single InterPro entries. Each InterPro entry includes a unique
accession number, functional descriptions and literature
references, and links are made back to the relevant member
database(s). Release 4.0 of InterPro (November 2001) contains 4,691
entries, representing 3,532 families, 1,068 domains, 74 repeats and
15 sites of post-translational modification (PTMs) encoded by
different regular expressions, profiles, fingerprints and hidden
Markov models (HMMs). Each InterPro entry lists all the matches
against SWISS-PROT and TrEMBL (2,141,621 InterPro hits from 586,124
SWISS-PROT and TrEMBL protein sequences). The database is freely
accessible for text- and sequence-based searches.
|
| 35. |
Common exon duplication in animals and its role
in alternative splicing.
Hum Mol Genet. 2002 Jun 15;
11(13):1561-7.
When searching the genomes
of human, fly and worm for cases of exon duplication, we found that
about 10% of all genes contain tandemly duplicated exons. In the
course of the analyses, 2438 unannotated exons were identified that
are not currently included in genome databases and that are likely
to be functional. The vast majority of them are likely to be
involved in mutually exclusive alternative splicing events. The
common nature of recent exon duplication indicates that it might
have a significant role in the fast evolution of eukaryotic genes.
It also provides a general mechanism for the regulation of protein
function.
|
| 36. |
Protein domain analysis in the era of complete
genomes.
FEBS Lett. 2002 Feb 20;
513(1):129-34.
Domains present one of the
most useful levels at which to understand protein function, and
domain family-based analysis has had a profound impact on the study
of individual proteins. Protein domain discovery has been
progressing steadily over the past 30 years. What are the
realistically achievable goals of sequence-based domain analysis,
and how far off are they for the sequences encoded in eukaryotic
genomes? Here we address some of the issues involved in better
coverage of sequence-based domain annotation, and the integration
of these results within the wider context of genomes, structures
and function.
|
| 37. |
Genome and protein evolution in
eukaryotes.
Curr Opin Chem Biol. 2002 Feb;
6(1):39-45.
The past year has seen the
completion of the genome sequence of the flowering plant
Arabidopsis thaliana and the initial sequence reports of the human
genome. The availability of completely sequenced eukaryotic genomes
from disparate phylogenetic lineages has opened the door to
comparative analyses and a better understanding of the evolutionary
processes shaping genomes. Complex many-to-many relationships
between genes from different species appear to be the norm,
suggesting that transfer of detailed functional annotation will not
be straightforward. In addition to expansion and contraction of
gene families, new genes evolve from recombination of pre-existing
domains, although some domain families do appear to have evolved
recently and to be specific to restricted phylogenetic lineages.
The overall picture is of a huge diversity of gene content within
eukaryotic genomes, reflecting different functional demands in
different species.
|
| 38. |
Recent improvements to the SMART domain-based
sequence annotation resource.
Nucleic Acids Res. 2002 Jan 1;
30(1):242-4.
SMART (Simple Modular
Architecture Research Tool, http://smart.embl-heidelberg.de)
is a web-based resource used for the annotation of protein domains
and the analysis of domain architectures, with particular emphasis
on mobile eukaryotic domains. Extensive annotation for each domain
family is available, providing information relating to function,
subcellular localization, phyletic distribution and tertiary
structure. The January 2002 release has added more than 200
hand-curated domain models. This brings the total to over 600
domain families that are widely represented among nuclear,
signalling and extracellular proteins. Annotation now includes
links to the Online Mendelian Inheritance in Man (OMIM) database in
cases where a human disease is associated with one or more
mutations in a particular domain. We have implemented new analysis
methods and updated others. New advanced queries provide direct
access to the SMART relational database using SQL. This database
now contains information on intrinsic sequence features such as
transmembrane regions, coiled-coils, signal peptides and internal
repeats. SMART output can now be easily included in users'
documents. A SMART mirror has been created at
http://smart.ox.ac.uk.
|
* equal contribution
