Progress 10/01/08 to 09/30/09
Outputs OUTPUTS: Our first set of research activities has centered on an extraordinary data set consisting of 22 newly sequenced eutherian mammalian genomes. We have participated in an international consortium that will soon publish an initial analyses of these sequences (2x Mammalian Sequencing Consortium, in prep.), and have taken the lead on three auxiliary projects that either have been, or will be, published separately. We developed a new statistical, phylogenetic method, called phyloP, that can detect possible negative or positive selection at individual sites in mammalian genomes, either across all branches of the phylogeny, or in individual clades or lineages (Pollard et al., 2010). PhyloP supports four different statistical tests for conservation or acceleration, and is the basis of a series of new tracks in the UCSC Genome Browser. In addition, we have performed a comprehensive analysis of sequencing error in the new low- coverage genome assemblies, considering the overall amount of error, its implications in various phylogenomic analyses of interest, and strategies for automatically processing the sequences to mitigate the effects of error (Hubisz et al., submitted). Finally, we have developed a computational method, called dmotif, that identifies cases of transcription factor binding site gain or loss along the branches of a phylogeny, and we have applied this method to the new genome-wide alignments of mammalian genomes (Diehl et al., in prep.). Our second set of research activities has focused on phylogenomic analyses of the new primate genome sequences that are rapidly becoming available. A major component of this work was our involvement in the orangutan genome project, which began in the summer of 2008 and is just now nearing completion (the genome paper will be submitted to Nature in January). Our primary role in this project was to carry out a genome-wide scan for genes under positive selection (similar to our previous work with the rhesus macaque genome). In addition, we analyzed several gene families that have undergone recent expansions in primates, using methods that we have developed over the past few years (Yu et al., 2009; Vinar et al., 2009). During our work on the orangutan genome project, we became interested in an emerging research area at the boundary between phylogenetics and population genetics, which is concerned with making inferences about ancestral populations based on the variation in genealogies across loci. We published a detailed review on this subject (Siepel, 2009) and then employed existing tools in an analysis of several human genomes (including one from a San Bushmen individual) with a chimpanzee outgroup. Work in our group is underway to improve these methods to accommodate intralocus recombination and migration between populations. In addition, we made significant improvements to our PHAST software package, while keeping up with general maintenance and user support. PHAST was downloaded more than 300 times (counting unique IP addresses only) since our last report, or nearly once per day -- a considerable increase in the download rate compared with previous years. PARTICIPANTS: Members of Siepel Lab contributing to the project during this reporting period. Staff members: Melissa Hubisz Postdoctoral associates: Charles Danko, Ilan Gronau Graduate students: Adam Diehl, Mike Phillips, Andre Luis Martins TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.
Impacts The project to sequence and analyze 22 additional eutherian genomes, in which we have participated, will have a major impact in mammalian genomics, by providing many new opportunities for functional and evolutionary analyses. We expect our related work on the problem of detecting possible negative or positive selection, at single base resolution, to be especially useful in helping researchers to identify and characterize functional elements in these genomes. The new tracks we have developed for the UCSC Genome Browser, based on the phyloP and phastCons programs, are already being widely used. In addition, we expect our work on sequencing error to be important in helping researchers to understand the limitations of these new data sets. We have made available a version of the current alignments which have been processed by our automatic methods for "sequencing error mitigation" and we expect these alignments will be useful to researchers who are concerned about the effects of error on downstream analyses. Our work on turnover of regulatory elements in mammalian genomes is still in progress, but we plan to make our predictions available as browser tracks, for use by the research community in understanding the evolutionary dynamics of regulatory elements in mammals. Similarly, our work on phylogenomics of primates will help to shed light on the evolutionary history of humans and our closest relatives. Our work on positive selection in protein coding genes, and on recent gene family expansions, helps to illuminate the genetic basis of differences between primate species, and the role of natural selection in species diversification. We anticipate that our newer work, at the intersection of population genetics and phylogenetics, will shed additional light on primate evolution, particularly on the nature of the ancestral populations that gave rise to today's primate species. My review article on this topic appears to have reached a fairly wide audience, and I am hopeful that it will raise awareness of the challenges of applying phylogenomics to groups such as the primates, in which the intervals between speciation events are small relative to the sizes of ancestral populations.
Publications
- Pollard, KS, Hubisz, MJ, Rosenboom, K. and Siepel, A. 2010. Detection of non-neutral substitution rates on Mammalian phylogenies. Genome Res, 10.1101/gr.097857.109.
- Siepel, A. 2009. Phylogenomics of primates and their ancestral populations. Genome Res, 19:1929-1941.
- Siepel, A. 2009. Darwinian alchemy: human genes from noncoding DNA. Genome Res, 19:1693-1695.
- The Mammalian Gene Collection Project Team. 2009. The completion of the Mammalian Gene Collection (MGC). Genome Res, 10.1101/gr.095976.109.
- Vinar, T, Brejova, B, Song, G, and Siepel, A. 2009. Reconstructing histories of complex gene clusters on a phylogeny. In RECOMB Comparative Genomics.
- Zhang, Y, Song, G, Vinar, T, Green, ED, Siepel, A, and Miller, W. 2009. Evolutionary history reconstruction for Mammalian complex gene clusters. J Comput Biol, 16(8):1051-1070.
|
Progress 10/01/07 to 09/30/08
Outputs OUTPUTS: In our work on gene finding, we have used our Exoniphy computer program to identify thousands of sequences in the human genome that show strong evolutionary signatures of protein-coding function, but are not annotated as genes. These predictions, together with predictions by our collaborators, have recently enabled the validation by RT-PCR of 2188 novel human protein-coding exons, corresponding to an estimated 563 genes, of which >160 are completely absent from the major gene catalogs (Siepel et al., 2007). All sequenced RT-PCR products have been submitted to GenBank and are available to the public. We have recently applied similar methods to identify 200 RT-PCR-supported novel single exon genes in the human genome. Our latest efforts in comparative gene finding have focused on the difficult problem of modeling gene structure, as well as gene sequence, evolution. We have begun a project focused on arrays of duplicated genes in primates. This project is still in the model development stage, but it has progressed significantly in recent months. We have developed a Markov chain Monte Carlo approach for sampling gene duplication histories consistent with a given species tree, and we have shown using simulated data that it is capable of reconstructing the complex histories of these clusters with high accuracy. This method solves an intriguing and difficult version of the phylogeny reconstruction problem -- essentially a series of coupled gene-tree reconstruction problems, with constraints imposed by a species tree. Our efforts to identify and characterize noncoding elements have also gained momentum recently. Our focus has been on identifying evolutionary turnover events of regulatory elements in mammals. We have developed a new method, called dmotif, that takes a position specific weight matrix, a neutral phylogenetic model, and a multiple alignment as input and predicts gain and loss events of regulatory elements along the branches of the tree. The method is fully Bayesian and is based on an efficient Gibbs sampling algorithm. At present it has only been applied to simulated data, but its performance is very encouraging. We have continued to maintain and improve the PHAST software package, and have begun work on improving its documentation and accessibility (objective 6). The package now has its own web site (http://compgen.bscb.cornell.edu/phast), with help pages for all programs and an open download link. PHAST is now officially "open source" and is available under a BSD-style license. A "phast-users" mailing list is also in place. A major expansion of the phyloP program was recently completed, to support conservation scoring at the level of individual bases by several different algorithms. A new conservation track based on phyloP will soon be released in the UCSC Genome Browser, and will complement the widely used phastCons track. We have developed a prototype of the proposed interface between PHAST and the R programming environment (called RPHAST) and plan to refine it over the next several months. PARTICIPANTS: Assigned lab members: 1. Tomas Vinar is a postdoctoral associate devoted full-time to the project during the reporting period. His focus has been on the development of new methods for reconstructing the evolutionary histories of complex tandem arrays of duplicated genes 2. Bronislava (Brona) Brejova is a postdoctoral associate devoted full-time to the project since December 1, 2007. Her focus has been on the development of new statistical models for the evolution of both gene sequences and exon-intron structure, with applications in gene finding and molecular evolution. Collaborators: Dr. Katherine Pollard (UC San Francisco) is collaborating with us on the development of PHAST-R, an R interface to our Phylogenetic Analysis with Space/Time models (PHAST) software package. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Not relevant to this project.
Impacts Our work on human gene discovery will have major consequences in genetics and molecular biology. The new genes we have found are gradually being incorporated into the major gene databases (RefSeq and ENSEMBL), and are becoming available to thousands of researchers doing applied and basic science. Among other things, they will help to make global studies of gene expression, gene evolution, and genetic disease associations more accurate and more complete. Our work has also demonstrated that many important genes can be missed by traditional approaches to gene discovery but are identifiable by comparative genomics, paving the way for similar efforts in mammals and in other groups of organisms. Our work on complex tandem arrays of duplicated genes is just getting underway but we are optimistic that our new methods will provide valuable insights into the evolution and function of many important gene clusters in the human genome. We have recently identified about a dozen clusters that appear to have undergone major duplication and/or loss events during primate evolution, and that also appear to be associated with human diseases. Our collaborator, Eric Green of the National Human Genome Research Institute (NHGRI), is in the process of sequencing these regions in 8-10 primates. We plan to apply our new computational tools to reconstruct the evolutionary histories of these regions, to find signatures of selection, and to shed light on the roles of these clusters in disease and in the divergence of primate species. Similarly, our work on turnover of regulatory elements is still in the method development phase, but we are preparing to apply our new methods to a comparative genomic data set of unprecedented scale, consisting of 44 aligned vertebrate genomes (in collaboration with investigators at the Broad Institute, UC Santa Cruz, and elsewhere). We plan to incorporate the latest chromatin immunoprecipitation (ChIP/chip and ChIP/seq) data into this analysis, to improve our ability to identify binding sites in one or more species. We expect this work to provide many valuable insights into the evolutionary dynamics of regulatory elements in mammals. With collaborators, we have also continued to study noncoding elements in non-mammalian species, and have shed light on both conserved and rapidly evolving elements in Drosophila (Holloway et al., 2008) and Solanaceae (Wang et al., 2008) species.
Publications
- Kosiol C, Vinar T, da Fonseca RR, Hubisz MJ, Bustamante CD, Nielsen R, Siepel A, "Patterns of positive selection in six mammalian genomes", PLoS Genetics, p. e1000144, vol. 4, (2008).
- Wang Y, Diehl A, Wu F, Vrebalov J, Giovannoni J, Siepel A, Tanksley SD, "Sequencing and comparative analysis of a conserved syntenic segment in the Solanaceae", Genetics, p. 391, vol. 180, (2008).
- Holloway A, Begun D, Siepel A, Pollard KS, "Adaptive evolution drives accelerated sequence divergence of conserved genomic elements in Drosophila melanogaster", Genome Research, p. 1592, vol. 18, (2008).
- Siepel A, Diekhans M, Brejova B, Langton L, Stevens M, Comstock CLG, Davis C, Ewing B, Oommen S, Lau C, Yu H-C, Li J, Roe BA, Green P, Gerhard DS, Temple G, Haussler D, Brent MR, "Targeted discovery of novel human exons by comparative genomics", Genome Research, p. 1763, vol. 17, (2007).
|
Progress 03/01/07 to 09/30/07
Outputs OUTPUTS: As proposed, we have continued to maintain our widely used PHAST (PHylogenetic Analysis with Space/Time models) software package, and have begun work on improving its documentation and accessibility. The package has now been downloaded more than 200 times, and we continue to provide routine support and bug fixes to its users. One program in particular, called phastCons, has gained a large following. This program is the basis of the conservation tracks in the UC Santa Cruz Genome Browser, a public community resource used by about 6,000 scientists per day. In addition, we have begun work on developing PHAST into a resource for education as well as research. Several students in my computational genomics class have used PHAST for their class projects. We have developed two courses on computational genomics: a rigorous introductory course and a graduate seminar. The introductory course covers topics such as sequence alignment, gene and motif finding, and phylogeny reconstruction,
with an emphasis on a comparative and evolutionary perspective on genomics. The students complete five challenging homework assignments, a midterm, and a six-week project involving probabilistic modeling, software implementation, and data analysis. Some class projects have developed into undergraduate honors theses or lab rotation projects. The graduate seminar is structured around recent papers from the literature on computational genomics. One to two papers per week are read and discussed, with students taking turns presenting papers. Each course has now been taught twice, with enrollment ranging from 11 to 25 -- reasonably large numbers for courses of this kind. Our predictions of genes under positive selection have been used by the Allen Institute for Brain Science in developing its map of gene expression in the human cortex (http://humancortex.alleninstitute.org/has/).
PARTICIPANTS: Assigned lab members: 1. Tomas Vinar. Tomas is a postdoctoral associate devoted full-time to the project since July, 2007. His focus so far has been on our comprehensive analysis of genes under positive selection in mammals. 2. Carolin Kosiol. Carolin is a postdoctoral associate who worked on the project part-time early in 2007. Her focus has been on the analysis of genes under positive selection in mammals. Collaborators: Cornell Carlos D. Bustamante is collaborating with us on the analysis of positively selected genes in mammals. University of California, Davis Drs. Katherine Pollard and Duncan Temple Lang of the Statistics Dept. are collaborating with us on the development of PHAST-R, an R interface to our Phylogenetic Analysis with Space/Time models (PHAST) software package. University of Copenhagen Dr. Rasmus Nielsen of the Center for Comparative Genomics is collaborating with us in our analysis of positive selection in mammalian genomes.
Impacts Our efforts on modeling of protein-coding genes and genome-wide analysis have so far focused on the detection of genes under positive Darwinian selection in mammals. We have conducted the most comprehensive analysis to date of mammalian positively selected genes (PSGs) using the six high-quality, high-coverage eutherian mammalian genome assemblies now available (human, chimpanzee, rhesus macaque, mouse, rat, dog). We developed a computational pipeline to identify high-confidence orthologs, to mask out regions of low quality sequence or uncertain alignment, and to test for selection using likelihood ratio tests (LTRs) based on codon models of sequence evolution. The identified PSGs were then analyzed for enriched functional categories, gene expression patterns, and relationships with known biological pathways. This analysis identified 544 genes with strong evidence of positive selection on one or more mammalian lineages. This is a substantial increase over previous
studies, owing to the increased phylogenetic depth of our alignments, and our improved pipeline for ortholog identification. The identified PSGs were enriched for roles in defense/immunity, chemosensory perception, and reproduction, consistent with previous studies, but our larger data set allowed for a much finer-grained analysis of gene function. For example, we found significant enrichments for genes involved in complement-mediated immunity and bitter taste reception. In addition, we were able for the first time to compare the sets of genes under positive selection in primates and rodents. Primate PSGs primarily tend to function in sensory perception while rodent PSGs tend to have roles in immunity and defense. We also found several cases of multiple PSGs from the same pathway, and we found some evidence of co-evolution of these genes. PSGs were found to be expressed at significantly lower levels, and in a more tissue-specific manner, than non-PSGs. Our analysis has a number of
important implications. For example, the finding that genes of the same pathway often appear to evolve under positive selection in concert suggests that more can be learned about positive selection, and perhaps more PSGs can be identified, by performing these analyses at the level of whole pathways instead of with individual genes. The striking relationship we found between gene expression patterns and positive selection -- with PSGs showing much lower expression levels and much greater tissue specificity than non-PSGs -- suggests that adaptive evolution may be much more likely for genes that have greater evolutionary 'flexibility' because they are active in a narrower range of conditions. Our analysis also helps to reveal the genetic basis of differences between species. It was particularly interesting to see strong evidence of positive selection in genes involved in the perception of bitter tastes, which play a critical role in how organisms avoid toxic and harmful substances.
Publications
- Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D, Kosakovsky Pond SL, Nekrutenko A, Giardine B, Harris RS, Tyekucheva S, Diekhans M, Pringle TH, Murphy WJ, Lesk A, Weinstock GM, Lindblad-Toh K, Gibbs RA, Lander ES, Siepel A, Haussler D, Kent WJ. 2007. 28-way vertebrate alignment and conservation track in the UCSC genome browser. Genome Res., 17:1797-1808.
|
|