Progress 02/01/06 to 01/31/08
Outputs OUTPUTS: Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identified by their evolutionary signatures using comparative se- quence data. The analysis suggests that the number of such missing genes is in the hundreds, but not in the thousands. Thus the current protein gene catalogs are likely to contain the vast majority of protein-coding gene loci. PARTICIPANTS: Not relevant to this project. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Not relevant to this project.
Impacts This project identified 734 novel gene fragments (NGFs) containing 2,188 exons with at most weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent signif- icant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-specific manner. Functional analysis indi- cates that they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Over the course of this project, we have identified 351 candidate novel single exon genes (SEGs) in four batches between June 2006 and June 2007. The first three batches, totaling 180 candidate SEGs, were described in detail in our previous report. To summarize briefly, the first two batches were selected by hand- crafted rules taking into account comparative evidence (reading frame conservation, synteny) and homology with known proteins. In the third batch, we have developed a logistic regression classifier combining a number of features to a single score, with the goal to distinguish real genes from pseudogenes and other false positives. We have applied this classifier to all open reading frames (ORFs) of length of at least 200bp in the human genome and selected the highest-ranking candidates for testing.
Publications
- A. Siepel, M. Diekhans, B. Brejova, L. Langton, M. Stevens, C. L. Comstock, C. Davis, B. Ewing, S. Oommen, C. Lau, H.-C. Yu, J. Li, B. A. Roe, P. Green, D. S. Gerhard, G. Temple, D. Haussler, and M. R. Brent. Targeted discovery of novel human exons by comparative genomics. Genome Res., 17(12):1763-1773, 2007.
|
Progress 10/01/06 to 09/30/07
Outputs OUTPUTS: The mRNA sequence data produced in this project is publicly available from GenBank, and predicted and validated genes are available as tracks in the Mammalian Gene Collection browser (http://mgc.ucsc.edu/).
PARTICIPANTS: Cornell: Brona Brejova, postdoctoral associate UC Santa Cruz: David Haussler Mark Diekhans Dana Farber Cancer Institute: Kourosh Salehi-Ashtiani
Impacts We have conducted computational and experimental screen for candidate novel single exon genes (SEGs) in the human genome. We have combined several sources of information to computationally select most promising SEG candidates and tested them for expression by RT-PCR. Over the course of this project, we have identified 351 candidate novel single exon genes (SEGs) in four batches between June 2006 and June 2007. To summarize briefly, the first two batches were selected by hand-crafted rules taking into account comparative evidence (reading frame conservation, synteny) and homology with known proteins. In the third batch, we have developed a logistic regression classifier combining a number of features to a single score, with the goal to distinguish real genes from pseudogenes and other false positives. We have applied this classifier to all open reading frames (ORFs) of length at least 200bp in the human genome and selected the highest-ranking candidates for testing.
This classifier was further refined for the fourth batch. In addition, we have completed a large-scale effort to identify novel multi-exon genes in the human genome by comparative genomic methods, and to test them systematically by RT-PCR. This project led to the identification of 734 novel gene fragments (NGFs) containing 2,188 exons with at most weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which about 200 are completely absent from the major gene catalogs, while the rest represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue- specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development.
Publications
- Siepel A, Diekhans M, Brejova B, Langton L, Stevens M, Comstock CLG, Davis C, Ewing B, Oommen S, Lau C, Yu H-C, Li J, Roe BA, Green P, Gerhard DS, Temple G, Haussler D, Brent MR. 2007. Targeted discovery of novel human exons by comparative genomics. Genome Res., 17:1763-1773.
|
Progress 01/01/06 to 12/31/06
Outputs This was a pilot project to determine the feasibility of identifying novel single exon genes in the human genome by comparative genomics. The mouse, rat, dog, and other mammalian genomes were used in the analysis. Our initial work identified 300-500 candidate genes. Twelve high-confidence candidates were submitted to the Mammalian Gene Collection project for RT-PCR validation. The results of the pilot project were generally regarded as promising, and a one-year extension was granted to continue with single exon gene identification. So far this extension has resulted in the identification and RT-PCR validation of about 50 new human genes. A publication is in preparation.
Impacts This work will identify perhaps 100 new human genes, will produce an improved estimate of how many human genes may remain to be identified, and will produce a high-quality publication.
Publications
- No publications reported this period
|
|