Development of a Web Interface for Mammalian Gene Collection Program Data

Recipient Organization
CORNELL UNIVERSITY
(N/A)
ITHACA,NY 14853

Performing Department
BIOLOGICAL STATISTICS & COMPUTATIONAL BIOLOGY

Non Technical Summary
As we get closer and closer to a complete catalog of genes in the human genome, single exon genes (SEGs) are emerging as perhaps the single most neglected category of genes. About 10 percent of known human genes are SEGs, but because these genes are hard to identify computationally and hard to validate experimentally, it is likely that they make up a substantially larger percentage of genes yet to be identified. This project will lead to improved methods for predicting SEGs and will identify a set of candidate SEGs that can be tested experimentally.

Animal Health Component

25%

Research Effort Categories

Basic

(N/A)

Applied

25%

Developmental

75%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
901	7310	1080	100%

Knowledge Area
901 - Program and Project Design, and Statistics;

Subject Of Investigation
7310 - Experimental design and statistical methods;

Field Of Science
1080 - Genetics;

Keywords

single exon genes

computational gene prediction

comparative genomics

phylogenetic modeling

Goals / Objectives
This is a pilot project focused on the identification of single-exon protein coding genes using phylogenetic models and comparative sequence data. The primary objectives are: 1.) To identify a subset of existing exon predictions that potentially represent novel single exon genes (SEGs). 2.) To develop a set of filters for candidate SEGs that remove likely pseudogenes and other false positive predictions 3.) To develop a method to rank candidate SEGs based on phylogenetic evidence 4.) To submit a set of high-confidence candiate SEGs to collaborators in the Mammalian Gene Collection project for experimental testing.

Project Methods
The initial set of candidate single exon genes will come from genome-wide predictions recently made with the ExoniPhy program. Filters will be based on overlap with predictions of pseudogenes and recent segmental duplications, on end-to-end homology with known genes, on large-scale synteny between species, and on manual inspection in the UCSC Genome Browser. For ranking predictions, a log-odds score will be developed based on ExoniPhy's evolutionary model. The final predictions will be submitted electronically in the form of a General Features Format (GFF) file.

Progress 02/01/06 to 01/31/08

Outputs
OUTPUTS: Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identiﬁed by their evolutionary signatures using comparative se- quence data. The analysis suggests that the number of such missing genes is in the hundreds, but not in the thousands. Thus the current protein gene catalogs are likely to contain the vast majority of protein-coding gene loci. PARTICIPANTS: Not relevant to this project. TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Not relevant to this project.

Impacts
This project identiﬁed 734 novel gene fragments (NGFs) containing 2,188 exons with at most weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent signif- icant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identiﬁed by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-speciﬁc manner. Functional analysis indi- cates that they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Over the course of this project, we have identiﬁed 351 candidate novel single exon genes (SEGs) in four batches between June 2006 and June 2007. The ﬁrst three batches, totaling 180 candidate SEGs, were described in detail in our previous report. To summarize brieﬂy, the ﬁrst two batches were selected by hand- crafted rules taking into account comparative evidence (reading frame conservation, synteny) and homology with known proteins. In the third batch, we have developed a logistic regression classiﬁer combining a number of features to a single score, with the goal to distinguish real genes from pseudogenes and other false positives. We have applied this classiﬁer to all open reading frames (ORFs) of length of at least 200bp in the human genome and selected the highest-ranking candidates for testing.

Publications

A. Siepel, M. Diekhans, B. Brejova, L. Langton, M. Stevens, C. L. Comstock, C. Davis, B. Ewing, S. Oommen, C. Lau, H.-C. Yu, J. Li, B. A. Roe, P. Green, D. S. Gerhard, G. Temple, D. Haussler, and M. R. Brent. Targeted discovery of novel human exons by comparative genomics. Genome Res., 17(12):1763-1773, 2007.

Progress 10/01/06 to 09/30/07

Outputs
OUTPUTS: The mRNA sequence data produced in this project is publicly available from GenBank, and predicted and validated genes are available as tracks in the Mammalian Gene Collection browser (http://mgc.ucsc.edu/). PARTICIPANTS: Cornell: Brona Brejova, postdoctoral associate UC Santa Cruz: David Haussler Mark Diekhans Dana Farber Cancer Institute: Kourosh Salehi-Ashtiani

Impacts
We have conducted computational and experimental screen for candidate novel single exon genes (SEGs) in the human genome. We have combined several sources of information to computationally select most promising SEG candidates and tested them for expression by RT-PCR. Over the course of this project, we have identified 351 candidate novel single exon genes (SEGs) in four batches between June 2006 and June 2007. To summarize briefly, the first two batches were selected by hand-crafted rules taking into account comparative evidence (reading frame conservation, synteny) and homology with known proteins. In the third batch, we have developed a logistic regression classifier combining a number of features to a single score, with the goal to distinguish real genes from pseudogenes and other false positives. We have applied this classifier to all open reading frames (ORFs) of length at least 200bp in the human genome and selected the highest-ranking candidates for testing. This classifier was further refined for the fourth batch. In addition, we have completed a large-scale effort to identify novel multi-exon genes in the human genome by comparative genomic methods, and to test them systematically by RT-PCR. This project led to the identification of 734 novel gene fragments (NGFs) containing 2,188 exons with at most weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which about 200 are completely absent from the major gene catalogs, while the rest represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue- specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development.

Publications

Siepel A, Diekhans M, Brejova B, Langton L, Stevens M, Comstock CLG, Davis C, Ewing B, Oommen S, Lau C, Yu H-C, Li J, Roe BA, Green P, Gerhard DS, Temple G, Haussler D, Brent MR. 2007. Targeted discovery of novel human exons by comparative genomics. Genome Res., 17:1763-1773.

Progress 01/01/06 to 12/31/06

Outputs
This was a pilot project to determine the feasibility of identifying novel single exon genes in the human genome by comparative genomics. The mouse, rat, dog, and other mammalian genomes were used in the analysis. Our initial work identified 300-500 candidate genes. Twelve high-confidence candidates were submitted to the Mammalian Gene Collection project for RT-PCR validation. The results of the pilot project were generally regarded as promising, and a one-year extension was granted to continue with single exon gene identification. So far this extension has resulted in the identification and RT-PCR validation of about 50 new human genes. A publication is in preparation.

Impacts
This work will identify perhaps 100 new human genes, will produce an improved estimate of how many human genes may remain to be identified, and will produce a high-quality publication.

Publications

No publications reported this period