Developing an accurate computer program to identify potential genes in tomato genome sequence

DEVELOPING AN ACCURATE COMPUTER PROGRAM TO IDENTIFY POTENTIAL GENES IN TOMATO GENOME SEQUENCE

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

NRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

0211422

Grant No.

2007-35300-18460

Cumulative Award Amt.

$398,000.00

Proposal No.

2007-02761

Multistate No.

(N/A)

Project Start Date

Sep 1, 2007

Project End Date

Aug 31, 2009

Grant Year

2007

Program Code

[52.1]- (N/A)

Recipient Organization
DONALD DANFORTH PLANT SCIENCE CENTER
975 NORTH WARSON ROAD
ST. LOUIS,MO 63132

Performing Department
(N/A)

Non Technical Summary
A broad understanding of the genes present in tomato would provide the identities of many of the genes responsible for controlling growth and development that ultimately influence agronomically important traits. In recognition of the scientific and agricultural significance of tomato, its genome is currently being sequenced by a consortium of 10 countries funded by their respective governments. However, the resultant sequence data itself will provide no knowledge of tomato gene content. Gene identification will rely on high throughput computational tools to identify and predict the structure of the tomato genes. The purpose of this project is to improve upon the current computational tools for tomato gene prediction by developing a comprehensive "training set" of complete and annotated tomato gene models and optimizing the probability model underlying the TWINSCAN gene finding program. The training set will consist of tomato gene models identified by assembling the large collection of public ESTs with 200MB of tomato EST sequence generated in this project by 454 sequencing of a normalized, full length enriched cDNA pool; and, aligning these assemblies to available tomato genomic sequences. Tomato trained TWINSCAN will be freely available to the scientific community under the Open-Source software agreement.

Animal Health Component

(N/A)

Research Effort Categories

Basic

100%

Applied

(N/A)

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	1460	1040	100%

Knowledge Area
201 - Plant Genome, Genetics, and Genetic Mechanisms;

Subject Of Investigation
1460 - Tomato;

Field Of Science
1040 - Molecular biology;

Keywords

tomato

genome annotation

gene prediction;

Goals / Objectives
The central goal of this research is to provide a computer gene prediction program that has been optimized for identifying genes in tomato genome sequence. The Solanaceae encompass diverse crop species including the tuber-bearing potato, a number of fruit-bearing vegetables, ornamental flowers, edible leaves, and medicinals. In addition, the Solanaceae are the primary source of vegetables in the U.S. and the world. In recognition of the scientific and agricultural significance of tomato, a consortium of 10 countries funded by their respective governments is currently sequencing its genome. A broad understanding of the genes present in tomato would provide the identities, and eventually the map positions, of many of the genes responsible for controlling growth and development that ultimately influence agronomically important traits. The resultant sequence data from the tomato genome sequencing project will consist of a substantial collection of large contiguous nucleotide segments about which little or no knowledge of content will be available, a priori. Therefore, high throughput computational tools that can identify and accurately predict the structure of the tomato genes will be absolutely necessary for annotating and understanding the tomato genome and relating it to other solanaceous plants and related taxa. We will improve upon the current computational tools for tomato gene prediction by developing a comprehensive training set of complete and annotated tomato gene models and optimizing the probability model underlying the TWINSCAN gene finding program. TWINSCAN incorporates cross-species similarity, when available, into its probability model, and out performs other gene finders on vertebrate genomic DNA sequence. The training set will consist of EST validated tomato gene models identified by assembling a large collection of public ESTs with 200MB of tomato EST sequence generated in this project by 454 sequencing of a normalized, full length enriched cDNA pool; and, aligning these assemblies to tomato genomic sequences produced by the International Tomato Sequencing project. We will use this training set to re-train TWINSCAN to provide an accurate tomato gene finder to support tomato genome annotation. Tomato trained TWINSCAN will be freely available under the Open-Source software agreement. Specific objectives: 1. Generate 200Mb of tomato transcript sequence by sequencing a normalized, full length transcript enriched tomato cDNA pool. 2. Use the sequence generated above to construct a comprehensive collection of well annotated tomato genes that will serve as a substrate for the determination of a tomato gene probability model. 3. Utilize the annotated gene models collected to re-train the TWINSCAN probability model for tomato. 4. Use tomato trained TWINSCAN to perform gene prediction on tomato genome sequence, and integrate this program into the tomato sequencing project annotation pipeline. 5. Select a subset of tomato hypothetical gene models predicted by TWINSCAN for verification. 6. Release the tomato specific version of TWINSCAN to the scientific community

Project Methods
The central aim of this project is to provide an accurate computational gene finder for tomato. We will train TWINSCAN for tomato to do this, and this requires a collection of well annotated, structure known (start codon, intron-exon boundaries, stop codon etc) tomato gene models. Aligning full length cDNAs to genomic sequence provides the most unambiguous path to identifying these. However, there is little full length tomato cDNA sequence available in the public sector. RNA from 5 tissues will be collected by Co-PI Giovannoni, and this will be used by Evrogen to construct a normalized cDNA pool enriched for FLcDNAs that will be subjected to 454 sequencing. A computational pipeline for identification of potentially full-length transcripts by aligning maize Sanger sequence derived ESTs, 454-ESTs, FLcDNAs and genomic DNA sequence has been developed during the course of NSF PGRP award DBI-0501758. The pipeline includes the Program to Assemble Spliced Alignments program (PASA). PASA is a genome annotation tool developed at TIGR is able to model complete and partial gene structures based on assembled spliced alignments. Potential full length transcript sequence will be computationally derived by combining the 454 ESTs generated during this project with all available ESTs in the public domain, and assembling through the GMAP + PASA pipeline mentioned above using available tomato genomic sequence as alignment anchors. Modification of the TWINSCAN probability model: Given a genomic DNA sequence, the goal of gene prediction is to assign each position within the sequence to a functional state: intergenic, untranslated regions (UTR), exon, intron, etc. The current TWINSCAN gene model incorporates several types of features including splice signal models, intron and exon length distributions, and promotor and poly-adenelation signals. Each of these fundamental elements is termed a state in the probability model, and transitions from state to state may occur in any biologically consistent order. To re-train TWINSCAN to predict tomato genes requires the parameters of the gene model to be re-evaluated based on the features of tomato genes. Initial probabilities of each state will be determined by their frequency in bulk tomato DNA; transition probabilities between states will likewise be proportional to the observed frequency of these transitions in the training set. Intron and exon length distributions will be modeled on the observed distributions in the training set, and the composition of coding and non-coding DNA will be empirically measured by frame-specific hexamer composition. Ab initio gene prediction by TWINSCAN will be performed on tomato genomic DNA sequences. The resultant predictions will be assessed for predicted positives, predicted negatives, actual positives, actual negatives, true positives and true negatives and TWINSCAN sensitivity and specificity will be determined. Finally, validation of hypothetical gene predictions will be accomplished by TWINSCAN gene prediction followed by RT-PCR and direct sequencing.

Progress 09/01/07 to 08/31/08

Outputs
OUTPUTS: The central goal of this research is to provide a computer gene prediction program that has been optimized for identifying genes in tomato genome sequence. This project will improve upon the current computational tools for tomato gene prediction by developing a comprehensive "training set" of complete and annotated tomato gene models and optimizing the probability model underlying the TWINSCAN gene finding program. The training set will consist of tomato gene models identified by assembling the large collection of public ESTs with 200MB of tomato EST sequence generated in this project by 454 sequencing of a normalized, full length enriched cDNA pool; and, aligning these assemblies to available tomato genomic sequences. Tomato trained TWINSCAN will be freely available to the scientific community under the Open-Source software agreement. The specific objectives are to 1) generate ~200Mb of tomato transcript sequence by sequencing a normalized, full length transcript enriched tomato cDNA pool, 2) construct a comprehensive collection of well annotated tomato genes with the sequence generated, and 3) train TWINSCAN and use it to help annotate the tomato genome project sequence. To date, RNA has been isolated from 5 tissues: whole seedling, flowers, root leaf and fruit. cDNA was synthesized from each RNA sample, pooled and then normalized. The cDNA was subjected to 2 454-FLX runs at the UF genomics service center. Over 767,000 454 tomato transcript sequences were obtained representing greater than 175 Mbp. The pipeline for EST to genomic DNA alignments for gene model discovery has been implemented and tested. PARTICIPANTS: W. Brad Barbazuk (PD) - Donald Danforth Plant Science Center. Michael R. Brent (Co-PD) - Computer Science, Washington University in St. Louis. Lukas Mueller (Co-PD) - Boyce Thompson Institute for Plant Research. James Giovannoni (Co-PD) - USDA-ARS and the Boyce Thompson Institute for Plant Research. Yan Fu (Post-doctoral Researcher) - Donald Danforth Plant Science center Chenhong Zhang (Computer technician) - Donald Danforth Plant Science Center. Daniel Schachtman (Collaborator) - Donald Danforth plant Science center TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
This research has generated a significant tomato transcript sequence database, and will generate an accurate gene finder for gene discovery in tomato genomic sequence. The transcript collection and the advanced gene finder will have a significant impact on tomato gene discovery, will improve accuracy, and will provide a foundation for understanding gene organization and function in the tomato genome. Furthermore, it is likely that tomato trained TWINSCAN will perform well as a gene finder in other Solanaceae species. To date, over 175Mbp of short read (200bp av.) tomato transcriptome sequences have been obtained. This sequence data has been provided to Lukas Mueller (curator of the SOL genomics Network) and is being included in the most recent build of the tomato unigene set, which is currently under construction.

Publications

Yan Fu, Oliver Bannach, Hao Chen, Jan-Hendrik Teune, Axel Schmitz, Gerhard Steger, Liming Xiong, and W. Brad Barbazuk. 2009 Alternative Splicing of Anciently Exonized 5S rRNA Regulates Plant Transcription Factor TFIIIA. Genome Research - In Press
W. Brad Barbazuk & Patrick S. Schnable 2008. Whole genome SNP discovery by pyrosequencing. In Genome Profiling for Genetic Marker Discovery. (Tong Zhu Ed.) Humana Press INC.
W. Brad Barbazuk, Yan Fu, Karen M. McGinnis. Genome-wide analyses of alternative splicing in plants: Opportunities and Challenges. 2008 Genome Research Sep;18(9):1381-92.