Recipient Organization
DONALD DANFORTH PLANT SCIENCE CENTER
975 NORTH WARSON ROAD
ST. LOUIS,MO 63132
Performing Department
(N/A)
Non Technical Summary
A broad understanding of the genes present in tomato would provide the identities of many of the genes responsible for controlling growth and development that ultimately influence agronomically important traits. In recognition of the scientific and agricultural significance of tomato, its genome is currently being sequenced by a consortium of 10 countries funded by their respective governments. However, the resultant sequence data itself will provide no knowledge of tomato gene content. Gene identification will rely on high throughput computational tools to identify and predict the structure of the tomato genes. The purpose of this project is to improve upon the current computational tools for tomato gene prediction by developing a comprehensive "training set" of complete and annotated tomato gene models and optimizing the probability model underlying the TWINSCAN gene finding program. The training set will consist of tomato gene models identified by assembling the
large collection of public ESTs with 200MB of tomato EST sequence generated in this project by 454 sequencing of a normalized, full length enriched cDNA pool; and, aligning these assemblies to available tomato genomic sequences. Tomato trained TWINSCAN will be freely available to the scientific community under the Open-Source software agreement.
Animal Health Component
(N/A)
Research Effort Categories
Basic
100%
Applied
(N/A)
Developmental
(N/A)
Goals / Objectives
The central goal of this research is to provide a computer gene prediction program that has been optimized for identifying genes in tomato genome sequence. The Solanaceae encompass diverse crop species including the tuber-bearing potato, a number of fruit-bearing vegetables, ornamental flowers, edible leaves, and medicinals. In addition, the Solanaceae are the primary source of vegetables in the U.S. and the world. In recognition of the scientific and agricultural significance of tomato, a consortium of 10 countries funded by their respective governments is currently sequencing its genome. A broad understanding of the genes present in tomato would provide the identities, and eventually the map positions, of many of the genes responsible for controlling growth and development that ultimately influence agronomically important traits. The resultant sequence data from the tomato genome sequencing project will consist of a substantial collection of large contiguous
nucleotide segments about which little or no knowledge of content will be available, a priori. Therefore, high throughput computational tools that can identify and accurately predict the structure of the tomato genes will be absolutely necessary for annotating and understanding the tomato genome and relating it to other solanaceous plants and related taxa. We will improve upon the current computational tools for tomato gene prediction by developing a comprehensive training set of complete and annotated tomato gene models and optimizing the probability model underlying the TWINSCAN gene finding program. TWINSCAN incorporates cross-species similarity, when available, into its probability model, and out performs other gene finders on vertebrate genomic DNA sequence. The training set will consist of EST validated tomato gene models identified by assembling a large collection of public ESTs with 200MB of tomato EST sequence generated in this project by 454 sequencing of a normalized, full
length enriched cDNA pool; and, aligning these assemblies to tomato genomic sequences produced by the International Tomato Sequencing project. We will use this training set to re-train TWINSCAN to provide an accurate tomato gene finder to support tomato genome annotation. Tomato trained TWINSCAN will be freely available under the Open-Source software agreement. Specific objectives: 1. Generate 200Mb of tomato transcript sequence by sequencing a normalized, full length transcript enriched tomato cDNA pool. 2. Use the sequence generated above to construct a comprehensive collection of well annotated tomato genes that will serve as a substrate for the determination of a tomato gene probability model. 3. Utilize the annotated gene models collected to re-train the TWINSCAN probability model for tomato. 4. Use tomato trained TWINSCAN to perform gene prediction on tomato genome sequence, and integrate this program into the tomato sequencing project annotation pipeline. 5. Select a subset of
tomato hypothetical gene models predicted by TWINSCAN for verification. 6. Release the tomato specific version of TWINSCAN to the scientific community
Project Methods
The central aim of this project is to provide an accurate computational gene finder for tomato. We will train TWINSCAN for tomato to do this, and this requires a collection of well annotated, structure known (start codon, intron-exon boundaries, stop codon etc) tomato gene models. Aligning full length cDNAs to genomic sequence provides the most unambiguous path to identifying these. However, there is little full length tomato cDNA sequence available in the public sector. RNA from 5 tissues will be collected by Co-PI Giovannoni, and this will be used by Evrogen to construct a normalized cDNA pool enriched for FLcDNAs that will be subjected to 454 sequencing. A computational pipeline for identification of potentially full-length transcripts by aligning maize Sanger sequence derived ESTs, 454-ESTs, FLcDNAs and genomic DNA sequence has been developed during the course of NSF PGRP award DBI-0501758. The pipeline includes the Program to Assemble Spliced Alignments program
(PASA). PASA is a genome annotation tool developed at TIGR is able to model complete and partial gene structures based on assembled spliced alignments. Potential full length transcript sequence will be computationally derived by combining the 454 ESTs generated during this project with all available ESTs in the public domain, and assembling through the GMAP + PASA pipeline mentioned above using available tomato genomic sequence as alignment anchors. Modification of the TWINSCAN probability model: Given a genomic DNA sequence, the goal of gene prediction is to assign each position within the sequence to a functional state: intergenic, untranslated regions (UTR), exon, intron, etc. The current TWINSCAN gene model incorporates several types of features including splice signal models, intron and exon length distributions, and promotor and poly-adenelation signals. Each of these fundamental elements is termed a state in the probability model, and transitions from state to state may occur
in any biologically consistent order. To re-train TWINSCAN to predict tomato genes requires the parameters of the gene model to be re-evaluated based on the features of tomato genes. Initial probabilities of each state will be determined by their frequency in bulk tomato DNA; transition probabilities between states will likewise be proportional to the observed frequency of these transitions in the training set. Intron and exon length distributions will be modeled on the observed distributions in the training set, and the composition of coding and non-coding DNA will be empirically measured by frame-specific hexamer composition. Ab initio gene prediction by TWINSCAN will be performed on tomato genomic DNA sequences. The resultant predictions will be assessed for predicted positives, predicted negatives, actual positives, actual negatives, true positives and true negatives and TWINSCAN sensitivity and specificity will be determined. Finally, validation of hypothetical gene predictions
will be accomplished by TWINSCAN gene prediction followed by RT-PCR and direct sequencing.