DEVELOPING AN ACCURATE COMPUTER PROGRAM TO IDENTIFY POTENTIAL GENES IN TOMATO GENOME SEQUENCE

DEVELOPING AN ACCURATE COMPUTER PROGRAM TO IDENTIFY POTENTIAL GENES IN TOMATO GENOME SEQUENCE

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

NRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

0218208

Grant No.

2007-35300-19739

Cumulative Award Amt.

(N/A)

Proposal No.

2009-01463

Multistate No.

(N/A)

Project Start Date

Sep 1, 2008

Project End Date

Aug 31, 2012

Grant Year

2009

Program Code

[52.1]- Plant Genome

Recipient Organization
UNIVERSITY OF FLORIDA
G022 MCCARTY HALL
GAINESVILLE,FL 32611

Performing Department
Biology

Non Technical Summary
A broad understanding of the genes present in tomato would provide the identities of many of the genes responsible for controlling growth and development that ultimately influence agronomically important traits. In recognition of the scientific and agricultural significance of tomato, its genome is currently being sequenced by a consortium of 10 countries funded by their respective governments. However, the resultant sequence data itself will provide no knowledge of tomato gene content. Gene identification will rely on high throughput computational tools to identify and predict the structure of the tomato genes. The purpose of this project is to improve upon the current computational tools for tomato gene prediction by developing a comprehensive "training set" of complete and annotated tomato gene models and optimizing the probability model underlying the TWINSCAN gene finding program. The training set will consist of tomato gene models identified by assembling the large collection of public ESTs with 200MB of tomato EST sequence generated in this project by 454 sequencing of a normalized, full length enriched cDNA pool; and, aligning these assemblies to available tomato genomic sequences. Tomato trained TWINSCAN will be freely available to the scientific community under the Open-Source software agreement.

Animal Health Component

(N/A)

Research Effort Categories

Basic

100%

Applied

(N/A)

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	1460	1040	50%
201	1460	2080	50%

Knowledge Area
201 - Plant Genome, Genetics, and Genetic Mechanisms;

Subject Of Investigation
1460 - Tomato;

Field Of Science
1040 - Molecular biology; 2080 - Mathematics and computer sciences;

Keywords

genomics

computational biology

transcript validation

gene models

training gene finders

Goals / Objectives
The central goal of this research is to provide a computer gene prediction program that has been optimized for identifying genes in tomato genome sequence. The Solanaceae encompass diverse crop species including the tuber-bearing potato, a number of fruit-bearing vegetables, ornamental flowers, edible leaves, and medicinals. In addition, the Solanaceae are the primary source of vegetables in the U.S. and the world. In recognition of the scientific and agricultural significance of tomato, a consortium of 10 countries funded by their respective governments is currently sequencing its genome. A broad understanding of the genes present in tomato would provide the identities, and eventually the map positions, of many of the genes responsible for controlling growth and development that ultimately influence agronomically important traits. The resultant sequence data from the tomato genome sequencing project will consist of a substantial collection of large contiguous nucleotide segments about which little or no knowledge of content will be available, a priori. Therefore, high throughput computational tools that can identify and accurately predict the structure of the tomato genes will be absolutely necessary for annotating and understanding the tomato genome and relating it to other solanaceous plants and related taxa. We will improve upon the current computational tools for tomato gene prediction by developing a comprehensive training set of complete and annotated tomato gene models and optimizing the probability model underlying the TWINSCAN gene finding program. TWINSCAN incorporates cross-species similarity, when available, into its probability model, and out performs other gene finders on vertebrate genomic DNA sequence. The training set will consist of EST validated tomato gene models identified by assembling a large collection of public ESTs with 200MB of tomato EST sequence generated in this project by 454 sequencing of a normalized, full length enriched cDNA pool; and, aligning these assemblies to tomato genomic sequences produced by the International Tomato Sequencing project. We will use this training set to re-train TWINSCAN to provide an accurate tomato gene finder to support tomato genome annotation. Tomato trained TWINSCAN will be freely available under the Open-Source software agreement. Specific objectives: 1. Generate 200Mb of tomato transcript sequence by sequencing a normalized, full length transcript enriched tomato cDNA pool. 2. Use the sequence generated above to construct a comprehensive collection of well annotated tomato genes that will serve as a substrate for the determination of a tomato gene probability model. 3. Utilize the annotated gene models collected to re-train the TWINSCAN probability model for tomato. 4. Use tomato trained TWINSCAN to perform gene prediction on tomato genome sequence, and integrate this program into the tomato sequencing project annotation pipeline. 5. Select a subset of tomato hypothetical gene models predicted by TWINSCAN for verification. 6. Release the tomato specific version of TWINSCAN to the scientific community.

Project Methods
The central aim of this project is to provide an accurate computational gene finder for tomato. We will train TWINSCAN for tomato to do this, and this requires a collection of well annotated, structure known (start codon, intron-exon boundaries, stop codon etc) tomato gene models. Aligning full length cDNAs to genomic sequence provides the most unambiguous path to identifying these. However, there is little full length tomato cDNA sequence available in the public sector. RNA from 5 tissues will be collected by Co-PI Giovannoni, and this will be used by Evrogen to construct a normalized cDNA pool enriched for FLcDNAs that will be subjected to 454 sequencing. A computational pipeline for identification of potentially full-length transcripts by aligning maize Sanger sequence derived ESTs, 454-ESTs, FLcDNAs and genomic DNA sequence has been developed during the course of NSF PGRP award DBI-0501758. The pipeline includes the Program to Assemble Spliced Alignments program (PASA). PASA is a genome annotation tool developed at TIGR is able to model complete and partial gene structures based on assembled spliced alignments. Potential full length transcript sequence will be computationally derived by combining the 454 ESTs generated during this project with all available ESTs in the public domain, and assembling through the GMAP + PASA pipeline mentioned above using available tomato genomic sequence as alignment anchors. Modification of the TWINSCAN probability model: Given a genomic DNA sequence, the goal of gene prediction is to assign each position within the sequence to a functional state: intergenic, untranslated regions (UTR), exon, intron, etc. The current TWINSCAN gene model incorporates several types of features including splice signal models, intron and exon length distributions, and promotor and poly-adenelation signals. Each of these fundamental elements is termed a state in the probability model, and transitions from state to state may occur in any biologically consistent order. To re-train TWINSCAN to predict tomato genes requires the parameters of the gene model to be re-evaluated based on the features of tomato genes. Initial probabilities of each state will be determined by their frequency in bulk tomato DNA; transition probabilities between states will likewise be proportional to the observed frequency of these transitions in the training set. Intron and exon length distributions will be modeled on the observed distributions in the training set, and the composition of coding and non-coding DNA will be empirically measured by frame-specific hexamer composition. Ab initio gene prediction by TWINSCAN will be performed on tomato genomic DNA sequences. The resultant predictions will be assessed for predicted positives, predicted negatives, actual positives, actual negatives, true positives and true negatives and TWINSCAN sensitivity and specificity will be determined. Finally, validation of hypothetical gene predictions will be accomplished by TWINSCAN gene prediction followed by RT-PCR and direct sequencing.

Progress 09/01/08 to 08/31/12

Outputs
OUTPUTS: This project improved upon the current computational tools for tomato gene prediction by optimizing the probability model underlying the TWINSCAN gene finding program. We also optimized TWINSCAN make it effective at predicting genes in potato genome sequence as well. Full length enriched cDNA was synthesized from RNA samples (whole seedling, flowers, root, leaf and fruit). RNA was pooled, normalized, and sequenced on the Roche 454-FLX platform to obtain over 767,000 (>175Mb) of cDNA sequence. This cDNA sequence collection was combined with 223,441 sanger sequenced tomato ESTs and computational pipelines were developed to align the cDNA sequence collection to available tomato genome sequences (initially collections of sequence BAC clones but ultimately the tomato whole genome shotgun assembly) to identify potential gene models, filter out contaminants (short assemblies, partial genes, redundancy) to produce 3342 likely full-length transcripts for the training set. Training and testing was accomplished using the fourfold cross validation facilitated by custom PERL scripts. Performance was assessed by measuring prediction sensitivity and specificity. In addition, the popular ab initio gene finder FGENSH was simultaneously compared. Gene predictions were made by running TWINSCAN_EST version 4.1.2 on assembly release 2.1 of the tomato genome sequence using the TAIR9 Arabidopsis genome assembly as an informant sequence and 239,564 sanger sequenced tomato ESTs as the EST database. 109,644 predicted exons representing 34,600 complete predicted genes were predicted by TWINSCAN EST and provided to the ITGP. Generation of potato gene models for training was done as above models with the following changes. 206,565 available potato ESTs were aligned to the 12 potato psuedomolecules. Analysis produced 1555 gene models for training. Gene predictions were made by running TWINSCAN version 4.1.2 on the 12 pseudomolecules for potato using the same informant sequence as above. The EST database for estseq generation was a collection of 206,565 sanger sequenced potato ESTs. 130,669 predicted exons representing 43,589 complete predicted genes were predicted by TWINSCAN EST and provided to the ITGP. TWINSCAN EST gene finding accuracy is: Tomato Gene Sensitivity: FGENESH 39.07%; TWINSCAN EST 79.56% Tomato Gene Specificity: FGENESH 22.92%; TWINSCAN EST 52.13% Potato Gene Sensitivity: FGENESH 39.42%; TWINSCAN EST 66.17% Potato Gene Specificity: FGENESH 24.35%; TWINSCAN EST 45.23%. All tomato and potato TWINSCAN models were provided to the International tomato genome sequencing consortium. TWINSCAN gene prediction accuracy was also assessed empirically using a sequence based approach. We used a mixed set of 327 million single and paired-end 50bp Illumina RNA-Seq reads (PE) to confirm splice junctions predicted by TWINSCAN. TWINSCAN predicted a total of 34,600 tomato transcripts that overlap RNA-Seq derived CUFFLINKS transcripts. These transcripts contain 240,521 junctions. 22,376 (65%) of these transcripts overlap with RNA-Seq derived junctions, and junction validation is >50%, which is consistent with our expectation based on validated training. PARTICIPANTS: W. Brad Barbazuk (University of Florida). Principal Investigator Barbazuk directed the project. Barbazuk designed the computational strategy and computational pipeline to identify potential gene models for training Twinscan. Michael Brent (Washington University, ST. Louis). Co-PI Brent provides expertise and software packages for TWINSCAN and its training modules. James Giovannoni (Boyce Thompson Institute). Co-PI Giovannoni provides RNA from several mixed tissues for transcriptome sequencing, as well access to some collections of RNA-Seq for validation. Lukas Mueller (Boyce Thompson Institute). Co-PI Mueller provides a portal for the sequence data to be disseminated to the community, provides BAC and EST datasets for training set construction, and ultimately will feed TWINSCAN and TWINSCAN EST predictions into their tomato genome annotation. Srikar Chamala, MSc, Computer Technician. Chamala performed the transcriptome sequence assembly and analysis, implemented and ran the pipeline for identifying gene models for training. Chamala also created the potato gene training set, and trained TWINSCAN with it. Ruth Davenport MSc, University of Florida, Computer technician, wet-bench scientist. Davenport used trained TWINSCAN to annotate tomato sequence and assisted in testing and validation. Mike Sandford MSc Eng. Computer technician. Sandford maintained computer infrastructure and software for this project. Brandon Walts MSc Computer technician. Walts helped maintain computer infrastructure and software for this project, and used trained TWINSCAN to annotate tomato and potato sequence and assisted in testing and validation. TARGET AUDIENCES: Tomato and potato trained TWINSCAN and TWINSCAN EST, as well as the tomato and potato predicted gene sets provided to the International project, have contributed to the public annotation of the tomato genome. The tomato genome has been selected as the target reference to understand genome evolution and genetic diversity in the Solanaceae, and is of high value to all academic and industry researchers in the Solanaceae. PROJECT MODIFICATIONS: Not relevant to this project.

Impacts
This research project has had four main impacts. 1.) It has generated a significant tomato transcript collection. Greater than 175 Mb of sequence reads from full length enriched cDNA were produced from a comprehensive collection of tomato tissues. The collection is represented by greater than 750 million reads with an average length of 250bp. This is a large collection that was placed into the public domain, and had significant impact in annotation and gene discovery during the tomato genome sequencing project. These transcripts have been assembled by Lukas Meuller into a larger collection of tomato unigenes available to the public as part of the tomato genome project. The resultant assemblies further increased accuracy of gene finding during the annotation phase of the international effort. 2.) This project has defined gene sets in tomato and potato that that were used to train the TWINSCAN gene finding program to be effective in tomato and potato gene finding. In addition, the TWINSCAN training set for potato was used by the international tomato genome sequencing project team to re-train their tomato gene finding pipeline for potato. 3) This project has delivered an accurate gene finder for gene discovery which has been specifically trained to predict genes in both tomato and potato genomic sequence. The tomato and potato trained versions of TWINSCAN and TWINSCAN EST are much more accurate than FGENESH, which is the most commonly used gene prediction program for plants. The accuracy of tomato and potato trained TWINSCAN and TWINSCAN EST have had a significant impact on improving annotation accuracy for the public tomato genome sequence project. Ultimately, gene prediction in the potato genome enabled comparative genomics analysis between potato and tomato to be conducted by the international tomato sequencing project. 4.) Gene models predicted with tomato and potato trained TWINSCAN and TWINSCAN EST within the tomato and potato whole genome sequence assemblies were predicted by this project and shared with the International Tomato Genome Sequence consortium, and were used in improving the final annotations of these public projects. Therefore, tomato and potato trained TWINSCAN and TWINSCAN EST, as well as the tomato and potato predicted gene sets provided to the International project, have impacted the public annotation of the tomato genome. The tomato genome has been selected as the target reference to understand genome evolution and genetic diversity in the Solanaceae, and is of high value to academic and industry researchers in the Solanaceae.

Publications

Barbazuk, W.B., Fu, Y. and McGinnis, K.M. (2008) Genome-wide analyses of alternative splicing in plants: Opportunities and Challenges. Genome Research, 18:1381-92.
Fu, Y., Bannach, O., Chen, H., Teune, JA., Schmitz, A., Steger G., Xiong, L., and W. Brad Barbazuk (2009) Alternative Splicing of Anciently Exonized 5S rRNA Regulates Plant Transcription Factor TFIIIA. Genome Research 19:913-21
Barbazuk, W.B. and Schnable, P.S. (2011) SNP discovery by transcriptome pyrosequencing. In cDNA libraries: methods and applications. (Chaofu Lu Ed.) Humana Press
Tomato Genome Consortium. (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 485:635-41.
Ruzicka, D., Chamala, S., Barrios-Masias F. H., Martin, F., Smith, S., Jackson, L. E., Barbazuk, W. B. and Schachtman D. P. (2012) Inside arbuscular mycorrhizal roots - Molecular probes to understand the symbiosis. The Plant Genome (In Press)
Barbazuk WB. (2010) A conserved alternative splicing event in plants reveals an ancient exonization of 5S rRNA that regulates TFIIIA. RNA Biol. 7:397-402.

Progress 09/01/10 to 08/31/11

Outputs
OUTPUTS: We have developed potato and tomato trained versions of TWINSCAN and TWISCAN_EST and have provided training sets and provided predicted genes to the International Tomato Genome Project (ITGP). Over 675,000 454 and 223,441 publicly available sanger sequenced tomato ESTs were aligned to the tomato WGS assembly (Ver 1.0.3) to identify putative full length gene models. Initial analysis produced 58,314 assemblies that reduced down to 3343 after filtering out redundancy, short assemblies and those not predicted to encode full length proteins. Genomic segments for each of the 3342 predicted models flanked with up to 2Kb of sequence were cut out of the tomato draft genome assembly and served as a training set. This collection represents >5X increase in our training set size and is a comprehensive representation of gene structures and nucleotide usage in tomato. Of the 3342 assemblies composing the training set, 441 represented single exon genes, while the remaining 2901 genes contain one or more intron (av. 5.86 exon/gene). Partial summary statistics of the training set are: Average transcript length: 1,515bp; Average coding length: 1,057bp; Total length of transcript assembly training sequence: 5,063,031bp; Total length of coding sequence: 3,532,134bp. Training and testing was accomplished using the fourfold cross validation described in Brown et al. Genome Res. 2005 15:742-7. Custom PERL scripts were written to help automate the process. Performance was assessed in terms of the accuracy of predictions and gave rise to measurements of sensitivity and specificity. In addition, the popular ab initio gene finder FGENSH was simultaneously compared against each training and testing slice. Gene predictions were made by running TWINSCAN version 4.1.2 on assembly release 2.1 of the tomato genome sequence using the TAIR9 Arabidopsis genome assembly as an informant sequence. The EST database for estseq generation was a collection of 239,564 sanger sequenced tomato ESTs. 109,644 predicted exons representing 34,600 complete predicted genes were predicted by TWINSCAN_EST and provided to the ITGP. Generation of potato gene models for training followed the same procedure as generation of tomato gene models above with the following changes. PASA was used to align 206,565 potato ESTs to the 12 potato psuedomolecules. Analysis produced 20,251 assemblies that reduced down to 1555 for training (see above). Gene predictions were made by running TWINSCAN version 4.1.2 on the 12 pseudo molecules for potato using the same informant sequence as above. The EST database for estseq generation was a collection of 206,565 sanger sequenced potato ESTs. 130,669 predicted exons representing 43,589 complete predicted genes were predicted by TWINSCAN_EST and provided to the ITGP. TWINSCAN_EST gene finding accuracy is: Tomato Gene Sensitivity: FGENESH 39.07%; TWINSCAN_EST 79.56% Tomato Gene Specificity: FGENESH 22.92%; TWINSCAN_EST 52.13% Potato Gene Sensitivity: FGENESH 39.42%; TWINSCAN_EST 66.17% Potato Gene Specificity: FGENESH 24.35%; TWINSCAN_EST 45.23% PARTICIPANTS: W. Brad Barbazuk (University of Florida), Principal Investigator Barbazuk directed the project. Barbazuk designed the computational strategy and computational pipeline to identify potential gene models for training Twinscan. Michael Brent (Washington University, ST. Louis), Co-PI Brent provides expertise and software packages for TWINSCAN and it's training modules. James Giovannoni (Boyce Thompson Institute), Co-PI Giovannoni provides RNA from several mixed tissues for transcriptome sequencing. Lukas Mueller (Boyce Thompson Institute), Co-PI Mueller provides a portal for the sequence data to be disseminated to the community, provides BAC and EST datasets for training set construction, and ultimaltly will feed TWINSCAN and TWINSCAN_EST predictions into their tomato genome annotation, Srikar Chamala, MSc, Computer Technician Chamala performed the transcriptome sequence assembly and analysis, implemented and ran the pipeline for identifying gene models for training. Ruth Davenport, University of Florida, used trained TWINSCAN to annotate tomato sequence and assisted in testing. Mike Sandford MSc Eng. Computer technician, Sandford maintained computer infrastructure and software for this project. Brandon Walts MSc Computer technician, Walts helped maintain computer infrastructure and software for this project. TARGET AUDIENCES: Target Audiences: Tomato and potato trained TWINSCAN and TWINSCAN_EST, as well as the tomato and potato predicted gene sets provided to the International project, have contributed to the public annotation of the tomato genome. The tomato genome has been selected as the target reference to understand genome evolution and genetic diversity in the Solanaceae, and is of high value to all academic and industry researchers in the Solanaceae. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
This research has generated a significant tomato transcript sequence database. These transcripts have been assembled by Lukas Meuller into a larger collection of tomato unigenes available to the public as part of the tomato genome project. These sequences have increased accuracy of gene finding during the annotation phase of the international effort. In addition, this project has delivered an accurate gene finder for gene discovery trained to predict genes in both tomato and potato genomic sequence, as well as a potato gene model training set that was used by the international project to train their gene finders for potato to aid comparative genomics analysis. The tomato and potato trained versions of TWINSCAN and TWINSCAN_EST are much more accurate than FGENESH, which is the most commonly used gene prediction program for plants. The accuracy of tomato and potato trained TWINSCAN and TWINSCAN_EST have had a significant impact on improving annotation accuracy for the public tomato genome sequence project. Gene models predicted with tomato and potato trained TWINSCAN and TWINSCAN_EST within the tomato and potato whole genome sequence assemblies were predicted by this project and shared with the International Tomato Genome Sequence consortium, and were used in improving the final annotations of these public projects. Tomato and potato trained TWINSCAN and TWINSCAN_EST, as well as the tomato and potato predicted gene sets provided to the International project, have contributed to the public annotation of the tomato genome. The tomato genome has been selected as the target reference to understand genome evolution and genetic diversity in the Solanaceae, and is of high value to academic and industry researchers in the Solanaceae.

Publications

The Genomes that make Tomatoes. The International Tomato Sequencing project. Letter to Nature, submitted 2011

Progress 09/01/09 to 08/31/10

Outputs
OUTPUTS: The aim of this research project is to optimize the Ab initio gene prediction software packages TWINSCAN and TWINSCAN_EST for identifying genes in tomato whole genome sequence. Optimization requires the production of a training set. Training set construction involves identifying putative full length tomato gene models from sequence assemblies provided by combining the large collection of public ESTs with ~200MB of tomato EST sequence generated on the Roche 454 sequencing during the first year of this project, and performing a reference based assembly of this against tomato genome sequence. Alignments were made to 856 BAC clones and generated 13,373 putative assemblies. Filtering out short assemblies, partial genes, redundancy and screening for assemblies that likely represent full-length transcripts resulted in final collection 945 putative full length tomato transcripts. The genomic sequence representing the full gene models associated with each transcript was obtained from the BAC sequences that were used in their reference-based assembly (exons and introns + 2Kb sequence up and downstream), and this genomic sequence was used to train the TWINSCAN and TWINSCAN_EST gene prediction programs. Of the 945 putative assemblies for training, 110 represented single exon genes. Partial summary statistics of the training set are: Average transcript length: 1,496bp; Average coding length: 1,043bp; Total length of transcript assembly training sequence: 1,414,346bp; Total length of coding sequence: 985,929bp; Total number of exons: 5625; Average number of exons: 5.95 per gene; Average exon length: 175bp; Based on these metrics and the non-redundant nature of the predicted proteins, we expected that this should serve as a reasonably good representation of tomato gene complexity, and thus would act as a good training set. Training and testing was accomplished using the fourfold cross validation described in Brown et al. Genome Res. 2005 15:742-7. Custom PERL scripts were written to help automate the process. Performance was assessed in terms of the accuracy of predictions and gave rise to measurements of sensitivity and specificity. In addition, the popular ab initio gene finder FGENSH was simultaneously compared against each training and testing slice. Performance is as follows: FGENSH TWINSCAN TWINSCAN_EST Gene Sensitivity 39.07% 28.76% 56.69% Gene Specificity 22.92% 15.90% 31.70% At this point TWINSCAN native does not perform as well as FGENESH utilizing a Dicot parameter set. However, the results are significantly improved when adding ESTs and using TWINSCAN_EST. We anticipate that future improvements in the depth and complexity of our training set will result in improved performance of both TWINSAN and TWINSCAN_EST. Currently we are exploring the use of the tomato shotgun assembly produced by the International Tomato Genome Sequencing Project. Our expectation is that the larger portion of the genome covered by this resource in comparison to the relatively small number of completed BACs available (856) when the training set was constructed will enable discovery of a larger set of complete gene models for future iterations of training and testing. PARTICIPANTS: W. Brad Barbazuk (University of Florida), Principal Investigator Barbazuk directed the project. Barbazuk designed the computational strategy and computational pipeline to identify potential gene models for training Twinscan. Michael Brent (Washington University, ST. Louis), Co-PI Brent provides expertise and software packages for TWINSCAN and it's training modules. James Giovannoni (Boyce Thompson Institute), Co-PI Giovannoni provides RNA from several mixed tissues for transcriptome sequencing. Lukas Mueller (Boyce Thompson Institute), Co-PI Mueller provides a portal for the sequence data to be disseminated to the community, provides BAC and EST datasets for training set construction, and ultimaltly will feed TWINSCAN and TWINSCAN_EST predictions into their tomato genome annotation, Srikar Chamala, MSc, Computer Technician Chamala performed the transcriptome sequence assembly and analysis, implemented and ran the pipeline for identifying gene models for training. Ruth Davenport, University of Florida, used trained TWINSCAN to annotate tomato sequence and assisted in testing. Mike Sandford MSc Eng. Computer technician, Sandford maintained computer infrastructure and software for this project. Brandon Walts MSc Computer technician, Walts helped maintain computer infrastructure and software for this project. TARGET AUDIENCES: Nothing significant to report during this reporting period PROJECT MODIFICATIONS: Nothing significant to report during this reporting period

Impacts
This research has generated a significant tomato transcript sequence database, and will generate an accurate gene finder for gene discovery in tomato genomic sequence. These transcripts have been assembled by Lukas Meuller into a larger collection of tomato unigenes available to the public as part of the tomato genome project. These sequences will increase accuracy of gene finding during the annotation phase of the international effort. The transcript collection and the advanced gene finder will have a significant impact on tomato gene discovery, will improve accuracy, and will provide a foundation for understanding gene organization and function in the tomato genome. Furthermore, it is likely that tomato trained TWINSCAN and TWINSCAN_EST will perform well as a gene finder in other Solanaceae species.

Publications

Barbazuk WB. A conserved alternative splicing event in plants reveals an ancient exonization of 5S rRNA that regulates TFIIIA. RNA Biol. 2010 7:397-402.

Progress 09/01/08 to 08/31/09

Outputs
OUTPUTS: The central goal of this research is to provide a computer gene prediction program that has been optimized for identifying genes in tomato genome sequence. This project will improve upon the current computational tools for tomato gene prediction by developing a comprehensive "training set" of complete and annotated tomato gene models and optimizing the probability model underlying the TWINSCAN gene finding program. The training set will consist of tomato gene models identified by assembling the large collection of public ESTs with 200MB of tomato EST sequence generated in this project by 454 sequencing of a normalized, full length enriched cDNA pool; and, aligning these assemblies to available tomato genomic sequences. Tomato trained TWINSCAN will be freely available to the scientific community under the Open-Source software agreement. The specific objectives are to 1) generate ~200Mb of tomato transcript sequence by sequencing a normalized, full length transcript enriched tomato cDNA pool, 2) construct a comprehensive collection of well annotated tomato genes with the sequence generated, and 3) train TWINSCAN and use it to help annotate the tomato genome project sequence. To date, RNA has been isolated from 5 tissues: whole seedling, flowers, root leaf and fruit. Full length enriched cDNA was synthesized from each RNA sample, pooled and then normalized. The cDNA was subjected to 2 454-FLX runs at the UF genomics service center. Directional tags were added during cDNA synthesis to enable recovery of 454 sequences that are either 3' or 5. Over 767,000 454 tomato transcript sequences were obtained representing greater than 175 Mbp. These ESTs were combined with 223,441 available ESTs in GenBank, and the whole EST collection was aligned to 735 finished BAC sequences with PASA to define gene models. Te resulting assemblies were screened for the presence of 454 sequences with directional tags (signifies potential 5' or 3' ends of the original FL enriched cDNA), clustered to remove redundancy, and BLAST aligned to potential full length protein sequences to identify potential structurally complete maize genes. Currently we have ~1300 potential full length tomato gene models that we identified during this first iteration of the process. These are being confirming as full length are being used to train the first iteration of tomato-TWINSCAN. PARTICIPANTS: W. Brad Barbazuk (University of Florida) , Principal Investigator Barbazuk directed the project. Barbazuk designed the computational strategy and computational pipeline to identify potential gene models for training Twinscan. Michael Brent (Washington University, ST. Louis) , Co-PI Brent provides expertise and software packages for TWINSCAN and it's training modules James Giovannoni (Boyce Thompson Institute) Co-PI Giovannoni provides RNA from several mixed tissues for transcriptome sequencing Lukas Mueller (Boyce Thompson Institute) Co-PI Mueller provides a portal for the sequence data to be disseminated to the community. Srikar Chamala, MSc, Computer Technician Chamala performed the transcriptome sequence assembly and analysis, implemented and ran the pipeline for identifying gene models fro training. Sandford maintained computer infrastructure and software for this project. Mike Sandford MSc Eng. Computer technician TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
This research has generated a significant tomato transcript sequence database, and will generate an accurate gene finder for gene discovery in tomato genomic sequence. The transcript collection and the advanced gene finder will have a significant impact on tomato gene discovery, will improve accuracy, and will provide a foundation for understanding gene organization and function in the tomato genome. Furthermore, it is likely that tomato trained TWINSCAN will perform well as a gene finder in other Solanaceae species.

Publications

Yan Fu, Oliver Bannach, Hao Chen, Jan-Hendrik Teune, Axel Schmitz, Gerhard Steger, Liming Xiong, and W. Brad Barbazuk . Alternative Splicing of Anciently Exonized 5S rRNA Regulates Plant Transcription Factor TFIIIA. 2009; Genome Res. 19:913-921
W. Brad Barbazuk & Patrick S. Schnable 2008. Pyrosequencing the transcriptome for gene annotation and SNP discovery. In cDNA libraries: Methods and Applications. (Chaofu Ed.) Humana Press INC.
W. Brad Barbazuk, Yan Fu, Karen M. McGinnis. Genome-wide analyses of alternative splicing in plants: Opportunities and Challenges. Genome Research 2008 Sep;18(9):1381-92.