Progress 09/01/08 to 08/31/12
Outputs OUTPUTS: This project improved upon the current computational tools for tomato gene prediction by optimizing the probability model underlying the TWINSCAN gene finding program. We also optimized TWINSCAN make it effective at predicting genes in potato genome sequence as well. Full length enriched cDNA was synthesized from RNA samples (whole seedling, flowers, root, leaf and fruit). RNA was pooled, normalized, and sequenced on the Roche 454-FLX platform to obtain over 767,000 (>175Mb) of cDNA sequence. This cDNA sequence collection was combined with 223,441 sanger sequenced tomato ESTs and computational pipelines were developed to align the cDNA sequence collection to available tomato genome sequences (initially collections of sequence BAC clones but ultimately the tomato whole genome shotgun assembly) to identify potential gene models, filter out contaminants (short assemblies, partial genes, redundancy) to produce 3342 likely full-length transcripts for the training set. Training and testing was accomplished using the fourfold cross validation facilitated by custom PERL scripts. Performance was assessed by measuring prediction sensitivity and specificity. In addition, the popular ab initio gene finder FGENSH was simultaneously compared. Gene predictions were made by running TWINSCAN_EST version 4.1.2 on assembly release 2.1 of the tomato genome sequence using the TAIR9 Arabidopsis genome assembly as an informant sequence and 239,564 sanger sequenced tomato ESTs as the EST database. 109,644 predicted exons representing 34,600 complete predicted genes were predicted by TWINSCAN EST and provided to the ITGP. Generation of potato gene models for training was done as above models with the following changes. 206,565 available potato ESTs were aligned to the 12 potato psuedomolecules. Analysis produced 1555 gene models for training. Gene predictions were made by running TWINSCAN version 4.1.2 on the 12 pseudomolecules for potato using the same informant sequence as above. The EST database for estseq generation was a collection of 206,565 sanger sequenced potato ESTs. 130,669 predicted exons representing 43,589 complete predicted genes were predicted by TWINSCAN EST and provided to the ITGP. TWINSCAN EST gene finding accuracy is: Tomato Gene Sensitivity: FGENESH 39.07%; TWINSCAN EST 79.56% Tomato Gene Specificity: FGENESH 22.92%; TWINSCAN EST 52.13% Potato Gene Sensitivity: FGENESH 39.42%; TWINSCAN EST 66.17% Potato Gene Specificity: FGENESH 24.35%; TWINSCAN EST 45.23%. All tomato and potato TWINSCAN models were provided to the International tomato genome sequencing consortium. TWINSCAN gene prediction accuracy was also assessed empirically using a sequence based approach. We used a mixed set of 327 million single and paired-end 50bp Illumina RNA-Seq reads (PE) to confirm splice junctions predicted by TWINSCAN. TWINSCAN predicted a total of 34,600 tomato transcripts that overlap RNA-Seq derived CUFFLINKS transcripts. These transcripts contain 240,521 junctions. 22,376 (65%) of these transcripts overlap with RNA-Seq derived junctions, and junction validation is >50%, which is consistent with our expectation based on validated training. PARTICIPANTS: W. Brad Barbazuk (University of Florida). Principal Investigator Barbazuk directed the project. Barbazuk designed the computational strategy and computational pipeline to identify potential gene models for training Twinscan. Michael Brent (Washington University, ST. Louis). Co-PI Brent provides expertise and software packages for TWINSCAN and its training modules. James Giovannoni (Boyce Thompson Institute). Co-PI Giovannoni provides RNA from several mixed tissues for transcriptome sequencing, as well access to some collections of RNA-Seq for validation. Lukas Mueller (Boyce Thompson Institute). Co-PI Mueller provides a portal for the sequence data to be disseminated to the community, provides BAC and EST datasets for training set construction, and ultimately will feed TWINSCAN and TWINSCAN EST predictions into their tomato genome annotation. Srikar Chamala, MSc, Computer Technician. Chamala performed the transcriptome sequence assembly and analysis, implemented and ran the pipeline for identifying gene models for training. Chamala also created the potato gene training set, and trained TWINSCAN with it. Ruth Davenport MSc, University of Florida, Computer technician, wet-bench scientist. Davenport used trained TWINSCAN to annotate tomato sequence and assisted in testing and validation. Mike Sandford MSc Eng. Computer technician. Sandford maintained computer infrastructure and software for this project. Brandon Walts MSc Computer technician. Walts helped maintain computer infrastructure and software for this project, and used trained TWINSCAN to annotate tomato and potato sequence and assisted in testing and validation. TARGET AUDIENCES: Tomato and potato trained TWINSCAN and TWINSCAN EST, as well as the tomato and potato predicted gene sets provided to the International project, have contributed to the public annotation of the tomato genome. The tomato genome has been selected as the target reference to understand genome evolution and genetic diversity in the Solanaceae, and is of high value to all academic and industry researchers in the Solanaceae. PROJECT MODIFICATIONS: Not relevant to this project.
Impacts This research project has had four main impacts. 1.) It has generated a significant tomato transcript collection. Greater than 175 Mb of sequence reads from full length enriched cDNA were produced from a comprehensive collection of tomato tissues. The collection is represented by greater than 750 million reads with an average length of 250bp. This is a large collection that was placed into the public domain, and had significant impact in annotation and gene discovery during the tomato genome sequencing project. These transcripts have been assembled by Lukas Meuller into a larger collection of tomato unigenes available to the public as part of the tomato genome project. The resultant assemblies further increased accuracy of gene finding during the annotation phase of the international effort. 2.) This project has defined gene sets in tomato and potato that that were used to train the TWINSCAN gene finding program to be effective in tomato and potato gene finding. In addition, the TWINSCAN training set for potato was used by the international tomato genome sequencing project team to re-train their tomato gene finding pipeline for potato. 3) This project has delivered an accurate gene finder for gene discovery which has been specifically trained to predict genes in both tomato and potato genomic sequence. The tomato and potato trained versions of TWINSCAN and TWINSCAN EST are much more accurate than FGENESH, which is the most commonly used gene prediction program for plants. The accuracy of tomato and potato trained TWINSCAN and TWINSCAN EST have had a significant impact on improving annotation accuracy for the public tomato genome sequence project. Ultimately, gene prediction in the potato genome enabled comparative genomics analysis between potato and tomato to be conducted by the international tomato sequencing project. 4.) Gene models predicted with tomato and potato trained TWINSCAN and TWINSCAN EST within the tomato and potato whole genome sequence assemblies were predicted by this project and shared with the International Tomato Genome Sequence consortium, and were used in improving the final annotations of these public projects. Therefore, tomato and potato trained TWINSCAN and TWINSCAN EST, as well as the tomato and potato predicted gene sets provided to the International project, have impacted the public annotation of the tomato genome. The tomato genome has been selected as the target reference to understand genome evolution and genetic diversity in the Solanaceae, and is of high value to academic and industry researchers in the Solanaceae.
Publications
- Barbazuk, W.B., Fu, Y. and McGinnis, K.M. (2008) Genome-wide analyses of alternative splicing in plants: Opportunities and Challenges. Genome Research, 18:1381-92.
- Fu, Y., Bannach, O., Chen, H., Teune, JA., Schmitz, A., Steger G., Xiong, L., and W. Brad Barbazuk (2009) Alternative Splicing of Anciently Exonized 5S rRNA Regulates Plant Transcription Factor TFIIIA. Genome Research 19:913-21
- Barbazuk, W.B. and Schnable, P.S. (2011) SNP discovery by transcriptome pyrosequencing. In cDNA libraries: methods and applications. (Chaofu Lu Ed.) Humana Press
- Tomato Genome Consortium. (2012) The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 485:635-41.
- Ruzicka, D., Chamala, S., Barrios-Masias F. H., Martin, F., Smith, S., Jackson, L. E., Barbazuk, W. B. and Schachtman D. P. (2012) Inside arbuscular mycorrhizal roots - Molecular probes to understand the symbiosis. The Plant Genome (In Press)
- Barbazuk WB. (2010) A conserved alternative splicing event in plants reveals an ancient exonization of 5S rRNA that regulates TFIIIA. RNA Biol. 7:397-402.
|
Progress 09/01/10 to 08/31/11
Outputs OUTPUTS: We have developed potato and tomato trained versions of TWINSCAN and TWISCAN_EST and have provided training sets and provided predicted genes to the International Tomato Genome Project (ITGP). Over 675,000 454 and 223,441 publicly available sanger sequenced tomato ESTs were aligned to the tomato WGS assembly (Ver 1.0.3) to identify putative full length gene models. Initial analysis produced 58,314 assemblies that reduced down to 3343 after filtering out redundancy, short assemblies and those not predicted to encode full length proteins. Genomic segments for each of the 3342 predicted models flanked with up to 2Kb of sequence were cut out of the tomato draft genome assembly and served as a training set. This collection represents >5X increase in our training set size and is a comprehensive representation of gene structures and nucleotide usage in tomato. Of the 3342 assemblies composing the training set, 441 represented single exon genes, while the remaining 2901 genes contain one or more intron (av. 5.86 exon/gene). Partial summary statistics of the training set are: Average transcript length: 1,515bp; Average coding length: 1,057bp; Total length of transcript assembly training sequence: 5,063,031bp; Total length of coding sequence: 3,532,134bp. Training and testing was accomplished using the fourfold cross validation described in Brown et al. Genome Res. 2005 15:742-7. Custom PERL scripts were written to help automate the process. Performance was assessed in terms of the accuracy of predictions and gave rise to measurements of sensitivity and specificity. In addition, the popular ab initio gene finder FGENSH was simultaneously compared against each training and testing slice. Gene predictions were made by running TWINSCAN version 4.1.2 on assembly release 2.1 of the tomato genome sequence using the TAIR9 Arabidopsis genome assembly as an informant sequence. The EST database for estseq generation was a collection of 239,564 sanger sequenced tomato ESTs. 109,644 predicted exons representing 34,600 complete predicted genes were predicted by TWINSCAN_EST and provided to the ITGP. Generation of potato gene models for training followed the same procedure as generation of tomato gene models above with the following changes. PASA was used to align 206,565 potato ESTs to the 12 potato psuedomolecules. Analysis produced 20,251 assemblies that reduced down to 1555 for training (see above). Gene predictions were made by running TWINSCAN version 4.1.2 on the 12 pseudo molecules for potato using the same informant sequence as above. The EST database for estseq generation was a collection of 206,565 sanger sequenced potato ESTs. 130,669 predicted exons representing 43,589 complete predicted genes were predicted by TWINSCAN_EST and provided to the ITGP. TWINSCAN_EST gene finding accuracy is: Tomato Gene Sensitivity: FGENESH 39.07%; TWINSCAN_EST 79.56% Tomato Gene Specificity: FGENESH 22.92%; TWINSCAN_EST 52.13% Potato Gene Sensitivity: FGENESH 39.42%; TWINSCAN_EST 66.17% Potato Gene Specificity: FGENESH 24.35%; TWINSCAN_EST 45.23% PARTICIPANTS: W. Brad Barbazuk (University of Florida), Principal Investigator Barbazuk directed the project. Barbazuk designed the computational strategy and computational pipeline to identify potential gene models for training Twinscan. Michael Brent (Washington University, ST. Louis), Co-PI Brent provides expertise and software packages for TWINSCAN and it's training modules. James Giovannoni (Boyce Thompson Institute), Co-PI Giovannoni provides RNA from several mixed tissues for transcriptome sequencing. Lukas Mueller (Boyce Thompson Institute), Co-PI Mueller provides a portal for the sequence data to be disseminated to the community, provides BAC and EST datasets for training set construction, and ultimaltly will feed TWINSCAN and TWINSCAN_EST predictions into their tomato genome annotation, Srikar Chamala, MSc, Computer Technician Chamala performed the transcriptome sequence assembly and analysis, implemented and ran the pipeline for identifying gene models for training. Ruth Davenport, University of Florida, used trained TWINSCAN to annotate tomato sequence and assisted in testing. Mike Sandford MSc Eng. Computer technician, Sandford maintained computer infrastructure and software for this project. Brandon Walts MSc Computer technician, Walts helped maintain computer infrastructure and software for this project. TARGET AUDIENCES: Target Audiences: Tomato and potato trained TWINSCAN and TWINSCAN_EST, as well as the tomato and potato predicted gene sets provided to the International project, have contributed to the public annotation of the tomato genome. The tomato genome has been selected as the target reference to understand genome evolution and genetic diversity in the Solanaceae, and is of high value to all academic and industry researchers in the Solanaceae. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.
Impacts This research has generated a significant tomato transcript sequence database. These transcripts have been assembled by Lukas Meuller into a larger collection of tomato unigenes available to the public as part of the tomato genome project. These sequences have increased accuracy of gene finding during the annotation phase of the international effort. In addition, this project has delivered an accurate gene finder for gene discovery trained to predict genes in both tomato and potato genomic sequence, as well as a potato gene model training set that was used by the international project to train their gene finders for potato to aid comparative genomics analysis. The tomato and potato trained versions of TWINSCAN and TWINSCAN_EST are much more accurate than FGENESH, which is the most commonly used gene prediction program for plants. The accuracy of tomato and potato trained TWINSCAN and TWINSCAN_EST have had a significant impact on improving annotation accuracy for the public tomato genome sequence project. Gene models predicted with tomato and potato trained TWINSCAN and TWINSCAN_EST within the tomato and potato whole genome sequence assemblies were predicted by this project and shared with the International Tomato Genome Sequence consortium, and were used in improving the final annotations of these public projects. Tomato and potato trained TWINSCAN and TWINSCAN_EST, as well as the tomato and potato predicted gene sets provided to the International project, have contributed to the public annotation of the tomato genome. The tomato genome has been selected as the target reference to understand genome evolution and genetic diversity in the Solanaceae, and is of high value to academic and industry researchers in the Solanaceae.
Publications
- The Genomes that make Tomatoes. The International Tomato Sequencing project. Letter to Nature, submitted 2011
|
Progress 09/01/09 to 08/31/10
Outputs OUTPUTS: The aim of this research project is to optimize the Ab initio gene prediction software packages TWINSCAN and TWINSCAN_EST for identifying genes in tomato whole genome sequence. Optimization requires the production of a training set. Training set construction involves identifying putative full length tomato gene models from sequence assemblies provided by combining the large collection of public ESTs with ~200MB of tomato EST sequence generated on the Roche 454 sequencing during the first year of this project, and performing a reference based assembly of this against tomato genome sequence. Alignments were made to 856 BAC clones and generated 13,373 putative assemblies. Filtering out short assemblies, partial genes, redundancy and screening for assemblies that likely represent full-length transcripts resulted in final collection 945 putative full length tomato transcripts. The genomic sequence representing the full gene models associated with each transcript was obtained from the BAC sequences that were used in their reference-based assembly (exons and introns + 2Kb sequence up and downstream), and this genomic sequence was used to train the TWINSCAN and TWINSCAN_EST gene prediction programs. Of the 945 putative assemblies for training, 110 represented single exon genes. Partial summary statistics of the training set are: Average transcript length: 1,496bp; Average coding length: 1,043bp; Total length of transcript assembly training sequence: 1,414,346bp; Total length of coding sequence: 985,929bp; Total number of exons: 5625; Average number of exons: 5.95 per gene; Average exon length: 175bp; Based on these metrics and the non-redundant nature of the predicted proteins, we expected that this should serve as a reasonably good representation of tomato gene complexity, and thus would act as a good training set. Training and testing was accomplished using the fourfold cross validation described in Brown et al. Genome Res. 2005 15:742-7. Custom PERL scripts were written to help automate the process. Performance was assessed in terms of the accuracy of predictions and gave rise to measurements of sensitivity and specificity. In addition, the popular ab initio gene finder FGENSH was simultaneously compared against each training and testing slice. Performance is as follows: FGENSH TWINSCAN TWINSCAN_EST Gene Sensitivity 39.07% 28.76% 56.69% Gene Specificity 22.92% 15.90% 31.70% At this point TWINSCAN native does not perform as well as FGENESH utilizing a Dicot parameter set. However, the results are significantly improved when adding ESTs and using TWINSCAN_EST. We anticipate that future improvements in the depth and complexity of our training set will result in improved performance of both TWINSAN and TWINSCAN_EST. Currently we are exploring the use of the tomato shotgun assembly produced by the International Tomato Genome Sequencing Project. Our expectation is that the larger portion of the genome covered by this resource in comparison to the relatively small number of completed BACs available (856) when the training set was constructed will enable discovery of a larger set of complete gene models for future iterations of training and testing. PARTICIPANTS: W. Brad Barbazuk (University of Florida), Principal Investigator Barbazuk directed the project. Barbazuk designed the computational strategy and computational pipeline to identify potential gene models for training Twinscan. Michael Brent (Washington University, ST. Louis), Co-PI Brent provides expertise and software packages for TWINSCAN and it's training modules. James Giovannoni (Boyce Thompson Institute), Co-PI Giovannoni provides RNA from several mixed tissues for transcriptome sequencing. Lukas Mueller (Boyce Thompson Institute), Co-PI Mueller provides a portal for the sequence data to be disseminated to the community, provides BAC and EST datasets for training set construction, and ultimaltly will feed TWINSCAN and TWINSCAN_EST predictions into their tomato genome annotation, Srikar Chamala, MSc, Computer Technician Chamala performed the transcriptome sequence assembly and analysis, implemented and ran the pipeline for identifying gene models for training. Ruth Davenport, University of Florida, used trained TWINSCAN to annotate tomato sequence and assisted in testing. Mike Sandford MSc Eng. Computer technician, Sandford maintained computer infrastructure and software for this project. Brandon Walts MSc Computer technician, Walts helped maintain computer infrastructure and software for this project. TARGET AUDIENCES: Nothing significant to report during this reporting period PROJECT MODIFICATIONS: Nothing significant to report during this reporting period
Impacts This research has generated a significant tomato transcript sequence database, and will generate an accurate gene finder for gene discovery in tomato genomic sequence. These transcripts have been assembled by Lukas Meuller into a larger collection of tomato unigenes available to the public as part of the tomato genome project. These sequences will increase accuracy of gene finding during the annotation phase of the international effort. The transcript collection and the advanced gene finder will have a significant impact on tomato gene discovery, will improve accuracy, and will provide a foundation for understanding gene organization and function in the tomato genome. Furthermore, it is likely that tomato trained TWINSCAN and TWINSCAN_EST will perform well as a gene finder in other Solanaceae species.
Publications
- Barbazuk WB. A conserved alternative splicing event in plants reveals an ancient exonization of 5S rRNA that regulates TFIIIA. RNA Biol. 2010 7:397-402.
|
Progress 09/01/08 to 08/31/09
Outputs OUTPUTS: The central goal of this research is to provide a computer gene prediction program that has been optimized for identifying genes in tomato genome sequence. This project will improve upon the current computational tools for tomato gene prediction by developing a comprehensive "training set" of complete and annotated tomato gene models and optimizing the probability model underlying the TWINSCAN gene finding program. The training set will consist of tomato gene models identified by assembling the large collection of public ESTs with 200MB of tomato EST sequence generated in this project by 454 sequencing of a normalized, full length enriched cDNA pool; and, aligning these assemblies to available tomato genomic sequences. Tomato trained TWINSCAN will be freely available to the scientific community under the Open-Source software agreement. The specific objectives are to 1) generate ~200Mb of tomato transcript sequence by sequencing a normalized, full length transcript enriched tomato cDNA pool, 2) construct a comprehensive collection of well annotated tomato genes with the sequence generated, and 3) train TWINSCAN and use it to help annotate the tomato genome project sequence. To date, RNA has been isolated from 5 tissues: whole seedling, flowers, root leaf and fruit. Full length enriched cDNA was synthesized from each RNA sample, pooled and then normalized. The cDNA was subjected to 2 454-FLX runs at the UF genomics service center. Directional tags were added during cDNA synthesis to enable recovery of 454 sequences that are either 3' or 5. Over 767,000 454 tomato transcript sequences were obtained representing greater than 175 Mbp. These ESTs were combined with 223,441 available ESTs in GenBank, and the whole EST collection was aligned to 735 finished BAC sequences with PASA to define gene models. Te resulting assemblies were screened for the presence of 454 sequences with directional tags (signifies potential 5' or 3' ends of the original FL enriched cDNA), clustered to remove redundancy, and BLAST aligned to potential full length protein sequences to identify potential structurally complete maize genes. Currently we have ~1300 potential full length tomato gene models that we identified during this first iteration of the process. These are being confirming as full length are being used to train the first iteration of tomato-TWINSCAN. PARTICIPANTS: W. Brad Barbazuk (University of Florida) , Principal Investigator Barbazuk directed the project. Barbazuk designed the computational strategy and computational pipeline to identify potential gene models for training Twinscan. Michael Brent (Washington University, ST. Louis) , Co-PI Brent provides expertise and software packages for TWINSCAN and it's training modules James Giovannoni (Boyce Thompson Institute) Co-PI Giovannoni provides RNA from several mixed tissues for transcriptome sequencing Lukas Mueller (Boyce Thompson Institute) Co-PI Mueller provides a portal for the sequence data to be disseminated to the community. Srikar Chamala, MSc, Computer Technician Chamala performed the transcriptome sequence assembly and analysis, implemented and ran the pipeline for identifying gene models fro training. Sandford maintained computer infrastructure and software for this project. Mike Sandford MSc Eng. Computer technician TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.
Impacts This research has generated a significant tomato transcript sequence database, and will generate an accurate gene finder for gene discovery in tomato genomic sequence. The transcript collection and the advanced gene finder will have a significant impact on tomato gene discovery, will improve accuracy, and will provide a foundation for understanding gene organization and function in the tomato genome. Furthermore, it is likely that tomato trained TWINSCAN will perform well as a gene finder in other Solanaceae species.
Publications
- Yan Fu, Oliver Bannach, Hao Chen, Jan-Hendrik Teune, Axel Schmitz, Gerhard Steger, Liming Xiong, and W. Brad Barbazuk . Alternative Splicing of Anciently Exonized 5S rRNA Regulates Plant Transcription Factor TFIIIA. 2009; Genome Res. 19:913-921
- W. Brad Barbazuk & Patrick S. Schnable 2008. Pyrosequencing the transcriptome for gene annotation and SNP discovery. In cDNA libraries: Methods and Applications. (Chaofu Ed.) Humana Press INC.
- W. Brad Barbazuk, Yan Fu, Karen M. McGinnis. Genome-wide analyses of alternative splicing in plants: Opportunities and Challenges. Genome Research 2008 Sep;18(9):1381-92.
|
|