Progress 11/01/15 to 10/31/16
Target Audience:We have reached the researchers and professionals in pine breeding community during the first year of the project. We invited the representatives from major forest companies and organizations to start the project (kick off meeting at NC State University, Raleigh on January 21st, 2016). We also attended Cooperative Tree Improvement Program annual meeting on May 11, 2016 and introduced the project to the stakeholders.The project increased the interest among pine breeding organizations and companies outside US (Brazil, Chile, New Zealand, South Africa). Changes/Problems:
What opportunities for training and professional development has the project provided?Graduate student Trevor Walker and research assistant professor Juan Acosta with NC State University received training on bioinformatics at Dr. Jill Wegryzn lab at University of Connected in July 2016. Graduate student Brian Smith with NCSU Bioinformatics program has been trained on bioinformatics tools to carry out sequence alignment. He is hired as temporary student worker to help with the project. Table 1: Total read data included in assembly for v2.0 Data type Number Total Length (bp) Mean read length Coverage Clone Coverage PacBio reads 27,667,399 267,426,106,405 9,665 12X n/a Illumina reads 10,563,266,162 1,499,483,795,334 142 68X 96X 5-10 kb paired reads 3,152,047,806 475,959,218,706 151 22x 69x Super-reads 96,369,476 44,307,329,021 460 2X 2X Mega-reads 27,986,125 103,129,750,091 3,685 4.7X 4.7X Table 2: Comparative statistics on assembly improvement from v1.01 to v2.0 Assembly Ptaeda 1.01 Ptaeda 2.0 Total size 20,148,103,497 bp 20,613,845,687 bp Total scaffold span 22,564,679,219 bp 22,104,209,064 bp N50 contig size 8,206 bp 25,361 bp Number of contigs 16,461,900 2,855,700 Number of contigs > 500bp 2,527,203 2,445,689 N50 scaffold size 66,920 bp 107,036 bp Number of scaffolds > 200bp 7,068,375 1,762,655 Number of scaffolds > 500bp 2,158,326 1,496,869 How have the results been disseminated to communities of interest?The report has been emailed to stakeholders, mostly researchers and profesisonals in the field. What do you plan to do during the next reporting period to accomplish the goals?We are in the process of SNP calling and annotation. This work will continue for several months.
What was accomplished under these goals?
The genome sequence for v2.0 of the loblolly pine genome was released on August 2016 through the PineRefSeq project. The v2.0 assembly represents inclusion of 11x coverage of PacBio genomic reads as well as fosmid sequence resources (Table 1). The final assembly reported a substantial increase in N50 contig/scaffold length as well as a reduction in total number of scaffolds (> 200bp in length) 7.1 million in v1.01 to 1.7 million in v2.0 (Table 2). A publication note has been submitted announcing the latest release of the genome. Following project release, the University of Connecticut team conducted transcriptome assembly with two sources (Illumina short reads generated by Indiana University and TAMU) as well as PacBio Iso-Seq reads from the same individual that was genome sequenced (20-1010) generated by North Carolina State University (Ross Whetten). For scaffolding purposes, the read sets were each assembled independently (de novo) and aligned to the genome versions for comparison (the published v1.01 versus v2.0). The full reference PacBio set contained 70,064 transcripts and of these, a total of 19,418 were used for scaffolding purposes with custom software. This resulted in a total of 11,951 linked scaffolds to create 4,545 super scaffolds. This assembly is listed as 2.01 and has been released to the PineRefSeq project page. This release serves as the primary reference for annotation of the new repeats and gene models. This updated assembly is the source of the current annotation. Prior to evaluating the gene space, full characterization of the repeat content is necessary and was conducted with the REPET pipeline. Gene annotation involved mapping the short read (RNA-Seq read data) to the genome in addition to aligning full length PacBio transcripts. Previous methods of relying on the MAKER-P pipeline have proved unsuccessful when implemented in the fragmented conifer assemblies with high pseudogene content. The new approach uses the BRAKER pipeline which combines the alignments with self-training with AUGUSTUS for ab initio gene prediction. The final gene selections totaled to 85,622 multi-exonic and full-length genes. This number is much larger than expected and was further filtered based on true domain identification as implemented in InterProScan. The current gene set is at 29,200 and is being examined for proper start site selection and canonical splice sites. A current GFF annotation files has been generated as preliminary data for the annotation but is currently refined based on specific observations of incorrect calls from BRAKER. An additional 43,200 mono-exonic genes were identified in the BRAKER analysis. Specific examination of these is necessary as most represent pseudogenes or result from fragmentation. Previously generated NGS datasets originating from the USDA funded PineMAP project were collected from NC State University, Virginia Tech and Texas A&M University and stored at UConn. All raw data has been quality controlled using Sickle and re-checked for coverage and sequencing constraints. Test runs of genomic read mapping are complete from all data originating from Texas A&M University (370+ individuals). This data represents the only set from exome capture and requires specific filtering to accept variant calls. The final pipeline is use for re-mapping all of the read sets is established with open-source packages installed and operational on the primary server. This pipeline and the associated parameters will be made available to all project members. Students are currently being trained on specific packages included here. Preliminary SNP calls will be delivered to Andrew Eckert's team at VCU to validate the approach and parameters. All accepted selections will be loaded with metadata into the TreeGenes database for public release. Table 1: Total read data included in assembly for v2.0