Progress 11/01/16 to 10/31/17
Target Audience:We have finalized the bioinformatics work. Currently, quality control of SNP markers is being carried out. During the last year, we have regular conference calls to talk about the project progress report. We also to talk to stakeholders (breeders, and geneticists) about establishing a pine consortium to bring the resources together for large scale genotyping. Based on a survey we conducted many companies and organizations showed great interest in genotyping their material once the SNP arrays are designed. Changes/Problems:
What opportunities for training and professional development has the project provided?We had undergraduate students working with Wegryzn, Eckert and Isik lab to gain experience. Isik lab recruited a new grad student (Colin Jakson) to work bioinformatics pipeline for additional sequence data. How have the results been disseminated to communities of interest?The report has been emailed to stakeholders, mostly researchers and profesisonals in the field. Survey results has have been shared with stakeholders. What do you plan to do during the next reporting period to accomplish the goals?Finish quality control of SNP loci using multiple methods and design SNP arrays. The next step is to put SNP markers on TreeGenes database for public access.
What was accomplished under these goals?
Previously generated NGS datasets originating from the USDA funded PineMAP project were collected from NC State University, Virginia Tech and Texas A&M University and stored at UConn. In addition, early data generated from the University of Florida (exome capture) was included. All of these data sets were aligned to the v0.8 or v1.01 of the loblolly pine genome and with an earlier version of the annotation. The most recent released version, v2.01, has been annotated and serves as the new target for variant detection and genotyping array design in this project. All raw data has been quality controlled using Sickle and re-checked for coverage and sequencing constraints. This data was processed with the basic pipeline outlined in Figure 1. Figure 1: Basic workflow for the annotation of variants across datasets Exome capture data was made available from University of Florida (24 loblolly pine and 24 slash pine) and Texas A&M (375 loblolly pine individuals). The exome capture datasets were used to test and implement the final pipeline. Two SNP methods were tested for the pipeline - Varscan2 and Freebayes. Test runs determined that Varscan2 generated consistently reliable calls. Further testing was applied to assess appropriate filtering parameters for the data (Figure 2). Following read alignment and strict and general filtering of SNP calls. The SNPs were further filtered based on very basic population metrics after the SNP call files (VCF files) were merged. The refined workflow for exome capture data is detailed in Figure 3. Figure 2: Evaluating VarScan2 Parameters on exome capture data Figure 3: Detailed SNP detection pipeline applied to exome capture data Following the selection of optimal SNP calls, the individual SNPs were evaluated in regards to their flanking regions. Here, we examined 50bp on either side of the SNP and identified those SNPs with no (or very few) flanking SNPs as priority selections. These selections were further processed using Illumina's ADT scoring software. This software provides a relative scaling and those SNPs scoring the highest were selected for the next stage. This resolved 169K potential targets for array development. The remaining selection for targets is based upon the loblolly pine genome annotation. The new annotation was curated with SNPeff in order to identify synonymous/non-synonymous sites as well as identify SNPs upstream of high quality gene annotations. The new annotation released with v2.01 was functionally re-annotated in order to provide comprehensive information on the gene space. From the ADT score filtered SNPs, a total of 52K were within or directly upstream of high quality gene annotations. This total also includes RAD-Seq data generated at NCSU for an additional 1432 individuals. Current work is underway with the Eckert lab to further refine the SNP selections based upon nucleotide diversity and linkage disequilibrium (LD). Work to date has quantified patterns of diversity for the majority of assembled contigs with at least one SNP in the most recent release of the loblolly pine genome sequence, as well as begun to quantify patterns of LD for contigs with at least two SNPs. Patterns of LD were quantified pairwise and the average pairwise value per contig was reported. Patterns to date are consistent with previous per site estimates of nucleotide diversity for loblolly pine ( = 0.0055 versus = 0.0033from Eckert et al. 2013), although the variance across contigs is much larger (2.5times larger than in Eckert et al. 2013). Similar patterns are emerging with LD, although these analyses are not yet complete. Initial analysis of LD decay plots (i.e. pairwise physical distance in bp versus magnitude of LD) show rapid to moderate rates of decay, with a much larger variance than reported previously. When coupled with emerging patterns of LD, these summaries allow for a set of nuanced selection criteria. For example, several contigs had high levels of diversity, as well as LD. Depending on the needs of the consortium, we could either take a tagSNP approach given the high levels of LD, so that this region need only be represented by one to a few SNPs, or an enrichment type approach because high LD and high nucleotide diversity are signatures of non-neutral evolution. Previously annotated SNPs identified through NSF and USDA funded resequencing projects will also be included on the final array. The final size and selections will be derived from the TreeGenes database which is storing the full profile of quality, coverage, population, and annotation information on each loci. Literature Cited Eckert, A. J., J. L. Wegrzyn, J. D. Liechty, J. M. Lee, W. P. Cumbie, J. M. Davis, B. Goldfarb, C. A. Loopstra, S. R. Palle, T. Quesada, C. H. Langley, and D. B. Neale. 2013. The evolutionary genetics of the genes underlying phenotypic associations for loblolly pine (Pinus taeda, Pinaceae). Genetics 195: 1353-1372
Progress 11/01/15 to 10/31/16
Target Audience:We have reached the researchers and professionals in pine breeding community during the first year of the project. We invited the representatives from major forest companies and organizations to start the project (kick off meeting at NC State University, Raleigh on January 21st, 2016). We also attended Cooperative Tree Improvement Program annual meeting on May 11, 2016 and introduced the project to the stakeholders.The project increased the interest among pine breeding organizations and companies outside US (Brazil, Chile, New Zealand, South Africa). Changes/Problems:
What opportunities for training and professional development has the project provided?Graduate student Trevor Walker and research assistant professor Juan Acosta with NC State University received training on bioinformatics at Dr. Jill Wegryzn lab at University of Connected in July 2016. Graduate student Brian Smith with NCSU Bioinformatics program has been trained on bioinformatics tools to carry out sequence alignment. He is hired as temporary student worker to help with the project. Table 1: Total read data included in assembly for v2.0 Data type Number Total Length (bp) Mean read length Coverage Clone Coverage PacBio reads 27,667,399 267,426,106,405 9,665 12X n/a Illumina reads 10,563,266,162 1,499,483,795,334 142 68X 96X 5-10 kb paired reads 3,152,047,806 475,959,218,706 151 22x 69x Super-reads 96,369,476 44,307,329,021 460 2X 2X Mega-reads 27,986,125 103,129,750,091 3,685 4.7X 4.7X Table 2: Comparative statistics on assembly improvement from v1.01 to v2.0 Assembly Ptaeda 1.01 Ptaeda 2.0 Total size 20,148,103,497 bp 20,613,845,687 bp Total scaffold span 22,564,679,219 bp 22,104,209,064 bp N50 contig size 8,206 bp 25,361 bp Number of contigs 16,461,900 2,855,700 Number of contigs > 500bp 2,527,203 2,445,689 N50 scaffold size 66,920 bp 107,036 bp Number of scaffolds > 200bp 7,068,375 1,762,655 Number of scaffolds > 500bp 2,158,326 1,496,869 How have the results been disseminated to communities of interest?The report has been emailed to stakeholders, mostly researchers and profesisonals in the field. What do you plan to do during the next reporting period to accomplish the goals?We are in the process of SNP calling and annotation. This work will continue for several months.
What was accomplished under these goals?
The genome sequence for v2.0 of the loblolly pine genome was released on August 2016 through the PineRefSeq project. The v2.0 assembly represents inclusion of 11x coverage of PacBio genomic reads as well as fosmid sequence resources (Table 1). The final assembly reported a substantial increase in N50 contig/scaffold length as well as a reduction in total number of scaffolds (> 200bp in length) 7.1 million in v1.01 to 1.7 million in v2.0 (Table 2). A publication note has been submitted announcing the latest release of the genome. Following project release, the University of Connecticut team conducted transcriptome assembly with two sources (Illumina short reads generated by Indiana University and TAMU) as well as PacBio Iso-Seq reads from the same individual that was genome sequenced (20-1010) generated by North Carolina State University (Ross Whetten). For scaffolding purposes, the read sets were each assembled independently (de novo) and aligned to the genome versions for comparison (the published v1.01 versus v2.0). The full reference PacBio set contained 70,064 transcripts and of these, a total of 19,418 were used for scaffolding purposes with custom software. This resulted in a total of 11,951 linked scaffolds to create 4,545 super scaffolds. This assembly is listed as 2.01 and has been released to the PineRefSeq project page. This release serves as the primary reference for annotation of the new repeats and gene models. This updated assembly is the source of the current annotation. Prior to evaluating the gene space, full characterization of the repeat content is necessary and was conducted with the REPET pipeline. Gene annotation involved mapping the short read (RNA-Seq read data) to the genome in addition to aligning full length PacBio transcripts. Previous methods of relying on the MAKER-P pipeline have proved unsuccessful when implemented in the fragmented conifer assemblies with high pseudogene content. The new approach uses the BRAKER pipeline which combines the alignments with self-training with AUGUSTUS for ab initio gene prediction. The final gene selections totaled to 85,622 multi-exonic and full-length genes. This number is much larger than expected and was further filtered based on true domain identification as implemented in InterProScan. The current gene set is at 29,200 and is being examined for proper start site selection and canonical splice sites. A current GFF annotation files has been generated as preliminary data for the annotation but is currently refined based on specific observations of incorrect calls from BRAKER. An additional 43,200 mono-exonic genes were identified in the BRAKER analysis. Specific examination of these is necessary as most represent pseudogenes or result from fragmentation. Previously generated NGS datasets originating from the USDA funded PineMAP project were collected from NC State University, Virginia Tech and Texas A&M University and stored at UConn. All raw data has been quality controlled using Sickle and re-checked for coverage and sequencing constraints. Test runs of genomic read mapping are complete from all data originating from Texas A&M University (370+ individuals). This data represents the only set from exome capture and requires specific filtering to accept variant calls. The final pipeline is use for re-mapping all of the read sets is established with open-source packages installed and operational on the primary server. This pipeline and the associated parameters will be made available to all project members. Students are currently being trained on specific packages included here. Preliminary SNP calls will be delivered to Andrew Eckert's team at VCU to validate the approach and parameters. All accepted selections will be loaded with metadata into the TreeGenes database for public release. Table 1: Total read data included in assembly for v2.0