Source: NORTH CAROLINA STATE UNIV submitted to
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
Funding Source
Reporting Frequency
Accession No.
Grant No.
Project No.
Proposal No.
Multistate No.
Program Code
Project Start Date
Nov 1, 2015
Project End Date
Oct 31, 2019
Grant Year
Project Director
Isik, F.
Recipient Organization
Performing Department
Tree Improvement Cooperative
Non Technical Summary
Publicly funded research projects have produced vast genomic resources for loblolly and sugar pine. We aim to discover about 100,000 candidate DNA markers based on re-sequencing efforts currently underway. Genomic resources will be organized into the TreeGenes database for community access. Leveraging the TreeGenes database, a PineSNPchip consortium will be established to bring the forest genetics community together to design SNP arrays for genomic breeding approaches in forest trees.
Animal Health Component
Research Effort Categories

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
Knowledge Area
202 - Plant Genetic Resources;

Subject Of Investigation
0611 - Conifer forests of the South;

Field Of Science
1080 - Genetics;
Goals / Objectives
1. Discover candidate single nucleotide polymorphic markers (SNPs) based on the loblolly pine and sugar pine re-sequencing efforts currently underway2. Annotate and select SNPs for a medium density (100K) Illumina genotyping array.3. Organize genomic resources and final SNP selections into the TreeGenes database for community access.
Project Methods
SNP Discovery: Candidate single nucleotide polymorphic markers discovery will be accomplished through existing large-scale USDA-funded conifer genomics projects. Datasets standardized for sequence quality will be analysed for patterns of diversity, divergence and linkage disequilibrium. The ability to select markers in regions with characterized patterns of linkage disequilibrium will allow us to calibrate expectations for success.Annotate and select SNPs for genotyping array: The success of such an assay depends on reliable bioinformatics SNP detection procedures, including even spacing of SNPs on the genome, selection of minor allele frequencies, and accuracy of the gene space annotations. Extensive bioinformatics analysis and strict criteria for selection will be used to select 90,000 SNPs for the Illumina assay. The final assay will be divided between the two conifers, 80/20 in favor of loblolly pine.TreeGenes database: The primary information resource for the SNP data will be the TreeGenes database, which will reference all dbSNP accessions. All selected SNPs will be submitted to Genbank's dbSNP repository and associated with physical positions in both genomes. The resulting reference SNP numbers will be updated as the reference sequence evolves over time through additional sequencing and scaffolding efforts.

Progress 11/01/15 to 10/31/16

Target Audience:We have reached the researchers and professionals in pine breeding community during the first year of the project. We invited the representatives from major forest companies and organizations to start the project (kick off meeting at NC State University, Raleigh on January 21st, 2016). We also attended Cooperative Tree Improvement Program annual meeting on May 11, 2016 and introduced the project to the stakeholders.The project increased the interest among pine breeding organizations and companies outside US (Brazil, Chile, New Zealand, South Africa). Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Graduate student Trevor Walker and research assistant professor Juan Acosta with NC State University received training on bioinformatics at Dr. Jill Wegryzn lab at University of Connected in July 2016. Graduate student Brian Smith with NCSU Bioinformatics program has been trained on bioinformatics tools to carry out sequence alignment. He is hired as temporary student worker to help with the project. Table 1: Total read data included in assembly for v2.0 Data type Number Total Length (bp) Mean read length Coverage Clone Coverage PacBio reads 27,667,399 267,426,106,405 9,665 12X n/a Illumina reads 10,563,266,162 1,499,483,795,334 142 68X 96X 5-10 kb paired reads 3,152,047,806 475,959,218,706 151 22x 69x Super-reads 96,369,476 44,307,329,021 460 2X 2X Mega-reads 27,986,125 103,129,750,091 3,685 4.7X 4.7X Table 2: Comparative statistics on assembly improvement from v1.01 to v2.0 Assembly Ptaeda 1.01 Ptaeda 2.0 Total size 20,148,103,497 bp 20,613,845,687 bp Total scaffold span 22,564,679,219 bp 22,104,209,064 bp N50 contig size 8,206 bp 25,361 bp Number of contigs 16,461,900 2,855,700 Number of contigs > 500bp 2,527,203 2,445,689 N50 scaffold size 66,920 bp 107,036 bp Number of scaffolds > 200bp 7,068,375 1,762,655 Number of scaffolds > 500bp 2,158,326 1,496,869 How have the results been disseminated to communities of interest?The report has been emailed to stakeholders, mostly researchers and profesisonals in the field. What do you plan to do during the next reporting period to accomplish the goals?We are in the process of SNP calling and annotation. This work will continue for several months.

What was accomplished under these goals? The genome sequence for v2.0 of the loblolly pine genome was released on August 2016 through the PineRefSeq project. The v2.0 assembly represents inclusion of 11x coverage of PacBio genomic reads as well as fosmid sequence resources (Table 1). The final assembly reported a substantial increase in N50 contig/scaffold length as well as a reduction in total number of scaffolds (> 200bp in length) 7.1 million in v1.01 to 1.7 million in v2.0 (Table 2). A publication note has been submitted announcing the latest release of the genome. Following project release, the University of Connecticut team conducted transcriptome assembly with two sources (Illumina short reads generated by Indiana University and TAMU) as well as PacBio Iso-Seq reads from the same individual that was genome sequenced (20-1010) generated by North Carolina State University (Ross Whetten). For scaffolding purposes, the read sets were each assembled independently (de novo) and aligned to the genome versions for comparison (the published v1.01 versus v2.0). The full reference PacBio set contained 70,064 transcripts and of these, a total of 19,418 were used for scaffolding purposes with custom software. This resulted in a total of 11,951 linked scaffolds to create 4,545 super scaffolds. This assembly is listed as 2.01 and has been released to the PineRefSeq project page. This release serves as the primary reference for annotation of the new repeats and gene models. This updated assembly is the source of the current annotation. Prior to evaluating the gene space, full characterization of the repeat content is necessary and was conducted with the REPET pipeline. Gene annotation involved mapping the short read (RNA-Seq read data) to the genome in addition to aligning full length PacBio transcripts. Previous methods of relying on the MAKER-P pipeline have proved unsuccessful when implemented in the fragmented conifer assemblies with high pseudogene content. The new approach uses the BRAKER pipeline which combines the alignments with self-training with AUGUSTUS for ab initio gene prediction. The final gene selections totaled to 85,622 multi-exonic and full-length genes. This number is much larger than expected and was further filtered based on true domain identification as implemented in InterProScan. The current gene set is at 29,200 and is being examined for proper start site selection and canonical splice sites. A current GFF annotation files has been generated as preliminary data for the annotation but is currently refined based on specific observations of incorrect calls from BRAKER. An additional 43,200 mono-exonic genes were identified in the BRAKER analysis. Specific examination of these is necessary as most represent pseudogenes or result from fragmentation. Previously generated NGS datasets originating from the USDA funded PineMAP project were collected from NC State University, Virginia Tech and Texas A&M University and stored at UConn. All raw data has been quality controlled using Sickle and re-checked for coverage and sequencing constraints. Test runs of genomic read mapping are complete from all data originating from Texas A&M University (370+ individuals). This data represents the only set from exome capture and requires specific filtering to accept variant calls. The final pipeline is use for re-mapping all of the read sets is established with open-source packages installed and operational on the primary server. This pipeline and the associated parameters will be made available to all project members. Students are currently being trained on specific packages included here. Preliminary SNP calls will be delivered to Andrew Eckert's team at VCU to validate the approach and parameters. All accepted selections will be loaded with metadata into the TreeGenes database for public release. Table 1: Total read data included in assembly for v2.0