Towards Genomic Breeding in Forest Trees - NORTH CAROLINA STATE UNIV

TOWARDS GENOMIC BREEDING IN FOREST TREES

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1008075

Grant No.

2016-67013-24469

Cumulative Award Amt.

$370,000.00

Proposal No.

2015-05832

Multistate No.

(N/A)

Project Start Date

Nov 1, 2015

Project End Date

Oct 31, 2019

Grant Year

2016

Program Code

[A1141]- Plant Health and Production and Plant Products: Plant Breeding for Agricultural Production

Recipient Organization
NORTH CAROLINA STATE UNIV
(N/A)
RALEIGH,NC 27695

Performing Department
Tree Improvement Cooperative

Non Technical Summary
Publicly funded research projects have produced vast genomic resources for loblolly and sugar pine. We aim to discover about 100,000 candidate DNA markers based on re-sequencing efforts currently underway. Genomic resources will be organized into the TreeGenes database for community access. Leveraging the TreeGenes database, a PineSNPchip consortium will be established to bring the forest genetics community together to design SNP arrays for genomic breeding approaches in forest trees.

Animal Health Component

20%

Research Effort Categories

Basic

70%

Applied

20%

Developmental

10%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
202	0611	1080	100%

Knowledge Area
202 - Plant Genetic Resources;

Subject Of Investigation
0611 - Conifer forests of the South;

Field Of Science
1080 - Genetics;

Keywords

Goals / Objectives
1. Discover candidate single nucleotide polymorphic markers (SNPs) based on the loblolly pine and sugar pine re-sequencing efforts currently underway2. Annotate and select SNPs for a medium density (100K) Illumina genotyping array.3. Organize genomic resources and final SNP selections into the TreeGenes database for community access.

Project Methods
SNP Discovery: Candidate single nucleotide polymorphic markers discovery will be accomplished through existing large-scale USDA-funded conifer genomics projects. Datasets standardized for sequence quality will be analysed for patterns of diversity, divergence and linkage disequilibrium. The ability to select markers in regions with characterized patterns of linkage disequilibrium will allow us to calibrate expectations for success.Annotate and select SNPs for genotyping array: The success of such an assay depends on reliable bioinformatics SNP detection procedures, including even spacing of SNPs on the genome, selection of minor allele frequencies, and accuracy of the gene space annotations. Extensive bioinformatics analysis and strict criteria for selection will be used to select 90,000 SNPs for the Illumina assay. The final assay will be divided between the two conifers, 80/20 in favor of loblolly pine.TreeGenes database: The primary information resource for the SNP data will be the TreeGenes database, which will reference all dbSNP accessions. All selected SNPs will be submitted to Genbank's dbSNP repository and associated with physical positions in both genomes. The resulting reference SNP numbers will be updated as the reference sequence evolves over time through additional sequencing and scaffolding efforts.

Progress 11/01/15 to 10/31/19

Outputs
Target Audience:We have reached the researchers and professionals in the pine breeding community during the project. We invited the representatives from major forest companies and organizations to project meetings. We also attended the Cooperative Tree Improvement Program annual meetings in 2016, 2017, 2018 and 2019 and introduced the project to the stakeholders. The project increased the interest among pine breeding organizations outside the US (Brazil, Chile, New Zealand, South Africa, Sweden, France). It also increases interest among the scientists both in the USA and internals. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Undergraduate and graduate students were trained on specific packages and genome assembly, variant detection, and assay design workflows. This includes training with basic scripting languages and R. Undergraduates trained at UConn, include: Madison Caballero, Jeremy Bennett, and Ava Fritz. Graduate students from NC State University and from Virginia Commonwealth University received training on bioinformatics from Dr. Jill Wegrzyn in the Plant Computational Genomics lab at the University of Connecticut. Trevor Walker, Colin Jackson, Eddie Lauer. Faculty from NC State University (Ross Whetten and Juan Acosta) received help and training on DNA/RNA sequence data analysis in the Plant Computational Genomics lab at the University of Connecticut. Four Ph.D. students at NC State University worked on the screening phase of the project and helped with DNA extraction, and data analysis: Colin Jackson, Eddie Lauer, Trevor Walker and Austin Heine. How have the results been disseminated to communities of interest?All selected SNP markers, the accession numbers of the screening project were loaded with metadata into the TreeGenes database for public release. The annual reports have been emailed to stakeholders, mostly researchers, and professionals in the field. We also gave presentations in scientific conferences and as well as tree improvement programs annual meetings in the South. Some of the meetings attended to present the results are, National Association of Plant Breeding Conference (2017, 2018, 2019), Tree Biotechnology Conference (2017, 2019), Plant and Animal Genome Conference (2018), NC State University Cooperative Tree Improvement Program annual meetings (2016, 2017, 2018, 2019), and NC State University Camcore Annual Meetings (2016, 2017, 2018, 2019), Texas AM Western Gulf Tree Improvement Program (2018), Southern Forest Tree Improvement Conference (2017, 2019). What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? The genome sequence for v2.0 of the loblolly pine genome was released in August 2016 through the PineRefSeq project. Transcriptome assembly and PacBio Iso-Seq reads were used to improve the genome assembly. This new assembly (listed as 2.01) was released to the PineRefSeq project page. This release served as the primary reference for annotation of the new repeats and gene models in the project. The final pipeline used for remapping all of the read sets was established with open-source packages installed and operational on the primary server. This final pipeline developed for the genome assembly and the associated parameters were made available to all project members. We established the Conifer SNP Consortium in 2017 to bring the community together to negotiate with technology companies to develop genotyping arrays that will benefit both the domestic and the international tree breeding community. The Conifer SNP Consortium met a key milestone in the development of the Axiom 50K application array for loblolly pine (Pita50K). A list of 642,000 probe sequences was delivered to Thermo Fisher for the prediction of marker quality. This pipeline delivered a number of quality metrics for each probe, which were aggregated into a single index score for each marker. A screening array was fabricated with the top 423K probes. In order to empirically validate these probes, a diverse panel of 392 diploid loblolly pine samples, as well as 36 haploid megagametophytes, were assayed. A total of 84,852 SNP were selected for downstream analysis. As expected, the genetic data revealed a highly skewed allele frequency distribution, with a mean below 0.1. Three main criteria were used for variant selection. 1) SNP with minimal heterozygosity on the haploid samples, 2) SNP with intermediate minor allele frequency (q > 0.05) and displaying three genotype clusters, 3) SNP with genotype ratios in Hardy-Weinberg equilibrium. This SNP selection process resulted in 46,439 markers for inclusion on the application Pita50K array. The Pita50K was manufactured in September 2019. The community started to genotype thousands of trees to jumpstart genomic selection in forest trees.

Publications

Type: Conference Papers and Presentations Status: Published Year Published: 2018 Citation: Isik et al. 2019. Towards Genomic Selection in Forest Trees. National Association of Plant Breeding Meeting, August 25-29, 2019, Callaway Gardens Pine Mountain, GA.
Type: Conference Papers and Presentations Status: Accepted Year Published: 2017 Citation: Fikret Isik, Juan Acosta, Trevor Walker, Richard Sniezko, Andrew Eckert, Jill Wegrzyn (2017). International pine SNP consortium: designing genotyping resources for the community. IUFRO Tree Biotechnology Conference. June 4-9, 2017. Conception, Chile.
Type: Conference Papers and Presentations Status: Published Year Published: 2017 Citation: Fikret Isik Juan Acosta, Trevor Walker, Richard Sniezko, Andrew Eckert, Jill Wegrzyn (2017). Progress on Pine SNP discovery and SNP array design. 34th Southern Tree Improvement Conference, June 19-22, 2017. Melbourne, FL.
Type: Conference Papers and Presentations Status: Accepted Year Published: 2019 Citation: Isik et al. 2019. Developing SNP Arrays for Genomic Applications in Forest Trees. IUFRO Forest Tree Biotechnology Conference, 23-28 June 2019, Raleigh, NC.
Type: Conference Papers and Presentations Status: Accepted Year Published: 2016 Citation: Daniel Gonzalez, Juan Acosta, Andrew Eckert, Richard Sniezko, Jill Wegrzyn, Fikret Isik (2016). Towards Genomic Selection in Forest Trees. The 5th International Conference on quantitative Genetics. Madison, WI.
Type: Conference Papers and Presentations Status: Accepted Year Published: 2019 Citation: Acosta et al. 2019. SNP Discovery and Probe Design Using RNA-seq and Target Capture in Tropical and Subtropical Pines. IUFRO Forest Tree Biotechnology Conference, 23-28 June 2019, Raleigh, NC.
Type: Conference Papers and Presentations Status: Accepted Year Published: 2018 Citation: Caballero et al. 2018. Mine the Pine: Genotyping Assay Design for Loblolly Pine Populations. University of Georgia Plant Breeding Symposium. Advancing Plant Sciences: Plants in an Evolving World. May 9, 2018, Athens, GA.
Type: Conference Papers and Presentations Status: Accepted Year Published: 2018 Citation: Fikret Isik, Juan Acosta, Andrew Eckert, Richard Sniezko, Jill Wegrzyn. Pine SNP chip Consortium: Progress on pine SNP Discovery and Array Design in Loblolly Pine. The 26th Plant and Animal Genome Conference, Sen Diego, CA. January 13-16, 2018.
Type: Conference Papers and Presentations Status: Accepted Year Published: 2017 Citation: Fikret Isik, Jill Wegrzyn, Juan Acosta, Trevor Walker, Andrew Eckert, Richard Sniezko (2017). Towards Genomic Selection in Forest Trees. National Association of Plant Breeding Conference. UC Davis, CA. August 7-10, 2017.
Type: Conference Papers and Presentations Status: Accepted Year Published: 2017 Citation: Jill Wegrzyn, Uzay Sezen, Nic Herndon, Emily Grau, Sean Beuhler, Sumaira Zaman, Alex Hart, Alex Trouern- Trend, and Madison Caballero (2017) Computational Approaches to Decode Megagenomes and Develop Database Resources for the Forest Tree Community. 34th Southern Tree Improvement Conference, June 19-22, 2017. Melbourne, FL.

Progress 11/01/16 to 10/31/17

Outputs
Target Audience:We have finalized the bioinformatics work. Currently, quality control of SNP markers is being carried out. During the last year, we have regular conference calls to talk about the project progress report. We also to talk to stakeholders (breeders, and geneticists) about establishing a pine consortium to bring the resources together for large scale genotyping. Based on a survey we conducted many companies and organizations showed great interest in genotyping their material once the SNP arrays are designed. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?We had undergraduate students working with Wegryzn, Eckert and Isik lab to gain experience. Isik lab recruited a new grad student (Colin Jakson) to work bioinformatics pipeline for additional sequence data. How have the results been disseminated to communities of interest?The report has been emailed to stakeholders, mostly researchers and profesisonals in the field. Survey results has have been shared with stakeholders. What do you plan to do during the next reporting period to accomplish the goals?Finish quality control of SNP loci using multiple methods and design SNP arrays. The next step is to put SNP markers on TreeGenes database for public access.

Impacts
What was accomplished under these goals? Previously generated NGS datasets originating from the USDA funded PineMAP project were collected from NC State University, Virginia Tech and Texas A&M University and stored at UConn. In addition, early data generated from the University of Florida (exome capture) was included. All of these data sets were aligned to the v0.8 or v1.01 of the loblolly pine genome and with an earlier version of the annotation. The most recent released version, v2.01, has been annotated and serves as the new target for variant detection and genotyping array design in this project. All raw data has been quality controlled using Sickle and re-checked for coverage and sequencing constraints. This data was processed with the basic pipeline outlined in Figure 1. Figure 1: Basic workflow for the annotation of variants across datasets Exome capture data was made available from University of Florida (24 loblolly pine and 24 slash pine) and Texas A&M (375 loblolly pine individuals). The exome capture datasets were used to test and implement the final pipeline. Two SNP methods were tested for the pipeline - Varscan2 and Freebayes. Test runs determined that Varscan2 generated consistently reliable calls. Further testing was applied to assess appropriate filtering parameters for the data (Figure 2). Following read alignment and strict and general filtering of SNP calls. The SNPs were further filtered based on very basic population metrics after the SNP call files (VCF files) were merged. The refined workflow for exome capture data is detailed in Figure 3. Figure 2: Evaluating VarScan2 Parameters on exome capture data Figure 3: Detailed SNP detection pipeline applied to exome capture data Following the selection of optimal SNP calls, the individual SNPs were evaluated in regards to their flanking regions. Here, we examined 50bp on either side of the SNP and identified those SNPs with no (or very few) flanking SNPs as priority selections. These selections were further processed using Illumina's ADT scoring software. This software provides a relative scaling and those SNPs scoring the highest were selected for the next stage. This resolved 169K potential targets for array development. The remaining selection for targets is based upon the loblolly pine genome annotation. The new annotation was curated with SNPeff in order to identify synonymous/non-synonymous sites as well as identify SNPs upstream of high quality gene annotations. The new annotation released with v2.01 was functionally re-annotated in order to provide comprehensive information on the gene space. From the ADT score filtered SNPs, a total of 52K were within or directly upstream of high quality gene annotations. This total also includes RAD-Seq data generated at NCSU for an additional 1432 individuals. Current work is underway with the Eckert lab to further refine the SNP selections based upon nucleotide diversity and linkage disequilibrium (LD). Work to date has quantified patterns of diversity for the majority of assembled contigs with at least one SNP in the most recent release of the loblolly pine genome sequence, as well as begun to quantify patterns of LD for contigs with at least two SNPs. Patterns of LD were quantified pairwise and the average pairwise value per contig was reported. Patterns to date are consistent with previous per site estimates of nucleotide diversity for loblolly pine ( = 0.0055 versus = 0.0033from Eckert et al. 2013), although the variance across contigs is much larger (2.5times larger than in Eckert et al. 2013). Similar patterns are emerging with LD, although these analyses are not yet complete. Initial analysis of LD decay plots (i.e. pairwise physical distance in bp versus magnitude of LD) show rapid to moderate rates of decay, with a much larger variance than reported previously. When coupled with emerging patterns of LD, these summaries allow for a set of nuanced selection criteria. For example, several contigs had high levels of diversity, as well as LD. Depending on the needs of the consortium, we could either take a tagSNP approach given the high levels of LD, so that this region need only be represented by one to a few SNPs, or an enrichment type approach because high LD and high nucleotide diversity are signatures of non-neutral evolution. Previously annotated SNPs identified through NSF and USDA funded resequencing projects will also be included on the final array. The final size and selections will be derived from the TreeGenes database which is storing the full profile of quality, coverage, population, and annotation information on each loci. Literature Cited Eckert, A. J., J. L. Wegrzyn, J. D. Liechty, J. M. Lee, W. P. Cumbie, J. M. Davis, B. Goldfarb, C. A. Loopstra, S. R. Palle, T. Quesada, C. H. Langley, and D. B. Neale. 2013. The evolutionary genetics of the genes underlying phenotypic associations for loblolly pine (Pinus taeda, Pinaceae). Genetics 195: 1353-1372

Publications

Progress 11/01/15 to 10/31/16

Outputs
Target Audience:We have reached the researchers and professionals in pine breeding community during the first year of the project. We invited the representatives from major forest companies and organizations to start the project (kick off meeting at NC State University, Raleigh on January 21st, 2016). We also attended Cooperative Tree Improvement Program annual meeting on May 11, 2016 and introduced the project to the stakeholders.The project increased the interest among pine breeding organizations and companies outside US (Brazil, Chile, New Zealand, South Africa). Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?Graduate student Trevor Walker and research assistant professor Juan Acosta with NC State University received training on bioinformatics at Dr. Jill Wegryzn lab at University of Connected in July 2016. Graduate student Brian Smith with NCSU Bioinformatics program has been trained on bioinformatics tools to carry out sequence alignment. He is hired as temporary student worker to help with the project. Table 1: Total read data included in assembly for v2.0 Data type Number Total Length (bp) Mean read length Coverage Clone Coverage PacBio reads 27,667,399 267,426,106,405 9,665 12X n/a Illumina reads 10,563,266,162 1,499,483,795,334 142 68X 96X 5-10 kb paired reads 3,152,047,806 475,959,218,706 151 22x 69x Super-reads 96,369,476 44,307,329,021 460 2X 2X Mega-reads 27,986,125 103,129,750,091 3,685 4.7X 4.7X Table 2: Comparative statistics on assembly improvement from v1.01 to v2.0 Assembly Ptaeda 1.01 Ptaeda 2.0 Total size 20,148,103,497 bp 20,613,845,687 bp Total scaffold span 22,564,679,219 bp 22,104,209,064 bp N50 contig size 8,206 bp 25,361 bp Number of contigs 16,461,900 2,855,700 Number of contigs > 500bp 2,527,203 2,445,689 N50 scaffold size 66,920 bp 107,036 bp Number of scaffolds > 200bp 7,068,375 1,762,655 Number of scaffolds > 500bp 2,158,326 1,496,869 How have the results been disseminated to communities of interest?The report has been emailed to stakeholders, mostly researchers and profesisonals in the field. What do you plan to do during the next reporting period to accomplish the goals?We are in the process of SNP calling and annotation. This work will continue for several months.

Impacts
What was accomplished under these goals? The genome sequence for v2.0 of the loblolly pine genome was released on August 2016 through the PineRefSeq project. The v2.0 assembly represents inclusion of 11x coverage of PacBio genomic reads as well as fosmid sequence resources (Table 1). The final assembly reported a substantial increase in N50 contig/scaffold length as well as a reduction in total number of scaffolds (> 200bp in length) 7.1 million in v1.01 to 1.7 million in v2.0 (Table 2). A publication note has been submitted announcing the latest release of the genome. Following project release, the University of Connecticut team conducted transcriptome assembly with two sources (Illumina short reads generated by Indiana University and TAMU) as well as PacBio Iso-Seq reads from the same individual that was genome sequenced (20-1010) generated by North Carolina State University (Ross Whetten). For scaffolding purposes, the read sets were each assembled independently (de novo) and aligned to the genome versions for comparison (the published v1.01 versus v2.0). The full reference PacBio set contained 70,064 transcripts and of these, a total of 19,418 were used for scaffolding purposes with custom software. This resulted in a total of 11,951 linked scaffolds to create 4,545 super scaffolds. This assembly is listed as 2.01 and has been released to the PineRefSeq project page. This release serves as the primary reference for annotation of the new repeats and gene models. This updated assembly is the source of the current annotation. Prior to evaluating the gene space, full characterization of the repeat content is necessary and was conducted with the REPET pipeline. Gene annotation involved mapping the short read (RNA-Seq read data) to the genome in addition to aligning full length PacBio transcripts. Previous methods of relying on the MAKER-P pipeline have proved unsuccessful when implemented in the fragmented conifer assemblies with high pseudogene content. The new approach uses the BRAKER pipeline which combines the alignments with self-training with AUGUSTUS for ab initio gene prediction. The final gene selections totaled to 85,622 multi-exonic and full-length genes. This number is much larger than expected and was further filtered based on true domain identification as implemented in InterProScan. The current gene set is at 29,200 and is being examined for proper start site selection and canonical splice sites. A current GFF annotation files has been generated as preliminary data for the annotation but is currently refined based on specific observations of incorrect calls from BRAKER. An additional 43,200 mono-exonic genes were identified in the BRAKER analysis. Specific examination of these is necessary as most represent pseudogenes or result from fragmentation. Previously generated NGS datasets originating from the USDA funded PineMAP project were collected from NC State University, Virginia Tech and Texas A&M University and stored at UConn. All raw data has been quality controlled using Sickle and re-checked for coverage and sequencing constraints. Test runs of genomic read mapping are complete from all data originating from Texas A&M University (370+ individuals). This data represents the only set from exome capture and requires specific filtering to accept variant calls. The final pipeline is use for re-mapping all of the read sets is established with open-source packages installed and operational on the primary server. This pipeline and the associated parameters will be made available to all project members. Students are currently being trained on specific packages included here. Preliminary SNP calls will be delivered to Andrew Eckert's team at VCU to validate the approach and parameters. All accepted selections will be loaded with metadata into the TreeGenes database for public release. Table 1: Total read data included in assembly for v2.0

Publications