Establishing Translational Genomics for Oklahoma Wheat Improvement

ESTABLISHING TRANSLATIONAL GENOMICS FOR OKLAHOMA WHEAT IMPROVEMENT

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

HATCH

Reporting Frequency

Annual

Accession No.

1007955

Grant No.

(N/A)

Cumulative Award Amt.

(N/A)

Proposal No.

(N/A)

Multistate No.

(N/A)

Project Start Date

Sep 19, 2015

Project End Date

Sep 14, 2020

Grant Year

(N/A)

Program Code

[(N/A)]- (N/A)

Recipient Organization
OKLAHOMA STATE UNIVERSITY
(N/A)
STILLWATER,OK 74078

Performing Department
Biochemistry & Molecular Biology

Non Technical Summary
Advancements in both sequencing technologies and quantitative/computational methods have created unprecedented opportunities for increasing efficiency and genetic gain. This proposal seeks to establish intellectual know-how for a genomics-driven breeding program for the state of Oklahoma. The main objective is to evaluate the applicability of genomic prediction for crop improvement, using Oklahoma's dual-purpose, high yielding 'Duster x Billings' breeding population as a 'test case'. By integrating next-generation sequencing technologies with advanced computational algorithms into existing gene discovery and testing plots in the field, the research objective can be accomplished with specific aims:(1) Apply and implement a proven high-throughout genotyping platform for the 'Duster x Billings' population. Illumina based GBS short-read sequencing will be conducted, with which the project will evaluate important factors aiding generation of quality SNP genotypic values, as well as linking the SNP profile of the 'Duster x Billings' population with the public wheat polymorphism dataset.(2) Evaluate performance of genomic prediction algorithms for grain yield components; factors to be studied include sequence read depth, marker density, inclusion of QTLs and composition of training populations.This project will compare advanced predictive algorithms, such as kernel-based semi-parametric procedures and non-parametric, artificial neuronal networks, with basic additive linear models, to determine the best model for grain yield components specifically based on prediction accuracy and biological relevance.(3) Access efficiency for adopting genomic prediction by two-generation validation.To realize the actual genetic gain, generation advancement based on genomic prediction and on phenotypic selection will be compared at the same generation. Unlike other cross validation methods, the unique two-generation validation will provide realistic measures of confidence in adopting genomic prediction for crop improvement.The success of this proposal will expand our understanding of the genomic elements resulting in high yielding for crop improvement. Performance assessment of computational algorithms will strengthen the confidence in modern agricultural biotechnology prior to its application. Results of this research could also further aid the calculation of perceived risks, thus encouraging implementation of regulatory and policy actions to ensure ecological benefit and economic return. As the overarching goal of this proposal, we anticipate to establish a genomics-backed, science-based breeding program that translates genomics knowledge into genetic gain in the field, by the close collaboration between research-oriented academia, and the Wheat Improvement Team at OSU together with State-run wheat breeding programs.

Animal Health Component

50%

Research Effort Categories

Basic

50%

Applied

50%

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	1544	1080	100%

Knowledge Area
201 - Plant Genome, Genetics, and Genetic Mechanisms;

Subject Of Investigation
1544 - Hard white wheat;

Field Of Science
1080 - Genetics;

Keywords

Goals / Objectives
Traditional means for crop improvement are decade-long, expensive and resource-dependent endeavors. While climatic uncertainty is developing more rapidly than variety development has been, the urgent need for technological advancement for crop improvement is stirring in every field, including the six million acres of winter wheat in the state of Oklahoma.This research, tailored to address the challenge caused by anticipated changes in climate and growing human populations, is proposed to leverage breeder's knowledge, technologies like next-generation sequencing and predictive analyses, and to approach efficiency and accuracy in selection response for grain yield components. Targeting for 'Duster x Billing' population advancement, the research goals are:(1) elucidate the efficiency of genomics-enabled technologies,(2) evaluate predictive model performance and (3)finally, facilitate the implementation of a knowledge-driven breeding program for the most important agriculture system in the state of Oklahoma.The southern Great Plain winter wheat (Triticum aestivum L.) production plays a vital role in both yield production and cattle industry in the US. Using a two-pronged breeding system coined GrazenGrain, this unique dual-purpose management aims to provide a winter forage source and grain production from the same crop (Thapa, et al. 2010). Developed cooperatively by the Oklahoma Agriculture Experiment Station (OAES) and the USDA-ARS, the hard red winter wheat cultivar 'Duster' (Reg. No. CV-1065, PI 644016) has been the number one variety in Oklahoma since its release in 2012 (USDA 2015b), demonstrating the superiority of its desirable qualities for this dual-purpose system, including (1) its rapid stand establishment as fall forage biomass accumulation, (2) non-precocious winter dormancy release, and more importantly (3) excellent recovery from grazing measured as grain yield across multiple dual-purpose environment (Edwards, et al. 2012).In the past few years, drought conditions in the southern Great Plain have been worsening (details on http://droughtmonitor.unl.edu and www.ncdc.noaa.gov/sotc/drought/); selection for early maturity has become the leading criterion for choosing cultivars to ensure crop an optimal climate during grain filling (Edwards, 2009). However, selecting for earliness can also be tantamount to selection for yield potential (Hunger, et al. 2012). To meet this breeding challenge, an Oklahoma variety- 'Billings' that combines early maturity, disease resistance and also shows superior yielding ability was selected (Hunger, et al. 2012). Interbreeding Duster and Billings would introduce early maturity into Duster's versatility, ultimately pyramiding both desirable traits and yield potential apposite to dual-purpose wheat producers.In summary, this proposal wishes to establish a real-world example that translates genomics knowledge into genetic gain, and ultimately help to address the agricultural challenges of the globe.

Project Methods
"Duster x Billings" DH population and field evaluationDuster and Billings are two hard red winter wheat cultivars released in the southern Great Plains. To bring together high yielding potential with the versatilities designed for Oklahoma's dual-purpose breeding programs, a doubled haploid (DH) population of 'Duster x Billings' has been created.The details for both management systems, which this study will follow, are described fully in Thapa et al. (2010). In brief, field trails will be conducted as a randomized complete block (RCB) design with three replicates per site. The DH population will be arbitrarily divided into 6 sets of 42 DH lines. The 6 sets, each an RCB, will be arranged in the field as replicates-in-sets; two parents, Duster and Billings, are include in each set as checks. The advantage of this small-plot experimental protocol is that it allows testing of such a large population of lines with reasonable replicate (or block) size. Field-testing will take place at the Agronomy Research Station in Stillwater, Oklahoma.Genotyping by sequencing (GBS), SNP determination and haplotype buildingOwing to the wide versatility of GBS, this project will investigate the applicability as well as the effectiveness of GBS for accessing genomic polymorphism. Also, quick access to genomic information for this population is also available, because DNA libraries of the current 282 DH lines can be requested from previous studies. These libraries were, however, prepared at 96-plex (see details in Li et al. 2015), with which, given the genome complexity, the average sequence coverage was too shallow to identify heterozygotes with confidence (Spindel et al. 2013). This proposal will take advantage of existing material and create multiple short-read profiles with different sequence read-depths by re-sequencing, to investigate prediction performance at different sequence read-depths and coverage.Genomic DNA of the selected offspring (2nd Generation) from the 1st Generation will be extracted, quantified and purified using DNeasy Plant Maxi Kit® (Qiagen; Poland et al. 2012). Genomic DNA will be co-digested with two enzymes, PstI (CTGCAG) and MspI (CCGG), and barcode adapters will be ligated to individual samples. The detailed protocol that this proposal would follow in principle can be found in Poland et al. (2012).SNP determination at different plexing-levels will be conducted for both individual Illumina sequencing passes and combined libraries. Procedure for calling SNP will largely follow Poland et al. (2012) using TASSEL (Glaubitz et al. 2014). This proposal will generate an SNP Discovery Build for these 282 DH lines by including the diverse collection of raw data from the T-CAP project (www.triticeaecap.org). Oklahoma SNP Discovery Build, in which SNPs of the 'Duster x Billings' population are anchored with diverse polymorphism of wheat genome, will establish a foundation for subsequent Oklahoma SNP Production Builds. SNP profile of the 2nd Generation will be determined based on Oklahoma SNP Discovery Build by TASSEL SNP Production Pipeline (Glaubitz et al. 2014).To anchor SNP polymorphism with gene regions, this research will develop a targeted capture re-sequencing protocol based upon polymorphism carried in NimbleGen array (Winfield et al. 2012). Verified with eight UK wheat varieties, NimbeGen array covers most of the unique gene set of wheat genome (Brenchley, et al. 2012), and defines polymorphism as homoeologous and varietal SNPs, for variation existing among the three homoeologous sub-genomes (A, B and D genomes) and between varieties, respectively. SNPs from targeted re-sequencing will be combined with SNP from NGS GBS technology.SNP filtering will be conducted according to allele frequencies and missing data ratio, to eliminate non-informative SNP markers. Missing SNP values in the filtered SNP table will be imputed using EM-method (Endelman 2011), k-nearest neighbor (Gama El-Dien, et al. 2015) and other haplotype sorting algorithms (Huang et al. 2009). Genomic similarity matrices will be estimated based on algorithms described in VanRaden (2008). To detect haplotype blocks, two approaches will be tested: a sliding-window based approach to identify recombination break points using Duster-versus-Billings allele ratios (Huang et al. 2009), and a LD based haplotype construction. To relax the dependency of allele frequency, Lewontin's D' will be performed for LD measurement (Lewontin, 1964). Mapping for QTLs and associated SNP variants will be conducted using R/qtl (Arends, et al. 2010) and GAPIT (Lipka, et al. 2012).Genomics-enabled predictions for grain yieldThe standard linear model that considers phenotypic response to be explained by genetic factors and residual (errors) will be fitted as basic comparison. The large variation resulting from hundreds of thousands of genetic markers can be controlled by various shrinkage methods that can be programmed in both frequentist or Bayesian methods like Meuwissen et al. (2001). Since large yield QTLs can be identified in the 'Duster x Billings', such functional information will also be included in prediction models as fixed covariates (Bernado 2013). Computational package "rrBLUP" will be used for this mixed linear model fitting (Endelman, 2011)In addition to linear predictive models, both semi-parameteric and non-parameteric approaches will also be evaluated. For example, semi-parametric methods like Reproducing kernel Hilbert spaces (RKHS) that treat genetic marker effects as conditional expectation of phenotypic values captures functional information of SNP markers amidst model fitting process (Gianola et al. 2006; de los Campos et al. 2010). Application of RKHS should be valuable for this population. Inspired from human central nervous system, artificial neuronal networks (ANNs) provide alternatives to approach universal approximators of complex function (Bishop 2006). This proposal will investigate a number of artificial neuronal networks in the school of non-parametric approaches. The strength in capturing non-linear relationships between predictors and responses makes ANNs candidate algorithms for this proposed work (Gianola et al. 2011).Assessment of prediction performanceSince the introduction of genomic-enabled prediction, cross-validation (CV) has become a popular technique for measuring prediction performance. There are also a number of other ways to approach predictive performance, like k-fold validation (Makowsky et al., 2011), repeated random sub-sampling (Gianola et al. 2014), leave-one-out CV (Stone 1977) and Akaike's information criterion (AIC) (Akaike, 1974). Using data from 1st Generation, this proposal seeks to find model optimization based on the CV methods mentioned above.To comprehend performance assessment, this project will take advantage of the high performance computing clusters at the Oklahoma State University (Cowboy supercomputer) for the need of intense computations.By design, two-generation validation provides more realistic measurement for performance assessment, when phenotype data of the 2nd Generation is available. Typically, two-generation design constructs training set from data in parental generation (i.e. the 1st Generation in this case), and prediction is carried out in a testing set that includes offspring of individuals in the training set. After phenotyping for the 2nd year, predictive performance for grain yield will be directly evaluated using both Pearson's correlation and mean-squared error between observed and predicted values. The advantages of two-generation validation are two folds: (1) it provides one single, realized prediction accuracy without significant amount of computational burden and (2) it stimulates a standard, realistic genetic evaluation scenario specifically for selection responses of the 'Duster x Billings' population.

Progress 09/19/15 to 09/14/20

Outputs
Target Audience:Individuals who study plant genomics, genomic technologies, association mapping techniques, and quantitative genetics aspects related to important attributes to yield production, growth, end-use quality, and abiotic and biotic resistance. Scientists working on developing computational algorithms for predictive analysis, optimization, and association genetics. Members of the Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and scientists, students, and graduate students who are studying the applicability of genomic selection for crop improvement purposes. Members and teams in grower associations and forest management sectors. Changes/Problems:Our local computing resource continues to be the limiting factor for large-scale genomic analysis. As artificial intelligence (AI) is rapidly changing the data-intensive research field, the demand for GPU (graphic processor units) has also been rapidly growing. Currently, HPCC at OSU has only two GPU nodes, each with a nearly-full occupancy. The lack of computing resources and management strategy at HPCC continues to be a struggle. Further, the availability of research capacity in the regions has been greatly reduced owing to the expanded demand for data scientists in the job market. To make Oklahoma an attractive place for STEM research and trainees, OSU should consider a significant update on the recruitment strategy and the salary structure. What opportunities for training and professional development has the project provided?Students were exposed to advanced sequencing technologies and associated analytical computation techniques. All these data analyses were conducted using Oklahoma State University's High-Performance Computing Clusters. How have the results been disseminated to communities of interest? Nothing Reported What do you plan to do during the next reporting period to accomplish the goals?The main objective of crop improvement is to obtain genetic gain in a number of traits of interest while maintaining genetic diversity in the variety development program. This breeding decision involves multiple, sometimes competing breeding objectives, and usually is achieved by applying a selection index. Furthermore, when negative genetic correlation present amongst breeding targets, extensive trade-offs would impact the long-term breeding goals. To this end, we will formulate crop improvement as a multi-objective optimization problem, and will first propose to solve the optimization mathematically, and then implement an optimization algorithm to maximize breeding values of multiple phenotypes, while minimizing inbreeding for selection. Here we anticipate that the framework of multi-objective optimization will provide substantial support for the breeding decisions, as well as the design of crossing-blocks.

Impacts
What was accomplished under these goals? 1. Drought-induced methylation and transcription changes in winter wheat genome Our study showed that, under drought, 719 more differentially expressed genes were detected in DH169 compared with the Duster genotype. The majority of the differentially expressed genes were associated with response to oxidative stress and bract morphology. Overall, Duster exhibited more significant methylation changes than DH169 and a greater extent of methylation in drought than control, whereas the methylation of DH169 was found higher under the well-watered condition. Finally, gene body hypermethylation was found associated with down-regulation in DH169; however, the positive association of up-regulation of gene expression and gene-body hypomethylation can only be seen in Duster. These findings suggest that under a water deficit, the drought-tolerant DH169 genotype undergoes significant transcriptional changes but, less so epigenetically. As a drought 'avoidant' winter wheat, Duster, on the other hand, demonstrated a much more extensive genome-wide epigenetic modification compared to the variation identified at the genetic level. To summarize, my study reveals various genetic adaptation mechanisms employed by two closely related winter wheat genotypes. The whole-genome recombination events during the hybridization process might have disrupted the epigenomic regulatory machinery, requesting DH169 to respond to the imposed water deficit with a more expensive transcriptional variation. 2. H3K4me3 and H3k27me3 histone modification in winter wheat In this study, we used whole-genome ChIP-seq to study genome-wide active histone mark, histone H3 lysine 4 trimethylation (H3K4me3), and repressive histone mark, histone H3 lysine 27 trimethylation (H3K27me3) patterns in winter wheat under drought stress. We found that although similar patterns of chromosomal and genomic distributions in both WW and DT were seen, the number of genes modified by H3K4me3 mark was increased and by H3K27me3 mark was reduced under drought condition. In addition, a good portion of genes was newly modified after drought treatment, especially for H3K4me3 modification. About 43% of DT H3K4me3 marked genes were unique to drought conditions, and over half of these drought-specific genes were significantly enriched with H3K4me3 in DT. Surprisingly, we identified 3,819 bivalent genes in DT, and the bivalency of over 70% of these bivalent genes was established upon water deficit. Interestingly, these newly formed bivalent genes in DT were established by depleting the repressive marks and obtaining the active marks, whereas the levels of bivalency did not change in the bivalent genes which were common in WW and DT. These results suggested that drought stress-induced H3K4me3 modifications and reduced the modifications of H3K27me3, and further to enhance bivalency during drought treatment in winter wheat.

Publications

Type: Journal Articles Status: Published Year Published: 2020 Citation: Thistlehwaite, F.R., O. Gamal El-Dien, B. Ratcliffe, J. Kl�pat�, I. Porth, C. Chen, M.U. Stoehr, P. Ivgvarsson and Y. A. El-Kassaby. 2020. Linkage disequilibrium vs. pedigree: genomic selection prediction accuracy in two conifer species. PLoS One 15: e0232201.
Type: Journal Articles Status: Published Year Published: 2020 Citation: Kehel, Z., M. Sanchex, A. El-Baouchi, H. Aberkane, A. Tsivelikas, C. Chen and A. Amri. 2020. Predictive characterization for seed morphometric traits for genebank accessions using genomic selection. Frontier in Ecology and Evolution 8:32.
Type: Theses/Dissertations Status: Other Year Published: 2020 Citation: Lim, Alexander. 2020. Drought-induced epigenetic modulation and transcriptional variation of winter wheat. Oklahoma State University.
Type: Theses/Dissertations Status: Other Year Published: 2020 Citation: Liao, Chi-Ping. 2020. Genome-wide analysis of drought stress induced histone 3 lysine 4 and histone 3 lysine 27 trimethylation modifications in winter wheat. Oklahoma State University

Progress 10/01/18 to 09/30/19

Outputs
Target Audience:Individuals who study wheat genomics, genomic sequencing, association mapping techniques and quantitative genetics aspects related to grain yield, drought tolerance, and end-use quality traits. Scientists working on developing algorithms for prediction purpose, parametric and non-parametric algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems:Our local computing resource continues to be the limiting factor for large-scale genomic analysis. Further, the availability of research capacity in the regions has been greatly reduced due to the expanded demand of data scientists in the job market. To make Oklahoma an attractive place for STEM research and trainees, OSU should consider a significant update on the recruitment strategy and the salary structure. What opportunities for training and professional development has the project provided?Training and professional development were provided to graduate students, who had the opportunity to experience in-depth training in crop research. How have the results been disseminated to communities of interest?Results have been disseminated via publications in scholarly publications, presentations at local, national meetings. What do you plan to do during the next reporting period to accomplish the goals?Duster genome assembly This is a continuing effort. De novo assembly of Duster ultra-long sequencing reads resulted into ~350,000 contigs with an average of genome coverage of 15X. We will be seeking resource to improve the continuity of the Duster genome assembly. Further to understand the functional variants, a number of computational algorithms used to identify structural variants uniquely present in Duster genome.

Impacts
What was accomplished under these goals? 1. Genomic regions identified responsible for drought stress Serving as a nutritional staple worldwide, hexaploid wheat production is challenged by unpredictable environmental variability to meet global food demands. Achieving genetic gains will require continued knowledge acquisition to uncover the genetic basis underlying important agronomic trait performance impacted by the increasing climate uncertainty in regional breeding populations. A genome-wide SNP profile derived from genotyping-by-sequencing and exome-capture was built to study genomic variants segregating in the Duster x Billings doubled-haploid (DH) population, with assessment of yield and end-user quality phenotypes over multiple field years exhibiting variable environments. Transcriptomics for two selected DH individuals was overlaid to determine gene expression within genomic regions modulated by environmental stress. Co-expression modules determined functional heritable contributions to overall agronomic trait heritability. Marker-trait associations demonstrate altered genetic control under drought stress, most notably in yield. Drought responsive mechanisms were identified in genomic regions modulated by drought stress. Four co-expression modules contribute 10-26% of overall narrow-sense heritability across multiple traits under drought stress; three modules are highly represented by chromosome 1B. Responses to drought stress are localized, suggesting mechanisms of narrowed genetic regulation under stress. Chromosome 1B largely encompasses the drought response, carries association with multiple agronomic traits, and delivers strong contributions to heritable variation. 2. Drought-induced global changes in genome accessibility Chromatin structure has a known relationship with gene regulation and therefore it was expected to find an association between genes and MNase hyper-sensitive (HS) regions. The accessible chromatin only accounts for a small portion of the genome (<1.5%) and though more than 70% of HS regions are found in non-genic space, they are enriched in genes and gene flanks. As a proportion of total genic space which accounts for approximately 6% of the genome, HS regions represent 7.41% and 4.88% in well-water (WW) and drought (DT) respectively, compared to only 1.2% and 0.96% of non-genic space. Conventional patterns of hypersensitivity can be seen in WW with average nucleosome occupancy levels increased immediately upstream the transcription start sites (TSS) and nucleosome depletion at the start site. In DT, overall average occupancy is decreased and depletion at the TSS is minimal. Similar patterns can be seen at the transcription termination sites (TTS). DT maintains at a steady state low level occupancy throughout the gene body, while increased nucleosome occupancy in WW is evident and may be linked to regulation of transcription elongation. A highly significant relationship was found between MNase HS and gene density across all 21 chromosomes under WW conditions. In contrast, only six chromosomes demonstrate such relationship in DT (chromosomes 4A, 6A, 1B, 2B, 6B, and 6D). Further analysis of the null hypothesis slopeWW vs slopeDT revealed 13 chromosomes where the relationship between MNase HS and gene density differs between WW and DT.

Publications

Type: Book Chapters Status: Published Year Published: 2019 Citation: El-Kassaby, Y.A., B. Ratcliffe, O. Gamal-El-Dien, S. Sun, C. Chen, E.P.Cappa and I. Porth. Genomic selection of wood quality in Canadian spruces. 2019. Springer Nature.
Type: Journal Articles Status: Published Year Published: 2019 Citation: Sun, S., S. Maio, B. Ratcliffe, P. Campell, Y.A. El-Kassaby, B. Balasundaram and C. Chen*. 2019. Variable Selection by Generalized Graph Domination. PLoS ONE 14(1):e0203242. doi.org/10.1371/journal.pone. 0203242.
Type: Journal Articles Status: Published Year Published: 2019 Citation: Hu, X., B.F. Carver, C. Power, L. Yan and C. Chen*. 2019. Genomic selection and response to selection by designed training population for grain yield and end-use quality traits in winter wheat variety development programs. The Plant Genome doi:10.3835/plantgenome2018.11.0090
Type: Journal Articles Status: Published Year Published: 2019 Citation: Naidenov, B., K. Willyerd, A. Lim, N. Torres, W. Johnson, H. J. Hwang, P. Hoyt, J. Gustafson and C. Chen*. 2019. Pan-genomic and polymorphic driven prediction of antibiotic resistance in Elizabethkingia. Frontiers in Microbiology doi.org/10.3389/fmicb.2019.01446.
Type: Journal Articles Status: Published Year Published: 2019 Citation: Lim, A., B. Naidenov, H. Bates, K. Willyerd, T. Snider, M.B. Couger, C. Chen* and A. Ramachandran. 2019. Nanopore ultra-long rad sequencing technology for antimicrobial resistance detection in Mannheimia haemolytica. Journal of Microbiological Methods 159:138-147. doi.org/10.1016/j.mimet.2019.03.001.
Type: Journal Articles Status: Published Year Published: 2019 Citation: Ratcliffe, B., F.R. Thistlethwaite, O. Ibrahim, E. Cappa, I. Porth, J. Kl�pat�, C. Chen, T. Wang, M.U. Stoehr and Y.A. El-Kassaby. 2019. Inter- and intra- generation genomic predictions for Douglas-fir growth in unobserved environments. Frontier in Genetics bioRxiv doi.org/10.1101/540765.
Type: Journal Articles Status: Published Year Published: 2019 Citation: Campbell, P., L. Ar�vale, H. Martin, C. Chen, S. Sun, A.H. Rowe, M.S. Webster. J.B. Searle and B. Pasch. 2019. Vocal divergence is concordant with genomic evidence for strong reproductive isolation in grasshopper mice (Onychomys). Ecology and Evolution doi.org/10.1002/ece3.5770
Type: Websites Status: Other Year Published: 2019 Citation: Translational Genomics Forum 2019 https://github.com/transgenomicsosu/Stillwater_Translational_Genomics_Forum_2019

Progress 10/01/17 to 09/30/18

Outputs
Target Audience:Individuals who study wheat genomics, genomic sequencing, mapping techniques and quantitative genetics aspects related to grain yield, drought tolerance, and end-use quality traits. Scientists working on developing algorithms for prediction purpose, parametric and non-parametric algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems:Our local computing resource continues to be the limiting factor for genomic analysis. To complete the proposed analyses, we have submitted research allocation proposal to the XSEDE (the National Science Foundation's Extreme Digital Program), requesting 72 TB of data storage and over 750,000 CPU hours on largely shared memory computing nodes. What opportunities for training and professional development has the project provided?Our research project covers wheat variety improvement, crop genetics, statistics, and bioinformatics, providing multi-disciplinary training opportunities for both graduate students and postdoc researchers. Postdoctoral research Dr. Willyerd and graduate student Bryan Naidenov have joined training hosted by the high performing computing cluster at OSU for parallel computing. Graduate student Xiaowei Hu, who was previously trained in Applied Mathematics, has worked alongside with Dr. Willyerd (molecular geneticist) and graduate student Alex Lim (biochemists), to enforce her knowledge in Biology and Genetics, and to broaden her career outlook. All postdoctoral researchers and graduate students have attended an annual event for data science, plant breeding, and statistical genomics. Dr. Willyerd, graduate student Bryan Naidenov, and Alex Lim joined an annual meeting of Midsouth Computational Biology and Bioinformatics Society as well. Naidenov and Lim, who both work on informatics and machine learning algorithms, attended 2018 Coalition for Advancing Digital Research and Education conference. How have the results been disseminated to communities of interest?The data-sets and algorithm of SNP-select is a stand-alone software; we have published it on the Translational Genomics Laboratory's GitHub site: https://github.com/transgenomicsosu/SNP-SELECT What do you plan to do during the next reporting period to accomplish the goals?1. Further development on the Bayesian multivariate model on multi-trait prediction A multivariate model will be expended to include multiple phenotypes in one GS run, to fully utilize the genetic correlation by the shared, common quantitative trait loci (QTL). 2. Epigenomic regulation of drought-tolerance of Oklahoma's hard red winter wheat Three doubled haploid lines derived from Duster and Billings intercross population have been selected for control environment experiments. DH lines and the parental varieties have been subject to drought condition that simulates the 2014 drought in the region. Lead samples of three biological replicates of these varieties have been collected at critical developmental stages, with respect to the control and drought treatment. The research expects to investigate impacts of methylation and chromatin structure variation that is induced by drought. Protocols of reduced-representation Bisulfite sequencing, as well as the titration of chromatin digestion with micrococcal nuclease, have been optimized. When sequencing is complete, genome-guide alignment would be performed using the current bread wheat reference assembly. The resulting hypersensitivity and hyposensitivity of these epigenomic regulations would be mapped together with the current genome-wide association and QTL mapping. 3. Duster genome assembly To avoid the off-targeting issue of our alignment, as well as to uncover unique genomic adaptation of Oklahoma's winter wheat varieties, we will continue the effort to uncover Duster genome with the capacity of ultra-long read sequencing technology. Currently, over 40 flow-cell worth of long-read data has been collected; base-calling for generating fast5 file is ongoing on OSU's supercomputer.

Impacts
What was accomplished under these goals? 1. Effectiveness of genomic prediction by response to selection, for grain yield and end-use quality traits under drought condition Considering the practicality of applying genomic selection (GS) in the line development stage of a hard red winter (HRW) wheat variety development program, we have evaluated the effectiveness of GS by prediction accuracy, as well as by the response to selection across field seasons that demonstrated challenges for crop improvement under significant climate variability. Important breeding targets for HRW wheat improvement in the southern Great Plains of USA, including grain yield, kernel weight, wheat protein content, and sodium dodecyl sulfate (SDS) Sedimentation Volume as a rapid test for predicting bread-making quality, were used to estimate GS's effectiveness across harvest years from 2014 (drought) to 2016 (normal). In general, our GS results show that nonparametric algorithms RKHS and RF produced higher accuracies in both same-year/environment cross-validations and cross-year/environment predictions, for the purpose of line selection in a bi-parental doubled haploid (DH) HRW population. As for OSU's wheat variety development program, accurate and stable selection of superior breeding lines over experimental trials could be still challenging with the presence of worsening drought condition. To ensure long-term response to selection, our results suggest that there are, however, cases where phenotypic selection would be still preferential or cases that retraining with updated phenotypes should be performed. In principle, the superiority of GS was most notable when the selection intensity was high, and when large training information was available. It is interesting to note that, supported by our findings, training conducted in sub-optimal conditions could still provide GS predictability while maintaining a desirable response to selection for both grain yield and SDS Sedimentation Volume. The reverse is however nor true; in our case study when predicting line performance under sub-optimal conditions (for example, under drought conditions) by information trained in normal growing conditions, additional phenotyping under the target, sub-optimal environment would be required to achieve a desirable selection response; this was most obvious for grain yield when selection intensity was high. In other words, when making selection decisions for trials under unexpected environmental stress, like the frequent drought in the southern Great Plains of USA, using GS trained in optimal growing conditions could very likely result in unreliable outcomes. Further, the stability of prediction performance was greatest for SDS Sedimentation Volume but least for wheat protein content, making SDS Sedimentation Volume a worthy candidate for GS in wheat variety development programs 2. Optimization of training population for genomic prediction in sub-optimal growing environments This study also evaluated the effectiveness of genomic prediction respect to the composition of training population. Our findings suggest that overall when the training population is optimized, an upward performance improvement in GS can be expected. A simple and straightforward approach to optimize training population for prediction could be done by maximizing phenotypic variation. In addition to the conventional two-tailed training population design, our study also investigated various approaches of constructing training population, such as two-tailed genomic estimated breeding values and the training population formed by the majority votes of both genomic estimated breeding values and raw phenotypes. Using grain yield as an example for polygenic traits, a broadly appropriate guideline is, when training was obtained from normal growing conditions, straightforward GS approaches with an intermediate size of training population should be considered for high selection intensity; and when training was performed in a stressed growing condition, at a high selection intensity, optimized training population with the majority votes could result in long-term advantage. The latter scenario was as well beneficial for end-use quality traits like SDS Sedimentation Volume and Kernel Weight. Also with a heritability estimate of 0.74 and appreciable phenotypic correlation coefficients across environments, the stability in genomic selection performance and in the response to selection across environment variability makes SDS Sedimentation Volume a worthy candidate for genomic selection in wheat variety development programs.

Publications

Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Hu, X., L. Zhu and C. Chen. 2018. Bayesian Multi-variate Weighted Kernel Genomic Prediction. Joint Statistical Meeting, Vancouver BC.
Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Willyerd, K., S. Sun, Y. Gao, X. Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2018. Genomic Variation for Yield Stability and End-Use Quality in Hexaploid Wheat. Mid-South Computational Biology and Bioinformatics Society Conference. Starkville, Mississippi
Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Launius, M., C. Chen and K. Willyerd. 2018. Investigating Stomatal Responses to Drought in Hard Red Winter Wheat. Department of Plant and Soil Science Research Symposium. Stillwater, Oklahoma
Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Rodriguez, A., C. Chen and K. Willyerd. 2018. Exploring Transcriptional Variation in Drought Tolerant and Susceptible Winter Wheat Lines. Department of Plant and Soil Science Research Symposium. Stillwater, Oklahoma
Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Naidenov, B. and C. Chen. 2018. Exposing the Hidden chromatin Regulatory Framework with Recurrent Deep Learning and Genomic Sequence Data. Mid-South Computational Biology and Bioinformatics Society Conference. Starkville, Mississippi
Type: Conference Papers and Presentations Status: Other Year Published: 2018 Citation: Lim, A., B. Naidenov, Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2018 Data Driven Genomic Surveillance of Microbial Drug Resistance Using Oxford using Nanopore single molecular sequencing technology. Mid-South Computational Biology and Bioinformatics Society Conference. Starkville, Mississippi
Type: Journal Articles Status: Published Year Published: 2018 Citation: Sun, S., S. Maio, B. Ratcliffe, P. Campell, Y.A. El-Kassaby, B. Balasundaram and C. Chen*. 2018. Variable Selection by Generalized Graph Domination. PLoS One. Available in biorxiv preprint.
Type: Journal Articles Status: Published Year Published: 2018 Citation: Thistlehwaite, F.R., B. Ratcliffe, J. Kl�pat�, I. Porth, C. Chen, M.U. Stoehr and Y.A. El-Kassaby. 2018. Genomic Selection of Juvenile Height across a Single Generational Gap in Douglas-fir. Heredity

Progress 10/01/16 to 09/30/17

Outputs
Target Audience:Individuals who study wheat genomics, genomic sequencing, mapping techniques and quantitative genetics aspects related to grain yield, drought tolerance, and end-use quality traits. Scientists working on developing algorithms for prediction purpose, parametric and non-parametric algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems:Since September 2017, high-quality reference of the genome sequence of bread wheat (IWGSC RefSeq v1.0) has become available through the agreement with IWGSC. Taking the advantage of IWGSC RefSeq v1.0, genome-guided HISAT alignment for the reads of DH169 and DH173 has been processed. However, running HISAT genome-guided alignment completed in just over 11 days on big mem nodes on Cowboy, OSU's high performing computing cluster, and the .sam file resulted from DH173 along was 1.3 TB, 1.8 TB for DH169. Although the .sam files have been removed immediately after the .bam files were generated, the current storage of our computing resource is insufficient, even just for storing the .bam files. Also, to estimate read abundance (using RSEM), the raw sequence files (450 GB) are required along with the trinity fasta files; and, this procedure utilizes bowtie and would produce .bam files and other intermediates. To begin RSEM, an estimate of over 800 GB input files and over 4 TB of intermediates for RNAseq read abundance calculation is required, which the data size itself presents a great challenge with our current computing resource. To complete this analysis, a large storage supplement for our current start-up allocation on the Xsede, the National Science Foundation's Extreme Digital Program. A supplement proposal has been submitted for storage and CPU hours on large memory computing notes. From previous experience running these algorithms, our sizeable datasets demand the use of a more powerful cluster than what is locally available. We are currently preparing a full application to the Xsede. The success of this application is critical for our upcoming RNAseq datasets that include all 24 genotypes. What opportunities for training and professional development has the project provided?This research project has created multi-disciplinary training opportunities for both graduate students and postdoc researchers. Postdoctoral research Dr. Willyerd and graduate student Bryan Naidenov have joined training hosted by the high performing computing cluster at OSU for parallel computing. Graduate student Xiaowei Hu has worked alongside with crop geneticist and biochemists to enforce her knowledge in Biology and Genetics, to broaden her career outlook. Dr. Willyerd has attended annual event for plant breeding graduate students and postdoctoral scientists. Graduate student Bryan Naidenov and Alex Lim joined the annual meeting of Midsouth Computational Biology and Bioinformatics Society in Little Rock, Arkansas. Naidenov and Lim, who both work on machine learning algorithms for a prediction on supercomputers, attended 2018 Coalition for Advancing Digital Research and Education conference. How have the results been disseminated to communities of interest?The algorithm of SNP-select is a stand along software; we have published it on the Translational Genomics Laboratory's GitHub site: https://github.com/transgenomicsosu/SNP-SELECT What do you plan to do during the next reporting period to accomplish the goals?1. Genomic prediction for winter wheat improvement A multivariate model will be developed to incorporate multi-year and multi-location trials to improve prediction accuracy. In a conventional setting, at least three years' yield trials would be conducted before the selection for variety development. To account for the evident GxE, more sophisticated models are required. 2. Drought-tolerance association for Oklahoma's hard red winter wheat population Three doubled haploid lines derived from Duster and Billings intercross population have been selected for control environment experiments. DH lines and the parental varieties have been subject to severe drought condition, and tissues have been sampled at critical developmental stages. After extracting total RNA, samples have been submitted for transcriptome sequencing. The research will perform genome-guide alignment, and assemble transcripts that express differentially under treatments. The resulting up- or down-regulated transcripts would be mapped together with the current genome-wide association and QTL mapping. 3. Duster genome assembly The laboratory has been investigating the capacity of ultra-long read sequencing technology. We have also implemented three high molecular weight DNA extraction protocols to obtain long DNA fragments. The de novo genome assembly would be conducted with up to 18 flow cells of sequencing capacity. We currently anticipate up to 800 million reads as sequencing data yield, with an average read length at 5,000 base-pairs of 1D reads. The reads will be also aligned with the current RefSeq v1.0 bread wheat assembly to distinguish the bread wheat's core genome and the accessory genome specifically to Oklahoma's winter wheat varieties.

Impacts
What was accomplished under these goals? 1. Winter wheat genomic resource and association for drought tolerance In total, 282 doubled haploid winter wheat lines derived from the intercross of Duster and Billing and members of Dual Purpose Observation Nursery, DPON were genotyped with Genotyping-by-sequencing technologies, resulting 289,222 SNP before filtering. In addition, SNP markers located in functional genes have also been developed using capture technology. A total of 50K probes were designed for exome capture sequencing using public databases, including high confidence gene model and CDS (MIPS v2.2), gene models in wheat D genome (Aegilops tauschii) and A genome (Triticum uratu) progenitors, the 454 titanium sequence reads from wheat cDNA libraries and probes of NimbleGen SNP array, which revealed 709,063 SNPs in functional regions before filtering. In total, 702,000 SNPs were derived from exome capture technology. These were merged with GBS SNPs and anchored on the current bread wheat IWGSC_RefSeq_v1.0 reference assembly. After removing non-informative and erroneous SNPs, the release of build DB_v1.0.1 for Duster and Billings DH population contains 16,383 quality SNPs (<25% missing data ratio) with a whole-genome coverage at ~ 96 SNPs/100MBp and approximately 50-50 segregation of parental genotypes. Genome-wide association mapping (GWAS) from these 242 DH lines identified SNP variants significantly associated with yield production, wheat protein, and hardiness traits, although a variation of associated SNP variants across the three field seasons signifies the impact of genotype x environment. Significant GWAS associations co-localizing with the previously identified yield QTL on chromosome 1BS were found only in low precipitation years 2014 and 2015. RNAseq of a drought tolerant genotype, DH169, revealed drought stress influenced differential expression of 6,936 transcripts (adjusted p < 0.05; -3 < log2FC >3), 4,989 of which represent the longest single isoform. Differentially expressed transcripts mapped to 318 Arabidopsis and 156 Oryza sp. proteins categorized as stress responsive. Cross-reference of these 474 differentially expressed transcripts and yield associated genomic sequences from SNP data revealed nine transcripts aligning to chromosome 2 in A, B and D genomes. Our findings of stress response genes responsible for yield maintenance pinpoint to the molecular breeding targets for the rally to battle food insecurity in this worsening drought climate. 2. Genomic prediction for grain yield and end-use quality traits Genomic selection performance was evaluated for grain yield and end-use quality traits, including wheat protein content, hardiness, and SDS traits. The project has examined predictability of eight algorithms; these include the parametric algorithms like ridge regression BLUP, GBLUP, Bayes A, B and Cpi, and Bayesian LASSO, and nonparametric ones such as random forest, reproducing kernel Hilbert space. Using within-year cross-validation with 100 replicates, a non-parametric algorithm like random forest demonstrated higher prediction accuracy, outperforming other algorithms by at least 7%. Result from cross-validation for wheat protein and sodium dodecyl sulfate sedimentation (SDS) showed no significant difference among prediction algorithms. Variable selection prediction methods like Bayesian LASSO (BL) showed an advantage in hardiness traits like single kernel characterization system average weight. The observed difference in predictability reflects the underlying genetic architecture of phenotypic variation. For example, major genes responsible for hardness can be found on the short arm of chromosome 5D, while a small number of small-effect loci can also be identified. In both within-year and cross-year validation, results showed that BL outperformed other prediction algorithms by the Pearson's correlation, suggesting selecting important SNP variables that might be in close LD with causative QTL improve prediction accuracy for kernel hardiness. Further, the greater degree of variability of MSE seen in BL method is indicative of whether such important alleles have been included in the training population or not. Compared with cross year validations, obvious overinflation was observed in within-year cross-validation; the weakness of cross-validation in capturing genotype x environment interaction was shown. Not only did cross year validation better capture the forward selection, but estimates in predictability in cross year validation also more likely reflect the reality that genomic breeders might anticipate. 3. Optimization of the training population Selection for training population has demonstrated advantage. Using both grain yield and end-use quality traits, the performance of genomic prediction can be improved by as much as 20%. This is more evident when selection is made on grain yield.

Publications

Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Chen, C. 2017. Treatment for data uncertainty in genomic prediction IUFRO, Concepion, Chile
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Hu, Xiaowei, C. Chen and L. Zhu. 2017. Kernel-based Bayesian model for genomic selection. Joint Statistical Meeting, Baltimore
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Willyerd, K., S. Sun, Y. Gao, Xiaowei Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2017. Buster_Hmp v0.4.1, an integrated genomic resource for development, exploitation and crop improvement for hard red winter wheat. MCBIOS XIV, Little Rock, Arkansas
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Naidenov, B., A. Lim, W. Johnson, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoty and C. Chen. 2017. Predicting antibiotic resistance with Nanopore long-reads and machine learning. BMBGSA 14th Annual Research Symposium in Biological Sciences. Stillwater, OK.
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Sun, S., Z. Miao, B. Ratcliffe, P. Campbell, Y. EI-Kassaby, B. Balasundaram and C. Chen. 2017. SNP variable selection by generalized graph domination. MCBIOS XIV, Little Rock, Arkansas
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Willyerd, K., S. Sun, Y. Gao, Xiaowei Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2017. Buster_Hmp v0.4.1, an integrated genomic resource for development, exploitation and crop improvement for hard red winter wheat. BMBGSA 14th Annual Research Symposium in Biological Sciences. Stillwater, OK.
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Naidenov, B., A. Lim, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2017. Novel gene discovery by genome completion through de novo assembly of long-reads. MCBIOS XIV, Little Rock, Arkansas
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Lim, A., B. Naidenov, C.J. Crick and C. Chen. 2017 Predictability of Neural Network Models for Carotenoid Biofortification. MCBIOS XIV, Little Rock, Arkansas
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Naidenov, B., A. Lim, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2017. Novel gene discovery by genome completion through de novo assembly of long-reads. CADRE conference, Stillwater, OK.
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Naidenov, B., H. Bates, A. Lim, K. Willyerd, K. Snider, M. Breshears, B.M. Couger, C. Chen and A. Ramachandran. 2017. A small device for a big challenge: surveillance of drug resistance in Mannheimia haemolytica using Nanopore single molecular sequencing technology. AAVLD, San Diego, CA.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Song, S.J., B. Carver, C. Power, L. Yan, Y. El-Kassaby, J. Kl�pat� and C. Chen. 2017. Practical application of genomic selection in a doubled-haploid winter wheat breeding program. Molecular Breeding 37:117 doi:10.1007/s11032-017-0715-8
Type: Theses/Dissertations Status: Other Year Published: 2017 Citation: Song, S.J. 2017. Genomic selection in a Doubled Haploid Winter Wheat Population. M. Sc. Thesis

Progress 10/01/15 to 09/30/16

Outputs
Target Audience:Individuals who study wheat genomics, genomic sequencing, mapping techniques and quantitative genetics aspects related to grain yield, drought tolerance and end-use quality traits. Scientists working on developing algorithms for prediction purpose, parametric and non-parametric algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems:Since September 2017, high quality reference of genome sequence of bread wheat (IWGSC RefSeq v1.0) has become available through the agreement with IWGSC. Taking the advantage of IWGSC RefSeq v1.0, genome-guided HISAT alignment for the reads of DH169 and DH173 has been processed. However, running HISAT genome-guided alignment completed in just over 11 days on big mem nodes on Cowboy, OSU's high performing computing cluster, and the .sam file resulted from DH173 along was 1.3 TB, 1.8 TB for DH169. Although the .sam files have been removed immediately after the .bam files were generated, the current storage of our computing resource is insufficient, even just for storing the .bam files. Also, to estimate read abundance (using RSEM), the raw sequence files (450 GB) are required along with the trinity fasta files; and, this procedure utilizes bowtie and would produce .bam files and other intermediates. To begin RSEM, an estimate of over 800 GB input files, and over 4 TB of intermediates for RNAseq read abundance calculation is required, which the data size itself presents a great challenge with our current computing resource. To complete this analysis, a large storage supplement for our current start-up allocation on the Xsede, the National Science Foundation's Extreme Digital Program. A supplement proposal has been submitted for storage and CPU hours on large memory computing notes. From previous experience running these algorithms, our sizeable datasets demand the use of a more powerful cluster than what is locally available. We are currently preparing a full application to the Xsede. The success of this application is critical for our upcoming RNAseq data sets that include all 24 genotypes. What opportunities for training and professional development has the project provided?This research project has created multi-disciplinary training opportunities for both graduate students and postdoc researchers. Postdoctoral research Dr. Willyerd and graduate student Bryan Naidenov have joined trainings hosted by the high performing computing cluster at OSU for parallel computing. Graduate student Xiaowei Hu has worked along side with crop geneticist and biochemists to enforce her knowledge in Biology and Genetics, to broaden her career outlook. Dr. Willyerd has attended annual event for plant breeding graduate students and postdoctoral scientists. Graduate student Bryan Naidenov and Alex Lim joined annual meeting of Midsouth Computational Biology and Bioinformatics Society in Little Rock, Arkansas. Naidenov and Lim, who both work on machine learning algorithms for prediction on supercomputers, attended 2018 Coalition for Advancing Digital Research and Education conference. How have the results been disseminated to communities of interest?The algorithm of SNP-select is a stand along software; we have published it on the Translational Genomics Laboratory's GitHub site: https://github.com/transgenomicsosu/SNP-SELECT What do you plan to do during the next reporting period to accomplish the goals?1. Genomic prediction for winter wheat improvement A multivariate model will be developed to incorporate multi-year and multi-location trials to improve prediction accuracy. In a conventional setting, at least three years' yield trials would be conducted before the selection for variety development. To account for the evident GxE, more sophisticated models are required. 2. Drought-tolerance association for Oklahoma's hard red winter wheat population Three doubled haploid lines derived from Duster and Billings intercross population have been selected for control environment experiments. DH lines and the parental varieties have been subject to sever drought condition, and tissues have been sampled at critical developmental stages. After extracting total RNA, samples have been submitted for transcriptome sequencing. The research will perform genome-guide alignment, and assemble transcripts that express differentially under treatments. The resulting up- or down- regulated transcripts would be mapped together with the current genome-wide association and QTL mapping. 3. Duster genome assembly The laboratory has been investigating the capacity of ultra-long read sequencing technology. We have also implemented three high molecular weight DNA extraction protocols to obtain long DNA fragments. The de novo genome assembly would be conducted with up to 18 flow cells of sequencing capacity. We currently anticipate up to 800 million reads as sequencing data yield, with an average read length at 5,000 base-pairs of 1D reads. The reads will be also aligned with the current RefSeq v1.0 bread wheat assembly to distinguish the bread wheat's core genome and the accessory genome specifically to Oklahoma's winter wheat varieties.

Impacts
What was accomplished under these goals? 1. Winter wheat genomic resource and association for drought tolerance In total, 282 doubled haploid winter wheat lines derived from the intercross of Duster and Billing and members of Dual Purpose Observation Nursery, DPON were genotyped with Genotyping-by-sequencing technologies, resulting 289,222 SNP before filtering. In addition, SNP markers located in functional genes have also been developed using capture technology. A total of 50K probes were designed for exome capture sequencing using public databases, including high confidence gene model and CDS (MIPS v2.2), gene models in wheat D genome (Aegilops tauschii) and A genome (Triticum uratu) progenitors, the 454 titanium sequence reads from wheat cDNA libraries and probes of NimbleGen SNP array, which revealed 709,063 SNPs in functional regions before filtering. In total, 702,000 SNPs were derived from exome capture technology. These were merged with GBS SNPs and anchored on the current bread wheat IWGSC_RefSeq_v1.0 reference assembly. After removing non-informative and erroneous SNPs, the release of build DB_v1.0.1 for Duster and Billings DH population contains 16,383 quality SNPs (<25% missing data ratio) with a whole-genome coverage at ~ 96 SNPs/100MBp and approximately 50-50 segregation of parental genotypes. Genome-wide association mapping (GWAS) from these 242 DH lines identified SNP variants significantly associated with yield production, wheat protein and hardiness traits, although variation of associated SNP variants across the three field seasons signifies the impact of genotype x environment. Significant GWAS associations co-localizing with the previously identified yield QTL on chromosome 1BS were found only in low precipitation years 2014 and 2015. RNAseq of a drought tolerant genotype, DH169, revealed drought stress influenced differential expression of 6,936 transcripts (adjusted p < 0.05; -3 < log2FC >3), 4,989 of which represent the longest single isoform. Differentially expressed transcripts mapped to 318 Arabidopsis and 156 Oryza sp. proteins categorized as stress responsive. Cross reference of these 474 differentially expressed transcripts and yield associated genomic sequences from SNP data revealed nine transcripts aligning to chromosome 2 in A, B and D genomes. Our findings of stress response genes responsible for yield maintenance pinpoint to the molecular breeding targets for the rally to battle food insecurity in this worsening drought climate. 2. Genomic prediction for grain yield and end-used quality traits Genomic selection performance was evaluated for grain yield and end-use quality traits, including wheat protein content, hardiness and SDS traits. The project has examined predictability of eight algorithms; these include the parametric algorithms like ridge regression BLUP, GBLUP, Bayes A, B and Cpi, and Bayesian LASSO, and nonparametric ones such as random forest, reproducing kernel Hilbert space. Using within-year cross-validation with 100 replicates, non-parametric algorithm like random forest demonstrated higher prediction accuracy, outperforming other algorithms by at least 7%. Result from cross-validation for wheat protein and sodium dodecyl sulfate sedimentation (SDS) showed no significant difference among prediction algorithms. Variable selection prediction methods like Bayesian LASSO (BL) showed advantage in hardiness traits like single kernel characterization system average weight. The observed difference in predictability reflects the underlying genetic architecture of phenotypic variation. For example, major genes responsible for hardness can be found on the short arm of chromosome 5D, while a small number of small-effect loci can also be identified. In both within-year and cross-year validation, results showed that BL outperformed other prediction algorithms by the Pearson's correlation, suggesting selecting important SNP variables that might be in close LD with causative QTL improve prediction accuracy for kernel hardiness. Further, the greater degree of variability of MSE seen in BL method is indicative of whether such important alleles have been included in training population or not. Compared with cross year validations, obvious overinflation was observed in within-year cross-validation; the weakness of cross-validation in capturing genotype x environment interaction was shown. Not only did cross year validation better capture the forward selection, estimates in predictability in cross year validation also more likely reflect to the reality that genomic breeders might anticipate. 3. Optimization of training population Selection for training population has demonstrated advantage. Using both grain yield and end-user quality traits, the performance of genomic prediction can be improved by as much as 20%. This is more evident when selection is made on grain yield.

Publications

Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Naidenov, B., A. Lim, W. Johnson, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoty and C. Chen. 2017. Predicting antibiotic resistance with Nanopore long-reads and machine learning. BMBGSA 14th Annual Research Symposium in Biological Sciences. Stillwater, OK.
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Sun, S., Z. Miao, B. Ratcliffe, P. Campbell, Y. EI-Kassaby, B. Balasundaram and C. Chen. 2017. SNP variable selection by generalized graph domination. MCBIOS XIV, Little Rock, Arkansas
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Willyerd, K., S. Sun, Y. Gao, Xiaowei Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2017. Buster_Hmp v0.4.1, an integrated genomic resource for development, exploitation and crop improvement for hard red winter wheat. BMBGSA 14th Annual Research Symposium in Biological Sciences. Stillwater, OK.
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Naidenov, B., A. Lim, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2017. Novel gene discovery by genome completion through de novo assembly of long-reads. MCBIOS XIV, Little Rock, Arkansas
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Chen, C. 2017. Treatment for data uncertainty in genomic prediction IUFRO, Concepion, Chile
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Hu, Xiaowei, C. Chen and L. Zhu. 2017. Kernel-based Bayesian model for genomic selection. Joint Statistical Meeting, Baltimore
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Willyerd, K., S. Sun, Y. Gao, Xiaowei Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2017. Buster_Hmp v0.4.1, an integrated genomic resource for development, exploitation and crop improvement for hard red winter wheat. MCBIOS XIV, Little Rock, Arkansas
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Lim, A., B. Naidenov, C.J. Crick and C. Chen. 2017 Predictability of Neural Network Models for Carotenoid Biofortification. MCBIOS XIV, Little Rock, Arkansas
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Naidenov, B., A. Lim, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2017. Novel gene discovery by genome completion through de novo assembly of long-reads. CADRE conference, Stillwater, OK.
Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: Naidenov, B., H. Bates, A. Lim, K. Willyerd, K. Snider, M. Breshears, B.M. Couger, C. Chen and A. Ramachandran. 2017. A small device for a big challenge: surveillance of drug resistance in Mannheimia haemolytica using Nanopore single molecular sequencing technology. AAVLD, San Diego, CA.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Song, S.J., B. Carver, C. Power, L. Yan, Y. El-Kassaby, J. Kl�pat� and C. Chen. 2017. Practical application of genomic selection in a doubled-haploid winter wheat breeding program. Molecular Breeding 37:117 doi:10.1007/s11032-017-0715-8
Type: Theses/Dissertations Status: Other Year Published: 2017 Citation: Song, S.J. 2017. Genomic selection in a Doubled Haploid Winter Wheat Population. M. Sc. Thesis.

Progress 09/19/15 to 09/30/15

Outputs
Target Audience:Individuals studying wheat genomics, genomic sequencing and mapping techniques and quantitative genetics related to grain yield components. Scientists working on genomic selection algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?The project has provided postdoc research associate Dr. Karyn Willyerd opportunities to combine her molecular genetics knowledge with computational and statistical tools. Dr. Willyerd was also accepted and participated in the Workshop on Cereal Genomics at the Cold Spring Harbor Laboratory, New York for furthering her skillsets and knowledge in upcoming large-scale genomic analysis. The workshop was held October 19-25, 2016. PhD student Xiaowei Hu, primarily working on optimal training design for Duster x Billings DH population, was selected to participate the Joint Statistics Meeting in Chicago, July 30 - August 4, 2016. Her thesis work on the optimization was presented in the workshop that targets on quantitative and statistical genetics. Working on Illumina short read data alignment and SNP imputation problem, PhD student Shuzhen Sun has used this opportunity to attend Illumina workshop at the Oklahoma Medical Research Foundation on October 12, 2016. Miss Sun is currently working on SNP calling, imputation and data integration, her training background is in variable reduction algorithms and the Illumina workshop has brought her the needed knowledge in next-seq data generation. Undergraduate student Suzi Barboza-Pachero also attended the 58th Maize Genetics Conference. With her undergraduate research in wheat genomic selection simulation, Miss Barboza-Pachero was awarded the travel fund to join this meeting in Jacksonville, Florida, March 17-20, 2016. Miss Barboza-Pachero was also selected to join the Boyce Thompson Institute at Cornell University for summer intership. At the Boyce Thompson Institute, Miss Barboza-Pachero was part of the database testing team, where she was working along side with database programmers and created a MySQL data loading script to examine the efficiency of querying and filtering large-volume SNP tables. How have the results been disseminated to communities of interest?It has been only one year into the proposed research program. The results of this project have not been disseminated. However, due to the close relationship of Chen's research program and the Wheat Improvement Team (WIT) at the OSU, Dr Brett Carver, the Regents Professor and OSU Wheat Genetics Chair who leads WIT has consulted the current results and compared selection conducted using previous yield data. What do you plan to do during the next reporting period to accomplish the goals?The first sequencing short reads used to generate the current SNP profile for the 'Duster x Billings" population will be poured with the recent GBS1038, GBS1039 and GBS1040 data. SNP determination considering different genome coverage will be executed as proposed. With poured short read data, we will expect better marker data quality with lower missing data ratio; also as a result of increased genome coverage, the impact of the sequencing read depth on statistical parameter estimation, such as Euclidean distance measure (the D matrix) in the RKHS model as well as criterion for variable selection, could be investigated. Upon the completion of this research, the Illumina short read raw data will be made available for the community to promote data reusability. We will also provide both of the SNP and phenotype tables through publications. Currently, the year 2016 grain yield data from the 'Duster x Billings' population is being processed. When ready, genomic selection on this new phenotype will be evaluated for another generation (grain yield 2016), to complete the two-generation validation proposed in the research proposal where the grain yield 2016 data will be the validation population. A number of predictive algorithms, including both parametric and non-parametric methods, as well as the script for cross-validation, has been implemented in our local machine for. Phenotypic data will be trained using data from 2015; the prediction for 2016 data will be used to evaluate direct impact of genotypic information on predictability. Also, the effectiveness of adopting genomic selection could be assessed from the direct comparison of genomic prediction with the traditional practice of phenotypic selection. Finally, in this year's research plan, we will explore variable selection strategies based on statistical correlation, linkage analysis and machines learning algorithm like k-domination with the objectives of maximizing predictability for grain yield. The performance of prediction models on the optimal training data set will also be studied.

Impacts
What was accomplished under these goals? 1. Major activities completed: 1.1. Genotypic information from the Genotyping-by-sequencing (GBS) technology DNA of the 'Duster x Billings' 282 lines were extracted from seedlings in early spring of 2016. According to the protocols of Poland et al (2012), Pst I and Msp I were chosen to perform genome complexity reduction. PCR products were amplified using short extension time (less than 30 seconds) to enrich short fragments suitable for bridge-amplication on the Illumina flow cells. The first run of sequencing has been completed and currently raw reads from the second Illumina pass have been analyzed. The merging of 2014 and 2016 SNP is currently underway. Therefore, only results from the first run of SNP calling are included in this report. 1.2. SNP calling and missing data imputation Three criteria were used for filtering quality SNP information for predictive analyses: (1) overall missing ratio < 50%, (2) heterozygosity < 5% and (3) minor allele frequency > 5%, resulting a total of 7,426 SNPs the rest of the analysis. Missing SNP data imputation was done by both EM and the k-nearest neighbor algorithms. Only k-nearest neighbor imputed SNPs were used for building predictive models, due to its superior imputation accuracy. In addition, read tags were also aligned with the most up-to-date wheat pseudo-molecules using BWA algorithm, with minimum tag read per alignment at 10. Before data processing, 1,137,153 read tags can be aligned; on average, 7,888 SNP can be found per chromosome. 1.3. Comparisons on genomic selection algorithms In total, seven predictive algorithms were examined using the 'Duster x Billings' population, including linear regression, Bayes Alphabets and derives methods, and semi- and non- parametric algorithms. In order to provide the assessment that is close to breeding practice, cross-validation has been examined with 2014 and 2015 grain yield data. In summary, penalized linear regression model (RR, ridge regression) was the most computationally efficient model, outperforming other algorithm by 10-77 folds in computing speed, whereas Bayesian LASSO might take over 12 hours to finish. When the conventional 5-fold cross-validation was applied to evaluate performance, random forest (RF) algorithm resulted in the highest predictability for year 2014, the year that severe drought occurred a few weeks before harvest; ridge regression algorithm was the best model for the 2015's phenotypes. 2. Specific objectives met: 2.1. Genomic prediction accuracy evaluation Prediction accuracies were tested with 5-fold cross-validation. Then the GEBVs for each fold were predicted by training the model on the four remaining folds. The procedure iterates for 5 times so that the observations in each fold can be compared with its own predicted value. A total of 10 random partitions were generated for each of data sets. The performance of prediction is assessed by Pearson's correlation and Spearman's (ranking) and Mean Squared Error (MSE) between observed phenotypic value and cross-validated GEBVs. We record the average and standard deviation of the above two measurements after 10 times 5-fold CV. 2.3. Optimal training population selection strategy established In this study, five different scenarios (10%, 20%, 30%, 40% and 50%) were investigated for each selection method. OS will be expected to select fewer lines than other three methods as OS only considers overlapping of both observed and predicted values for those extreme performers. The statistical power of our OS scheme was calculated by a bootstrapping procedure. We bootstrapped m (e.g., m=100) samples of size n (e.g., n=150) from original 239 lines. In each bootstrapped sample, these n lines were treated as a new population. The above four selection methods were then applied on the new population to find out its own new optimal TP. The prediction performance of each new TP can be evaluated under each scenario. The power of OS is the frequency that OS beats other three selection methods in m bootstrapped samples. 3. Significant results achieved: 3.1. Preliminary genomic selection results In general, all models perform similarly within the year; in year 2014 the average predictability as Pearson's correlation coefficient was at 56.8%, ranging from 55% to 58%, 55% on average for year 2015 with random forest (RF) being the highest at 58% accuracy and lowest at 53% (ridge regression). Among all predictive models RF performs slightly better than all others, regardless what year of training data came from. Random forest (RF) also outperformed all other models in the year 2014 within year cross validations, even when ranked phenotype was used to evaluate the performance. Surprisingly, the penalized linear RR model was the best algorithm for the ranking of 2015's grain yield phenotype. A much-reduced predictability was observed in cross-year results, suggesting a strong gene-by-environment variation in field condition in different years. Overall, RF still outperforms others for both scenarios (model trained in 2014 predicted 2015 and model trained in 2015 and predicted 2014). The lowest predictability of year 2015's grain yield is resulted from penalized linear RR model that was trained by 2014's data, indicating that much higher interaction was hindering performance of predictive analyses due to the unaccountable year effects. The drought condition in 2015 was not as severe as 2014, reflecting on the predictability of 2014's grain yield where the highest prediction performance can be as high as 40% using random forest algorithm. Also, it is worthwhile mentioning that the variability of predictive algorithms was highest when penalized linear model was used, which further confirms the lack of strength in linear, additive models. 3.2. Optimal training population selection Due to its superiority, results of optimal selection were only based on the random forest algorithm. Also, the predictability was tested using 100 replicates of 5-fold cross validations. The number of lines selected in the optimal TP increased as selection coefficients decreased. For example, the number of lines from OS ranged from 17 to 174 from year 2014 as selection increases from 10% to 50%. As for the power of optimal selection (OS) methods, the power of GS can be increased from 40% accuracy to 81%, as the line coverage cut-off value increases from 10% to 50% and different selection respect to training information. When training information was optimized from year 2015, the prediction accuracy reaches it's highest of 81%. When year 2014 was used as training population, after optimization the predictability was increased from 36% to 70%, showing a significant benefit to organize phenotypes such that the likelihood to include the underlying QTLs is maximized. 4. Key outcomes and other accomplishments: A significantly reduced prediction performance was observed when two-generation validation was used, indicating the apparent gene-by-environment variation from our cross-year study. In both scenarios, data trained in 2014 predicting 2015 and data trained in 2015 predicting 2014, non-parametric models (RKHS and RF) outperformed parametric models (RR and BL). When heading date data was included as covariate, model performance increased. Our preliminary results suggest, to account for the climate variation in between growing seasons, non-parametric algorithms capable of modeling interaction should be considered. Also, important factors, such as the span of LD, trait heritability, genetic architecture underlying the trait variation, marker density need to be considered as well as the models used to assess predictability across environment. Due to its largely reduced biological complexity, a single bi-parental DH population would be ideal to investigate the genetic components influencing predictability for scenarios that are more close to a breeding program.

Publications

Type: Conference Papers and Presentations Status: Other Year Published: 2016 Citation: Chen, C., S. Sun, E.J. Schwarzkoph and Y.A. El-Kassaby. 2016. Missing data interpretation for non-referenced or semi-referenced genomes. Midsouth Computation Biology and Bioinformatics Society 2016 Conference (MCBIOS-XIII), March 03-05, Memphis TN. Abstract Identifying Number: 1006021
Type: Conference Papers and Presentations Status: Other Year Published: 2016 Citation: Hu, X., L. Zhu and C. Chen. 2016. Genomic prediction models on wheat doubled haploid population. The Joint Statistical Meetings, July 30- Aug 04, Chicago, IL https://ww2.amstat.org/meetings/jsm/2016/onlineprogram/AbstractDetails.cfm?abstractid=321241