Progress 09/19/15 to 09/14/20
Outputs Target Audience:Individuals who study plant genomics, genomic technologies, association mapping techniques, and quantitative genetics aspects related to important attributes to yield production, growth, end-use quality, and abiotic and biotic resistance. Scientists working on developing computational algorithms for predictive analysis, optimization, and association genetics. Members of the Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and scientists, students, and graduate students who are studying the applicability of genomic selection for crop improvement purposes. Members and teams in grower associations and forest management sectors. Changes/Problems:Our local computing resource continues to be the limiting factor for large-scale genomic analysis. As artificial intelligence (AI) is rapidly changing the data-intensive research field, the demand for GPU (graphic processor units) has also been rapidly growing. Currently, HPCC at OSU has only two GPU nodes, each with a nearly-full occupancy. The lack of computing resources and management strategy at HPCC continues to be a struggle. Further, the availability of research capacity in the regions has been greatly reduced owing to the expanded demand for data scientists in the job market. To make Oklahoma an attractive place for STEM research and trainees, OSU should consider a significant update on the recruitment strategy and the salary structure. What opportunities for training and professional development has the project provided?Students were exposed to advanced sequencing technologies and associated analytical computation techniques. All these data analyses were conducted using Oklahoma State University's High-Performance Computing Clusters. How have the results been disseminated to communities of interest?
Nothing Reported
What do you plan to do during the next reporting period to accomplish the goals?The main objective of crop improvement is to obtain genetic gain in a number of traits of interest while maintaining genetic diversity in the variety development program. This breeding decision involves multiple, sometimes competing breeding objectives, and usually is achieved by applying a selection index. Furthermore, when negative genetic correlation present amongst breeding targets, extensive trade-offs would impact the long-term breeding goals. To this end, we will formulate crop improvement as a multi-objective optimization problem, and will first propose to solve the optimization mathematically, and then implement an optimization algorithm to maximize breeding values of multiple phenotypes, while minimizing inbreeding for selection. Here we anticipate that the framework of multi-objective optimization will provide substantial support for the breeding decisions, as well as the design of crossing-blocks.
Impacts What was accomplished under these goals?
1. Drought-induced methylation and transcription changes in winter wheat genome Our study showed that, under drought, 719 more differentially expressed genes were detected in DH169 compared with the Duster genotype. The majority of the differentially expressed genes were associated with response to oxidative stress and bract morphology. Overall, Duster exhibited more significant methylation changes than DH169 and a greater extent of methylation in drought than control, whereas the methylation of DH169 was found higher under the well-watered condition. Finally, gene body hypermethylation was found associated with down-regulation in DH169; however, the positive association of up-regulation of gene expression and gene-body hypomethylation can only be seen in Duster. These findings suggest that under a water deficit, the drought-tolerant DH169 genotype undergoes significant transcriptional changes but, less so epigenetically. As a drought 'avoidant' winter wheat, Duster, on the other hand, demonstrated a much more extensive genome-wide epigenetic modification compared to the variation identified at the genetic level. To summarize, my study reveals various genetic adaptation mechanisms employed by two closely related winter wheat genotypes. The whole-genome recombination events during the hybridization process might have disrupted the epigenomic regulatory machinery, requesting DH169 to respond to the imposed water deficit with a more expensive transcriptional variation. 2. H3K4me3 and H3k27me3 histone modification in winter wheat In this study, we used whole-genome ChIP-seq to study genome-wide active histone mark, histone H3 lysine 4 trimethylation (H3K4me3), and repressive histone mark, histone H3 lysine 27 trimethylation (H3K27me3) patterns in winter wheat under drought stress. We found that although similar patterns of chromosomal and genomic distributions in both WW and DT were seen, the number of genes modified by H3K4me3 mark was increased and by H3K27me3 mark was reduced under drought condition. In addition, a good portion of genes was newly modified after drought treatment, especially for H3K4me3 modification. About 43% of DT H3K4me3 marked genes were unique to drought conditions, and over half of these drought-specific genes were significantly enriched with H3K4me3 in DT. Surprisingly, we identified 3,819 bivalent genes in DT, and the bivalency of over 70% of these bivalent genes was established upon water deficit. Interestingly, these newly formed bivalent genes in DT were established by depleting the repressive marks and obtaining the active marks, whereas the levels of bivalency did not change in the bivalent genes which were common in WW and DT. These results suggested that drought stress-induced H3K4me3 modifications and reduced the modifications of H3K27me3, and further to enhance bivalency during drought treatment in winter wheat.
Publications
- Type:
Journal Articles
Status:
Published
Year Published:
2020
Citation:
Thistlehwaite, F.R., O. Gamal El-Dien, B. Ratcliffe, J. Kl�pat�, I. Porth, C. Chen, M.U. Stoehr, P. Ivgvarsson and Y. A. El-Kassaby. 2020. Linkage disequilibrium vs. pedigree: genomic selection prediction accuracy in two conifer species. PLoS One 15: e0232201.
- Type:
Journal Articles
Status:
Published
Year Published:
2020
Citation:
Kehel, Z., M. Sanchex, A. El-Baouchi, H. Aberkane, A. Tsivelikas, C. Chen and A. Amri. 2020. Predictive characterization for seed morphometric traits for genebank accessions using genomic selection. Frontier in Ecology and Evolution 8:32.
- Type:
Theses/Dissertations
Status:
Other
Year Published:
2020
Citation:
Lim, Alexander. 2020. Drought-induced epigenetic modulation and transcriptional variation of winter wheat. Oklahoma State University.
- Type:
Theses/Dissertations
Status:
Other
Year Published:
2020
Citation:
Liao, Chi-Ping. 2020. Genome-wide analysis of drought stress induced histone 3 lysine 4 and histone 3 lysine 27 trimethylation modifications in winter wheat. Oklahoma State University
|
Progress 10/01/18 to 09/30/19
Outputs Target Audience:Individuals who study wheat genomics, genomic sequencing, association mapping techniques and quantitative genetics aspects related to grain yield, drought tolerance, and end-use quality traits. Scientists working on developing algorithms for prediction purpose, parametric and non-parametric algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems:Our local computing resource continues to be the limiting factor for large-scale genomic analysis. Further, the availability of research capacity in the regions has been greatly reduced due to the expanded demand of data scientists in the job market. To make Oklahoma an attractive place for STEM research and trainees, OSU should consider a significant update on the recruitment strategy and the salary structure. What opportunities for training and professional development has the project provided?Training and professional development were provided to graduate students, who had the opportunity to experience in-depth training in crop research. How have the results been disseminated to communities of interest?Results have been disseminated via publications in scholarly publications, presentations at local, national meetings. What do you plan to do during the next reporting period to accomplish the goals?Duster genome assembly This is a continuing effort. De novo assembly of Duster ultra-long sequencing reads resulted into ~350,000 contigs with an average of genome coverage of 15X. We will be seeking resource to improve the continuity of the Duster genome assembly. Further to understand the functional variants, a number of computational algorithms used to identify structural variants uniquely present in Duster genome.
Impacts What was accomplished under these goals?
1. Genomic regions identified responsible for drought stress Serving as a nutritional staple worldwide, hexaploid wheat production is challenged by unpredictable environmental variability to meet global food demands. Achieving genetic gains will require continued knowledge acquisition to uncover the genetic basis underlying important agronomic trait performance impacted by the increasing climate uncertainty in regional breeding populations. A genome-wide SNP profile derived from genotyping-by-sequencing and exome-capture was built to study genomic variants segregating in the Duster x Billings doubled-haploid (DH) population, with assessment of yield and end-user quality phenotypes over multiple field years exhibiting variable environments. Transcriptomics for two selected DH individuals was overlaid to determine gene expression within genomic regions modulated by environmental stress. Co-expression modules determined functional heritable contributions to overall agronomic trait heritability. Marker-trait associations demonstrate altered genetic control under drought stress, most notably in yield. Drought responsive mechanisms were identified in genomic regions modulated by drought stress. Four co-expression modules contribute 10-26% of overall narrow-sense heritability across multiple traits under drought stress; three modules are highly represented by chromosome 1B. Responses to drought stress are localized, suggesting mechanisms of narrowed genetic regulation under stress. Chromosome 1B largely encompasses the drought response, carries association with multiple agronomic traits, and delivers strong contributions to heritable variation. 2. Drought-induced global changes in genome accessibility Chromatin structure has a known relationship with gene regulation and therefore it was expected to find an association between genes and MNase hyper-sensitive (HS) regions. The accessible chromatin only accounts for a small portion of the genome (<1.5%) and though more than 70% of HS regions are found in non-genic space, they are enriched in genes and gene flanks. As a proportion of total genic space which accounts for approximately 6% of the genome, HS regions represent 7.41% and 4.88% in well-water (WW) and drought (DT) respectively, compared to only 1.2% and 0.96% of non-genic space. Conventional patterns of hypersensitivity can be seen in WW with average nucleosome occupancy levels increased immediately upstream the transcription start sites (TSS) and nucleosome depletion at the start site. In DT, overall average occupancy is decreased and depletion at the TSS is minimal. Similar patterns can be seen at the transcription termination sites (TTS). DT maintains at a steady state low level occupancy throughout the gene body, while increased nucleosome occupancy in WW is evident and may be linked to regulation of transcription elongation. A highly significant relationship was found between MNase HS and gene density across all 21 chromosomes under WW conditions. In contrast, only six chromosomes demonstrate such relationship in DT (chromosomes 4A, 6A, 1B, 2B, 6B, and 6D). Further analysis of the null hypothesis slopeWW vs slopeDT revealed 13 chromosomes where the relationship between MNase HS and gene density differs between WW and DT.
Publications
- Type:
Book Chapters
Status:
Published
Year Published:
2019
Citation:
El-Kassaby, Y.A., B. Ratcliffe, O. Gamal-El-Dien, S. Sun, C. Chen, E.P.Cappa and I. Porth. Genomic selection of wood quality in Canadian spruces. 2019. Springer Nature.
- Type:
Journal Articles
Status:
Published
Year Published:
2019
Citation:
Sun, S., S. Maio, B. Ratcliffe, P. Campell, Y.A. El-Kassaby, B. Balasundaram and C. Chen*. 2019. Variable Selection by Generalized Graph Domination. PLoS ONE 14(1):e0203242. doi.org/10.1371/journal.pone. 0203242.
- Type:
Journal Articles
Status:
Published
Year Published:
2019
Citation:
Hu, X., B.F. Carver, C. Power, L. Yan and C. Chen*. 2019. Genomic selection and response to selection by designed training population for grain yield and end-use quality traits in winter wheat variety development programs. The Plant Genome doi:10.3835/plantgenome2018.11.0090
- Type:
Journal Articles
Status:
Published
Year Published:
2019
Citation:
Naidenov, B., K. Willyerd, A. Lim, N. Torres, W. Johnson, H. J. Hwang, P. Hoyt, J. Gustafson and C. Chen*. 2019. Pan-genomic and polymorphic driven prediction of antibiotic resistance in Elizabethkingia. Frontiers in Microbiology doi.org/10.3389/fmicb.2019.01446.
- Type:
Journal Articles
Status:
Published
Year Published:
2019
Citation:
Lim, A., B. Naidenov, H. Bates, K. Willyerd, T. Snider, M.B. Couger, C. Chen* and A. Ramachandran. 2019. Nanopore ultra-long rad sequencing technology for antimicrobial resistance detection in Mannheimia haemolytica. Journal of Microbiological Methods 159:138-147. doi.org/10.1016/j.mimet.2019.03.001.
- Type:
Journal Articles
Status:
Published
Year Published:
2019
Citation:
Ratcliffe, B., F.R. Thistlethwaite, O. Ibrahim, E. Cappa, I. Porth, J. Kl�pat�, C. Chen, T. Wang, M.U. Stoehr and Y.A. El-Kassaby. 2019. Inter- and intra- generation genomic predictions for Douglas-fir growth in unobserved environments. Frontier in Genetics bioRxiv doi.org/10.1101/540765.
- Type:
Journal Articles
Status:
Published
Year Published:
2019
Citation:
Campbell, P., L. Ar�vale, H. Martin, C. Chen, S. Sun, A.H. Rowe, M.S. Webster. J.B. Searle and B. Pasch. 2019. Vocal divergence is concordant with genomic evidence for strong reproductive isolation in grasshopper mice (Onychomys). Ecology and Evolution doi.org/10.1002/ece3.5770
- Type:
Websites
Status:
Other
Year Published:
2019
Citation:
Translational Genomics Forum 2019
https://github.com/transgenomicsosu/Stillwater_Translational_Genomics_Forum_2019
|
Progress 10/01/17 to 09/30/18
Outputs Target Audience:Individuals who study wheat genomics, genomic sequencing, mapping techniques and quantitative genetics aspects related to grain yield, drought tolerance, and end-use quality traits. Scientists working on developing algorithms for prediction purpose, parametric and non-parametric algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems:Our local computing resource continues to be the limiting factor for genomic analysis. To complete the proposed analyses, we have submitted research allocation proposal to the XSEDE (the National Science Foundation's Extreme Digital Program), requesting 72 TB of data storage and over 750,000 CPU hours on largely shared memory computing nodes. What opportunities for training and professional development has the project provided?Our research project covers wheat variety improvement, crop genetics, statistics, and bioinformatics, providing multi-disciplinary training opportunities for both graduate students and postdoc researchers. Postdoctoral research Dr. Willyerd and graduate student Bryan Naidenov have joined training hosted by the high performing computing cluster at OSU for parallel computing. Graduate student Xiaowei Hu, who was previously trained in Applied Mathematics, has worked alongside with Dr. Willyerd (molecular geneticist) and graduate student Alex Lim (biochemists), to enforce her knowledge in Biology and Genetics, and to broaden her career outlook. All postdoctoral researchers and graduate students have attended an annual event for data science, plant breeding, and statistical genomics. Dr. Willyerd, graduate student Bryan Naidenov, and Alex Lim joined an annual meeting of Midsouth Computational Biology and Bioinformatics Society as well. Naidenov and Lim, who both work on informatics and machine learning algorithms, attended 2018 Coalition for Advancing Digital Research and Education conference. How have the results been disseminated to communities of interest?The data-sets and algorithm of SNP-select is a stand-alone software; we have published it on the Translational Genomics Laboratory's GitHub site: https://github.com/transgenomicsosu/SNP-SELECT What do you plan to do during the next reporting period to accomplish the goals?1. Further development on the Bayesian multivariate model on multi-trait prediction A multivariate model will be expended to include multiple phenotypes in one GS run, to fully utilize the genetic correlation by the shared, common quantitative trait loci (QTL). 2. Epigenomic regulation of drought-tolerance of Oklahoma's hard red winter wheat Three doubled haploid lines derived from Duster and Billings intercross population have been selected for control environment experiments. DH lines and the parental varieties have been subject to drought condition that simulates the 2014 drought in the region. Lead samples of three biological replicates of these varieties have been collected at critical developmental stages, with respect to the control and drought treatment. The research expects to investigate impacts of methylation and chromatin structure variation that is induced by drought. Protocols of reduced-representation Bisulfite sequencing, as well as the titration of chromatin digestion with micrococcal nuclease, have been optimized. When sequencing is complete, genome-guide alignment would be performed using the current bread wheat reference assembly. The resulting hypersensitivity and hyposensitivity of these epigenomic regulations would be mapped together with the current genome-wide association and QTL mapping. 3. Duster genome assembly To avoid the off-targeting issue of our alignment, as well as to uncover unique genomic adaptation of Oklahoma's winter wheat varieties, we will continue the effort to uncover Duster genome with the capacity of ultra-long read sequencing technology. Currently, over 40 flow-cell worth of long-read data has been collected; base-calling for generating fast5 file is ongoing on OSU's supercomputer.
Impacts What was accomplished under these goals?
1. Effectiveness of genomic prediction by response to selection, for grain yield and end-use quality traits under drought condition Considering the practicality of applying genomic selection (GS) in the line development stage of a hard red winter (HRW) wheat variety development program, we have evaluated the effectiveness of GS by prediction accuracy, as well as by the response to selection across field seasons that demonstrated challenges for crop improvement under significant climate variability. Important breeding targets for HRW wheat improvement in the southern Great Plains of USA, including grain yield, kernel weight, wheat protein content, and sodium dodecyl sulfate (SDS) Sedimentation Volume as a rapid test for predicting bread-making quality, were used to estimate GS's effectiveness across harvest years from 2014 (drought) to 2016 (normal). In general, our GS results show that nonparametric algorithms RKHS and RF produced higher accuracies in both same-year/environment cross-validations and cross-year/environment predictions, for the purpose of line selection in a bi-parental doubled haploid (DH) HRW population. As for OSU's wheat variety development program, accurate and stable selection of superior breeding lines over experimental trials could be still challenging with the presence of worsening drought condition. To ensure long-term response to selection, our results suggest that there are, however, cases where phenotypic selection would be still preferential or cases that retraining with updated phenotypes should be performed. In principle, the superiority of GS was most notable when the selection intensity was high, and when large training information was available. It is interesting to note that, supported by our findings, training conducted in sub-optimal conditions could still provide GS predictability while maintaining a desirable response to selection for both grain yield and SDS Sedimentation Volume. The reverse is however nor true; in our case study when predicting line performance under sub-optimal conditions (for example, under drought conditions) by information trained in normal growing conditions, additional phenotyping under the target, sub-optimal environment would be required to achieve a desirable selection response; this was most obvious for grain yield when selection intensity was high. In other words, when making selection decisions for trials under unexpected environmental stress, like the frequent drought in the southern Great Plains of USA, using GS trained in optimal growing conditions could very likely result in unreliable outcomes. Further, the stability of prediction performance was greatest for SDS Sedimentation Volume but least for wheat protein content, making SDS Sedimentation Volume a worthy candidate for GS in wheat variety development programs 2. Optimization of training population for genomic prediction in sub-optimal growing environments This study also evaluated the effectiveness of genomic prediction respect to the composition of training population. Our findings suggest that overall when the training population is optimized, an upward performance improvement in GS can be expected. A simple and straightforward approach to optimize training population for prediction could be done by maximizing phenotypic variation. In addition to the conventional two-tailed training population design, our study also investigated various approaches of constructing training population, such as two-tailed genomic estimated breeding values and the training population formed by the majority votes of both genomic estimated breeding values and raw phenotypes. Using grain yield as an example for polygenic traits, a broadly appropriate guideline is, when training was obtained from normal growing conditions, straightforward GS approaches with an intermediate size of training population should be considered for high selection intensity; and when training was performed in a stressed growing condition, at a high selection intensity, optimized training population with the majority votes could result in long-term advantage. The latter scenario was as well beneficial for end-use quality traits like SDS Sedimentation Volume and Kernel Weight. Also with a heritability estimate of 0.74 and appreciable phenotypic correlation coefficients across environments, the stability in genomic selection performance and in the response to selection across environment variability makes SDS Sedimentation Volume a worthy candidate for genomic selection in wheat variety development programs.
Publications
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Hu, X., L. Zhu and C. Chen. 2018. Bayesian Multi-variate Weighted Kernel Genomic Prediction. Joint Statistical Meeting, Vancouver BC.
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Willyerd, K., S. Sun, Y. Gao, X. Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2018. Genomic Variation for Yield Stability and End-Use Quality in Hexaploid Wheat. Mid-South Computational Biology and Bioinformatics Society Conference. Starkville, Mississippi
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Launius, M., C. Chen and K. Willyerd. 2018. Investigating Stomatal Responses to Drought in Hard Red Winter Wheat. Department of Plant and Soil Science Research Symposium. Stillwater, Oklahoma
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Rodriguez, A., C. Chen and K. Willyerd. 2018. Exploring Transcriptional Variation in Drought Tolerant and Susceptible Winter Wheat Lines. Department of Plant and Soil Science Research Symposium. Stillwater, Oklahoma
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Naidenov, B. and C. Chen. 2018. Exposing the Hidden chromatin Regulatory Framework with Recurrent Deep Learning and Genomic Sequence Data. Mid-South Computational Biology and Bioinformatics Society Conference. Starkville, Mississippi
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2018
Citation:
Lim, A., B. Naidenov, Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2018 Data Driven Genomic Surveillance of Microbial Drug Resistance Using Oxford using Nanopore single molecular sequencing technology. Mid-South Computational Biology and Bioinformatics Society Conference. Starkville, Mississippi
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Sun, S., S. Maio, B. Ratcliffe, P. Campell, Y.A. El-Kassaby, B. Balasundaram and C. Chen*. 2018. Variable Selection by Generalized Graph Domination. PLoS One. Available in biorxiv preprint.
- Type:
Journal Articles
Status:
Published
Year Published:
2018
Citation:
Thistlehwaite, F.R., B. Ratcliffe, J. Kl�pat�, I. Porth, C. Chen, M.U. Stoehr and Y.A. El-Kassaby. 2018. Genomic Selection of Juvenile Height across a Single Generational Gap in Douglas-fir. Heredity
|
Progress 10/01/16 to 09/30/17
Outputs Target Audience:Individuals who study wheat genomics, genomic sequencing, mapping techniques and quantitative genetics aspects related to grain yield, drought tolerance, and end-use quality traits. Scientists working on developing algorithms for prediction purpose, parametric and non-parametric algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems:Since September 2017, high-quality reference of the genome sequence of bread wheat (IWGSC RefSeq v1.0) has become available through the agreement with IWGSC. Taking the advantage of IWGSC RefSeq v1.0, genome-guided HISAT alignment for the reads of DH169 and DH173 has been processed. However, running HISAT genome-guided alignment completed in just over 11 days on big mem nodes on Cowboy, OSU's high performing computing cluster, and the .sam file resulted from DH173 along was 1.3 TB, 1.8 TB for DH169. Although the .sam files have been removed immediately after the .bam files were generated, the current storage of our computing resource is insufficient, even just for storing the .bam files. Also, to estimate read abundance (using RSEM), the raw sequence files (450 GB) are required along with the trinity fasta files; and, this procedure utilizes bowtie and would produce .bam files and other intermediates. To begin RSEM, an estimate of over 800 GB input files and over 4 TB of intermediates for RNAseq read abundance calculation is required, which the data size itself presents a great challenge with our current computing resource. To complete this analysis, a large storage supplement for our current start-up allocation on the Xsede, the National Science Foundation's Extreme Digital Program. A supplement proposal has been submitted for storage and CPU hours on large memory computing notes. From previous experience running these algorithms, our sizeable datasets demand the use of a more powerful cluster than what is locally available. We are currently preparing a full application to the Xsede. The success of this application is critical for our upcoming RNAseq datasets that include all 24 genotypes. What opportunities for training and professional development has the project provided?This research project has created multi-disciplinary training opportunities for both graduate students and postdoc researchers. Postdoctoral research Dr. Willyerd and graduate student Bryan Naidenov have joined training hosted by the high performing computing cluster at OSU for parallel computing. Graduate student Xiaowei Hu has worked alongside with crop geneticist and biochemists to enforce her knowledge in Biology and Genetics, to broaden her career outlook. Dr. Willyerd has attended annual event for plant breeding graduate students and postdoctoral scientists. Graduate student Bryan Naidenov and Alex Lim joined the annual meeting of Midsouth Computational Biology and Bioinformatics Society in Little Rock, Arkansas. Naidenov and Lim, who both work on machine learning algorithms for a prediction on supercomputers, attended 2018 Coalition for Advancing Digital Research and Education conference. How have the results been disseminated to communities of interest?The algorithm of SNP-select is a stand along software; we have published it on the Translational Genomics Laboratory's GitHub site: https://github.com/transgenomicsosu/SNP-SELECT What do you plan to do during the next reporting period to accomplish the goals?1. Genomic prediction for winter wheat improvement A multivariate model will be developed to incorporate multi-year and multi-location trials to improve prediction accuracy. In a conventional setting, at least three years' yield trials would be conducted before the selection for variety development. To account for the evident GxE, more sophisticated models are required. 2. Drought-tolerance association for Oklahoma's hard red winter wheat population Three doubled haploid lines derived from Duster and Billings intercross population have been selected for control environment experiments. DH lines and the parental varieties have been subject to severe drought condition, and tissues have been sampled at critical developmental stages. After extracting total RNA, samples have been submitted for transcriptome sequencing. The research will perform genome-guide alignment, and assemble transcripts that express differentially under treatments. The resulting up- or down-regulated transcripts would be mapped together with the current genome-wide association and QTL mapping. 3. Duster genome assembly The laboratory has been investigating the capacity of ultra-long read sequencing technology. We have also implemented three high molecular weight DNA extraction protocols to obtain long DNA fragments. The de novo genome assembly would be conducted with up to 18 flow cells of sequencing capacity. We currently anticipate up to 800 million reads as sequencing data yield, with an average read length at 5,000 base-pairs of 1D reads. The reads will be also aligned with the current RefSeq v1.0 bread wheat assembly to distinguish the bread wheat's core genome and the accessory genome specifically to Oklahoma's winter wheat varieties.
Impacts What was accomplished under these goals?
1. Winter wheat genomic resource and association for drought tolerance In total, 282 doubled haploid winter wheat lines derived from the intercross of Duster and Billing and members of Dual Purpose Observation Nursery, DPON were genotyped with Genotyping-by-sequencing technologies, resulting 289,222 SNP before filtering. In addition, SNP markers located in functional genes have also been developed using capture technology. A total of 50K probes were designed for exome capture sequencing using public databases, including high confidence gene model and CDS (MIPS v2.2), gene models in wheat D genome (Aegilops tauschii) and A genome (Triticum uratu) progenitors, the 454 titanium sequence reads from wheat cDNA libraries and probes of NimbleGen SNP array, which revealed 709,063 SNPs in functional regions before filtering. In total, 702,000 SNPs were derived from exome capture technology. These were merged with GBS SNPs and anchored on the current bread wheat IWGSC_RefSeq_v1.0 reference assembly. After removing non-informative and erroneous SNPs, the release of build DB_v1.0.1 for Duster and Billings DH population contains 16,383 quality SNPs (<25% missing data ratio) with a whole-genome coverage at ~ 96 SNPs/100MBp and approximately 50-50 segregation of parental genotypes. Genome-wide association mapping (GWAS) from these 242 DH lines identified SNP variants significantly associated with yield production, wheat protein, and hardiness traits, although a variation of associated SNP variants across the three field seasons signifies the impact of genotype x environment. Significant GWAS associations co-localizing with the previously identified yield QTL on chromosome 1BS were found only in low precipitation years 2014 and 2015. RNAseq of a drought tolerant genotype, DH169, revealed drought stress influenced differential expression of 6,936 transcripts (adjusted p < 0.05; -3 < log2FC >3), 4,989 of which represent the longest single isoform. Differentially expressed transcripts mapped to 318 Arabidopsis and 156 Oryza sp. proteins categorized as stress responsive. Cross-reference of these 474 differentially expressed transcripts and yield associated genomic sequences from SNP data revealed nine transcripts aligning to chromosome 2 in A, B and D genomes. Our findings of stress response genes responsible for yield maintenance pinpoint to the molecular breeding targets for the rally to battle food insecurity in this worsening drought climate. 2. Genomic prediction for grain yield and end-use quality traits Genomic selection performance was evaluated for grain yield and end-use quality traits, including wheat protein content, hardiness, and SDS traits. The project has examined predictability of eight algorithms; these include the parametric algorithms like ridge regression BLUP, GBLUP, Bayes A, B and Cpi, and Bayesian LASSO, and nonparametric ones such as random forest, reproducing kernel Hilbert space. Using within-year cross-validation with 100 replicates, a non-parametric algorithm like random forest demonstrated higher prediction accuracy, outperforming other algorithms by at least 7%. Result from cross-validation for wheat protein and sodium dodecyl sulfate sedimentation (SDS) showed no significant difference among prediction algorithms. Variable selection prediction methods like Bayesian LASSO (BL) showed an advantage in hardiness traits like single kernel characterization system average weight. The observed difference in predictability reflects the underlying genetic architecture of phenotypic variation. For example, major genes responsible for hardness can be found on the short arm of chromosome 5D, while a small number of small-effect loci can also be identified. In both within-year and cross-year validation, results showed that BL outperformed other prediction algorithms by the Pearson's correlation, suggesting selecting important SNP variables that might be in close LD with causative QTL improve prediction accuracy for kernel hardiness. Further, the greater degree of variability of MSE seen in BL method is indicative of whether such important alleles have been included in the training population or not. Compared with cross year validations, obvious overinflation was observed in within-year cross-validation; the weakness of cross-validation in capturing genotype x environment interaction was shown. Not only did cross year validation better capture the forward selection, but estimates in predictability in cross year validation also more likely reflect the reality that genomic breeders might anticipate. 3. Optimization of the training population Selection for training population has demonstrated advantage. Using both grain yield and end-use quality traits, the performance of genomic prediction can be improved by as much as 20%. This is more evident when selection is made on grain yield.
Publications
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Chen, C. 2017. Treatment for data uncertainty in genomic prediction IUFRO, Concepion, Chile
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Hu, Xiaowei, C. Chen and L. Zhu. 2017. Kernel-based Bayesian model for genomic selection. Joint Statistical Meeting, Baltimore
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Willyerd, K., S. Sun, Y. Gao, Xiaowei Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2017. Buster_Hmp v0.4.1, an integrated genomic resource for development, exploitation and crop improvement for hard red winter wheat. MCBIOS XIV, Little Rock, Arkansas
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Naidenov, B., A. Lim, W. Johnson, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoty and C. Chen. 2017. Predicting antibiotic resistance with Nanopore long-reads and machine learning. BMBGSA 14th Annual Research Symposium in Biological Sciences. Stillwater, OK.
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Sun, S., Z. Miao, B. Ratcliffe, P. Campbell, Y. EI-Kassaby, B. Balasundaram and C. Chen. 2017. SNP variable selection by generalized graph domination. MCBIOS XIV, Little Rock, Arkansas
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Willyerd, K., S. Sun, Y. Gao, Xiaowei Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2017. Buster_Hmp v0.4.1, an integrated genomic resource for development, exploitation and crop improvement for hard red winter wheat. BMBGSA 14th Annual Research Symposium in Biological Sciences. Stillwater, OK.
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Naidenov, B., A. Lim, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2017. Novel gene discovery by genome completion through de novo assembly of long-reads. MCBIOS XIV, Little Rock, Arkansas
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Lim, A., B. Naidenov, C.J. Crick and C. Chen. 2017 Predictability of Neural Network Models for Carotenoid Biofortification. MCBIOS XIV, Little Rock, Arkansas
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Naidenov, B., A. Lim, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2017. Novel gene discovery by genome completion through de novo assembly of long-reads. CADRE conference, Stillwater, OK.
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Naidenov, B., H. Bates, A. Lim, K. Willyerd, K. Snider, M. Breshears, B.M. Couger, C. Chen and A. Ramachandran. 2017. A small device for a big challenge: surveillance of drug resistance in Mannheimia haemolytica using Nanopore single molecular sequencing technology. AAVLD, San Diego, CA.
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Song, S.J., B. Carver, C. Power, L. Yan, Y. El-Kassaby, J. Kl�pat� and C. Chen. 2017. Practical application of genomic selection in a doubled-haploid winter wheat breeding program. Molecular Breeding 37:117 doi:10.1007/s11032-017-0715-8
- Type:
Theses/Dissertations
Status:
Other
Year Published:
2017
Citation:
Song, S.J. 2017. Genomic selection in a Doubled Haploid Winter Wheat Population. M. Sc. Thesis
|
Progress 10/01/15 to 09/30/16
Outputs Target Audience:Individuals who study wheat genomics, genomic sequencing, mapping techniques and quantitative genetics aspects related to grain yield, drought tolerance and end-use quality traits. Scientists working on developing algorithms for prediction purpose, parametric and non-parametric algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems:Since September 2017, high quality reference of genome sequence of bread wheat (IWGSC RefSeq v1.0) has become available through the agreement with IWGSC. Taking the advantage of IWGSC RefSeq v1.0, genome-guided HISAT alignment for the reads of DH169 and DH173 has been processed. However, running HISAT genome-guided alignment completed in just over 11 days on big mem nodes on Cowboy, OSU's high performing computing cluster, and the .sam file resulted from DH173 along was 1.3 TB, 1.8 TB for DH169. Although the .sam files have been removed immediately after the .bam files were generated, the current storage of our computing resource is insufficient, even just for storing the .bam files. Also, to estimate read abundance (using RSEM), the raw sequence files (450 GB) are required along with the trinity fasta files; and, this procedure utilizes bowtie and would produce .bam files and other intermediates. To begin RSEM, an estimate of over 800 GB input files, and over 4 TB of intermediates for RNAseq read abundance calculation is required, which the data size itself presents a great challenge with our current computing resource. To complete this analysis, a large storage supplement for our current start-up allocation on the Xsede, the National Science Foundation's Extreme Digital Program. A supplement proposal has been submitted for storage and CPU hours on large memory computing notes. From previous experience running these algorithms, our sizeable datasets demand the use of a more powerful cluster than what is locally available. We are currently preparing a full application to the Xsede. The success of this application is critical for our upcoming RNAseq data sets that include all 24 genotypes. What opportunities for training and professional development has the project provided?This research project has created multi-disciplinary training opportunities for both graduate students and postdoc researchers. Postdoctoral research Dr. Willyerd and graduate student Bryan Naidenov have joined trainings hosted by the high performing computing cluster at OSU for parallel computing. Graduate student Xiaowei Hu has worked along side with crop geneticist and biochemists to enforce her knowledge in Biology and Genetics, to broaden her career outlook. Dr. Willyerd has attended annual event for plant breeding graduate students and postdoctoral scientists. Graduate student Bryan Naidenov and Alex Lim joined annual meeting of Midsouth Computational Biology and Bioinformatics Society in Little Rock, Arkansas. Naidenov and Lim, who both work on machine learning algorithms for prediction on supercomputers, attended 2018 Coalition for Advancing Digital Research and Education conference. How have the results been disseminated to communities of interest?The algorithm of SNP-select is a stand along software; we have published it on the Translational Genomics Laboratory's GitHub site: https://github.com/transgenomicsosu/SNP-SELECT What do you plan to do during the next reporting period to accomplish the goals?1. Genomic prediction for winter wheat improvement A multivariate model will be developed to incorporate multi-year and multi-location trials to improve prediction accuracy. In a conventional setting, at least three years' yield trials would be conducted before the selection for variety development. To account for the evident GxE, more sophisticated models are required. 2. Drought-tolerance association for Oklahoma's hard red winter wheat population Three doubled haploid lines derived from Duster and Billings intercross population have been selected for control environment experiments. DH lines and the parental varieties have been subject to sever drought condition, and tissues have been sampled at critical developmental stages. After extracting total RNA, samples have been submitted for transcriptome sequencing. The research will perform genome-guide alignment, and assemble transcripts that express differentially under treatments. The resulting up- or down- regulated transcripts would be mapped together with the current genome-wide association and QTL mapping. 3. Duster genome assembly The laboratory has been investigating the capacity of ultra-long read sequencing technology. We have also implemented three high molecular weight DNA extraction protocols to obtain long DNA fragments. The de novo genome assembly would be conducted with up to 18 flow cells of sequencing capacity. We currently anticipate up to 800 million reads as sequencing data yield, with an average read length at 5,000 base-pairs of 1D reads. The reads will be also aligned with the current RefSeq v1.0 bread wheat assembly to distinguish the bread wheat's core genome and the accessory genome specifically to Oklahoma's winter wheat varieties.
Impacts What was accomplished under these goals?
1. Winter wheat genomic resource and association for drought tolerance In total, 282 doubled haploid winter wheat lines derived from the intercross of Duster and Billing and members of Dual Purpose Observation Nursery, DPON were genotyped with Genotyping-by-sequencing technologies, resulting 289,222 SNP before filtering. In addition, SNP markers located in functional genes have also been developed using capture technology. A total of 50K probes were designed for exome capture sequencing using public databases, including high confidence gene model and CDS (MIPS v2.2), gene models in wheat D genome (Aegilops tauschii) and A genome (Triticum uratu) progenitors, the 454 titanium sequence reads from wheat cDNA libraries and probes of NimbleGen SNP array, which revealed 709,063 SNPs in functional regions before filtering. In total, 702,000 SNPs were derived from exome capture technology. These were merged with GBS SNPs and anchored on the current bread wheat IWGSC_RefSeq_v1.0 reference assembly. After removing non-informative and erroneous SNPs, the release of build DB_v1.0.1 for Duster and Billings DH population contains 16,383 quality SNPs (<25% missing data ratio) with a whole-genome coverage at ~ 96 SNPs/100MBp and approximately 50-50 segregation of parental genotypes. Genome-wide association mapping (GWAS) from these 242 DH lines identified SNP variants significantly associated with yield production, wheat protein and hardiness traits, although variation of associated SNP variants across the three field seasons signifies the impact of genotype x environment. Significant GWAS associations co-localizing with the previously identified yield QTL on chromosome 1BS were found only in low precipitation years 2014 and 2015. RNAseq of a drought tolerant genotype, DH169, revealed drought stress influenced differential expression of 6,936 transcripts (adjusted p < 0.05; -3 < log2FC >3), 4,989 of which represent the longest single isoform. Differentially expressed transcripts mapped to 318 Arabidopsis and 156 Oryza sp. proteins categorized as stress responsive. Cross reference of these 474 differentially expressed transcripts and yield associated genomic sequences from SNP data revealed nine transcripts aligning to chromosome 2 in A, B and D genomes. Our findings of stress response genes responsible for yield maintenance pinpoint to the molecular breeding targets for the rally to battle food insecurity in this worsening drought climate. 2. Genomic prediction for grain yield and end-used quality traits Genomic selection performance was evaluated for grain yield and end-use quality traits, including wheat protein content, hardiness and SDS traits. The project has examined predictability of eight algorithms; these include the parametric algorithms like ridge regression BLUP, GBLUP, Bayes A, B and Cpi, and Bayesian LASSO, and nonparametric ones such as random forest, reproducing kernel Hilbert space. Using within-year cross-validation with 100 replicates, non-parametric algorithm like random forest demonstrated higher prediction accuracy, outperforming other algorithms by at least 7%. Result from cross-validation for wheat protein and sodium dodecyl sulfate sedimentation (SDS) showed no significant difference among prediction algorithms. Variable selection prediction methods like Bayesian LASSO (BL) showed advantage in hardiness traits like single kernel characterization system average weight. The observed difference in predictability reflects the underlying genetic architecture of phenotypic variation. For example, major genes responsible for hardness can be found on the short arm of chromosome 5D, while a small number of small-effect loci can also be identified. In both within-year and cross-year validation, results showed that BL outperformed other prediction algorithms by the Pearson's correlation, suggesting selecting important SNP variables that might be in close LD with causative QTL improve prediction accuracy for kernel hardiness. Further, the greater degree of variability of MSE seen in BL method is indicative of whether such important alleles have been included in training population or not. Compared with cross year validations, obvious overinflation was observed in within-year cross-validation; the weakness of cross-validation in capturing genotype x environment interaction was shown. Not only did cross year validation better capture the forward selection, estimates in predictability in cross year validation also more likely reflect to the reality that genomic breeders might anticipate. 3. Optimization of training population Selection for training population has demonstrated advantage. Using both grain yield and end-user quality traits, the performance of genomic prediction can be improved by as much as 20%. This is more evident when selection is made on grain yield.
Publications
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Naidenov, B., A. Lim, W. Johnson, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoty and C. Chen. 2017. Predicting antibiotic resistance with Nanopore long-reads and machine learning. BMBGSA 14th Annual Research Symposium in Biological Sciences. Stillwater, OK.
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Sun, S., Z. Miao, B. Ratcliffe, P. Campbell, Y. EI-Kassaby, B. Balasundaram and C. Chen. 2017. SNP variable selection by generalized graph domination. MCBIOS XIV, Little Rock, Arkansas
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Willyerd, K., S. Sun, Y. Gao, Xiaowei Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2017. Buster_Hmp v0.4.1, an integrated genomic resource for development, exploitation and crop improvement for hard red winter wheat. BMBGSA 14th Annual Research Symposium in Biological Sciences. Stillwater, OK.
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Naidenov, B., A. Lim, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2017. Novel gene discovery by genome completion through de novo assembly of long-reads. MCBIOS XIV, Little Rock, Arkansas
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Chen, C. 2017. Treatment for data uncertainty in genomic prediction IUFRO, Concepion, Chile
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Hu, Xiaowei, C. Chen and L. Zhu. 2017. Kernel-based Bayesian model for genomic selection. Joint Statistical Meeting, Baltimore
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Willyerd, K., S. Sun, Y. Gao, Xiaowei Hu, C. Powers, L. Yan, B. Carver, and C. Chen 2017. Buster_Hmp v0.4.1, an integrated genomic resource for development, exploitation and crop improvement for hard red winter wheat. MCBIOS XIV, Little Rock, Arkansas
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Lim, A., B. Naidenov, C.J. Crick and C. Chen. 2017 Predictability of Neural Network Models for Carotenoid Biofortification. MCBIOS XIV, Little Rock, Arkansas
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Naidenov, B., A. Lim, K. Willyerd, H. Hwang, N. Torres, J. Gustafson, P. Hoyt and C. Chen. 2017. Novel gene discovery by genome completion through de novo assembly of long-reads. CADRE conference, Stillwater, OK.
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2017
Citation:
Naidenov, B., H. Bates, A. Lim, K. Willyerd, K. Snider, M. Breshears, B.M. Couger, C. Chen and A. Ramachandran. 2017. A small device for a big challenge: surveillance of drug resistance in Mannheimia haemolytica using Nanopore single molecular sequencing technology. AAVLD, San Diego, CA.
- Type:
Journal Articles
Status:
Published
Year Published:
2017
Citation:
Song, S.J., B. Carver, C. Power, L. Yan, Y. El-Kassaby, J. Kl�pat� and C. Chen. 2017. Practical application of genomic selection in a doubled-haploid winter wheat breeding program. Molecular Breeding 37:117 doi:10.1007/s11032-017-0715-8
- Type:
Theses/Dissertations
Status:
Other
Year Published:
2017
Citation:
Song, S.J. 2017. Genomic selection in a Doubled Haploid Winter Wheat Population. M. Sc. Thesis.
|
Progress 09/19/15 to 09/30/15
Outputs Target Audience:Individuals studying wheat genomics, genomic sequencing and mapping techniques and quantitative genetics related to grain yield components. Scientists working on genomic selection algorithms. Members in Wheat Improvement Team at the Oklahoma Agricultural Experiment Station, and students and graduate students who are studying the applicability of genomic selection for crop improvement purpose. Changes/Problems:
Nothing Reported
What opportunities for training and professional development has the project provided?The project has provided postdoc research associate Dr. Karyn Willyerd opportunities to combine her molecular genetics knowledge with computational and statistical tools. Dr. Willyerd was also accepted and participated in the Workshop on Cereal Genomics at the Cold Spring Harbor Laboratory, New York for furthering her skillsets and knowledge in upcoming large-scale genomic analysis. The workshop was held October 19-25, 2016. PhD student Xiaowei Hu, primarily working on optimal training design for Duster x Billings DH population, was selected to participate the Joint Statistics Meeting in Chicago, July 30 - August 4, 2016. Her thesis work on the optimization was presented in the workshop that targets on quantitative and statistical genetics. Working on Illumina short read data alignment and SNP imputation problem, PhD student Shuzhen Sun has used this opportunity to attend Illumina workshop at the Oklahoma Medical Research Foundation on October 12, 2016. Miss Sun is currently working on SNP calling, imputation and data integration, her training background is in variable reduction algorithms and the Illumina workshop has brought her the needed knowledge in next-seq data generation. Undergraduate student Suzi Barboza-Pachero also attended the 58th Maize Genetics Conference. With her undergraduate research in wheat genomic selection simulation, Miss Barboza-Pachero was awarded the travel fund to join this meeting in Jacksonville, Florida, March 17-20, 2016. Miss Barboza-Pachero was also selected to join the Boyce Thompson Institute at Cornell University for summer intership. At the Boyce Thompson Institute, Miss Barboza-Pachero was part of the database testing team, where she was working along side with database programmers and created a MySQL data loading script to examine the efficiency of querying and filtering large-volume SNP tables. How have the results been disseminated to communities of interest?It has been only one year into the proposed research program. The results of this project have not been disseminated. However, due to the close relationship of Chen's research program and the Wheat Improvement Team (WIT) at the OSU, Dr Brett Carver, the Regents Professor and OSU Wheat Genetics Chair who leads WIT has consulted the current results and compared selection conducted using previous yield data. What do you plan to do during the next reporting period to accomplish the goals?The first sequencing short reads used to generate the current SNP profile for the 'Duster x Billings" population will be poured with the recent GBS1038, GBS1039 and GBS1040 data. SNP determination considering different genome coverage will be executed as proposed. With poured short read data, we will expect better marker data quality with lower missing data ratio; also as a result of increased genome coverage, the impact of the sequencing read depth on statistical parameter estimation, such as Euclidean distance measure (the D matrix) in the RKHS model as well as criterion for variable selection, could be investigated. Upon the completion of this research, the Illumina short read raw data will be made available for the community to promote data reusability. We will also provide both of the SNP and phenotype tables through publications. Currently, the year 2016 grain yield data from the 'Duster x Billings' population is being processed. When ready, genomic selection on this new phenotype will be evaluated for another generation (grain yield 2016), to complete the two-generation validation proposed in the research proposal where the grain yield 2016 data will be the validation population. A number of predictive algorithms, including both parametric and non-parametric methods, as well as the script for cross-validation, has been implemented in our local machine for. Phenotypic data will be trained using data from 2015; the prediction for 2016 data will be used to evaluate direct impact of genotypic information on predictability. Also, the effectiveness of adopting genomic selection could be assessed from the direct comparison of genomic prediction with the traditional practice of phenotypic selection. Finally, in this year's research plan, we will explore variable selection strategies based on statistical correlation, linkage analysis and machines learning algorithm like k-domination with the objectives of maximizing predictability for grain yield. The performance of prediction models on the optimal training data set will also be studied.
Impacts What was accomplished under these goals?
1. Major activities completed: 1.1. Genotypic information from the Genotyping-by-sequencing (GBS) technology DNA of the 'Duster x Billings' 282 lines were extracted from seedlings in early spring of 2016. According to the protocols of Poland et al (2012), Pst I and Msp I were chosen to perform genome complexity reduction. PCR products were amplified using short extension time (less than 30 seconds) to enrich short fragments suitable for bridge-amplication on the Illumina flow cells. The first run of sequencing has been completed and currently raw reads from the second Illumina pass have been analyzed. The merging of 2014 and 2016 SNP is currently underway. Therefore, only results from the first run of SNP calling are included in this report. 1.2. SNP calling and missing data imputation Three criteria were used for filtering quality SNP information for predictive analyses: (1) overall missing ratio < 50%, (2) heterozygosity < 5% and (3) minor allele frequency > 5%, resulting a total of 7,426 SNPs the rest of the analysis. Missing SNP data imputation was done by both EM and the k-nearest neighbor algorithms. Only k-nearest neighbor imputed SNPs were used for building predictive models, due to its superior imputation accuracy. In addition, read tags were also aligned with the most up-to-date wheat pseudo-molecules using BWA algorithm, with minimum tag read per alignment at 10. Before data processing, 1,137,153 read tags can be aligned; on average, 7,888 SNP can be found per chromosome. 1.3. Comparisons on genomic selection algorithms In total, seven predictive algorithms were examined using the 'Duster x Billings' population, including linear regression, Bayes Alphabets and derives methods, and semi- and non- parametric algorithms. In order to provide the assessment that is close to breeding practice, cross-validation has been examined with 2014 and 2015 grain yield data. In summary, penalized linear regression model (RR, ridge regression) was the most computationally efficient model, outperforming other algorithm by 10-77 folds in computing speed, whereas Bayesian LASSO might take over 12 hours to finish. When the conventional 5-fold cross-validation was applied to evaluate performance, random forest (RF) algorithm resulted in the highest predictability for year 2014, the year that severe drought occurred a few weeks before harvest; ridge regression algorithm was the best model for the 2015's phenotypes. 2. Specific objectives met: 2.1. Genomic prediction accuracy evaluation Prediction accuracies were tested with 5-fold cross-validation. Then the GEBVs for each fold were predicted by training the model on the four remaining folds. The procedure iterates for 5 times so that the observations in each fold can be compared with its own predicted value. A total of 10 random partitions were generated for each of data sets. The performance of prediction is assessed by Pearson's correlation and Spearman's (ranking) and Mean Squared Error (MSE) between observed phenotypic value and cross-validated GEBVs. We record the average and standard deviation of the above two measurements after 10 times 5-fold CV. 2.3. Optimal training population selection strategy established In this study, five different scenarios (10%, 20%, 30%, 40% and 50%) were investigated for each selection method. OS will be expected to select fewer lines than other three methods as OS only considers overlapping of both observed and predicted values for those extreme performers. The statistical power of our OS scheme was calculated by a bootstrapping procedure. We bootstrapped m (e.g., m=100) samples of size n (e.g., n=150) from original 239 lines. In each bootstrapped sample, these n lines were treated as a new population. The above four selection methods were then applied on the new population to find out its own new optimal TP. The prediction performance of each new TP can be evaluated under each scenario. The power of OS is the frequency that OS beats other three selection methods in m bootstrapped samples. 3. Significant results achieved: 3.1. Preliminary genomic selection results In general, all models perform similarly within the year; in year 2014 the average predictability as Pearson's correlation coefficient was at 56.8%, ranging from 55% to 58%, 55% on average for year 2015 with random forest (RF) being the highest at 58% accuracy and lowest at 53% (ridge regression). Among all predictive models RF performs slightly better than all others, regardless what year of training data came from. Random forest (RF) also outperformed all other models in the year 2014 within year cross validations, even when ranked phenotype was used to evaluate the performance. Surprisingly, the penalized linear RR model was the best algorithm for the ranking of 2015's grain yield phenotype. A much-reduced predictability was observed in cross-year results, suggesting a strong gene-by-environment variation in field condition in different years. Overall, RF still outperforms others for both scenarios (model trained in 2014 predicted 2015 and model trained in 2015 and predicted 2014). The lowest predictability of year 2015's grain yield is resulted from penalized linear RR model that was trained by 2014's data, indicating that much higher interaction was hindering performance of predictive analyses due to the unaccountable year effects. The drought condition in 2015 was not as severe as 2014, reflecting on the predictability of 2014's grain yield where the highest prediction performance can be as high as 40% using random forest algorithm. Also, it is worthwhile mentioning that the variability of predictive algorithms was highest when penalized linear model was used, which further confirms the lack of strength in linear, additive models. 3.2. Optimal training population selection Due to its superiority, results of optimal selection were only based on the random forest algorithm. Also, the predictability was tested using 100 replicates of 5-fold cross validations. The number of lines selected in the optimal TP increased as selection coefficients decreased. For example, the number of lines from OS ranged from 17 to 174 from year 2014 as selection increases from 10% to 50%. As for the power of optimal selection (OS) methods, the power of GS can be increased from 40% accuracy to 81%, as the line coverage cut-off value increases from 10% to 50% and different selection respect to training information. When training information was optimized from year 2015, the prediction accuracy reaches it's highest of 81%. When year 2014 was used as training population, after optimization the predictability was increased from 36% to 70%, showing a significant benefit to organize phenotypes such that the likelihood to include the underlying QTLs is maximized. 4. Key outcomes and other accomplishments: A significantly reduced prediction performance was observed when two-generation validation was used, indicating the apparent gene-by-environment variation from our cross-year study. In both scenarios, data trained in 2014 predicting 2015 and data trained in 2015 predicting 2014, non-parametric models (RKHS and RF) outperformed parametric models (RR and BL). When heading date data was included as covariate, model performance increased. Our preliminary results suggest, to account for the climate variation in between growing seasons, non-parametric algorithms capable of modeling interaction should be considered. Also, important factors, such as the span of LD, trait heritability, genetic architecture underlying the trait variation, marker density need to be considered as well as the models used to assess predictability across environment. Due to its largely reduced biological complexity, a single bi-parental DH population would be ideal to investigate the genetic components influencing predictability for scenarios that are more close to a breeding program.
Publications
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2016
Citation:
Chen, C., S. Sun, E.J. Schwarzkoph and Y.A. El-Kassaby. 2016. Missing data interpretation for non-referenced or semi-referenced genomes. Midsouth Computation Biology and Bioinformatics Society 2016 Conference (MCBIOS-XIII), March 03-05, Memphis TN.
Abstract Identifying Number: 1006021
- Type:
Conference Papers and Presentations
Status:
Other
Year Published:
2016
Citation:
Hu, X., L. Zhu and C. Chen. 2016. Genomic prediction models on wheat doubled haploid population. The Joint Statistical Meetings, July 30- Aug 04, Chicago, IL
https://ww2.amstat.org/meetings/jsm/2016/onlineprogram/AbstractDetails.cfm?abstractid=321241
|
|