Non Technical Summary
Many important agricultural species are polyploids, i.e., have multiple copies of their genomes. They range from staple food crops (potato, sweetpotato) to fruits (strawberry, kiwi, blueberry, banana), ornamental flowers (roses, chrysanthemum), forage crops, turfgrass, and sugar and energy production crops (sugarcane). The transmission of genic material across generations in polyploids is much more intricate and challenging to unravel than diploids, such as maize, rice, and soybeans. Although challenging, understanding inheritance patterns is essential information in breeding programs. With the correct assessment of these patterns, it is possible to associate specific genomic positions to important traits or even find the gene responsible for them and use this information in breeding programs.In the last few years, we have developed a series of computational tools to help breeders and geneticists answer these questions by analyzing genomic data in polyploid species. We developed tools such as VCF2SM and SuperMASSA for processing raw DNA sequences and identifying genetic markers, MAPpoly for constructing genetic maps, and QTLpoly for locating important genes to trait phenotypes also, to perform prediction. Currently, our tools are implemented for limited genetic design: full-sib families, and we are extending to multiple families. This project proposes extending, even more, our previous polyploid genomic tool for general multiple-generation pedigree breeding populations typically present in practical polyploid breeding programs. Moreover, we propose developing a new downstream computational tool, called DecisionPoly, user-friendly and offers clearly illustrated actionable information to assist polyploid breeders in making short- and long-term breeding decisions based on the collected and learned information about their breeding populations different breeding objectives.
Animal Health Component
Research Effort Categories
Goals / Objectives
The main goal of this project is to develop a comprehensive, integrated, open-source, and publicly available pipeline data analysis platform to process genomic data, to infer the complex inheritance patterns from parents to offspring, to map genes that are important for breeding objectives, and to offer breeders actionable information to make short and long-term breeding decisions in practical breeding programs for polyploid species. In its upstream, we aim to develop computational tools to deal with different genomic data, call biallelic and multiallelic markers, combine genetic and genomic information from multiple breeding families, and build complex genetic models from complex population structures. In its downstream, we aim to develop a user-friendly computational tool offering illustrated actionable information to assist breeders in making breeding decisions based on the collected and learned information about their breeding populations for different breeding objectives.For this project, the specific objectives are:To further develop haplotyping algorithms to consider all the relevant information from complex breeding schemes with multiple-generation and partially inter-connected multiple families.To extend the genetic models between genotypes and phenotypes for all scenarios presented in item 1 and build sound and efficient statistical analysis procedures to achieve the purposes of genetic discovery and more accurate prediction.To implement a user-friendly computational tool in Shiny-R language to help breeders make short and long-term decisions in their breeding programs. Breeders will be able to use the results obtained in items 1 and 2 to make educated short-term decisions, such as which individuals to select and mate, as well as long-term decisions which will be attained through interactive breeding exercises (predict outcomes for various breeding decisions); optimize decisions based on breeding objectives; multiple-generation forward simulation exercises (predict results for different breeding strategies).
Haplotype inference in complex pedigrees.To infer the haplotypes in complex pedigrees, we will extend our previous work on constructing genetic maps in full-sib families. Our solution uses multilocus hidden Markov model (HMM) analysis and works for even ploidy levels from diploids up to autooctaploids. Our current model uses input probability-based dosage markers and recovers the multiple polyploid genotypes present in the segregating population, otherwise masked by the biallelic nature of the SNPs. Given the flexibility of the HMM framework, We also expect to extend our model to use multiallelic markers. The concept of HMM in genetic mapping is to use multiple linked markers to estimate the parental linkage phase, the genetic distance between markers, and reconstruct the offspring haplotypes. This procedure can correct many potential genotyping errors.Using the various sources of interconnected evidence (multiple SNPs and individuals), HMM can aggregate multiple SNP information and partially recover the intrinsic high error rate in individual marker dosage callings in polyploid species. Multilocus analysis to construct genetic maps and offspring haplotypes in polyploids is extremely important for information recovery and marker data quality control. We will implement these innovations in our open source publically available R package MAPpoly.Genetic models between genotypes and phenotypes in complex pedigreesTo integrate all the relevant information (genomic markers and trait phenotypes) in a complex pedigree breeding population for joint and informed analysis, we need a cohesive quantitative genetic model that applies to the whole breeding population. A quantitative genetic model can be devised based on the alleles of the population founders, assuming that all the segregating alleles of breeding individuals can be traced back to the founders in probability. The challenge is how to perform the genetic analysis efficiently and informatively.There are two strategies. One is to extend our current QTL random-effect genetic model for a full-sib family (implemented in QTLpoly) to a complex multi-generational breeding population with the alleles and their effects defined in terms of the founders' alleles. Like computing the G matrix for GBLUP (for GS), a corresponding G matrix (or Q matrix) can be computed for each targeted QTL locus. Multiple Q matrices can be built in one model for multiple QTL. The statistical importance of each QTL locus can be evaluated in the variance component. This strategy is similar to the one proposed by Amadeu et al. (2020) for QTL mapping in tetraploid dialleles in their software DiaQTL. However, there needs to be a more efficient way to identify multiple QTL in this case, as a sequential genome search is not computationally efficient. The second strategy may help in this respect.The second strategy is: first to build a large set of the founder's allele effects sampled in the genome at every specified genome interval position; then to put that set of allelic effects in a LASSO analysis (Tibshirani, 1996) to shrink them to a small set; and from this set to identify potential QTL positions for evaluation. Some combinations of the two strategies can help identify both QTL and their relevant allelic effects computationally more manageable way. We can first try to find significant QTL additive effects, and then conditional on those, try to find their essential dominance effects. This analysis can be used for genetic discovery (identification of QTL and additive, dominant, and epistatic allelic effects). It can also be optimized to predict the performance of future generation individuals. This prediction can be used for selection and mating design and optimizing the experimental designs (which can be aided by a forward simulation). Identifying significant dominant allelic effects is particularly important for variety development because they contribute to heterosis or special combining ability. Of course, our computational tool will also have an option to compute the standard genome G matrix and use it to do GBLUP for GS, at least for comparison. We will implement these innovations in our open source publically available R package QTLpoly.Development of a terminal tool for breeders - DecisionPolyTo truly help breeders incorporate genomic information in their breeding programs, we need to make extra efforts to put the downstream tool in the hands of breeders to help them make breeding decisions directly. For this purpose, we ask ourselves this question: What do breeders need? We think they may need a computational tool that can assist them in making the short and long-term breeding decisions based on the collected and learned information about their breeding populations for different breeding objectives.Data analysis of using the upstream computational tools (SuperMASSA, MAPpoly, and QTLpoly) requires quite a bit of proficiency in using R programs and understanding the scientific and technical issues behind the tools. Thus, we will implement DecisonPoly, an easy-to-use interactive application that can access the relevant information from MAPpoly and QTLpoly, display various information and results graphically, provide various options for users to perform further breeding analysis and simulations, and give breeders a direct control on the decision-making process.We will implement this application using the multiplatform language Shiny. Shiny is an R package that makes it easy to build interactive web apps straight from R with direct connections with databases and the other upstream R packages (MAPpoly and QTLpoly). It is highly interactive, can perform analysis, directly communicates with users with graphic results, and tells the story from the data. We can host this standalone app on a webpage or embed it in R Markdown documents or build dashboards.We will develop our tools using the Breading Application Programing Interface (BrAPI) standardized technical specifications (https://brapi.org/) to facilitate communications between different breeding platforms. We also will implement other ways to import and export datasets, such as CSV and Excell files, since these formats are prevalent in many breeding programs. We will also follow the FAIR guidelines to make our analysis, codes, and datasets findable, accessible, interoperable, and reproducible to the scientific community, especially breeders.