Source: CORNELL UNIVERSITY submitted to
DEVELOP STATISTICAL GENETIC ANALYSES FOR HIGH-DIMENSIONAL DATA
Sponsoring Institution
Agricultural Research Service/USDA
Project Status
NEW
Funding Source
Reporting Frequency
Annual
Accession No.
0430195
Grant No.
(N/A)
Project No.
8062-21000-038-07A
Proposal No.
(N/A)
Multistate No.
(N/A)
Program Code
(N/A)
Project Start Date
Mar 1, 2016
Project End Date
Nov 30, 2016
Grant Year
(N/A)
Project Director
JANNINK J
Recipient Organization
CORNELL UNIVERSITY
(N/A)
ITHACA,NY 14853
Performing Department
PLANT BREEDING
Non Technical Summary
(N/A)
Animal Health Component
(N/A)
Research Effort Categories
Basic
50%
Applied
50%
Developmental
0%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
20115491080100%
Goals / Objectives
Genes across the genome interact with each other through development to cause the crop traits that we observe. To date, we have done a poor job at modeling these interactions so that the predictions of traits that we observe do not incorporate them. There have been two weaknesses in the way we model interactions. First, we model them as occurring uniformly over the whole genome. In fact, interactions may be local to specific segments of the genome. Modeling them as such has two potential benefits: it narrows the search space making it less likely that we will incorporate noise into the model, and whole chromosome arms are often passed from parent to progeny such that the effect of local interactions may be passed intact to progeny. Second, we have primarily used parametric forms of interaction that combine orthogonal modes of gene action such as dominance and additive by additive epistasis. But we have no compelling evidence that gene action actually follows such parametric forms. We plan to use the machine learning technique of random forests within local genomic segments to model epistasis. Random forests is entirely flexible in terms of the form of interaction that it fits, relieving the model of constraints imposed by gene action assumptions.
Project Methods
A genomic relationship matrix is a grid representing individuals in rows and columns. The value at the intersection of a row and column indicates the extent to which the pair of individuals represented should resemble each other. Generally, this expected resemblance is calculated simply based on the overall similarity of the genomes of the two individuals. This approach ignores the fact that resemblance may depend solely on combinations of genes, not on genes singly. Random forests is a machine learning algorithm that identifies influential combinations in addition to single genes. We will apply random forests to segments of the genome. We will construct relationship matrices based on its outputs. We will then use those relationship matrices within the same statistical linear model frameworks that generate predictions using the simpler standard matrices. We plan to develop software for this model and fit it in the summer of 2016 and to test it on wheat, barley, and cassava data in the fall of 2016.