Bayesian Mixture Models for Quantitative Genetic and Expression Data

BAYESIAN MIXTURE MODELS FOR QUANTITATIVE GENETIC AND EXPRESSION DATA

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

NRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

0193917

Grant No.

2003-35205-12833

Cumulative Award Amt.

(N/A)

Proposal No.

2002-03389

Multistate No.

(N/A)

Project Start Date

Nov 1, 2002

Project End Date

Oct 31, 2006

Grant Year

2003

Program Code

[43.0]- (N/A)

Recipient Organization
UNIV OF WISCONSIN
21 N PARK ST STE 6401
MADISON,WI 53715-1218

Performing Department
ANIMAL SCIENCES

Non Technical Summary
Mastitis is the most costly disease affecting the dairy cattle industry in the United States and elsewhere. Increasing resistance to mastitis by genetic seletion can lead to a reduction in the use of antibiotics and to an increase in the economic efficiency of production of milk. The project attempts to develop better tools for measuring quantitative genetic parameters of mastitis indirectly. Models for estimating the probability of disease using somatic cell counts are developed and validated using clinical mastitis records from a large data base. The project involves cooperation with Norway. Possible economic benefits can accrue, to the extent that the methods developed are found to be effective and applicable in practice. The dairy industry is of major importance, both in the United States and elsewhere, and even moderate improvements in genetic analysis of mastitis-related data can translate into reductions in production costs.

Animal Health Component

50%

Research Effort Categories

Basic

50%

Applied

50%

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
303	3499	1080	50%
303	3499	2090	50%

Knowledge Area
303 - Genetic Improvement of Animals;

Subject Of Investigation
3499 - Dairy cattle, general/other;

Field Of Science
2090 - Statistics, econometrics, and biometrics; 1080 - Genetics;

Goals / Objectives
The objective is to develop statistical models and methods for inference of quantitative genetic parameters of somatic cell count (SCC) in dairy cattle using Bayesian finite mixture models. The project proposes a major extension of a mixture model for analysis of SCC, an indication of mastitis, which is the most important disease in dairy cattle. Since the USA does not have a national mastitis-recording program, it is essential to make the best use of the available SCC data used currently in genetic evaluation. The extensions encompass cross-sectional and longitudinal settings, Bayesian and Markov chain Monte Carlo implementations, suggest novel model features (e.g., an imperfect genetic correlation between effects of genes in healthy and diseased animals) and allow for use of traits that are correlated with SCC, such as milk yield. Data from the Norwegian health recording system, containing about 0.5 million cases of clinical mastitis, will be used to gauge the predictive ability of the mixture models. The research enhances animal agriculture through: 1) better tools for genetic evaluation of dairy animals for liability to mastitis from SCC records, 2) improved statistical methods for quantitative genetic data, and 3) development of basic statistical methodology for future strategic study of similar traits in genetic selection programs.

Project Methods
This project has two main parts that are inextricable from each other. The first one, essentially theoretical, deals with the development of statistical models including functional forms, parameterizations, distributional assumptions and consideration of computing strategies for large Markov chain Monte Carlo implementations. The second part can be viewed as one of validation and application of the methods to large dairy cattle data sets. In order to evaluate the ability of mixture models for SCS of discriminating into infected or non-infected cows, it is essential to have access to a large body of data in which both SCS and mastitis events are recorded. Support from the Babcock Center for International Dairy Research and Development of the University of Wisconsin-Madison has allowed us to establish a cooperative project with scientists from the Agricultural University of Norway (Drs. Gunnar Klemetsdal and Bjorg Heringstad). This cooperation enables us to have access to large data sets containing information on both SCS and clinical mastitis from the Norwegian nation-wide recording system. A cow cannot be infected and healthy at the same time, but it is reasonable to pose that there may be genes that are expressed when infection takes place, whereas others are expressed in the absence of the disease (or perhaps expressed at a different level). This can be modeled by introducing a genetic correlation in a mixture model, much in the same way that one can think of a genetic covariance between ovulation rate and scrotal circumference in sheep. In the context of mastitis, and in a 2-component mixture, a sufficient condition for statistical identification of the genetic correlation is that some healthy animals have relatives that contract the disease. More generally, if a K-component mixture model is fitted, one would need that at least some "families" (in some loosely defined sense) are represented in all components of the mixture. In dairy cattle, where large half-sib families are the norm, this requirement can be be met, unless K is very large, this being seldom the case. We propose fitting this "differential gene expression" model to the Norwegian data. The clinical mastitis information will be used to evaluate the predictive ability of several mixture models via standard cross-validation techniques. In the case of longitudinal settings, we need to match mastitis events to a certain test-day where both yield and SCC are measured.

Progress 11/01/02 to 10/31/06

Outputs
Finite mixture models can uncover heterogeneity due to hidden structure. Quantitative genetics of continuous characters having a finite mixture of Gaussian components was explored. The partition of variance in a mixture, the covariance between relatives under the supposition of an additive genetic model, and the offspring-parent regression were derived. Formulae for assessing the effect of mass selection operating on a mixture were worked out. Expressions for the genetic and phenotypic correlations between mixture and Gaussian traits, and between two mixture traits were derived as well. Semi-parametric procedures for prediction of total genetic value for quantitative traits, that make use of phenotypic and genomic data simultaneously, were developed. The methods focus on the treatment of massive information provided by, e.g., single-nucleotide polymorphisms, which can create a mixture of distributions. It was argued that standard parametric methods for quantitative genetic analysis cannot handle the multiplicity of potential interactions arising in models with, e.g., hundreds of thousands of markers, and that most of the assumptions required for an orthogonal decomposition of variance are violated in artificial and natural populations. A fully Bayesian method for quantitative genetic analysis of data consisting of ranks of, e. g., genotypes, scored at a series of events or experiments was developed. The model postulates a latent structure, with an underlying variable realized for each genotype or individual involved in the event. The rank observed is assumed to reflect the order of the values of the unobserved variables, i.e., the classical Thurstonian model of psychometrics. A study was conducted to apply finite mixture models to field data for somatic cell scores (SCS) for estimation of genetic parameters. Data were approximately 170,000 test-day records for SCS from first-parity Holstein cows in Wisconsin, USA. Five different models were applied, each one with an increasing level of complexity. The best model was one for which genetic and permanent environmental variances were heterogeneous, but residual variances were homogeneous. The genetic effects for the two components suggested that SCS from healthy and infected cattle were different traits, with a genetic correlation between high and low SCS of only 0.13. Robust threshold models with multivariate Student's t or multivariate slash link functions were employed to infer genetic parameters of clinical mastitis at different stages of lactation, with each cow defining a cluster of records. The robust fits were compared with that from a multivariate probit model with a pseudo-Bayes factor an an analysis of residuals. Results suggest that clinical mastitis resistance is not the same trait across periods, corroborating earlier findings with probit models.

Impacts
This project developed theory and methods for genetic analysis of heterogeneous characters whose statistical distributionn requires the specification of a mixture of distributions, such as somatic cell scores in cattle, gene expression data and traits for which there may be major genes segregating, but without knowing the genotypes involved. Models and methods were applied to dairy cattle records on somatic cell scores from the USA and Norway, and found to have a better performance than standard methods of genetic evaluation. The idea of mixtures also underlies a wide class of semi-parametric methods, which may be useful in conjunction with genomic selection. Application of these methods to livestock populations can increase the accuracy of prediction of genetic merit of animals.

Publications

D. Gianola, R. L. Fernando and A. Stella. 2006. Genomic assisted prediction of genetic value with semi-parametric procedures. Genetics 173, 1761-1776.
D. Gianola, B. Heringstad and J. Odegaard. 2006. On the quantitative genetics of mixture characters. Genetics 173, 2247-2255.
D. Gianola, and H. Simianer. 2006. A thurstonian model for quantitative genetic analysis of ranks: A Bayesian approach. Genetics 174, 1613-1624.
P. J. Boettcher, D. Caraviello and D. Gianola. 2007. Genetic analysis of somatic cell Scores of Holstein cows with a Bayesian mixture model. Journal of Dairy Science 90, 435-443.
Y. M. Chang, D. Gianola, B. Heringstad and G. Klemetsdal. 2006. A comparison between multivariate slash, Student-t and probit threshold models for analysis of clinical mastitis in first lactation cows. Journal of Animal Breeding and Genetics 123, 290-300.

Progress 01/01/04 to 12/31/04

Outputs
Mastitis elevates SCC, inducing a positive correlation between SCS and the disease. Selection against mastitis has focused on genetic evaluations for low level of SCC. An observed SCC can be viewed as drawn from a two component mixture defined by the unknown health status of a cow. A mixture model was developed, assuming that health class membership associated with a test-day record of SCS was fully determined by an underlying variable. The probability of putative mastitis may vary between sub-groups. Further, a baseline SCS may be affected by fixed and random effects. Based on simulations, the model gave unbiased estimates of parameters. We fitted a finite mixture model (FMM) somatic cell counts in goats and compared the fit to that of a standard linear mixed effects model. Bacteriological information was used to assess the ability of the model to classify records as from healthy or infected goats. Data were 4518 observations of SCS and bacterial infection from both udder halves of 310 goats from 5 herds in Northern Italy. Records were from a complete production season, and were taken monthly from February to November of 2000. Explanatory factors included a three-parameter regression on days in milk; fixed class effects of herd-test-day, parity group, and udder side (left or right); and random effects of goat and udder half within goat. The two-component FMM included a fixed mean for the second component of the model (theoretically corresponding to infected udder halves), and an unknown probability of membership to a given putative infection status. A Bayesian approach was used for the analysis with Gibbs sampling employed to obtain draws from posterior distributions of parameters of interest, The Deviance Information Criterion (DIC) was used to compare the fit of the two models. The FMM yielded a much lower estimate of residual variance than the standard model (1.28 vs. 3.02 SCS2), and a slightly higher estimate for the between-goat variance (1.79 vs. 1.48). The DIC was much lower for the FMM, indicating a better fit to the data. The FMM was able to classify correctly 60% and 48% of the healthy and infected observations, respectively. This was slightly greater than what would be expected from random classification, but not high enough for useful mastitis diagnosis. Nevertheless, increased precision of genetic evaluation is the goal of applying the FMM, rather than timely and accurate mastitis diagnosis. The results suggest that more research on FMM for SCS is merited and necessary for proper application. Prediction of random effects with mixtures with Gaussian distributions was studied from a non-Bayesian perspective, assuming that location and dispersion parameters are known. The focus was on calculating the best predictor for several models. Coverage included mixture sampling models, and mixtures for the distribution of the random effects. Longitudinal and cross-sectional specifications such as those arising in animal breeding and genetics, were examined. The best linear predictor and the best linear unbiased predictor were derived for these models.

Impacts
Theory and algorithms for genetic analysis of Gaussian mixtures was developed and applied to livestock data, with the primary motivation being mastitis, an udder disease. Application of these procedures to livestock populations can enhance the effectiveness of genetic selection for increased resistance to disease.

Publications

D. Gianola, J. Odegaard, B. Heringstad, G. Klemetsdal, D. Sorensen, P. Madsen, J. Jensen and J. Detilleux. 2004. Mixture model for inferring susceptibility to mastitis in dairy cattle: a procedure for likelihood-based inference. Genetics, Selection, Evolution 36, 3-27.
P. J. Boettcher, P. Moroni, G. Pisoni and D. Gianola. 2005. Application of a finite mixture model to somatic cell scores of Italian goats. Journal of Dairy Science 88, 2209-2216.
D. Gianola. 2005. A primer on prediction of random effects in finite mixture models with Gaussian components. Journal of Animal Breeding and Genetics 122, 145-160 J. Odegaard, J. Jensen, P. Madsen, D. Gianola, G. Klemetsdal and B. Heringstad. 2004. A Bayesian Liability-Normal mixture model for analysis of a continuous mastitis-related trait. Journal of Dairy Science 88, 2652-2659.

Progress 01/01/03 to 12/31/03

Outputs
The distribution of somatic cell score (SCS) in cows with and without intramammary infection (mastitis) may be different. SCS could be regarded as a mixture of at least two components depending on cow udder health status. A heteroscedastic two-component Bayesian normal mixture model with random effects was developed and implemented via Gibbs sampling. The model was evaluated using simulated data sets. SCS was simulated as a mixture representing two alternative udder health statuses (healthy or mastitic). Animals were assigned randomly to the two components according to the probability of group membership (Pm). Random effects (additive genetic and permanent environment), when included, had identical distributions across mixture components. Posterior probabilities of putative mastitis were estimated for all observations, and model adequacy was evaluated. Fitting different residual variances in the two mixture components seems to cause bias in estimation of parameters. When the components are difficult to disentangle, so are their residual variances; this biases estimation of Pm and of location parameters of the two underlying distributions. When all variance components were identical across mixture components, the mixture model analyses returned parameter estimates without bias and with a high degree of accuracy. Including random effects in the model significantly increased the probability of correct classification. No sizable differences in probability of correct classification were found between models in which a single animal effect (ignoring relationships) was fitted and models where this effect was split into genetic and permanent environmental components utilizing relationship information. Finite mixture models can separate a heterogeneous population into homogeneous components. These models can be used to classify individuals to the unknown member groups, e.g., using somatic cell scores to assign cows into udder health groups. A simulation study of a two-component normal mixture model was carried out. Parameters were estimated by maximum likelihood using the EM algorithm. The objective was to evaluate alternative Monte Carlo implementations of the E-step. 400 individuals were randomly assigned into two sub-populations with mixture parameter 75% (P); heritability was 0.2, and the residual variance was homogeneous. The E-step was done with Gibbs sampling. Different lengths of burn-in and of sampling periods were run to study the efficiency of the Monte Carlo implementation. Location parameters and P converged quickly regardless of the length of burn-in and the number of Gibbs samples collected; however, smaller mean-squared errors were observed with longer burn-in. Variance components were less stable, and larger number of Gibbs samples gave smaller mean-squared errors. Longer burn-in at beginning stage of the Monte Carlo implementation and increasing the Gibbs samples collected should result in a better performance of the algorithm.

Impacts
The expectation is that a mixture model may enhance the accuracy of genetic evaluation for resistance to mastitis which, at present, is based on analysis of somatic cell scores, ignoring the type of heterogeneity considered in a mixture specification.

Publications

J. Odegaard, J. Jensen, P. Madsen, D. Gianola, G. Klemetsdal and B. Heringstad. 2003. Detection of mastitis in dairy cattle by use of mixture models for repeated somatic cell scores: A Bayesian approach via Gibbs sampling. Journal of Dairy Science 86, 3694-3703.
Y. M. Chang, D. Gianola, J. Odegaard, J. Jensen, P. Madsen, D. Sorensen, G. Klemetsdal and B. Heringstad. 2003. Evaluation of a Monte Carlo EM algorithm for likelihood inference in a finite normal mixture model with random effects. European Association for Animal Production, 54th Annual Meeting, Rome, Italy.