Linkage Disequilibrium Mapping Using Single Nucleotide Polymorphisms

LINKAGE DISEQUILIBRIUM MAPPING USING SINGLE NUCLEOTIDE POLYMORPHISMS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

HATCH

Reporting Frequency

Annual

Accession No.

0187577

Grant No.

(N/A)

Cumulative Award Amt.

(N/A)

Proposal No.

(N/A)

Multistate No.

(N/A)

Project Start Date

Oct 1, 2000

Project End Date

Sep 30, 2005

Grant Year

(N/A)

Program Code

[(N/A)]- (N/A)

Recipient Organization
CORNELL UNIVERSITY
(N/A)
ITHACA,NY 14853

Performing Department
BIOLOGICAL STATISTICS & COMPUTATIONAL BIOLOGY

Non Technical Summary
There exists currently no method for full multipoint Linkage Disequilibrium Mapping. The development of a full multipoint Linkage Disequilibrium Mapping method will greatly increase mapping power and should find applications in the agricultural sciences and in other genetics studies.

Animal Health Component

(N/A)

Research Effort Categories

Basic

20%

Applied

(N/A)

Developmental

80%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
901	7310	1080	50%
901	7310	2090	50%

Knowledge Area
901 - Program and Project Design, and Statistics;

Subject Of Investigation
7310 - Experimental design and statistical methods;

Field Of Science
2090 - Statistics, econometrics, and biometrics; 1080 - Genetics;

Goals / Objectives
Single Nucleotide Polymorphisms (SNPs) are single base changes in a DNA sequence. Such polymorphisms are thought to occur in approximately one out of every 50-500 base positions in most natural organisms. Much interest has centered on these genetic markers because of their potential use in gene mapping and in evolutionary studies. The recent advent of chip technology gives strength to the idea that SNP data may soon become abundant. For example, Wang et al. (1998) constructed a human genetic map consisting of 2227 SNPs and similar data sets are being produced in other organisms. The availability of SNP data opens up completely new avenues in genomics. SNPs may be used for large-scale genomic scans in addition to fine scale mapping and for the mapping of complex traits with multiple genetic factors. However, the great promise of the SNP markers has not been followed by the development of statistical and population genetical methods for analyzing such data. The proposed research will correct for this problem by devising new statistical methods for data analysis that take the special properties of SNPs into account. The methods are based on the computational approach of Nielsen (2000) that allows likelihood and Bayesian inference on multiple linked SNPs. The emphasis will be on developing a full probabilistic framework for Linkage Disequilibrium Mapping (LDM) method based on population samples (Association Mapping). The full solution to this problem requires integration over all possible ancestral histories (the set of linked genealogies among SNP sites in the sequence). Even stochastic integration over the set of ancestral histories is challenging because of the high dimensionality of the problem. Consequently, there currently exists no method for multipoint (i.e. using multiple linked markers) LDM that provides a full treatment of the problem in a genealogical framework. This is unfortunate because a method that can employ all of the information in the genetic data from multiple linked markers will have an increased mapping power. The reason for this is that the data from multiple linked markers provides information regarding the underlying structure of the genealogy in a region. Considering the markers one by one cannot retrieve such information. Development of a method for LDM that uses all of the genealogical information in multiple linked markers is therefore, highly desirable. The theoretical framework will be based on coalescence theory (Kingman 1982a,b). In coalescence theory the ancestry of the population or a sample of genetic markers from the population is described in terms of a gene genealogy. Recent developments in coalescence theory allow for an accurate description of demographic factors such as population growth (e.g. Slatkin and Hudson 1991) and population subdivision (e.g. Bahlo and Griffiths, 1998; Beerli and Felsenstein, 1999, Nielsen and Slatkin 2000). Samples from subdivided or growing populations will have an increased degree of linkage disequilibrium. Developing LDM methods applicable to subdivided populations is therefore important in maximizing the power of the statistical method.

Project Methods
The Markov Chain Monte Carlo (MCMC) approach described in Nielsen (2000) will be extended to provide a full coalescence based likelihood method for LDM. The advantage of this approach is that it makes stochastic integration over the set of linked genealogies from multiple sites feasible. With the availability of such a method it can be determined exactly how much power is gained by considering multiple markers and how much the accuracy in the estimate of trait location is improved by using a full coalescence model to describe the genetic ancestry of the population. Many other problems regarding LDM are still unsolved. For example, Kaplan et al. (1995) suggested that LDM will only work for simple monogenic diseases. Other problems involve the choice of study populations i.e. to which degree does LDM rely on the use of growing populations and how sensitive are the methods to assumptions regarding the absence of population subdivision? It is difficult to answer these questions without a method that can take full account of the demographic and ancestral processes in the study population while at the same time employing all of the genealogical information available in multiple linked markers. However, with the availability of such a method it is possible to answer these questions unequivocally. The development of such a method is the major objective of the proposed research. Three problems will be emphasized: (1) Extending the method of Nielsen (2000) to the analysis of SNP data with unknown haplotypic phase (genotypic data). Most of the available data is genotypic, it is therefore important to develop LDM methods based on this type of data. The availability of a likelihood method appropriate for this type of data will also be instrumental in determining the loss of power associated with the use of genotypic instead of haplotypic data. (2) Devising a full probabilistic, coalescent based framework for local multipoint LDM applicable to SNPs. A method for calculating likelihood ratios based on the joint data from many markers will be provided. This method should have more mapping power than any previously proposed LDM methods based on population samples. (3) Providing a Bayesian mapping method. The availability of such a method will alleviate the problems associated with multiple tests, often encountered in mapping studies. Throughout, simulations will be used to compare the new methods with previous methods and illuminate questions regarding power and accuracy. The methods will be tested on previously published data and a computer program will be made publicly available. In addition a large simulation study will be undertaken to determine the statistical properties of the method.

Progress 10/01/00 to 09/30/05

Outputs
The major objective of the proposed research is to develop new methods for association mapping. A paper has been published in Genetic Epidemiology on some technical aspects of association mapping, particularly related to have markers are chosen. In addition, the research project has lead to a study of population genetical models for analyzing inbred populations and other projects in the statistical analysis of population genetical data. In particular several new methods for analyzing (Single Nucleotide Polymorphism) SNP data has been developed. SNP data is being generated in many organisms for the purpose of providing a resource for gene mapping, and to improve our understanding of genomic variation. The methods I have developed deal with several problems that are specific to SNP data because of the protocols used to obtain such data. The new methods will greatly enhance the utility of SNP data and has already found several applications. Although the majority of results of the research project have been theoretical, it has also resulted in numerous data analysis projects, often as a result of collaborations with other faculty members at Cornell University. Among the most important projects are: The development of a new Bayesian method for construction genetic maps, with application to the eggplant genome. This research has been performed in collaboration with Steven Tanksley at Cornell University and has resulted in a publication (In press) in Genetical Research). A project on data-analysis of population genetic data from rice (Oryza sativa). This project has been completed in collaboration with lab. Of Susan McCouch, Cornell University, and has resulted in a publication in Genetics (In press) and another publication submitted to Genetics. A project centered on improving the available methods for estimating the effective paternity number and the effective number of alleles, with applications to social insects. This project was conducted in collaboration with David Tarpy, Cornell University, and has resulted in a publication in Molecular Ecology and a publication in Insects Sociaux. Together with Andrew G. Clark, Cornell University, I analyzed a large scale human SNP data set in order to estimate recombination rates and make predictions regarding the power of association mapping.

Impacts
The theoretical research is expected to find applications in gene mapping in plants and other organisms. In addition, our results will help improve our knowledge regarding the way molecular biological and evolutionary forces interact to shape the variation observed among individuals within species and among different species.

Publications

Tarpy, D. R., R. Nielsen and and Nielsen, D.I.. 2004. Paternity estimation in social insects. Insectes Sociaux. 51: 203-204. Nielsen, R. 2004. Population genetic analysis of ascertained SNP data. Human Genomics 3: 218-224.
Nielsen, R., M. J. Todd and A. G. Clark. 2004. Reconstituting the frequency spectrum of ascertained SNP data. Genetics 168: 2373-2382.
Zhai, W., M. J. Todd, and R. Nielsen. 2004. Is Haplotype Block Identification Useful for Association Mapping Studies? Genetic Epidemiology 27: 80-83. Kim, Y. and R. Nielsen. 2004. Linkage disequilibrium as a signature of selective sweeps. Genetics 167: 1513-1524.
Nielsen, R., D. R. Tarpy, and H. K. Reeve. 2003. Estimating the effective paternity number in social insects and the effective number of alleles in a population. Molecular Ecology 12: 3157-3164.
Clark, A.G., Nielsen, R., Signorovitch, J., Matise, T.C., Glanowski, S., Heil, J., Winn-Deen, E. S., Holden, A.L. and Lai, E. 2003. Linkage Disequilibrium and Inference of Ancestral Recombination in 538 Single-Nucleotide Polymorphism Clusters across the Human Genome. Am. J. Hum. Genet. 73:285-300.
Nielsen , R. and Signorovitch, J. 2003. Correcting for Ascertainment Biases when Analyzing SNP Data: Applications to the Estimation of Linkage Disequilibrium. Theor. Pop. Biol. 63:245-255.

Progress 01/01/03 to 12/31/03

Outputs
The major objective of the proposed research is to develop new methods for association mapping. The proposed method is under development and I have currently a computer programmer working on implementing the method. In addition, the research project has lead to a study of population genetical models for analyzing inbred populations and other projects in the statistical analysis of population genetical data. In relation to the USDA Hatch grant I have published an article in Heredity on methods for detecting selection from genomic data, an article in Genetics on paternity inference in natural populations and an article in American Journal of Human Genetics on analyzing data from Single Nucleotide Polymorphisms. The latter two articles resulted in part also from research conducted before the initiation of the Hatch grant. Recently, efforts on the grant have concentrated on the analysis of genomic data in humans and the development of methods for estimation recombination rates from large SNP data sets. In particular, together with collaborators I have published an article to the American Journal of Human Genetics regarding linkage disequilibrium in the human genome and a paper in Theoretical Population Biology giving the theoretical underpinnings of the method.

Impacts
The research is expected to find applications in gene mapping in plants and other organisms.

Publications

Clark, A.G., Nielsen, R., Signorovitch, J., Matise, T.C., Glanowski, S., Heil, J., Winn-Deen, E. S., Holden, A.L. and Lai, E. 2003. Linkage Disequilibrium and Inference of Ancestral Recombination in 538 Single-Nucleotide Polymorphism Clusters across the Human Genome. Am. J. Hum. Genet. 73(2):285-300.
Nielsen , R. and Signorovitch, J. 2003. Correcting for Ascertainment Biases when Analyzing SNP Data: Applications to the Estimation of Linkage Disequilibrium. Theor. Pop. Biol. 63(3):245-255.

Progress 01/01/02 to 12/31/02

Outputs
The major objective of the proposed research is to develop new methods for association mapping. The proposed method is under development and I have currently a computer programmer working on implementing the method. In addition, the research project has lead to a study of population genetical models for analyzing inbred populations and other projects in the statistical analysis of population genetical data. In relation to the USDA Hatch grant I have published an article in Heredity on methods for detecting selection from genomic data, an article in Genetics on paternity inference in natural populations and an article in American Journal of Human Genetics on analyzing data from Single Nucleotide Polymorphisms. The latter two articles resulted in part also from research conducted before the initiation of the Hatch grant. Recently, efforts on the grant have concentrated on the analysis of genomic data in humans. In particular, together with collaborators I have submitted an article to the American Journal of Human Genetics regarding linkage disequilibrium in the human genome.

Impacts
The research is expected to find applications in gene mapping in plants and other organisms.

Publications

Signorovitch, J. and Nielsen, R. 2002. PATRI - paternity inference in natural populations. Bioinformatics. 18: 341-342.

Progress 01/01/01 to 12/31/01

Outputs
The major objective of the proposed research is to develop new methods for association mapping. The proposed method is under development and I have currently a computer programmer working on implementing the method. In addition, the research project has lead to a study of population genetical models for analyzing inbred populations and other projects in the statistical analysis of population genetical data. In relation to the USDA Hatch grant I have published an article in Heredity on methods for detecting selection from genomic data, an article in Genetics on paternity inference in natural populations and an article in American Journal of Human Genetics on analyzing data from Single Nucleotide Polymorphisms. The latter two articles resulted in part also from research conducted before the initiation of the Hatch grant.

Impacts
The research is expected to find applications in gene mapping in plants and other organisms.

Publications

Nielsen, R., Mattilla, D.K., Clapham, P.J., and Palsboll, P.J. 2001. Statistical Approaches to Paternity Analysis in Natural Populations and Applications to the North Atlantic Humpback Whale. Genetics, 157: 1673-1682.
Nielsen, R., and Wakeley, J.W. 2001. Distinguishing Migration from Isolation: an MCMC Approach. Genetics 158: 885-895.
Nielsen, R. 2001. Statistical Tests of Selective Neutrality in the Age of Genomics. Heredity, 86:641-647.
Nielsen, R., Wakeley, J., Ardlie, K., and Liu-Cordero, S.N. 2001. The discovery of single nucleotide polymorphisms and inferences about human demographic history. Am. J. Hum. Genet., 69: 1332-1347.