Development and application of a high-density SNP genotyping array for citrus

DEVELOPMENT AND APPLICATION OF A HIGH-DENSITY SNP GENOTYPING ARRAY FOR CITRUS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

AFRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

1001031

Grant No.

2013-67013-21110

Cumulative Award Amt.

$450,000.00

Proposal No.

2013-01898

Multistate No.

(N/A)

Project Start Date

Sep 1, 2013

Project End Date

Aug 31, 2016

Grant Year

2013

Program Code

[A1141]- Plant Health and Production and Plant Products: Plant Breeding for Agricultural Production

Recipient Organization
UNIVERSITY OF CALIFORNIA, RIVERSIDE
(N/A)
RIVERSIDE,CA 92521

Performing Department
College of Nat & Agr Sciences

Non Technical Summary
Citrus fruits are a major crop in several states of the USA, but the recent introduction of Huanglongbing, a bacterial disease, and the Asian Citrus Psyllid insect which transmits it, pose the most severe threat that these industries have ever faced. To address HLB and other exotic diseases, breeders need tools to rapidly characterize citrus varieties and hybrids and to locate genes for disease resistance, fruit quality, and other essential traits. To address this problem, this project will develop a high-density SNP genotyping array for citrus. This tool allows investigators to rapidly determine the particular genetic variants present in a variety or hybrid. The 20,000 genetic variants detected with this SNP array can then be related to the traits displayed by the various individuals studied. We will use this tool to study essentially all trees in the largest and most diverse citrus variety collection in the US and several large families in which individuals vary for traits of economic importance. We will then correlate the particular genetic variants carried by each individual to measured traits such as disease resistance and fruit quality. The most valuable outcomes of this project will be a tool that citrus breeders can use to improve the efficiency of breeding, and a comprehensive understanding of relationships among citrus varieties and how these relate to economically valuable characters.

Animal Health Component

(N/A)

Research Effort Categories

Basic

100%

Applied

(N/A)

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	0999	1080	50%
202	0999	1081	50%

Knowledge Area
202 - Plant Genetic Resources; 201 - Plant Genome, Genetics, and Genetic Mechanisms;

Subject Of Investigation
0999 - Citrus, general/other;

Field Of Science
1081 - Breeding; 1080 - Genetics;

Keywords

Goals / Objectives
Objective 1. Develop sequences for a large and diverse set of citrus varieties. Collect all available citrus sequences and supplement with new Illumina HiSeq data to compile sequences of at least 50 germplasm accessions, each at 12X depth or more. Objective 2. Design a 20,000 SNP Illumina Infinium SNP assay. Analyze the resulting dataset to identify SNPs and design a robust 20,000-SNP assay, principally targeting genes while also maintaining reasonable coverage in gene-poor regions. Objective 3. Apply the SNP assay to Citrus and closely related genera in the citrus germplasm collection to support phylogenetic clarifications and association mapping. Genotype approximately 1000 accessions of Citrus and closely related genera to create a database for association mapping in citrus. Objective 4. Apply the SNP assay to breeding populations used for mapping disease resistance and tolerance, and populations already phenotyped for fruit quality traits. Genotype and map several populations with about 500 total individuals that have already been phenotyped for various traits.

Project Methods
Objective 1. Develop sequences for a large and diverse set of citrus varieties. Collect all available citrus sequences from online databases and by contacting scientists that have reported sequencing but not yet released sequence. For species and other groups not adequately represented by available sequences, we will isolate DNA and sequence these to at least 12X depth using Illumina HiSeq. A total of about 50 sequences is expected. Objective 2. Design a 20,000 SNP Illumina Infinium SNP assay. Analyze all available sequence data to identify SNPs and design a robust 20,000-SNP assay, principally targeting genes while also maintaining reasonable coverage in gene-poor regions. New sequences will be assembled using SOAPdenovo and aligned to the reference sweet orange and Clementine sequences. Within and between individual SNPs will be called with several algorithms and a consensus used to select SNPs for further analysis. Initially we will select SNPs within genes and then add non-genic SNPs as needed to give good genome coverage. The genotyping array will then be designed by Illumina or another genotyping array provider. Objective 3. Apply the SNP assay to Citrus and closely related genera in the citrus germplasm collection to support phylogenetic clarifications and association mapping. We will isolate DNA from about 1000 accessions from the citrus germplasm collection and breeding parents and analyze these with the SNP array. Prior to association mapping, we will determine population structure by analysis with Structure and similar programs. Genome wide association mapping will be performed with TASSEL or similar programs with adjustments for population structure. Objective 4. Apply the SNP assay to breeding populations used for mapping disease resistance and tolerance, and populations already phenotyped for fruit quality traits. Several populations with about 500 total individuals that have already been phenotyped for various traits will be mapped using the SNP array. Additional array mapping will be made available to collaborators in Florida where plants are being phenotyped for resistance or tolerance to HLB.

Progress 09/01/13 to 08/31/16

Outputs
Target Audience:The target audience for this final phase of the project is the community of citrus geneticists and breeders who will benefit from the resources generated. A secondary target is the broader community of plant and animal geneticists who may benefit from methods we have developed for genotyping single pollen grains and a novel method for inferring chromosome level haplotypes from analysis of a few haploid individuals. Changes/Problems:This two-year project was extended to a third year because it took more time than expected to negotiate access to DNA sequences developed in other laboratories. We did not want to sequence the same genotypes as others. This was eventually resolved. Completion of sequencing also took more time than expected due to queues at the sequencing center and low coverage from some initial sequencing runs. Overall, we note four major improvements to the originally planned project: greater sequencing depth, development of SNP arrays with about 2.5 times the originally proposed SNP density, development of a high density (1.4M) SNP array (although the number of valid SNPs on this array will be less than 1.4M), and development of methods to infer chromosome level haplotypes for citrus. What opportunities for training and professional development has the project provided?One Ph.D. student worked on the project. She developed improved DNA isolation methods for citrus and used these to collect most of the germplasm collection samples that we analyzed. She also learned bioinformatics methods used in selecting sequences for inclusion on the array. Training in plant genetics and biotechnology was provided to this student through personal meetings and presentations to the lab group. She expects to present a talk on the citrus array project at the PAG meeting in January 2017. A visiting scientist from Italy developed the pollen grain isolation and whole genome amplification methods used. He received training in bioinformatics use in SNP analysis, presented a poster at PAG in January 2016 and will do so again in 2017. A visiting scientist from India is analyzing array data for loss-of-heterozygosity and copy number variation and will present a poster at PAG in 2017. How have the results been disseminated to communities of interest?Results have been disseminated through talks and posters at scientific meetings. Manuscripts are being prepared for submission to journals. Posters were presented at the PAG meeting in San Diego in 2016 and a talk and two posters will be presented in January 2017. One talk and one poster developed from the project were presented at the International Citrus Congress in Brazil in September 2016. Sequence data will be released through the NCBI Sequence Read Archive and through the Citrus Genome Database (https://www.citrusgenomedb.org/). We expect to sign an agreement with Affymetrix to allow commercialization of the two Citrus arrays. Other citrus researchers will be able to submit larger numbers of samples directly to Affymetrix. The Roose lab will also collect and consolidate smaller numbers of samples from other labs and periodically submit these to Affymetrix for analysis. This will allow the tool to be used by those without sufficient funding to process large numbers of samples. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? The major objective of this project was to develop a new platform that will allow citrus researchers to rapidly characterize the genetic composition of citrus varieties and hybrids in order to improve the efficiency of citrus breeding, particularly for resistance to Huanglongbing and other exotic diseases. We sequenced 30 citrus varieties and related species and combined this information with available sequences for 10 additional varieties to form a database of genetic variation in citrus. From this we developed two SNP array platforms, with about 1.4 million and 58,000 variants respectively. The 1.5M array was used to analyze over 200 citrus accessions, and the smaller array was used to analyze over 1900 samples from germplasm and breeding collections. The arrays produce very high quality data with replicate samples being more than 99.8% identical. A project innovation was development of methods to analyze DNA from single pollen grains that contain on two identical copies of the genome. This technical breakthrough will allow us to infer ancestry of citrus varieties with greater precision and better locate genes for useful traits. Final data were only obtained near the end of the project and are still being analyzed but it is clear that we will be able to construct high quality linkage maps that will enable citrus breeders to more rapidly identify genes for disease resistance and other traits. The accomplishments of the project in meeting specific goals are listed below. Objective 1. Develop sequences for a large and diverse set of citrus varieties. Collect all available citrus sequences and supplement with new Illumina HiSeq data to compile sequences of at least 50 germplasm accessions, each at 12X depth or more. We obtained 11 unique genome sequences from public databases and other collaborators and sequenced 30 accessions of citrus and its near relatives. 100 bp paired ends were sequenced on an Illumina HiSeq 2500 at the Institute for Integrative Genome Biology at the University of California Riverside. The average sequence depth was increased from the planned 12X to 30X in order to accurately genotype most variants in these highly heterozygous samples. Nominal sequence depth for the 30 accessions ranged from 5.7X to 53X, with a mean of 29.0. Three samples had less than 20X depth and 60% had greater than 25X depth. Across all samples an average of 7.8% of reads did not align with the 301 Mb Clementine reference genome. Variation in the percentage of reads that did not align was not related to taxonomic distance from the reference. The accessions analyzed included 6 citrus relatives, 4 citrons, 9 pummelos, 13 mandarins, two papedas, and 7 major commercial cultivar types (lemon, lime, sour orange, sweet orange, grapefruit, and rough lemon). In the entire dataset of 41 accessions, after filtering, the percentage of missing data ranged from 3% to 75%, with the two samples (Eremocitrus glauca, Citrus hystrix) having low nominal read depth having the highest percentage of missing SNP calls. Among the called SNPs, heterozygosity was lowest in the citron group, intermediate in pummelos and mandarins, higher in citrus relatives and mandarins thought to be introgressed with pummelo, and highest in those accessions known to be interspecific hybrids. This sequence information is being used to analyze genetic relationships among these species and cultivars, including the extent and location of introgressed chromosome segments. Objective 2. Design a SNP array from the genome sequence information. We exceeded the project objective of designing a 20,000 SNP array, developing two complementary Affymetrix Axiom arrays for citrus. Sequence data was analyzed to identify almost 20 million variants, 17.7M SNPs and 2.2M indels. We then identified about 5.3M variants located in genes or adjacent regions and having minor allele frequencies of 10% or more. Affymetrix recommended 2.0M of these for tiling on arrays and we selected about 1.4M for the first array, called Citrus15. This array was designed for two purposes. First it was used to validate the SNP markers using replicate samples, and parent-offspring trios. Second, it was used to analyze 288 samples including about 200 citrus varieties and various control samples. An objective was to design an array that would be useful for genotyping a wide range of citrus germplasm, including all of the Citrus relatives that can be crossed with Citrus. The sequences deposited on the array are the Clementine mandarin reference sequence except at the SNP position. Analysis of more divergent taxa is a challenge because additional SNPs within about 30 bases of the targeted SNP reduce hybridization and thereby decrease accuracy of SNP calls in more divergent taxa. We reduced this effect by selecting SNPs in more conserved sequences, but it is still a factor. When analyzing all samples, including relatives, about 505,000 SNPs can be considered high quality. This number increases to about 728,000 SNPs for analysis of Citrus samples only. Analysis of the Citrus15 array data for citrus accessions is still in progress. We also developed a chloroplast phylogeny based on 1000 chloroplast markers. Based on SNPs validated with the Citrus15 array, we developed a smaller array with about 58,000 SNP probes (only 20,000 were originally planned) and used this to analyze 1920 samples, including all accessions in the UCR Citrus Variety Collection, several mapping populations, and 384 samples from cooperating citrus geneticists in Florida. We are in early stages of analyzing data from this array but it is clear that it is high quality. 43,322 probes passed Affymetrix's default criteria when all samples were analyzed. A higher percentage are useful if analysis is restricted to Citrus samples. Objective 3 is to apply the SNP array to Citrus germplasm to support phylogenetic clarifications and association mapping. Data analysis for phylogeny and association mapping is in early stages. Cultivars within major groups such as navels, Valencia, and blood oranges are more similar to each other than to members of other groups. With the Citrus15 array, analysis of loss-of-heterozygosity and copy number variation analysis can identify fairly large deletions that distinguish, for example, Valencias from other oranges. Such analysis may provide important clues about the genetic differences that confer late maturity on Valencia oranges. An important development in this project was the development of a method to genotype single pollen grains. Mature pollen is collected, single grains are isolated under a dissecting microscope, and immediately used for whole genome amplication (WGA) with a commerical kit. The amplified DNA is then hybridized to the array. We analyzed 221 pollen grains from 39 diploid accessions, about 5 grains per accession. 71% of WGA samples had at least 70% of SNPs called, with 64% having at least 95% of SNPs called. Work in progress, prompted by this new data, is developing efficient methods to infer chromosome level haplotypes from high-density genotype data on 3-5 pollen grains. Objective 4 is to apply the SNP assay to breeding populations used for mapping various traits including disease resistance and tolerance. The Citrus56 array will allow construction of high density linkage maps of exceptional quality. We initially focused on mapping a cross of two mandarins, Fortune and Fairchild, each of which is heterozygous for approximately 11,000 SNPs. The fraction of missing data is typically less than 1% and maps have very few detectable bad calls.Map order of SNP markers generally agrees with positions on the reference genome, but there are some disagreements likely due to either real differences in order or to errors in the reference for which a much lower resolution map was available during assembly.

Publications

Type: Journal Articles Status: Submitted Year Published: 2017 Citation: Wu GA, 14 others. Genomics of the origin, evolution and domestication of citrus.

Progress 09/01/14 to 08/31/15

Outputs
Target Audience:The target audience for this initial phase of the project is the community of citrus geneticists and breeders who will benefit from the resources generated. Changes/Problems:This two-year project was extended to a third year because it took more time than expected to negotiate access to DNA sequences developed in other laboratories. We did not want to sequence the same genotypes as others. This was eventually resolved. Completion of sequencing also took more time than expected due to queues at the sequencing center and low coverage from some initial sequencing runs. Overall, we note four major improvements to the originally planned project: greater sequencing depth, development of SNP arrays with about 2.5 times the original SNP density, development of a high density (1.3M) SNP array (although the number of valid SNPs on this array will be less than 1.3M), and development of methods to infer chromosome level haplotypes for citrus. We expect to collect all planned marker data by the revised end date, but complete analysis of all of the data will likely require additional time. What opportunities for training and professional development has the project provided?One Ph.D. student is working on the project. She is developing improved DNA isolation methods for citrus and using these to collect the large number of samples that we plan to analyze. She has also learned bioinformatics methods using in selecting sequences for inclusion on the array. Training in plant genetics and biotechnology is being provided to this student through personal meetings and presentations to the lab group. Two undergraduate students provided assistance in DNA isolation and received training in lab procedures. How have the results been disseminated to communities of interest? Nothing Reported What do you plan to do during the next reporting period to accomplish the goals?Objective 1) During the next reporting period we will conduct a phylogenetic analysis of citrus based on the citrus sequence data, and further analyze evidence for recent and ancient hybridization and introgression in the evolution of citrus. Sequences will be released in December 2015 or January 2016. Objective 2) We have designed a 1.3M SNP array for citrus and will use information from this to validate SNPs and design a robust, 50K SNP array. Objective 3) Genetic diversity analysis. We will analyze about 200 diverse citrus accessions using the 1.3M SNP array, and then analyze an additional 800 accessions using the 50K array. Association mapping for various traits will be explored using this database. We have developed methods for whole genome amplification of DNA from single pollen grains that are very effective for amplification for SSR markers and we expect this WGA DNA to also be suitable for SNP array analysis. Analysis of a few pollen grains from each genotype will allow assignment of haplotypes to selected accessions, increase our understanding of citrus evolution and improve accuracy and resolution of association mapping. Objective 4) The 50K array will be used to generate dense (at least 1000 marker) maps for several mapping populations. The maps will be combined with phenotypic data and QTL analysis performed.

Impacts
What was accomplished under these goals? 1) Develop sequences for a large and diverse set of citrus varieties. Before deciding which samples we would sequence, we first determined what sequence data we could obtain from public repositories and, by agreement, from other researchers. We then identified 30 accessions that, together with the 12 sequenced accessions available from others, would represent diversity in citrus. These include the major commercial cultivar types as well as ancestral species and closely related and interfertile genera. Several accessions with apparent HLB tolerance were included. DNA was isolated and sequenced on the Illumina Hi-Seq 2500. The citrus genome is about 380 Mb but essentially all individuals are fairly heterozygous. To capture this heterozygosity we chose to sequence 2 x 100 bp reads and 30X nominal depth. Nominal sequence depth for the 30 accessions ranged from 5.7X to 53X, with a mean of 29.0. Three samples had less than 20X depth and 60% had greater than 25X depth. Across all samples an average of 7.8% of reads did not align with the 301 Mb reference genome. Variation in the percentage of reads that did not align was not related to taxonomic distance from the reference. About 19.9M variants were annotated: 17.7M SNPs, 0.9M insertions, and 1.2M deletions, for a total variant rate of about 1 per 14 bases. Among the variant positions, percentage of heterozygous variants was 2-3% in citrons, 4-5% in pummelos, trifoliate oranges, and kumquat, 3-8% in mandarins (some introgressed), and 12-19% in known interspecific hybrids. We also have chloroplast genome sequences for all accessions and are using these to develop a definitive chloroplast phylogeny of these accessions. 2) Design a 20,000 SNP Illumina Infinium SNP assay. We investigated SNP platforms from Illumina and Affymetrix and eventually chose Affymetrix as offering greater capacity for the funds available. We are developing two Affymetrix Axiom arrays. An initial 1.3M SNP array will be used to genotype 288 samples including about 200 accessions selected to represent diversity, parent-offspring trios, varieties that have diverged by mutation, and whole-genome amplified samples from single pollen grains. Result from this array will be used to validate SNPs selected from the sequence database and also provide high density coverage of selected accessions. We will then select about 50,000 SNPs to be tiled on a lower cost array that will be used for analysis of germplasm and mapping populations as outlined in 3) and 4) below. The 1.3M SNP arrays is now being manufactured, and the DNA samples for this analysis have been prepared. 3) Apply the SNP assay to Citrus and closely related genera in the citrus germplasm collection to support phylogenetic clarifications and association mapping. We have isolate DNA from nearly all germplasm accessions to be studied using a new protocol to first reduce the waxy coating on citrus leaves, dry the leaves with silica gel, and then isolate DNA with a commercial kit. 4) Apply the SNP assay to breeding populations used for mapping disease resistance and tolerance, and populations already phenotyped for fruit quality traits. Most leaf and/or DNA samples for this objective have been obtained.

Publications

Progress 09/01/13 to 08/31/14

Outputs
Target Audience: The target audience for this initial phase of the project is the community of citrus geneticists and breeders who will benefit from the resources generated. Changes/Problems: This two-year project is somewhat behind schedule because it took more time than expected to negotiate access to DNA sequences developed in other laboratories. We did not want to sequence the same genotypes as others. This was eventually resolved. Completion of sequencing is also taking more time than expected due to queues at the sequencing center. A concern is that the approved budget does not include funding for personnel to analyze the SNP data to be generated by this project. Therefore we have explored less expensive alternatives to Illumina SNP arrays, specifically a variant of genotyping-by-sequencing that uses sequence capture methods to produce a targeted, representative set of sequences from each sample. However the bioinformatics workload for analyzing this type of data in heterozygous genotypes is likely to be higher than with Illumina SNP data. A final decision on the platform has not been made. What opportunities for training and professional development has the project provided? One Ph.D. student is working on the project. She is developing improved DNA isolation methods for citrus and using these to collect the large number of samples that we plan to analyze. Training in plant genetics and biotechnology is being provided to this student through personal meetings and presentations to the lab group. How have the results been disseminated to communities of interest? No results suitable for dissemination at this point. What do you plan to do during the next reporting period to accomplish the goals? Objective 1) During the next reporting period we will complete DNA sequencing and analyze the DNA sequence database to identify SNPs. Objective 2) Design a 20,000 SNP assay. We will analyze the SNP database to identify a set of SNPs expected to maximize information about citrus germplasm. We will like run a set of already sequenced samples through a sequence-capture/HT Sequencing process to evaluate this system. Objective 3) Genetic diversity analysis. We will complete preparation of DNA from at least 600 of the 1000 germplasm accessions and at least 300 individuals from mapping populations. The SNP assay will be applied to these samples. A preliminary analysis of genetic diversity will be completed. Objective 4) The mapping population data will be analyzed to generate linkage maps and QTL analysis initiated. It is likely that a no-cost extension to the project will be required to complete the remaining samples.

Impacts
What was accomplished under these goals? 1) Develop sequences for a large and diverse set of citrus varieties. Before deciding which samples we would sequence, we first determined what sequence data we could obtain from public repositories and, by agreement, from other researchers. We then identified 30 accessions that, together with the 13 sequenced accessions available from others, would represent diversity in citrus. These include the major commercial cultivar types as well as ancestral species and closely related and interfertile genera. Several accessions with apparent HLB tolerance were included. DNA was isolated and submitted for sequencing on the Illumina Hi-Seq 2500. The citrus genome is about 380 Mb but essentially all individuals are fairly heterozygous. To capture this heterozygosity we chose to sequence 2 x 100 bp reads and 30X nominal depth. We are currently waiting for the sequencing to be completed. 2) Design a 20,000 SNP Illumina Infinium SNP assay. Mainly because of cost issues, we are investigating alternatives to the Infinium assay, primarily a sequence capture method followed by sequencing. We developed a strategy to allocate the 20,000 SNPs within and between major species groups. Groups, such as mandarins, that are of greater commercial importance and have more diversity have deeper coverage. SNPs will be selected primarily from genes, but we will include at least one SNP per Mb for the major groups. 3) Apply the SNP assay to Citrus and closely related genera in the citrus germplasm collection to support phylogenetic clarifications and association mapping. We have begun to evaluate alternative DNA isolation methods than we have used previously for citrus. For the large number of samples to be studied with the array, a more rapid and reliable method would be quite valuable. 4) Apply the SNP assay to breeding populations used for mapping disease resistance and tolerance, and populations already phenotyped for fruit quality traits. No accomplishments yet in this area.

Publications