Performing Department
Plant Pathology
Non Technical Summary
Legumes play a vital role in ecological and agricultural systems. Among cultivated crops, legumes are unique in their ability to fix atmospheric nitrogen through symbioses with rhizobial bacteria. Alfalfa (Medicago sativa) is a legume that occupies a key role as forage for livestock production. Alfalfa is the fourth most widely grown crop in the United States with an annual value exceeding $8 billion. Unfortunately, the genome of alfalfa has not yet been sequenced and it also displays autotetrapoloid genetics, making it difficult to study in terms of genome organization and genetic analysis. The closely related M. truncatula is often utilized as a model for genome studies. My research program will investigate the genomics of M. truncatula, emphasizing the use of genome sequence data to explore genome variation and the architecture of gene families important in plant-microbe interactions. We will use functional assays, de novo genome assembly, and bioinformatic analyses to characterize genes underlying symbiosis in M. truncatula. Legumes are noteworthy for the sophisticated symbioses they form with rhizobial bacteria (Sinorhizobium). However, existing knowledge about symbioses comes primarily from knockout mutants, an approach that can miss genes of subtle yet significant effect. This project seeks to discover the genes most likely to experience active selection and therefore be important in the contemporary evolution of rhizobial and mycorrhizal symbioses. In earlier work, we utilized genome-wide association analysis (GWAS) to discover several strongly supported candidate loci, often with independent lines of evidence (expression profile, correlation with multiple traits, published symbiotic phenotype). In contrast to earlier GWAS studies, our results were based on whole genome resequencing that enabled SNP analysis at much higher density and without ascertainment bias. We will now go on to test ~100 of these candidate loci through reverse genetic experiments involving Tnt1 insertion lines and Agrobacterium rhizogenes-based gene silencing. Structural variants (SVs) and copy number variants (CNVs) are both known to have major impacts on genome variation. This is extremely important in exploring the genomics of symbiosis because certain large gene families, especially NB-ARC disease resistance genes and nodule-specific cysteine rich peptides, play important roles in symbiosis and other plant-microbe interactions. Unfortunately, SVs and CNVs are difficult to discover with confidence using medium-depth next generation sequencing. Therefore, we will deeply sequence and de novo assemble 30 nodal M. truncatula accessions to discover SVs and CNVs with high confidence. Sequence-based variant discovery will be complemented by comparative genome hybridization and experimental validation. Ultimately, SVs and CNVs will be imputed genome-wide for our entire panel of 250 Medicago accessions.
Animal Health Component
0%
Research Effort Categories
Basic
100%
Applied
(N/A)
Developmental
(N/A)
Goals / Objectives
Legumes play a vital role in ecological and agricultural systems. Among cultivated crops, legumes are unique in their ability to fix atmospheric nitrogen through symbioses with rhizobial bacteria. Indeed, fixed nitrogen derived from legume-rhizobia symbioses contributes more than 90 Tg of nitrogen per year worldwide, an amount that would require roughly 300 Tg of fuel (>$30 billion) annually if replaced using the Haber-Bosch process (Kinzig 1994). Since legumes are not constrained for nitrogen, they produce remarkable levels of protein, a property that is both biologically and agriculturally significant. Nearly 33% of all nutritional nitrogen comes from legumes and in many developing countries, legumes serve as the single most important source of vegetable protein (Graham et al 2003). Legumes also provide a significant fraction of the world's edible oil and synthesize an impressive array of isoflavonoid and triterpene saponin compounds, chemicals possessing anti-cancer, anti-inflammatory, or cardiovascular-promoting properties. Among legumes, alfalfa (Medicago sativa) occupies a key role as forage for livestock production. Alfalfa is the fourth most widely grown crop in the United States with an annual value exceeding $8 billion (http://www.naaic.org/). When the value of alfalfa as a mixture with other forages is considered, it is actually equal to either wheat or soybeans. Unfortunately, the genome of alfalfa has not yet been sequenced. Moreover, alfalfa displays autotetrapoloid genetics, making it difficult to study in terms of genome organization and genetic analysis. Consequently, the closely related Medicago truncatula is often utilized as a model for genome studies. My research program investigates the genomics of M. truncatula, emphasizing the use of genome sequence data to explore genome variation and the architecture of gene families important in plant-microbe interactions, especially genes critical in symbiotic nitrogen fixation and resistance to disease pathogens. M. truncatula and its microbial partners are superb models to study the biology of symbiosis. It is widely recognized as an excellent model for legume genomics and has been the subject of a long and highly productive history of symbiosis research (Young and Udvardi 2009). M. truncatula played a key role in the initial description of the chemical dialogue underlying nodulation and M. truncatula is the system where many of the known molecular factors in nodulation were originally described (Geurts et al 2005). The power of M. truncatula as a model for legume research expanded further in 2011 when a high quality, BAC-based sequence for M. truncatula was published (Young 2011). The research proposed here extends the results of earlier CRIS / MAES work where my colleagues and I utilized next generation sequencing to explore genome sequence variation throughout the M. truncatula genome. We went on to perform genome-wide association analysis mapping (GWAS) to discover candidate genes associated with rhizobial nodulation and other fitness traits. We will now go onto validate these candidates and explore other types of genome variation and their impact on nodulation and plant-microbe interactions more generally. This research, which is long-term and broad in scope, relies primarily on NSF Plant Genome funding. The CRIS / MAES funding requested here is an essential complement. Specifically, the CRIS / MAES provides partial salary support for the lead PI on the project (Young) plus the associate scientist who acts as overall lab manager. GOAL 1. Identify and validate candidate genes playing a role in M. truncatula-rhizobium symbioses In the preceding cycle of CRIS / MAES, my colleagues and I utilized genome-wide association analysis (GWAS) to discover several strongly supported candidate loci associated with rhizobial symbiosis. Many of these candidates were also supported by independent lines of evidence (expression profile, correlation with multiple traits, published symbiotic phenotype). In contrast to earlier GWAS studies, our results were based on whole genome resequencing, which enabled single nucleotide polymorphism (SNP) analysis at much higher density and without ascertainment bias (Stanton-Geddes et al 2013). As a foundation for this work, we identified > 6 million SNPs (approximately 750,000 in coding regions) present at a minor allele frequency (MAF) > 0.02 (Branca et al 2011). We used the entire set of SNPs to explore the genetic basis of variation in nodules per plant and rhizobial strain specificity along with developmental traits such as flowering time. Phenotypic data were then collected from 226 accessions grown in replicate with each plant co-inoculated with two strains of S meliloti (~ 2,000 plants assayed). GWAS was conducted using mixed model analyses of variance as implemented in TASSLE with confounding effects of demographic history (ie, population structure) minimized by using a kinship matrix as a covariate. The top 100 candidate SNPs (those with most significant P values) include several within or beside genes with biological functions that make them promising as candidates for validation. Candidate SNPs tagged several nodulation-related examples such as SERK2, MtnodGRP3, MtMMPL1, NFP, CaML3, and MtnodGRP3A. GWAS also identified numerous genes coding for nodule-specific cysteine-rich peptides (NCRs), proteins previously shown to play a role in Sinorhizobial differentiation in nodules (Haag et al 2011). We also identified SNPs that explain a significant portion of trait variance within genes not previously recognized as having a possible nodulation function as well as totally uncharacterized genes - important candidates for the novel gene discovery that is the focus of our next round of symbiosis research. GOAL 2. Explore structural variation and characterize the genome architecture of M. truncatula symbiosis-related gene families Among the plant factors essential for legume symbiosis are two key gene families: the NB-ARCs, which are separately known to play a role in disease resistance, and the nodule cysteine rich peptides (NCRs), which are observed only in Medicago and its close taxonomic relatives (Alunni 2007). Recent studies indicate central roles for both protein families in symbiosis (Haag 2011), while previous studies of genome variation have shown they both exhibit high levels of sequence and structural variation (Branca 2011, Lai 2010). Notably, these gene families are highly diverse not only at the level of SNP variation, but even more so at the level of structural variation and copy number variation. However, better definition of the role of variation, especially its impact on symbiosis and other plant-microbe interactions, is impossible without deeper resequencing combined with de novo genome assembly. Our earlier GWAS mapping only targeted SNP variation, even though structural variants (SVs) and copy number variants (CNVs) are both known to have major impacts on phenotypic variation. This is extremely important in exploring the genomics of symbiosis and disease resistance because certain large gene families - NB-ARCs and nodule cysteine rich peptides - play especially important roles in plant-microbe interactions. Unfortunately, SVs and CNVs are difficult to discover with confidence using medium-depth next generation (Illumina) sequencing alone. Therefore, we will deeply sequence and de novo assemble 30 nodal M. truncatula accessions and then compare the assemblies directly to discover SVs and CNVs with high confidence.
Project Methods
METHODS 1. Identify and validate candidate genes playing a role in M. truncatula-rhizobium symbioses Based on results described earlier, we will go on to test approximately 100 nodulation candidates through reverse genetic experiments involving Tnt1 insertion lines and Agrobacterium rhizogenes-based gene silencing. This will involve RNA silencing and transposon tagging strategies examined through interaction assays with previously defined Sinorhizobium strains. Currently, more than 22,000 independent Tnt1 insertion lines have been generated at the Noble Foundation, where we have established a collaboration with project leaders, Michael Udvardi and Jiangqi Wen. It is estimated that Tnt1-insertion lines contain an average 25 insertions per genome (Tadege et al 2008) so the Tnt1-tagged mutant collection represents ~525,000 independent insertions altogether, translating to ~90% of Medicago genes with inserts (Tadege et al 2009). We will begin by searching Noble's existing flanking sequence tag (FST) database, but if mutants are lacking, we will work with Noble to design gene-specific primers to screen DNA pools. Direct association between Tnt1 and candidates will be confirmed either by the presence of multiple independent Tnt1 inserts in the same gene model or by co-segregation between the target gene and Tnt1 in test crosses that we will make. In parallel, my colleagues and I will utilize RNAi silencing technology to deliver hairpins to Medicago roots that generate dsRNA to degrade, in a sequence-specific manner, homologous target mRNA (Fusaro et al 2006). Briefly, a PCR amplicon is amplified from the coding region of the target gene and cloned into a binary vector as an inverted repeat. This inverted repeat (or "hairpin") is then transcribed by a nodule-specific promoter. The transgene is next transformed into a low-virulence Agrobacterium strain so as not to drastically alter root phenotype or confound downstream phenotyping. Upon Agrobacterium infection, a root is generated that is transformed with the hairpin transgene. The composite plant can then inoculated with rhizobia to induce nodules. Using these techniques we will knock-down GWAS-discovered candidate gene transcripts in order to test their function in nodulation pathways. The hairpin RNA platform is a robust and rapid assay that has been developed over the past decade by legume researchers (Limpens et al 2004). METHODS 2. Explore structural variation and characterize the genome architecture of M. truncatula symbiosis-related gene families Our de novo sequencing will target 30 diverse and informative M. truncatula accessions as a basis for discovering structural variants (SVs) and copy number variants (CNVs) at high resolution and with high confidence. The strategy will utilize 100X Illumina coverage to overcome short read lengths and bolster quality. We will employ a mix of fragment sizes to mitigate the assembly challenges posed by genomic repeats of differing length (Gnerre 2011). The shortest fragment size will be designed to generate overlapping reads that can be joined to produce a longer, higher quality contigs. Moderate and larger fragment sizes provide long-range connectivity for extension of contigs and scaffolds. We have selected assemblers that are compatible with this mix of data types (Gnerre 2011): ALLPATHS-LG merges overlapping reads, ABySS tolerates broad insert size distributions while Celera and ABySS filter chimeric mate pair read. We will evaluate all assemblies by intrinsic and orthogonal measures. Intrinsic measures include number of incorporated reads, total bases in contigs, combined span of large contigs, contig bases N50, scaffold span N50, and mate constraint satisfaction. Our extrinsic measures include concordance to reference sequence including BACs, organelle sequences, transcript sequences, and other Medicago assemblies, keeping in mind the limitations of each reference. Accurate discovery of structural and copy number variants is a crucial prerequisite to understanding small-effect and large-effect phenotypic variation. Consequently, we will calculate a comprehensive set of SV and CNVs across all 30 nodal accessions and then choose a subset for experimental validation to attain an estimate of our false-positive and false-negative rates. No assembler can correctly assemble all regions of the genome and therefore we examine potential regions of misassembly that might lead to inaccurate SV calls. Because M. truncatula has low levels of heterozygosity, regions of clustered heterozygosity are indicative of overassembly, such as occurs when repeats or members of gene families are collapsed. By aligning the reads back to the assembly they created, it is possible to quickly flag regions that should be approached with caution. By assessing heterozygosity density in adjacent genes, we can efficiently identify genomic regions where the number of genes in the assembly may not reflect the number of genes in the actual genome. This not only flags regions where SVs should be interpreted with caution but also gives us the opportunity to manually explore these regions. To identify and compare SVs among accessions, the reference M. truncatula genome sequence (A17) is not an ideal point of comparison. Preliminary data indicate numerous regions where A17 deviates in structure and content from other accessions. A more neutral reference for Medicago accessions would therefore be preferable. Therefore, we will begin by creating a pan-Medicago reference (Gan et al 2011). For this, we will generate a "minimal complete representation" reference, where the new reference will include all novel sequence from any of the 30 accessions and where no new sequence must be inserted in order to derive one of those accessions. Our work to create de novo assemblies and SV prediction will be complemented by comparative genomic hybridization (CGH) technology (Haun 2011). The two platforms are complementary, in part, because they are based on different chemistries and subject to different biases. CGH relies on relative hybridization of genomic sequences to a pre-ascertained set of probes, whereas the sequencing platform relies on library construction and adequate read coverage. SVs detected across both platforms can be called at high confidence. To create a Medicago platform, we will leverage the sequence data from the de novo genomes to build a comprehensive set of probes. This will result in a one-million feature Medicago genome tiling array based on Agilent technology. The array probes, each 50-70 bp oligos, will be designed from non-repetitive sequence spaces identified in the A17 genome plus the 30 de novo Medicago assemblies. Therefore, the CGH platform should be useful for mapping and cross-validating SVs that are present in at least one of the sequenced accessions. Approximately 100 putative structural and copy number variants will be subject to wet-lab verification. Preference will be given to variants expected to have biological import, namely those that result in alterations to genes, especially symbiosis-related genes or those with nodule expression. Due to their high sequence similarity and highly rearranged nature, tandem clusters of NB-ARC and NCR genes are expected to pose particular challenges to the underlying sequence assemblies on which our SVs are called and so a significant number of the regions we verify will involve these types of gene clusters. Candidate SVs will be selected for verification via PCR using primers spanning precise breakpoints identified by the sequencing, outward-facing primers for suspected tandem duplicates, and large-scale deletions via far-separated primers that would only amplify sequence if the deletion actually exists. PCR products will be cloned and sequenced to verify breakpoints. Finally, SVs and CNVs will be imputed genome-wide for our entire ~250 accession GWAS panel.