Source: AGRICULTURAL RESEARCH SERVICE submitted to NRP
IMPROVING CROP EFFICIENCY USING GENOMIC DIVERSITY AND COMPUTATIONAL MODELING
Sponsoring Institution
Agricultural Research Service/USDA
Project Status
COMPLETE
Funding Source
Reporting Frequency
Annual
Accession No.
0434435
Grant No.
(N/A)
Cumulative Award Amt.
(N/A)
Proposal No.
(N/A)
Multistate No.
(N/A)
Project Start Date
Mar 1, 2018
Project End Date
Feb 28, 2023
Grant Year
(N/A)
Program Code
[(N/A)]- (N/A)
Recipient Organization
AGRICULTURAL RESEARCH SERVICE
(N/A)
ITHACA,NY 14853
Performing Department
(N/A)
Non Technical Summary
(N/A)
Animal Health Component
20%
Research Effort Categories
Basic
80%
Applied
20%
Developmental
0%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
2011510108060%
2021520108010%
2031620108010%
2011139108010%
2021549108010%
Goals / Objectives
Objective 1: Create approaches and tools for identifying causal variants directly from genomic sequencing of diverse germplasm and species of C4 crops. [NP301, C1, PS1A] Objective 2: Identify deleterious mutations, and model their impact on crop efficiency and heterosis in C4 crops. [NP301, C3, PS3A] Objective 3: Identify adaptive variants for drought and temperature tolerance across C4 crops. [NP301, C1, PS1B] Objective 4: Establish community tools for processing and integration of sequence haplotypes to estimate their breeding effects in crop productivity. [NP301, C4, PS4A]
Project Methods
Increasing grass crop productivity is key for feeding the world over the next 50 years and this will require removing the deleterious variants in every genome, as well as adapting the crops to highly variable and stressful environments. This project will build better breeding models for improving and adapting maize and sorghum by surveying the natural variation across their entire group of wild relative species - the Andropogoneae. With over 1,000 species, the Andropogoneae are the most productive and water-use efficient plants in the world. Yet, for applied purposes, we have only tapped the variation from a handful of species. This project will lead an effort to survey DNA-level variation across this entire clade and analyze the variation with statistical and machine learning approaches. This will allow us to develop two sets of applied models for maize and sorghum. First, we will quantitatively estimate the deleterious impact on yield for every nucleotide in the genome. Second, we will identify the genes with a high capacity for adaptation to drought, flooding, temperature tolerance and their properties. These approaches and models will be deployed via integration with big data bioinformatics. This project will produce DNA-level knowledge that can be used across breeding programs and crops, and applied through either genomic selection or genome editing.

Progress 03/01/18 to 02/28/23

Outputs
PROGRESS REPORT Objectives (from AD-416): Objective 1: Create approaches and tools for identifying causal variants directly from genomic sequencing of diverse germplasm and species of C4 crops. [NP301, C1, PS1A] Objective 2: Identify deleterious mutations, and model their impact on crop efficiency and heterosis in C4 crops. [NP301, C3, PS3A] Objective 3: Identify adaptive variants for drought and temperature tolerance across C4 crops. [NP301, C1, PS1B] Objective 4: Establish community tools for processing and integration of sequence haplotypes to estimate their breeding effects in crop productivity. [NP301, C4, PS4A] Approach (from AD-416): Increasing grass crop productivity is key for feeding the world over the next 50 years and this will require removing the deleterious variants in every genome, as well as adapting the crops to highly variable and stressful environments. This project will build better breeding models for improving and adapting maize and sorghum by surveying the natural variation across their entire group of wild relative species - the Andropogoneae. With over 1,000 species, the Andropogoneae are the most productive and water-use efficient plants in the world. Yet, for applied purposes, we have only tapped the variation from a handful of species. This project will lead an effort to survey DNA-level variation across this entire clade and analyze the variation with statistical and machine learning approaches. This will allow us to develop two sets of applied models for maize and sorghum. First, we will quantitatively estimate the deleterious impact on yield for every nucleotide in the genome. Second, we will identify the genes with a high capacity for adaptation to drought, flooding, temperature tolerance and their properties. These approaches and models will be deployed via integration with big data bioinformatics. This project will produce DNA-level knowledge that can be used across breeding programs and crops, and applied through either genomic selection or genome editing. Deleterious mutations are common, with each generation of a plant acquiring a dozen or more. Some are mildly harmful, others nearly lethal. Over recent years, we've shown how these mutations affect yield and hybrid vigor. In the last two years, our collaborations have revealed that harmful mutations, particularly in protein sequences, can be detected by machine learning models trained on all proteins found across life forms. These models, calibrated using recent evolutionary comparisons, can also prioritize the impact of these mutations. Our research has led to improved breeding models and the identification of causal variants in maize, cassava, and, with collaboration, in potato. Several seed companies now combine editing with these methods to eliminate these harmful variants. This year, we examined how transposons, mobile DNA contributing to nearly 60% of the maize genome, impact crop yield. Despite their diversity in age, size, and location within the genome, our largest and most sensitive study found that transposons only slightly affect fitness and yield. They've learned to coexist with their host, with minimal impact. This underscores that most harmful mutations occur in protein- coding and regulatory sequences. This year we continued our genomic efforts, focusing on sequencing wild species in the maize and sorghum clade ⿿ the Andropogoneae. 40 species have been assembled and annotated to a high quality. The 12 species more closely related to maize, are now publicly available through maizeGDB, while the remaining ones will be available soon in a second-round release. In addition, we have assembled, using short-read DNA sequencing, the gene space of over 400 genomes, including many from the U.S. National Plant Germplasm System (NPGS), and we can now see how individual genes have evolved and adapted to various environments across hundreds of closely related species. The first major application of this resource has been to resolve problems with the gene models of maize. Genome annotation (where we model which DNA sequences are transcribed into RNA and then into protein) is a significant bioinformatics challenge. Current methods focusing on functional data within species are often hampered by molecular assay noise and biological system noise. To overcome this, we combined machine learning and evolutionary comparison of nearly 100 sequenced species, enabling us to identify the most functional mRNA and protein. This tool allows us to locate crucial harmful mutations and functional genes, even those that are rarely expressed or only in specific conditions. We also developed a tool that identifies open chromatin across flowering plants, and we are currently developing machine-learning models to predict gene expression. Our most substantial project this year was the launch of the CERCA (Circular Economy for Reimaging Corn Agriculture) project, aimed at revolutionizing sustainable maize production while increasing productivity and efficiency through improved nitrogen cycling. The project involves 27 labs across the U.S., funded through a combination of USDA-ARS, FFAR, industry, and two foundations. The project's primary goals are nitrogen recycling on farms, reducing grain nitrogen demand, returning nitrogen to the soil at season's end, and extending the growing season through cold tolerance. In the first year, germplasm from a wide range of perennial wild relatives were advanced and field evaluations began. Complementing the CERCA project, we are exploring opportunities to redesign storage proteins, crucial for nitrogen provision in seedlings and winter storage in perennials, and our main source of grain and legume protein. Using machine learning algorithms, we've scanned the maize genome for evolved vegetative storage proteins. While we found a few candidates, proteomic analysis indicates that maize's perennial relatives use numerous proteins to store nitrogen, unlike some species. These tools, combined with analyses of hundreds of other species' seed storage proteins, are assisting us in redesigning storage proteins for durability, efficiency, nutrition, and digestibility. Our Practical Haplotype Graph (PHG), a powerful way to represent the haplotype diversity of a crop, has had a number of software improvements made to expand the efficiency and accessibility of the system. Several adjustments were made to allow the PHG system to utilize large public supercomputing systems like the USDA SCINet system, as well as to speed up the system and reduce the computing resources required. BioKotlin, another library designed to provide high-performance bioinformatics in a scripting environment, is continually being updated with new functionality to make it more useable. The main updates have been adding Multiple Sequence Alignment support, updating the documentation along with examples and initial support for parsing the common GFF file format for storing genomic annotation information. To make these breeding tools more available, we designed, implemented, and deployed a Breeder Genomics Hub in collaboration with USAID-funded Cornell collaborators. This hub bundles a number of breeding tools into a single computational platform utilizing the open-source software, JupyterHub, allowing users immediate access to these tools and cloud- based computation. JupyterHub supports the R programming language which many scientists use to analyze data and produce results. To further ease the use of our existing tools like the PHG and TASSEL, a tool for associating traits with genomic information, R interfaces of these tools (rPHG and rTASSEL) have been developed and are available on the Hub. These R interfaces utilize BrAPI-compliant services which allow users to load publicly available genotypic and phenotypic data, including our PHG databases, into the JupyterHub environment without needing to download or copy large files. As part of the development of this hub environment, we have given a number of talks, workshops, and poster presentations at various conferences. Currently, we have a test instance of this system publicly available for testing the most recent Maize build of the PHG. We have created a new build of the Maize PHG using 84 available assemblies to build the graph. This build (v 2.1) was used to impute ~2000 samples spanning a decade of historical data across several different sequencing technologies for the G2F project for a prediction competition (Nov-Dec 2022) organized by ARS in Columbia, Missouri, and Raleigh, North Carolina, in collaboration with University of Wisconsin. Additionally, nearly 5000 maize landraces were imputed from the CIMMYT SEEDs project, which is allowing a complete reanalysis of this key resource with complete genome sequence. Importantly, we are now identifying a key group of genes involved in temperature adaptation. The public results from this PHG build are available through a BrAPI service hosted by MaizeGDB. In parallel, substantial efforts have been made to generate assembly-based PHG databases for sorghum, in collaboration with HudsonAlpha and ARS Cold Spring Harbor leveraging USAID funding, and cassava, with funding from the Bill & Melinda Gates Foundation. Breeding Insight (BI) is the ARS initiative to increase the adoption of genomics, phenomics, and analytics tools (including data management software) in ARS specialty crop and animal breeding programs, which have lagged behind major crop and animal breeding programs. BI is currently in year 5 (phase II) and its sister program, BI OnRamp, is in year 3. Together, BI and OnRamp provide breeding support services for 19 ARS species (blueberry, table grape, sweetpotato, alfalfa, rainbow trout, and North American Atlantic salmon, honeybee, strawberry, cranberry, oat, pecan, lettuce, cucumber, sorghum, hemp, citrus, sugarcane, soybean, and cotton), with BI providing support to multiple breeder programs for some species. The future goal is expansion out to all ARS specialty crops, animal, and natural resource breeding programs. In FY 2022-2023, BI⿿s most significant accomplishment was proving the feasibility of using haplotype genotyping (from custom 3K panels) to create genetic maps, and run GWAS, QTL analysis, and genomic predictions for two autotetraploid species, blueberry and alfalfa, and one autohexaploid species, sweetpotato. Over 10,000 alfalfa and more than 8, 000 blueberry samples were genotyped. This is a substantial and important leap forward in capabilities and genomic insight from 2019, when there were no genetic markers available for these species. We have created a genotyping ordering DB to manage all orders across all species. Furthermore, these data prove that breeders of species with complex polyploid genomes, a common feature in specialty crops and a major blocker for genomic resource creation, can successfully leverage genomic insight into their breeding program. These panels are both having impacts outside of ARS stakeholders as public breeders in academia and private breeders in the industry and around the globe genotype their own breeding material and contribute data back to the public upon publication. The adoption of the genotyping platform and pipeline benefits the entire global breeding effort such that all breeders have access to the same markers to improve data sharing with FAIR data principles. Given this success, Breeding Insight created additional species-specific 3K marker panels for cranberry, pecan, lettuce, and cucumber, none of which had routine or affordable genotyping platforms available to them. Putting these powerful analyses and genomic tools into the hands of ARS⿿s excellent specialty crop and animal breeders helps to improve breeding decisions and to meet public demand for more nutritious and flavorful foods. Artificial Intelligence (AI)/Machine Learning (ML) Machine learning approaches are being used in roughly half of the research being conducted here. Machine learning methods for most of the analyses of the DNA data now include random forests, convolutional neural nets, and transformer models. Breeding Insight additionally uses machine learning extensively for image analysis of fields. Most of the machine learning models requiring advanced graphic processing units (GPU)s were either run locally, on collaborators GPUs clusters, or commercial cloud services. Scient has been used extensively but for mostly non-GPU intensive tasks, enhanced access to GPU clusters is needed. The transformer based methods are changing the entire way we analyze DNA and protein sequence data and are likely in the next year to allow generalization of models across all plants ⿿ potentially massive increase in scope. Causal variants are about to be discovered for all species, and then AI tools will be key to developing hypotheses on the phenotype impacted and mechanism. ACCOMPLISHMENTS 01 Natural and synthetic nitrogen is lost from our food systems before it reaches the consumer. Over 80% of natural and synthetic nitrogen is lost from our food systems before it reaches the consumer, contributing to 97% of US agricultural greenhouse gas emissions (nitrous oxide, methane) and over 60% of water pollution. The CERCA (Circular Economy for Reimaging Corn Agriculture) project being launched this year focuses on corn, the single largest player in the US agricultural nitrogen system. The goal of this project is to develop corn genetics in concert with agronomy that that reduces corn⿿s environmental impact by well over 50% by shifting the growing season earlier to capture natural soil nitrogen, reducing corn⿿s demand for nitrogen, and recycling nitrogen back to the soil at the end of season like perennials. The CERCA project lead by USDA scientists from across the country and university collaborators (27 total labs) have initiated integrated research covering modeling, agronomy, genetics, and physiology to accomplish these goals. 02 Specialty crops and livestock are central to human nutrition, wellbeing, and cultural preservation. Together their production is responsible for over $150 billion in cash receipts. The USDA-ARS and university partner breeder teams who work on these species develop outstanding biological and practical know-how but often lack specialized expertise in genomic DNA-based breeding or advanced information technologies. USDA-ARS Breeding Insight in collaboration with Cornell University centralizes that expertise and adds to it a flexibility to apply advanced genomic and information/automation technologies to these many idiosyncratic species across the country. This year, the project expanded to support 19 species including 32 breeding teams across 18 states. Genomic markers that accelerate breeding were developed for an additional 5 species (70% increase from last year). Nearly 40,000 potential new varieties were genomically evaluated (80% increase from last year). Information and machine learning technologies deployed to 18 species have integrated historical data and automated the collection of new field data resulting in a 190% increase in databased records from last year. Centralization and flexibility are working to enable a scaling not seen before within USDA specialty crops and livestock. Having more data, effectively organized, matters: In sugarcane, blueberry, citrus and sweet potato, ARS breeders are collecting data faster and with fewer errors while for the first time leveraging all aggregated historical datasets to improve precision in selection. In partnership with ARS breeders in St. Paul, Minnesota, and Prosser, Washington, working on alfalfa. Despite alfalfa's complex genome, the collaboration identified genomic markers for the key disease resistance that will accelerate the delivery of highly digestible feed alfalfa to farmers. Breeding Insight helps US breeders accelerate the delivery of nutritious and resilient crops and livestock.

Impacts
(N/A)

Publications

  • Wrightsman, T., Marand, A.P., Crisp, P.A., Springer, N.M., Buckler IV, E.S. 2022. Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks. The Plant Genome. 15(3):e20249. https://doi.org/10.1002/tpg2.20249.
  • Khaipho-Burch, M., Ferebee, T., Giri, A., Ramstein, G., Monier, B., Yi, E., Romay, M., Buckler IV, E.S. 2023. Elucidating the patterns of pleiotropy and its biological relevance in maize. PLoS Genetics. PLoS Genet 19(3): e1010664. https://doi.org/10.1371/journal.pgen.1010664.
  • Bradbury, P.J., Casstevens, T., Jensen, S.E., Johnson, L.E., Miller, Z.R., Monier, B., Romay, M., Song, B., Buckler IV, E.S. 2022. The practical haplotype graph, a platform for storing and using pangenomes for imputation. Bioinformatics. 38(15):3698-3702. https://doi.org/10.1093/ bioinformatics/btac410.
  • Samayoa, L., Olukolu, B.A., Yang, C.J., Chen, Q., Stetter, M.G., York, A.M. , Sanchez-Gonzalez, J., Glaubitz, J.C., Bradbury, P., Cinta Romay, M., Sun, Q., Yang, J., Ross-Ibarra, J., Buckler IV, E.S., Doebley, J.F., Holland, J.B. 2021. Domestication reshaped the genetic basis of inbreeding depression in a maize landrace compared to its wild relative, teosinte. PLoS Genetics. 2:1009797. https://doi.org/10.6084/m9.figshare.14750790.
  • Lima, D.C., Washburn, J.D., Varela, J.I., Chen, Q., Gage, J.L., Romay, M.C. , Holland, J.B., Ertl, D., Lopez-Cruz, M., Aguate, F.M., De Los Campos, G., Kaeppler, S., Beissinger, T., Bohn, M., Buckler IV, E.S., Edwards, J.W., Flint Garcia, S.A., Gore, M.A., Hirsch, C.N., Knoll, J.E., Mckay, J., Minyo, R., Murray, S.C., Ortez, O.A., Schnable, J., Sekhon, R.S., Singh, M. P., Sparks, E.E., Thompson, A., Tuinstra, M., Wallace, J., Weldekidan, T., Xu, W., De Leon, N. 2023. Genomes to fields 2022 maize genotype by environment prediction competition. BMC Research Notes. 16: Article 148. https://doi.org/10.1186/s13104-023-06421-z.
  • Monier, B., Casstevens, T.M., Bradbury, P., Buckler IV, E.S. 2022. rTASSEL: An R interface to TASSEL for analyzing genomic diversity. Journal of Open Source Software. https://doi.org/10.21105/joss.04530.
  • Washburn, J.D., Cimen, E., Ramstein, G., Reeves, T., O'Briant, P., McLean, G., Cooper, M., Hammer, G., Buckler IV, E.S. 2021. Predicting phenotypes from genetic, environment, management, and historical data using CNNs. Theoretical and Applied Genetics. 134:3997⿿4011. https://doi.org/10.1007/ s00122-021-03943-7.
  • Long, E.K., Romay, M., Ramstein, G., Buckler IV, E.S., Robbins, K.R. 2023. Utilizing evolutionary conservation to detect deleterious mutations and improve genomic prediction in cassava. Frontiers in Plant Science. 13:1041925. https://doi.org/10.3389/fpls.2022.1041925.


Progress 10/01/21 to 09/30/22

Outputs
PROGRESS REPORT Objectives (from AD-416): Objective 1: Create approaches and tools for identifying causal variants directly from genomic sequencing of diverse germplasm and species of C4 crops. [NP301, C1, PS1A] Objective 2: Identify deleterious mutations, and model their impact on crop efficiency and heterosis in C4 crops. [NP301, C3, PS3A] Objective 3: Identify adaptive variants for drought and temperature tolerance across C4 crops. [NP301, C1, PS1B] Objective 4: Establish community tools for processing and integration of sequence haplotypes to estimate their breeding effects in crop productivity. [NP301, C4, PS4A] Approach (from AD-416): Increasing grass crop productivity is key for feeding the world over the next 50 years and this will require removing the deleterious variants in every genome, as well as adapting the crops to highly variable and stressful environments. This project will build better breeding models for improving and adapting maize and sorghum by surveying the natural variation across their entire group of wild relative species - the Andropogoneae. With over 1,000 species, the Andropogoneae are the most productive and water-use efficient plants in the world. Yet, for applied purposes, we have only tapped the variation from a handful of species. This project will lead an effort to survey DNA-level variation across this entire clade and analyze the variation with statistical and machine learning approaches. This will allow us to develop two sets of applied models for maize and sorghum. First, we will quantitatively estimate the deleterious impact on yield for every nucleotide in the genome. Second, we will identify the genes with a high capacity for adaptation to drought, flooding, temperature tolerance and their properties. These approaches and models will be deployed via integration with big data bioinformatics. This project will produce DNA-level knowledge that can be used across breeding programs and crops, and applied through either genomic selection or genome editing. This year we continued our genomic efforts, focusing on sequencing wild species in the maize and sorghum clade ⿿ the Andropogoneae. Working with USDA collaborators (Stoneville, Mississippi) and others, 40 species have been assembled to a high quality and annotation is nearly complete for most. We have also completed sequencing and annotation for 40 diverse maize inbred lines, selected because of their high diversity and/or importance in the Genomes to Fields and Germplasm Enhanced of Maize (GEM) projects. Finally, we have attempted short read DNA sequencing on 1,000 Andropogoneae samples from herbarium specimens and other sources, including many from the U.S. National Plant Germplasm System (NPGS). The quality of the DNA has been highly variable and most of the herbarium samples challenging, so we are trying to use ancient DNA approaches to fill in key species. Despite this, we have been able to assemble the gene space of over 350 genomes so far and identify the core sets of genes shared across the tribe. For the first time, we can now see how individual genes have evolved and adapted to various environments across hundreds of closely related species. The first application of this resource has been to resolve problems with the gene models of the key crops maize and sorghum. We have successfully created machine learning models that use evolutionary conservation across species to identify which genes are likely to produce functional proteins. In the case of maize, it now appears that 85% of the non-core genes in maize are likely pseudogenes. This is important, because now scientists can focus their attention on the 15% of genes that are likely functional but not shared across varieties. We have been wrapping up our analysis of deleterious variants in maize and (through collaborations) in other species. Our previous work has shown that deleterious mutations explain about half of the variation in crop yield between varieties. Our most powerful tool to identify deleterious mutation is the comparison of DNA variation within species to conservation across species. We continue to make progress on tools and pipelines to accurately compare DNA variation with and between species with the publication of two papers on these approaches this year. Using these alignments, we have now been able to identify deleterious mutations with great precision and show that we can improve genome wide prediction in studies in three species ⿿ maize, tomato, and cassava. These improvements in accuracy are most important when using information across populations. In maize and cassava, we have gone even further and used protein structure machine learning calibrated against evolutionary conservation machine learning to further increase prediction accuracy. In the coming year, this work will be pushed further with the high accuracy structures being determined with AlphaFold2. We also tested two other hypotheses regarding deleterious genetic load. First, transposons are genomic parasites that can occupy over 85% of a genome, and there has been much debate over the importance re their impact on the fitness of the host. While there is no doubt that active transposition can produce an occasional deleterious effect, the question was whether there is a bulk effect of transposons that is deleterious. In the largest analysis of its kind, we have shown that while maize transposons are significantly slightly deleterious ⿿ it is only slight, explaining less than 1% of yield variation. The removal of transposons from genomes will not benefit applied agriculture. Second, pleiotropy is when one genetic variant affects two unrelated traits; there are arguments that pleiotropic constraints on the genetic architecture of traits play important roles in productivity. Through a massive analysis of maize QTL looking for pleiotropy at the level of the field, metabolite, and RNA expression, there is very little evidence of pleiotropic effects in common standing genetic variation. Again, this suggests most deleterious mutations are likely to be rare variants directly affecting RNA expression, translation, and protein structure. Our bioinformatic focus is aimed at making tools more useable by molecular breeders while also facilitating repeatable science. We are accomplishing this with three approaches: (1) We released a new version of TASSEL (our tool for analysis of genetic diversity which is used in over 800 studies annually), and we have developed user-friendly ways to integrate it with the R statistical environment that is frequently used by breeders (rTASSEL). (2) Our Practical Haplotype Graph (PHG), a powerful way to represent the haplotype diversity of a crop, has now been developed for 5 major crops. We enhanced the PHG so that it works with the Breeder API (BrAPI) which is the global standard for sharing germplasm, phenotype, and genotype data. Through this BrAPI standard, PHG haplotypes and genotypes are now accessible through the R environment via rPHG, a package we developed for applied researchers. (3) Setting up the computing environment necessary to do genomic analysis and genome wide prediction can be challenging for many applied breeders. We developed a prototype Breeder Genomics Hub ⿿ which is a Jupyter Hub that supports the scripting languages used by both breeders (R) and genomicists (Python). We are testing and teaching this hub to breeders this fall. Fundamentally, this project is beginning to shift its effort from understanding the basis of quantitative genetic variation to the development of applied models and evolutionary datasets applied to develop a new Circular Agricultural System in the US. Breeding Insight (BI) is the ARS initiative to increase the adoption of genomics, phenomics, and analytics tools (including data management software) in ARS specialty crop and animal breeding programs, which have lagged behind major crop and animal breeding programs. BI is currently in year 4 (phase II) and its sister program, BI OnRamp, is in year 2. Together, BI and OnRamp provide breeding support services for 19 ARS species (blueberry, table grape, sweet potato, alfalfa, rainbow trout, and North American Atlantic salmon, honeybee, strawberry, cranberry, oat, pecan, lettuce, cucumber, sorghum, hemp, citrus, sugarcane, soybean, and cotton), with BI providing support to multiple breeder programs for some species. The future goal is expansion out to all ARS specialty crops, animal, and natural resource breeding programs. As COVID restricted travel until March of 2022, BI focused on 1) onboarding the new species admitted as part of phase II and preparing the 2022 timeline of deliverables for each species, 2) using the custom 3K marker panels for blueberry and alfalfa for routine genotyping needs for these breeding programs across multiple ARS locations and creation of new 3K marker panels for cucumber, lettuce, and sweetpotato, 3) deployment of mobile phenotyping apps Field Book and Smatrix systems for phenotyping in the 2022 season, 4) partial historical breeding data was loaded for salmon, grape, sorghum, cranberry, blueberry, lettuce, cucumber, citrus, and sugarcane into their own BreedBase instances (ongoing effort), and 5) Once COVID restrictions were lifted, BI scheduled and arranged in-person site visits with each species. As of July 2022, BI has traveled to 8 different breeding program sites with the goal of visiting 6 more before the end of the year. Other accomplishments of smaller note include efforts to create an image analysis pipeline for phenotypic data extraction (ongoing), successful deployment of voice-to-text digital data collection workflow (with Smatrix mobile app) for animal welfare data logging to replace paper records, creation and maintenance of an active Learning Hub on BI⿿s website where training material is housed and publicly available for any breeder to learn about the technologies and services that BI provides, and BI has taken an elevated role in feature requests to Field Book, Smatrix Systems, BreedBase, and BrAPI. BI⿿s third significant software development accomplishment is the release of version 0.6, which includes germplasm loading functionalities and allows for the expansion of the trait ontology to more general ontology functions such as indicating events and for describing the trial environment. The software team has made major improvements to the back- end communications between Field Book and BreedBase through the Breeding API (BrAPI) connection, pushing for BrAPI expansion when necessary. The difficulties experienced by BI staff when importing historical data into BreedBase (while remaining BrAPI compliant) prompted the IT team to create a better and more flexible import/export solution for breeders. Working prototypes of this import tool are under refinement at BI and will be used by BI coordinators to hasten the loading of any type of data into BreedBase. As with all BI⿿s software, it will be BrAPI 2.1 compliant, open-source, and publicly available. In addition to software development, the IT team manages over 30 servers and databases to support BI⿿s software, BreedBase instances for each species under the Breeding Insight umbrella, and development servers to test new features. This management is done with the help of automated deployment pipelines that the IT team has created, where a new software release can be deployed to all servers in 30 minutes. ACCOMPLISHMENTS 01 Combining evolution and protein structure identifies deleterious mutations in maize impacting yield. Maize has 37,000 genes that interact together to produce the world⿿s highest yielding crop, and natural mutations and disruptions keep the crop from meeting its genetic potential. Combining new machine learning models for protein structure with evolutionary comparisons of maize with other plants, ARS researchers in Ithaca, New York, (along with collaborators) have identified those individual mutations affecting yield and used them to improve yield prediction in maize hybrids. In collaboration with colleagues in cassava, this same approach also works in yield prediction, and we expect similar models to be applied to all crops where yield is the primary trait. 02 Breeding Insight expands access to field and genomic tools for 19 specialty crops and animal species. One of the major challenges in breeding is the integration and processing the billions of genomic and field data points needed to make informed decisions. Breeding Insight (a USDA-ARS cooperative agreement with Cornell University) expanded support to double the number of crops in the program with the establishment of data management systems for all species, new genomic tools are available for a quarter of the species, and field informatics tools for three-quarters of them. Many of the specialty crops (like blueberry and alfalfa) have genome duplications that make genomics tools challenging to apply, but this year the teams were able to apply these tools to these complex genomes. Putting these powerful analyses and genomic tools into the hands of ARS⿿s excellent specialty crop and animal breeders helps to improve breeding decisions and to meet public demand for more sustainable, nutritious, and flavorful foods.

Impacts
(N/A)

Publications

  • Ferguson, J.N., Fernandes, S.B., Monier, B., Miller, N.D., Allan, D., Dmitrieva, A., Schmuker, P., Lozano, R., Valluru, R., Buckler IV, E.S., Gore, M.A., Brown, P.J., Spalding, E.P., Leakey, A.D. 2021. Machine learning-enabled phenotyping for GWAS and TWAS of WUE traits in 869 field- grown sorghum accessions. Plant Physiology. 187(3):1481-1500. https://doi. org/10.1093/plphys/kiab346.
  • Baseggio, M., Murray, M., Wu, D., Ziegler, G., Kaczmar, N., Chamness, J., Hamilton, J.P., Buell, R.C., Vatamaniuk, O.K., Buckler IV, E.S., Smith, M. E., Baxter, I., Tracy, W.F., Gore, M.A. 2021. Genome-wide association study suggests an independent genetic basis of zinc and cadmium concentrations in fresh sweet corn kernels. G3, Genes/Genomes/Genetics. 11(8). https://doi.org/10.1093/g3journal/jkab186.
  • Willcox, M.C., Burgueño, J.A., Jeffers, D., Rodriguez-Chanona, E., Guadarrama-Espinoza, A., Kehel, Z., Chepetla, D., Shrestha, R., Swarts, K., Hearne, S., Buckler IV, E.S., Chen, N.C. 2022. Mining alleles for tar spot complex resistance from CIMMYT's maize germplasm bank. Frontiers in Sustainable Food Systems. 6:937200. https://doi.org/10.3389/fsufs.2022. 937200.
  • Oren, E., Tzuri, G., Dafna, A., Reese, E.R., Song, B., Freilich, S., Elkind, Y., Isaacson, T., Schaffer, A.A., Tadmor, Y., Burger, J., Buckler IV, E.S., Gur, A. 2022. QTL mapping and genomic analyses of earliness and fruit ripening traits in a melon recombinant inbred lines population supported by de novo assembly of their parental genomes. Horticulture Research. https://doi.org/10.1093/hr/uhab081.
  • Lozano, R., Gazave, E., Dos Santos, J., Stetter, M.G., Valluru, R., Bandillo, N., Fernandes, S., Brown, P.J., Shakoor, N., Mockler, T., Cooper, E., Perkins, T., Buckler IV, E.S., Ross-Ibarra, J., Gore, M.A. 2021. Comparative evolutionary genetics of deleterious load in sorghum and maize. Nature Plants. 7:17-24. https://doi.org/10.1038/s41477-020-00834-5.
  • Barnes, A.C., Rodríguez-Zapata, F., Juárez-Núñez, K.A., Gates, D.J., Janzen, G.M., Kur, A., Wang, L., Jensen, S.J., Estévez-Palmas, J.M., Crow, T.M., Kavi, H.S., Pil, H.D., Stokes, R.L., Knizner, K.T., Aguilar-Rangel, M.R., Demesa-Arévalo, E., Skopelitis, T., Pérez-Limón, S., Stutts, W.L., Thompson, P., Chiu, Y., Jackson, D., Muddiman, D.C., Fiehn, O., Runcie, D., Buckler IV, E.S., Ross-Ibarra, J., Hufford, M.B., Sawers, R.J., Rellán- Álvarez, R. 2022. An adaptive teosinte mexicana introgression modulates phosphatidylcholine levels and is associated with maize flowering time. Proceedings of the National Academy of Sciences(PNAS). 119(27)Article e2100036119. https://doi.org/10.1073/pnas.2100036119.
  • Lozano, R., Booth, G.T., Omar, B., Li, B., Buckler IV, E.S., Lis, J.T., Pino Del Carpio, D., Jannink, J. 2021. RNA polymerase mapping in plants identifies intergenic regulatory elements enriched in causal variants. Genes, Genomes, Genetics. jkab273. https://doi.org/10.1093/g3journal/ jkab273.
  • Giri, A., Khaipho-Burch, M., Buckler IV, E.S., Ramstein, G.P. 2021. Haplotype associated RNA expression (HARE) improves prediction of complex traits in maize. PLoS Genetics. https://doi.org/10.1371/journal.pgen. 1009568.
  • Zhang, X., Zhu, Y., Kremling, K.A., Romay, C., Bukowski, R., Sun, Q., Gao, S., Buckler Iv, E.S., Lu, F. 2021. Genome-wide analysis of deletions in maize population reveals abundant genetic diversity and functional impact. Theoretical and Applied Genetics. https://doi.org/10.1007/s00122-021-03965- 1.
  • Long, E.M., Bradbury, P., Romay, C.M., Buckler IV, E.S., Robbins, K.R. 2021. Genome-wide imputation using the practical haplotype graph in the heterozygous crop cassava. G3, Genes/Genomes/Genetics. 12(1):jkab383. https://doi.org/10.1093/g3journal/jkab383.
  • Song, B., Marco-Sola, S., Moreto, M., Johnson, L., Buckler IV, E.S., Stitzer, M.C. 2021. AnchorWave: sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication. Proceedings of the National Academy of Sciences(PNAS). 119(1). Article e2113075119. https://doi.org/10.1073/pnas.2113075119.
  • Pignon, C.P., Fernandes, S.B., Valluru, R., Bandillo, N., Lozano, R., Buckler IV, E.S., Gore, M.A., Long, S.P., Brown, P.J., Leakey, A. 2021. Phenotyping stomatal closure by thermal imaging for GWAS and TWAS of water use efficiency-related genes. Plant Physiology. 184(4):2544-2562. https:// doi.org/10.1093/plphys/kiab395.
  • Wu, Y., Johnson, L., Song, B., Romay, M.C., Stitzer, M., Siepel, A., Buckler IV, E.S., Scheben, A. 2022. A multiple alignment workflow shows the effect of repeat masking and parameter tuning on alignment in plants. The Plant Genome. 15(2). Article e20204. https://doi.org/10.1002/tpg2. 20204.
  • Gage, J.L., Mali, S., McLoughlin, F., Khaipho-Burch, M., Monier, B., Bailey-Serres, J., Vierstra, R.D., Buckler IV, E.S. 2022. Variation in upstream open reading frames contributes to allelic diversity in maize protein abundance. Proceedings of the National Academy of Sciences(PNAS). 119(14). Article e2112516119. https://doi.org/10.1073/pnas.2112516119.
  • Bradbury, P.J., Casstevens, T., Jensen, S.E., Johnson, L.C., Miller, Z.R., Monier, B., Romay, M.C., Song, B., Buckler IV, E.S. 2022. The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation.Bioinformatics. https://doi.org/10.1093/bioinformatics/btac410.
  • Dafna, A., Halperin, I., Oren, E., Isaacson, T., Tzuri, G., Meir, A., Schaffer, A.A., Burger, J., Tadmor, Y., Buckler IV, E.S., Gur, A. 2021. Underground heterosis for yield improvement in melon. Journal of Experimental Botany. 72(18):6205-6218. https://doi.org/10.1093/jxb/erab219.


Progress 10/01/20 to 09/30/21

Outputs
PROGRESS REPORT Objectives (from AD-416): Objective 1: Create approaches and tools for identifying causal variants directly from genomic sequencing of diverse germplasm and species of C4 crops. [NP301, C1, PS1A] Objective 2: Identify deleterious mutations, and model their impact on crop efficiency and heterosis in C4 crops. [NP301, C3, PS3A] Objective 3: Identify adaptive variants for drought and temperature tolerance across C4 crops. [NP301, C1, PS1B] Objective 4: Establish community tools for processing and integration of sequence haplotypes to estimate their breeding effects in crop productivity. [NP301, C4, PS4A] Approach (from AD-416): Increasing grass crop productivity is key for feeding the world over the next 50 years and this will require removing the deleterious variants in every genome, as well as adapting the crops to highly variable and stressful environments. This project will build better breeding models for improving and adapting maize and sorghum by surveying the natural variation across their entire group of wild relative species - the Andropogoneae. With over 1,000 species, the Andropogoneae are the most productive and water-use efficient plants in the world. Yet, for applied purposes, we have only tapped the variation from a handful of species. This project will lead an effort to survey DNA-level variation across this entire clade and analyze the variation with statistical and machine learning approaches. This will allow us to develop two sets of applied models for maize and sorghum. First, we will quantitatively estimate the deleterious impact on yield for every nucleotide in the genome. Second, we will identify the genes with a high capacity for adaptation to drought, flooding, temperature tolerance and their properties. These approaches and models will be deployed via integration with big data bioinformatics. This project will produce DNA-level knowledge that can be used across breeding programs and crops, and applied through either genomic selection or genome editing. This year our genomic efforts were focused on sequencing wild species in the maize and sorghum clade ⿿ the Andropogoneae. Working with USDA collaborators (Stoneville, Mississippi) and others, 24 species have been assembled to a high quality, and an additional 21 species are in process. We have also sampled and sequenced 49 diverse maize inbred lines, selected because of their high diversity and/or importance in the Genomes to Fields and Germplasm Enhanced of Maize (GEM) projects. While the current generation of long read DNA sequencers is extremely powerful, there is still tremendous variation in output from different samples and different runs. The variation in output is caused by carryover contaminants from long DNA preps (plants have a wide range of secondary metabolites) and inconsistencies in DNA sequencer technology and reagents. Finally, we have performed short read DNA sequencing on 350 Andropogoneae species from herbarium specimens and other field collections. Along with these massive genomic datasets come two challenges ⿿ dealing with variation in quality and alignment. In terms of quality, these massive genomic datasets, while expansive, have biases and genomic regions that are not fully sequenced. These inconsistencies have pushed us to developed bioinformatic pipelines that scaffold entire genomes accurately using pangenomes as a reference. This pipeline allows us to upgrade both our assemblies and other assemblies to higher quality. As for alignment, with nearly 500 genomes of maize and related species available, sequence alignment is one of the most important steps to make inferences from this data. However, it is still extremely challenging to align and compare the large intergenic regions of two separate species. By combining a new dynamic programming algorithm with whole genome alignment, we have created the most sensitive whole genome aligner, which is providing critical insights into the evolution and function of regulatory regions of the genome. How an organism grows and performs is substantially the product of the level and timing of gene expression and the activity of each protein. In three studies this year, we showed that RNA expression, protein translation, and protein structure modeling resulted in substantial improvements in genomic prediction of field performance. First, we used a novel approach to statistically estimate RNA expression variation caused by genetic variants near the gene, and then used these estimates to predict field traits in other populations. Notably, the genetic variation near genes (cis-variation) produces consistent RNA expression effects across tissues and conditions. With these estimated cis-RNA effects, we saw substantial improvements in genomic prediction across all traits, suggesting that modeling RNA expression for all genes is key to accurate genomic prediction. Second, we studied how natural variation impacts translation of RNA to protein ⿿ and found that upstream open reading frames had both the most deleterious mutations and were key for adaptation. Third, our modeling of deleterious mutations has previously focused on the conservation of amino acids. This year, we combined machine learning models to predict deleterious impact using both conservation and structure, and we saw substantial improvements in the prediction accuracy of hybrid vigor and yield. In seven papers this year, we evaluated how maize, sorghum, and other plants adapt to their local environments. First, using a novel statistical approach that compares population differentiation for RNA expression variation to DNA variation, we showed that RNA changes drive substantial local adaptation. Second, we demonstrated that the adaptation is frequently controlled at conserved non-coding regions which are shared among related grasses. As our genomic sequencing expands to hundreds of Andropogoneae genomes, this approach will become extremely powerful in helping us understand adaptive variation. Third, in two studies lead by collaborators, we saw the impact of RNA expression on adaptation to drought in both maize and sorghum. Finally, in three studies, we have evaluated how evolution adapts proteins to various environmental temperatures. Our strategy was to develop models at the single residue level in microbes and then apply those models to homologous proteins in plants. A third of the amino acids in each protein are temperature sensitive. These microbially developed models correctly predicted how an essential maize gene in phospholipid metabolism helps provide maize adaptation to various temperature regimes. In a comparison of maize, Arabidopsis, and poplar, we also saw evidence for temperature adaptation among different plant organs (e.g., roots versus leaves), among different organelles, and based on the environmental history of each species. Quantitative genetics over the last two decades have focused on either on individual or combined impact of genetic variants on phenotype. However, we know there are lots of syngistic and non-linear effects between genetic variants at single gene. Now that we can fully reconstruct genomes, we are tackling the bioinformatics to model entire gene haplotypes. Our system for modeling haplotypes across a species ⿿ the Practical Haplotype Graph (PHG) - has been substantially enhanced to work with massive datasets now typified by maize. It also now supports integration with the public Breeding API standards and works with the R computing environment. We have helped develop PHGs for maize, wheat, sorghum, and cassava. Our TASSEL software package continues to be among the most popular tools for analysis of functional diversity, and we have begun to develop and the public release of TASSEL 6, which is starting the paradigm shift from focusing on large number of genetic variants to focusing on how these variants combine to create different functional haplotypes. Breeding Insight (BI) is the ARS initiative to increase the adoption of genomics, phenomics, and analytics tools (including data management software) in ARS specialty crop and animal breeding programs, which have lagged behind major crop and animal breeding programs. BI is currently in year three of a pilot phase focused on building support services for six ARS breeding programs (blueberry, table grape, sweet potato, alfalfa, rainbow trout, and North American Atlantic salmon), with the future goal of expansion out to all ARS specialty crops, animal, and natural resource breeding programs. As COVID restricted travel, BI focused on 1) training and support for Field Book deployment in blueberry, alfalfa, and sweetpotato for the 2021 field season, 2) validation of custom 3K marker panels for blueberry, alfalfa, and Atlantic salmon, 3) genome sequencing on ARS sweetpotato for marker creation, and 4) loading of historical breeding program data for grape, sweetpotato, and both salmonids into their own BreedBase instances. Other accomplishments of smaller note include creation of a genotypic analyses pipeline to assist with marker-assisted selection in grape (though the solution is flexible enough to be expanded to other species), the initiation of a sweetpotato weevil sequencing project to allow the breeder to identify wild populations of this endemic pest in his fields, and a completely revamped website to better serve BI stakeholders. BI's second significant software development accomplishment is the release of two open-source applications that allow the public to test and use BI code in a user-friendly interface. These ⿿sandbox sites⿝ are critical for both testing new code and for refining the interface to better suit the wide variety of breeders that BI services. The software team has made major improvements the back-end communications between Field Book and BreedBase through the Breeding API (BrAPI) connection. The difficulties experienced by BI staff when importing historical data into BreedBase (while remaining BrAPI compliant) prompted the IT team create a better and more flexible import/export solution for breeders. Working prototypes of this import tool are under refinement at BI and will be used by BI coordinators to hasten loading of any type data into BreedBase. As with all BI's software, it will be BrAPI 2.0 compliant, open-source, and publicly available. The IT team at the Breeding Management System has already expressed a desire to integrate this tool into their software stack. In year 3, BI completed hiring for all the roles detailed in the proposal. These new hirings included a new Breeding Coordinator, a Communication and Training Lead, a Phenomics Coordinator, a Software Q/A Specialist and two new Application Programmers. BI also hired a Product Owner to guide and prioritize software development to align with breeders⿿ need and complete BI minimum viable product (MVP). Record of Any Impact of Maximized Teleworking Requirement: While much of the group⿿s research was unaffected by teleworking, our research has been impacted in three areas. First, we were unable to conduct molecular biology lab work for nearly three months at Cornell facilities, which delayed our sequencing of new genomes from maize and other grasses. Second, decreased access and intensity of use of the Cornell-run field locations resulted in curtailment of some field work. We had planned nearly a dozen field experiments, but we were only able to carry out our two largest experiments. Third, Breeding Insight staff could not engage in on-site learning and collaboration with the various ARS breeding programs as planned (suspended travel), which slowed down training and coordination. ACCOMPLISHMENTS 01 Gene level modeling of RNA production improves prediction of field performance. Maize has 37,000 genes that interact together to grow and respond to the environment, thus one of the key goals of breeding is to be able predict these interactions from just the DNA. Using over 70 million measurements of how much RNA each of these genes produce under various conditions and novel statistical approaches, ARS researchers in Ithaca, New York, (along with collaborators) have substantially improved our ability to predict how thousands of varieties of maize will grow for over two dozen traits. Importantly, this approach could be developed for many other crops using the powerful genomic tools available. Long-term this will allow advanced genomic models to be applied to all crops. 02 Breeding Insight deploys field and genomic tools for five of six specialty crop and animal species. While specialty crops and animals are a large portion of gross U.S. agricultural revenue, individually these small programs have not had access to the data capture and genomic innovations that benefit major crop and animal breeding programs and, thus, have lagged behind. The challenge is in both constructing the genomic resources (data) and in integrating and processing the billions of genomic and field data points needed to make informed decisions. This year, Breeding Insight generated genomic resources for breeding of blueberry, alfalfa, sweetpotato, sweetpotato weevil, and North American Atlantic salmon. Additionally, BI developed databases and field data collection systems for each of these species. Putting this powerful information and these genomic tools into the hands of ARS⿿s excellent specialty crop and animal breeders helps to improve breeding decisions and to meet public demand for more nutritious and flavorful foods.

Impacts
(N/A)

Publications

  • Lozano, R., Gazave, E., Dos Santos, J., Valluru, R., Bandillo, N., Fernandes, S., Brown, P.J., Shakoor, N., Mockler, T., Ross-Ibarra, J., Buckler IV, E.S., Gore, M.A. 2021. Comparative evolutionary genetics of deleterious load in sorghum and maize. Nature Plants. (7):17-24. https:// doi.org/10.1038/s41477-020-00834-5.
  • Cimen, E., Jensen, S., Buckler IV, E.S. 2020. Building a tRNA thermometer to estimate microbial adaptation to temperature. Nucleic Acids Research. 48(21):12004-12045. https://doi.org/10.1093/nar/gkaa1030.
  • Rogers, A.R., Dunne, J.C., Romay, C., Bohn, M., Buckler IV, E.S., Ciampitti, I.C., Edwards, J.W., Ertl, D., Flint Garcia, S.A., Gore, M.A., Graham, C., Hirsch, C.N., Hood, E., Hooker, D.C., Knoll, J.E., Lee, E.C., Lorenz, A., Lynch, J.P., Mckay, J., Moose, S.P., Murray, S.C., Nelson, R., Rocheford, T., Schnable, J.C., Schnable, P.S., Sekhon, R., Singh, M., Smith, M., Springer, N., Thelen, K., Thomison, P., Thompson, A., Tuinstra, M., Wallace, J., Wisser, R.J., Xu, W., Gilmour, A., Kaeppler, S.M., Deleon, N., Holland, J.B. 2021. The importance of dominance and genotype-by- environment interactions on grain yield variation in a large-scale public cooperative maize experiment. Genes, Genomes, Genetics. https://doi.org/10. 1093/g3journal/jkaa050.
  • Chen, S., Mei-Hsiu, S., Kremling, K.A., Lepak, N.K., Romay, M.C., Sun, Q., Bradbury, P., Buckler IV, E.S., Ku, H. 2020. Identification of miRNA-eQTLs in maize mature leaf by GWAS. Biomed Central (BMC) Genomics. 21(689). https://doi.org/10.1186/s12864-020-07073-0.
  • Ding, Y., Weckwerth, P.R., Poretsky, E., Murphy, K.M., Sims, J., Saldivar, E., Christensen, S.A., Char, S., Yang, B., Tong, A., Shen, Z., Kremling, K. A., Buckler IV, E.S., Kono, T., Nelson, D.R., Bohlmann, J., Bakker, M.G., Vaughan, M.M., Khalil, A.S., Betsiashvili, M., Briggs, S.P., Zerbe, P., Schmelz, E.A., Huffaker, A. 2020. Genetic elucidation of interconnected antibiotic pathways mediating maize innate immunity. Nature Plants. (6) :1375-1388. https://doi.org/10.1038/s41477-020-00787-9.
  • Tu, X., Majia-Guerra, M., Valdes Franco, J.A., Tzeng, D., Chu, P., Shen, W. , Wei, Y., Dai, X., Li, P., Buckler IV, E.S., Zhong, S. 2020. Reconstructing the maize leaf regulatory network using ChIP-seq data of 104 transcription factors. Nature Communications. (11):5089. https://doi. org/10.1038/s41467-020-18832-8.
  • Blanc, J., Kremling, K., Buckler IV, E.S., Josephs, E. 2021. Local adaptation contributes to gene expression divergence in maize. Genes, Genomes, Genetics. 11(2):jkab004. https://doi.org/10.1093/g3journal/ jkab004.
  • Jores, T., Tonnies, J., Wrightsman, T., Buckler IV, E.S., Cuperus, J.T., Fields, S., Queitsch, C. 2021. Synthetic promoter designs enabled by a comprehensive analysis of plant core promoters. Nature Plants. 7:842-855. https://doi.org/10.1038/s41477-021-00932-y.
  • Jarquin, D., De Leon, N., Romay, M., Bohn, M., Buckler IV, E.S., Ciampitti, I., Edwards, J.W., Ertl, D., Flint Garcia, S.A., Gore, M.A., Graham, C., Hirsch, C.N., Holland, J.B., Hooker, D., Kaeppler, S.M., Knoll, J.E., Lee, E.S., Lawrence-Dill, C.J., Lynch, J.P., Moose, S.P., Murray, S.C., Nelson, R., Rocheford, T., Schnable, J.C., Schnable, P.S., Smith, M., Springer, N., Thomison, P., Tuinstra, M., Wisser, R.J., Xu, W., Lorenz, A. 2021. Utility of climatic information via combining ability models to improve genomic prediction for yield within the genomes to fields maize project. Frontiers in Genetics. 11:592769. https://doi.org/10.3389/fgene.2020. 592769.
  • Rogers, A.R., Dunne, J.C., Romay, M.C., Bohn, M., Buckler IV, E.S., Ciampitti, I.C., Edwards, J.W., Ertl, D., Flint Garcia, S.A., Gore, M.A., Graham, C., Hirsch, C.N., Hood, E.C., Hooker, D., Knoll, J.E., Lee, E.C., Lorenz, A., Lynch, J.P., Mckay, J., Moose, S.P., Murray, S.C., Nelson, R., Rocheford, T., Schnable, J.C., Schnable, P.S., Sekhon, R., Singh, M., Smith, M., Springer, N., Thelen, K., Thomison, P., Thompson, A., Tuinstra, M., Wallace, J., Wisser, R., Xu, W., Gilmour, A., Kaeppler, S.M., Deleon, N., Holland, J.B. 2021. The importance of dominance and genotype-by- environment interactions on grain yield variation in a large-scale public cooperative maize experiment. Genes, Genomes, Genetics. 11(2):jkaa050. https://doi.org/10.1093/g3journal/jkaa050.
  • Dos Santos, J.P., Fernandes, S.B., Mccoy, S., Lozano, R., Brown, P.J., Leakey, A.D., Buckler IV, E.S., Garcia, A.A., Gore, M.A. 2020. Novel bayesian networks for genomic prediction of developmental traits in biomass sorghum. Genes, Genomes, Genetics. 10(2):769-781. https://doi.org/ 10.1534/g3.119.400759.
  • Jordan, K., Bradbury, P., Miller, Z., Nyine, M., He, F., Guttieri, M.J., Brown Guedira, G.L., Buckler Iv, E.S., Jannink, J., Akhunov, E., Ward, B.P. , Bai, G., Bowden, R.L., Fiedler, J.D., Faris, J.D. 2021. Development of the Wheat Practical Haplotype Graph Database as a Resource for Genotyping Data Storage and Genotype Imputation. G3 Genes/Genomes/Genetics. https:// doi.org/10.1101/2021.06.10.447944.
  • Swarts, K., Bauer, E., Glaubitz, J.C., Ho, T., Johnson, L., Li, Y., Li, Y., Miller, Z., Schon, C., Wang, T., Zhang, Z., Buckler Iv, E.S., Bradbury, P. 2021. Joint analysis of days to flowering reveals independent temperate adaptations in maize. Heredity. 126:929-941. https://doi.org/10.1038/ s41437-021-00422-z.
  • Wang, L., Huang, Y., Liu, Z., He, J., Jiang, X., He, F., Lu, Z., Yang, S., Chen, P., Yu, H., Zeng, B., Ke, L., Xie, Z., Larkin, R., Jiang, D., Ming, R., Buckler IV, E.S., Xu, Q. 2021. Somatic variations led to the selection of acidic and acidless orange cultivars. Nature Plants. https://doi.org/10. 1038/s41477-021-00941-x.
  • Diepenbrock, C.H., Ilut, D.C., Magallanes-Lundback, M., Kandianis, C.B., Lipka, A.E., Bradbury, P., Holland, J.B., Hamilton, J.P., Wooldridge, E., Vaillancourt, B., Góngora-Castillo, E., Wallace, J.G., Cepela, J., Mateos- Hernandez, M., Owens, B.F., Tiede, T., Buckler IV, E.S., Rocheford, T., Buell, C., Gore, M.A., Dellapenna, D. 2021. Eleven biosynthetic genes explain the majority of natural variation in carotenoid levels in maize grain. The Plant Cell. 33(4):882⿿900. https://doi.org/10.1093/plcell/ koab032.
  • Jensen, S., Charles, J., Muleta, K., Bradbury, P., Casstevens, T., Deshpande, S.P., Gore, M.A., Gupta, R., Johnson, L., Lozano, R., Miller, Z. , Ramu, P., Rathore, A., Upadhyaya, H.D., Varshney, R., Morris, G.P., Pressoir, G., Buckler IV, E.S., Ramstein, G. 2020. A sorghum practical haplotype graph facilitates genome-wide imputation and cost effective genomic prediction. The Plant Genome. 13(1). Article e20009. https://doi. org/10.1002/tpg2.20009.
  • Wu, X., Feng, H., Wu, D., Yan, S., Zhang, P., Wang, W., Zhang, J., Ye, J., Dai, G., Fan, Y., Li, W., Song, B., Geng, Z., Yang, W., Chen, G., Qin, F., Terzaghi, W., Stitzer, M., Li, L., Xiong, L., Yan, J., Buckler IV, E.S., Dai, M. 2021. Using high-throughput multiple optical phenotyping to decipher the genetic architecture of maize drought tolerance. Genome Biology. 22(185):1-26. https://doi.org/10.1186/s13059-021-02377-0.
  • Song, B., Buckler IV, E.S., Wang, H., Wu, Y., Rees, E., Kellogg, E.A., Gates, D.J., Khaipho-Burch, M., Bradbury, P., Ross-Ibarra, J., Hufford, M. B., Romay, M. 2021. Conserved noncoding sequences provide insights into regulatory sequence and loss of gene expression in maize. Genome Research. 31:1245-1257.


Progress 10/01/19 to 09/30/20

Outputs
Progress Report Objectives (from AD-416): Objective 1: Create approaches and tools for identifying causal variants directly from genomic sequencing of diverse germplasm and species of C4 crops. [NP301, C1, PS1A] Objective 2: Identify deleterious mutations, and model their impact on crop efficiency and heterosis in C4 crops. [NP301, C3, PS3A] Objective 3: Identify adaptive variants for drought and temperature tolerance across C4 crops. [NP301, C1, PS1B] Objective 4: Establish community tools for processing and integration of sequence haplotypes to estimate their breeding effects in crop productivity. [NP301, C4, PS4A] Approach (from AD-416): Increasing grass crop productivity is key for feeding the world over the next 50 years and this will require removing the deleterious variants in every genome, as well as adapting the crops to highly variable and stressful environments. This project will build better breeding models for improving and adapting maize and sorghum by surveying the natural variation across their entire group of wild relative species - the Andropogoneae. With over 1,000 species, the Andropogoneae are the most productive and water-use efficient plants in the world. Yet, for applied purposes, we have only tapped the variation from a handful of species. This project will lead an effort to survey DNA-level variation across this entire clade and analyze the variation with statistical and machine learning approaches. This will allow us to develop two sets of applied models for maize and sorghum. First, we will quantitatively estimate the deleterious impact on yield for every nucleotide in the genome. Second, we will identify the genes with a high capacity for adaptation to drought, flooding, temperature tolerance and their properties. These approaches and models will be deployed via integration with big data bioinformatics. This project will produce DNA-level knowledge that can be used across breeding programs and crops, and applied through either genomic selection or genome editing. During this year we have been using the pipeline developed last year to continue the assembly of maize genomes, starting with an additional 9 genomes from the Genomes to Fields initiative. As sequencing technologies keep updating and improving, the project worked closely with USDA scientists in Stoneville, Mississippi, to implement a high-throughput genome assembly pipeline for maize using the newest high quality, long reads technology. To complement the public release of 26 very high- quality maize assemblies, we selected an additional 25 maize genomes to capture the diversity that was missing in them, and we have assembled 15 of those using this new system, with the other 10 ongoing and expected to be done in the next few months. We have used all this information to generate an improved maize practical haplotype graph database that captures most of maize diversity. In addition, working with our collaborators, we have finished the collection of live clones of Andropogoneae species, wild relatives of maize and sorghum, and started genome sequencing and assembly at scale. At this point we have 15 more genomes of different species available, and 5 more are coming in the next few months. All these newly assembled genomes combined with the information in the maize graph will improve our understanding of the functionally constrained regions of the genome and our current estimates of the effects of allele changes in yield. The prevailing hypothesis for hybrid vigor is that it is the product having two alleles of each gene and provides either complementation of deleterious mutations or total expression that is closer to the optimal expression levels. For individual genes and pathways, we know that both mechanisms are in operation. This general question surrounds how does the dosage and dominance of a gene relate to phenotype? At the whole genome scale, we tested a range of hypotheses to relate SNP variation to prediction hybrid yield and other traits. This showed the contributors to predicting yield were dosage, genetic variation close to genes, and mutations in conserved nucleotides. In another series of experiments, that related the dosage of RNA expression to whole-plant performance, we found that substantial improvements in prediction accuracy could be made if models were trained on 5000 different genotypic observations, but datasets with only a few hundred distinct genotypes provide no improvement over standard genome-wide prediction. This suggests that given the over 30,000 genes in the genome, that large studies are needed to make progress on integration of the dosage from whole plant phenotype directly. An alternative to building large empirical datasets that connect genotype or expression to phenotype is to develop models that are built on mechanistic processes that are well parameterized with specific datasets. We have made progress on training models to predict expression in maize and Arabidopsis, transcription factor binding in maize and Arabidopsis, transfer of expression and transcription factor models between maize and Arabidopsis, and through several collaborations have just collected data on protein levels and expression of a control gene against nearly every maize promoter (STARR-seq). These models are starting to approach the accuracy where they could rival measured expression in the next couple years. There has also been tremendous progress by other groups in using machine learning for protein structure prediction. While we have extensively tested these models, at present they do not seem sensitive enough to make actionable predictions. Machine learning methods for protein structure are evolving very rapidly in the biomedical context, and we will continue to evaluate these models as they are developed to test for applicability to crop improvement. Overall, we are seeing evidence for good transferability of models among plants, and while protein models have insufficient resolution currently, we expect this to change soon. Within the next several years, mechanistic models that work across eukaryotes seems likely. While the project originally targeted integration of its software with Spark, a platform that facilitates large-scale, parallel computation, it became apparent that the genomics and plant breeding community was not adopting that platform. Instead, two important trends have been the widespread use of R and software containers. R is a computing environment for statistics and graphics. Containers encapsulate complex software environments, making complex packages much easier to distribute and use. We also began development of the Practical Haplotype Graph (PHG) software for organizing pan-genomes and using them for imputation. The project has recently made publicly available R interfaces to both TASSEL and PHG, called, rTASSEL and rPHG. These interfaces provide the ability to run analyses, export results, and take advantage of R packages for downstream analysis and visualization. The PHG software is being distributed as a Docker image, which can be downloaded from a website called DockerHub. As another approach to making analysis methods in TASSEL and PHG more easily accessible and more performant, the project has begun investigating the use of GraalVM, a remarkable computing environment that both makes it easier to combine programming languages and provides faster execution than the leading JVM's. A JVM is a Java Virtual Machine, which is the software engine that runs Java and Java compliant programs. TASSEL and PHG are written in a combination of Java and Kotlin languages, both of which run on JVM's. All of these efforts reflect the project's longer- term objective to stop developing graphical user interfaces and instead to take advantage of widely used notebook-style software for managing complex research workflows. Breeding Insight (BI) is an ARS initiative to increase adoption of genomics, phenolics, and analytics tools (including data management software) in ARS specialty crop and animal breeding programs, which have lagged behind major crop and animal breeding programs. BI is currently in year 2 of a pilot phase focused on building support services for 6 ARS breeding programs (blueberry, table grape, sweet potato, alfalfa, rainbow trout, and North American Atlantic salmon), with the future goal of expansion out to all ARS specialty crops, animal, and natural resource breeding programs. In year 2, we completed most of BI's hiring. The first focus was on understanding the needs of the various breeding programs, the commonalities and the differences, which was accomplished with location visits and monthly or more frequent meetings. While there are a wide range of informatic platforms to assist breeders, these collaborations facilitate the development of clear tools that are workflow based with high quality interfaces. The software team is about half way done in creating these initial workflows. BI's first significant accomplishment is the release of open-source software code that allows seamless data transfer between the leading field data collection platform and the leading open source database system using BrAPI (Breeding API)I. BI worked with the US grape breeding community to deploy this, and they are using it to improve their day-to-day work to improve efficiency and accuracy. In the past year, BI had completed genome sequencing on ARS alfalfa and blueberry to create a set of markers for breeding efforts and provided a set of 100K markers to create a North American Atlantic salmon genotyping platform available to the public. BI has also supported genotyping and evaluating 4000 grape varieties. Breeding Insight is off to a strong start, but the rest of year will be key in completing the initial version. Accomplishments 01 The genomic toolbox for regulating genes is shared across flowering plants and crops. Flowering plants and crops have 20,000 to 60,000 genes, but those genes are controlled by a smaller set of two thousand regulator genes called transcription factors. Are the patterns for how these regulator genes bind DNA and turn on genes consistent across plants? In two large studies, ARS researchers in Ithaca, New York, along with collaborators, have shown that the interaction between regulator genes and DNA is evolutionarily consistent across flowering plants. The tremendous diversity of plants is the product of combining these regulator gene-DNA interactions into numerous new combinations. This suggests that plant scientists should work across species to develop a single model for the regulation of plant genes. Long term this will allow advanced genomic models to be applied to all crops. 02 Breeding Insight starts supporting ARS specialty crop and animal breeders. While specialty crops and animals are a large portion of gross US agricultural revenue, individually these small programs have not had access to innovations that benefited major crop and animal breeding programs and thus have lagged behind. ARS specialty breeders are often the sole source of publicly available new crop varieties for farmers and growers across the US and elsewhere. Breeding Insight is currently in a pilot phase focused on building support services for 6 ARS breeding programs (blueberry, table grape, sweet potato, alfalfa, rainbow trout, and North American Atlantic salmon), with the future goal of expansion to all ARS specialty crops, animal, and natural resource breeding programs. The project has identified the key workflows common to these diverse programs, and initiated the development of extensive software and genomics to support these efforts. A key early success was integration of the leading field data collection tool with the community⿿s leading database. Genomic support was delivered for all programs. Providing powerful information and genomic tools to ARS⿿s excellent specialty crop and animal breeders is helping to improve breeding decisions, meet public demands for more nutritious and flavorful foods, and improve food security for the US and its trade partners.

Impacts
(N/A)

Publications

  • Valluru, R., Gazave, E.E., Fernandes, S.B., Ferguson, J.N., Lozano, R., Hirannaiah, P., Zuo, T., Brown, P.J., Leakey, A.D., Gore, M., Buckler IV, E.S., Bandillo, N. 2018. Leveraging mutational burden for complex trait prediction in sorghum. bioRxiv.
  • Shaoqun, Z., Kremling, K.A., Bandillo, N., Richter, A., Zhang, Y.K., Ahern, K.R., Artyukhin, A.B., Hui, J.X., Younkin, G.C., Schroeder, F.C., Buckler IV, E.S., Jander, G. 2019. Metabolome-scale genome-wide association studies reveal chemical diversity and genetic control of maize specialized metabolites. The Plant Cell. 31:937-955.
  • Alkhalifah, N., Campbell, D., Falcon, C., Miller, N., Romay, M., Walls, R., Walton, R., Yeh, C., Bohn, M., Buckler IV, E.S., Ciampitti, I., Flint Garcia, S.A., Gore, M., Graham, C., Hirsch, C., Holland, J.B., Hooker, D., Kaeppler, S., Knoll, J.E., Lauter, N.C., Lee, E., Lorenz, A., Lynch, J., Moose, S., Murray, S., Nelson, R., Rocheford, T., Rodriguez, O., Schnable, J., Scully, B.T., Smith, M., Springer, N., Thomison, P., Tuinstra, M., Wisser, R., Xu, W., Ertl, D., Schnable, P., De Leon, N., Spalding, E., Edwards, J.W., Lawrence-Dill, C. 2018. Maize genomes to fields: 2014 and 2015 field season genotype, phenotype, environment, and inbred ear image datasets. Biomed Central (BMC) Plant Biology. 11:452.
  • Ding, Y., Murphy, K., Poretsky, E., Mafu, S., Yang, B., Char, S., Christensen, S.A., Saldivar, E., Wu, M., Wang, Q., Ji, L., Schmitz, R., Kremling, K., Buckler IV, E.S., Shen, Z., Briggs, S., Bohlmann, J., Sher, A., Castro-Falcon, G., Hughes, C., Huffaker, A., Zerbe, P., Schmelz, E. 2019. Multiple genes recruited from hormone pathways partition maize diterpenoid defences. Nature Plants.
  • Yang, C., Samayoa, L., Bradbury, P., Olukolu, B.A., Xue, W., York, A.M., Tuholski, M.R., Wang, W., Daskalska, L.L., Neumeyer, M.A., Sanchez- Gonzales, J., Romay, M.C., Glaubitz, J.C., Sun, Q., Buckler IV, E.S., Holland, J.B., Doebley, J.F. 2019. The genetic architecture of teosinte catalyzed and constrained maize domestication. Proceedings of the National Academy of Sciences. 116:5643-5652.
  • Ramstein, G.P., Larsson, S.J., Cook, J.P., Edwards, J.W., Ersoz, E.S., Flint Garcia, S.A., Gardner, C.A., Holland, J.B., Lorenz, A.J., Mcmullen, M.D., Millard, M.J., Rocheford, T.R., Tuinstra, M.R., Bradbury, P., Buckler IV, E.S., Romay, M.C. 2020. Dominance effects and functional enrichments improve prediction of agronomic traits in hybrid maize. Genetics. 215:215-230.
  • Baseggio, M., Murray, M., Magallanes-Lundback, M., Kaczmar, N., Chamness, J., Buckler IV, E.S., Smith, M.E., Dellapenna, D., Tracy, W.F., Gore, M.A. 2020. Natural variation for carotenoids in fresh kernels is controlled by uncommon variants in sweet corn. The Plant Genome.
  • Kremling, K., Diepenbrock, C., Gore, M., Buckler IV, E.S., Bandillo, N. 2019. Transcriptome-wide association supplements genome-wide association in Zea mays. Genes, Genomes, Genetics. 9(9):3023-3033.
  • Chen, Q., Samayoa, L., Yang, C.J., Bradbury, P., Olukolu, B., Neumeyer, M. A., Tomay, M., Sun, Q., Lorant, A., Buckler IV, E.S., Ross-Ibarra, J., Holland, J.B., Doebley, J.F. 2020. The genetic architecture of the maize progenitor, teosinte, and how it was altered during maize domestication. PLoS Genetics. 16(5):e1008791.
  • Falcon, C.M., Kaeppler, S.M., Spalding, E.P., Miller, N.D., Haase, N., Alkhalifah, N., Bohn, M., Buckler IV, E.S., Campbell, D.A., Ciampitti, I., Coffey, L., Edwards, J.W., Ertl, D., Flint Garcia, S.A., Gore, M.A., Graham, C., Hirsch, C.N., Holland, J.B., Jarquin, D., Knoll, J.E., Lauter, N.C., Lawrence-Dill, C.J., Lee, E.C., Lorenz, A., Lynch, J.P., Murray, S.C. , Nelson, R., Romay, M., Rocheford, T., Schnable, P., Scully, B.T., Smith, M.C., Springer, N., Tuinstra, M., Walton, R., Weldekidan, T., Wisser, R.J., Xu, W., De Leon, N. Relative utility of agronomic, phenological, and morphological traits for assessing genotype-by-environment interaction in maize inbreds. Crop Science. 2020; 60:62-81.
  • Mcfarland, B.A., Alkhalifah, N., Bohn, M., Bubert, J., Buckler IV, E.S., Ciampitti, I., Edwards, J.W., Ertl, D., Gage, J.L., Falcon, C.M., Flint Garcia, S.A., Gore, M., Graham, C., Hirsch, C., Holland, J.B., Hood, E., Hooker, D., Jarquin, D., Kaeppler, S., Knoll, J.E., Kruger, G., Lauter, N. C., Lee, E.C., Lima, D.C., Lorenz, A., Lynch, J.P., Mckay, J., Miller, N.D. , Moose, S.P., Murray, S.C., Nelson, R., Poudyal, C., Rocheford, T., Rodriguez, O., Romay, M., Schnable, J.C., Schnable, P.S., Scully, B.T., Sekhon, R., Silverstein, K., Singh, M., Smith, M., Spalding, E.P., Springer, N., Thelen, K., Thomison, P., Tuinstra, M., Wallace, J., Walls, R., Wills, D., Wisser, R.J., Xu, W., Yeh, C., De Leon, N. Maize genomes to fields (G2F): 2014 ⿿2017 field seasons: genotype, phenotype, climatic, soil and inbred ear image datasets. BMC Research Notes. 13,71 (2020).
  • Ricci, W.A., Lu, Z., Ji, L., Marand, A.P., Ethridge, C.L., Murphy, N.G., Noshay, J.M., Galli, M., Mejia-Guerra, M.K., Colome-Tatche, M., Johannes, F., Rowley, M., Corces, V.G., Zhai, J., Scanlon, M.J., Buckler IV, E.S., Gallavotti, A., Springer, N.M., Schmitz, R.J., Zhang, X. 2019. Widespread long-range cis-regulatory elements in the maize genome. Nature Plants. 5:1237-1249.
  • Gage, J.L., Richards, E., Lepak, N.K., Kaczmar, N., Soman, C., Chowdhary, G., Gore, M.A., Buckler IV, E.S. 2019. In-field whole plant maize architecture characterized by Subcanopy Rovers and Latent Space Phenotyping. The Plant Phenome Journal. 2(1):1-11.
  • Bukowski, R., Guo, X., Lu, Y., Zou, C., He, B., Rong, Z., Yang, B., Wang, B., Xu, D., Xie, C., Fan, L., Gao, S., Xy, X., Zhang, G., Li, Y., Jiao, Y., Doebley, J., Ross-Ibarra, J., Buffalo, V., Romay, C., Buckler IV, E.S., Wu, Y., Lai, J., Ware, D., Sun, Q. 2018. Construction of the third generation Zea mays haplotype map. Gigascience. 7(4):1-12.
  • Sun, S., Zhou, Y., Chen, J., Shi, J., Zhao, H., Zhao, H., Song, W., Zhang, M., Cui, Y., Dong, X., Liu, H., Ma, X., Yinping, J., Bo, W., Wei, X., Stein, J., Glaubitz, J., Lu, F., Yu, G., Liang, C., Fengler, K., Li, B., Rafalski, A., Schnable, P., Ware, D., Buckler IV, E.S., Lai, J. 2018. Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes. Nature Genetics.
  • Mejia-Guerra, M., Buckler IV, E.S. 2019. k-mer grammar uncovers maize regulatory architecture. Biomed Central (BMC) Plant Biology. 19:103.
  • Gault, C., Kremling, K., Buckler IV, E.S. 2018. Tripsacum de novo transcriptome assemblies reveal parallel gene evolution with maize after ancient polyploidy. The Plant Genome.
  • Ramstein, G.P., Jensen, S.E., Buckler IV, E.S. 2019. Breaking the curse of dimensionality to identify causal variants in Breeding 4. Theoretical and Applied Genetics. 132(3):559-567.
  • Wang, H., Cimen, E., Singh, N., Buckler IV, E.S. 2020. Deep learning for plant genomics and crop improvement. Current Opinion in Plant Biology. 54:34-41.
  • Gage, J., Monier, B., Giri, A., Buckler IV, E.S. 2020. Ten years of the maize Nested Association Mapping population: impact, limitations, and future directions. The Plant Cell.
  • Wallace, J., Kremling, K., Buckler IV, E.S. 2019. Quantitative genetic analysis of the maize leaf microbiome. Phytobiomes Journal. 2(4):208-224.
  • Lozano, R., Booth, G.T., Omar, B.Y., Li, B., Buckler IV, E.S., Lis, J.T., Jannink, J., Pino Del Carpio, D. 2018. RNA polymerase mapping in plants identifies enhancers enriched in causal variants. bioRxiv.
  • Washburn, J.D., Mejia Guerra, M., Ramstein, G., Kremling, K., Valluru, R., Buckler IV, E.S., Wang, H. 2019. Evolutionarily informed deep learning methods: Predicting transcript abundance from DNA sequence. Proceedings of the National Academy of Sciences. 116(12):5542-5549.
  • Baseggio, M., Murray, M., Magallanes-Lundback, M., Kaczmar, N., Chamness, J., Buckler IV, E.S., Smith, M.E., Dellapenna, D., Tracy, W.F., Gore, M.A. 2018. Genome-wide association and genomic prediction models of tocochromanols in fresh sweet corn kernels. The Plant Genome. 12:180038.
  • Oren, E., Tzuri, G., Vexler, L., Dafna, A., Meir, A., Faigenboim, A., Kenigswald, M., Portnoy, V., Schaffer, A.A., Levi, A., Buckler IV, E.S., Katzir, N., Burger, J., Tadmor, Y., Gur, A. 2019. The multi-allelic APRR2 Gene is associated with fruit pigment accumulation in melon and watermelon. Journal of Experimental Botany.
  • Valluru, R., Gazave, E., Fernandes, S., Ferguson, J., Lazano, R., Hirannaiah, P., Zuo, T., Brown, P., Leakey, A., Gore, M., Buckler IV, E.S., Bandillo, N. 2019. Deleterious mutation burden and its association with complex traits in sorghum (sorghum bicolor). Genetics. 211(3):1075-1087.
  • Zhou, S., Zhang, Y., Kremling, K., Ding, Y., Bennett, J., Bae, J., Kim, D., Kolomiets, M., Schmelz, E., Schroeder, F., Buckler Iv, E.S., Jander, G. 2018. Ethylene signaling regulates natural variation in the abundance of antifungal acetylated diferuloylsucroses and Fusarium graminearum resistance in maize seedling roots. New Phytologist. 221(4):2096-2111.


Progress 10/01/18 to 09/30/19

Outputs
Progress Report Objectives (from AD-416): Objective 1: Create approaches and tools for identifying causal variants directly from genomic sequencing of diverse germplasm and species of C4 crops. [NP301, C1, PS1A] Objective 2: Identify deleterious mutations, and model their impact on crop efficiency and heterosis in C4 crops. [NP301, C3, PS3A] Objective 3: Identify adaptive variants for drought and temperature tolerance across C4 crops. [NP301, C1, PS1B] Objective 4: Establish community tools for processing and integration of sequence haplotypes to estimate their breeding effects in crop productivity. [NP301, C4, PS4A] Approach (from AD-416): Increasing grass crop productivity is key for feeding the world over the next 50 years and this will require removing the deleterious variants in every genome, as well as adapting the crops to highly variable and stressful environments. This project will build better breeding models for improving and adapting maize and sorghum by surveying the natural variation across their entire group of wild relative species - the Andropogoneae. With over 1,000 species, the Andropogoneae are the most productive and water-use efficient plants in the world. Yet, for applied purposes, we have only tapped the variation from a handful of species. This project will lead an effort to survey DNA-level variation across this entire clade and analyze the variation with statistical and machine learning approaches. This will allow us to develop two sets of applied models for maize and sorghum. First, we will quantitatively estimate the deleterious impact on yield for every nucleotide in the genome. Second, we will identify the genes with a high capacity for adaptation to drought, flooding, temperature tolerance and their properties. These approaches and models will be deployed via integration with big data bioinformatics. This project will produce DNA-level knowledge that can be used across breeding programs and crops, and applied through either genomic selection or genome editing. In order to measure the functional constraint of every nucleotide in the maize and sorghum genome, this project is comparing these species to the Andropogoneae tribe of over 1000 species. Over the last year, along with our collaborators we have been collecting samples, propagating, and beginning to sequence the genomes of these species. Novel sequencing and bioinformatic approaches have been evaluated to do a detailed genome analysis on 10 species and rough analysis on 8 species. Key advances were made in reducing the cost of long read sequencing, isolation of long DNA fragments from difficult species, assembling these DNA reads together, and metrics for assessing genome sequencing quality were developed. These efforts provide a rigorous set of approaches for DNA sequencing the rest of the tribe species in the coming years. The DNA sequencing technologies to assemble full genomes have made tremendous progress in the last year, which are allowing entire genomes to be sequenced for 1/1000th of their previous cost. This project collaborated with several other global efforts to sequence and assemble three genomes, and lead the analysis of the community⿿s eighteen genomes into a practical haplotype graph. This represents a substantial portion of the temperate adapted field maize. Later in the year, an additional maize 26 genomes will be released by collaborators, and curated in publicly shared, which will provide access to tropical, sweet, and popcorn diversity. The graph has been used to identify functionally constrained regions of maize genomes, and used to support the U.S. Genomes To Field genotypic analyses. The central dogma of molecular biology is that DNA sequence transcribed into RNA, which in turn is translated into proteins, do the work of the cell. By collecting billions of observations on maize DNA, RNA, and protein levels, researchers are applying the tools of machine learning to this space. We have made substantial progress in developing machine learning models to predict directly from DNA sequence ⿿ its structure, what proteins bind the DNA, whether a region of DNA will produce RNA, how much RNA it produces, and how much protein is likely to be present. Nearly 200 different models for separate processes have been created. These models are providing insight into how variation in DNA sequence produces changes in RNA and protein level, which subsequently affect field level variation. This project leads a number of bioinformatic efforts to support the analysis of crop diversity. The TASSEL software tools, which have been a mainstay for plant trait and genotypic analysis, was enhanced with connections to the R analysis platform ⿿ R is pre-eminent statistical analysis environment. This connection improves TASSEL interconnectedness with other systems and should greatly expand its user base. Plant genomes are frequently extremely diverse, and graph rather than a linear representation is needed to capture this diversity. This project continued to develop the Practical Haplotype Graph (PHG) to deal with dozens of well assembled genomes, and continued to apply the graph to maize and sorghum breeding. Finally, this project has released a range of bioinformatic tools for machine learning using two approaches. Most ARS specialty crop and animal breeders that run breeding programs do not have the scale to fully implement modern breeding technologies, practices, and tools, all of which could help them meet the increasing demands for new varieties that can better tolerate pests and diseases, changing weather patterns, and match consumer demands and preferences. The Breeding Insight (BI) program will bring integrated breeding software, rapid, efficient genotyping, dynamic real-time trait data collection, and a secure data management system to five ARS breeding programs (Alfalfa, Blueberry, Grape, Sweet potato, and Salmonid fishes) in the pilot phase of the project. In the first year, the accomplishments have been: hire most of the team, setup the facilities, established our key milestones and deliverables, established the foundations for the underlying data management systems, initiated development for genetic marker systems for all 5 species. This provides the foundation for all of USDA-ARS specialty breeding programs to begin leveraging the tools of modern genomics and informatics. Accomplishments 01 Successful development of two methods for training machine learning models to help researchers. Both genomics and machine learning have advanced remarkably over the last several years, but the application of machine learning to modeling genomic data is frequently confounded by the strong evolutionary signatures in data, which prevents the development of accurate mechanistic models. These models are needed to identify genetic variation that is likely to be functional and could be used to improve varieties either through genomic selection or editing. ARS researchers in Ithaca, New York, along with collaborators have developed two methods for training machine learning models without being confounded by evolution, and successfully applied these approaches to the prediction of gene RNA expression. These strategies can be applied to any species and a wide range of genomic problems, which should allow research to quickly discover the functional mechanisms and the underlying variants responsible for them.

Impacts
(N/A)

Publications

  • Punnuri, S.M., Wallace, J.G., Knoll, J.E., Hyma, K.E., Mitchell, S.E., Buckler IV, E.S., Varshney, R.K., Singh, B.P. 2016. Development of a high- density linkage map and tagging leaf spot resistance in pearl millet uysing genotyping-by-sequencing markers. The Plant Genome. 9(2):1-13.
  • Li, B., Kremling, K., Wu, P., Bukowski, R., Romay, M., Xie, E., Buckler IV, E.S., Chen, M. 2018. Co-regulation of ribosomal RNA with hundreds of genes contributes to phenotypic variations. Genome Research.
  • Wang, J., Zhou, Z., Li, H., Liu, D., Zhang, Q., Bradbury, P., Buckler IV, E.S., Zhang, Z. 2018. Expanding the BLUP alphabet for genomic prediction adaptable to the genetic architectures of complex traits. Heredity.
  • Wallace, J.G., Rodgers-Melnick, E., Buckler IV, E.S. 2018. On the road to breeding 4.0: unraveling the good, the bad, and the boring of crop quantitative genomics. Annual Review of Genetics. 52(1)421-444.
  • Yang, J., Mezmouk, S., Baumgarten, A., Buckler IV, E.S., Guill, K.E., McMullen, M., Mumm, R., Ross-Ibarra, J. 2017. Incomplete dominance of deleterious alleles contributes substantially to trait variation and heterosis in maize. PLoS Genetics.
  • Li, Y., Chen, L., Bradbury, P., Shi, Y., Song, Y., Zhang, D., Zhang, Z., Buckler IV, E.S., Li, Y., Wang, T. 2018. Increased experimental conditions and marker densities identified more genetic loci associated with southern and northern leaf blight resistance in maize. Nature Scientific Reports. (8):6848.
  • Zhang, D., Easterling, K., Pitra, N., Coles, M., Buckler IV, E.S., Bass, H. , Matthews, P. 2017. Non-mendelian single-nucleotide polymorphism inheritance and atypical meiotic configurations are prevalent in hop. The Plant Genome. 10(3).
  • Dos Santos, J.P., Fernandes, S.B., Lozano, R., Brown, P.K., Buckler IV, E. S., Garcia, A.A., Gore, M.A. 2019. Novel bayesian networks for genomic prediction of developmental traits in biomass sorghum. bioRxiv.
  • He, Y., Wang, M., Dukowic-Schulze, S., Zhou, A., Tiang, C., Shilo, S., Sidh Sidhu, G., Eichten, S., Bradbury, P., Springer, N., Buckler IV, E.S., Levy, A., Sun, Q., Pillardy, J., Kianian, P., Kianian, S., Chen, C., Pawlowski, W. 2017. Genomic features shaping the landscape of meiotic double-strand break hotspots in maize. Proceedings of the National Academy of Sciences. 114(46):12231-12236.
  • Liu, Z., Cook, J., Melia-Hancock, S., Guill, K.E., Bottoms, C., Garcia, A., Ott, O., Nelson, R., Reckerd, J., Balint Kurti, P.J., Larsson, S., Lepak, N.K., Buckler IV, E.S., Trimble, L., Tracy, W., McMullen, M.D., Flint Garcia, S.A. 2016. Expanding maize genetic resources with predomestication alleles: Maize⿿teosinte introgression populations. The Plant Genome. (9):1.
  • Walters, W.A., Jin, Z., Youngblut, N., Wallace, J.G., Sutter, J., Zhang, W. , González-Peña, A., Peiffer, J., Koren, O., Shi, Q., Knight, R., Glavina Del Rio, T., Tringe, S.G., Buckler IV, E.S., Dangl, J.L., Ley, R.E. 2018. Large-scale replicated field study of maize rhizosphere identifies heritable microbes. Proceedings of the National Academy of Sciences. 115(28):7368-7373.
  • Diepenbrock, C., Kandianis, C., Lipka, A., Magallanes-Lundback, M., Vaillancourt, B., Gongora-Castillo, E., Wallace, J., Cepela, J., Mesberg, A., Bradbury, P., Ilut, D., Mateos-Hernandez, M., Hamilton, J., Owens, B., Tiede, T., Buckler IV, E.S., Rocheford, T., Buell, R., Gore, M., Dellapenna, D. 2017. Novel loci underlie natural variation in vitamin E levels in maize grain. The Plant Cell. 29(10):2374-2392. DOI:
  • Varshney, R., Shi, C., Thudi, M., Mariac, C., Wallace, J., Qi, P., Zhang, H., Zhao, Y., Wang, X., Rathore, A., Srivastava, R., Chitikineni, A., Fan, G., Bajaj, P., Punnuri, S., Gupta, S., Wang, H., Jiang, Y., Couderc, M., Katta, M., Paudel, D., Mungra, K., Chen, W., Harris-Shultz, K.R., Garg, V., Desai, N., Doddamani, D., Kane, N., Conner, J., Ghatak, A., Chaturvedi, P. , Subramaniam, S., Yadav, O., Berthouly-Salazar, C., Hamidou, F., Wang, J., Liang, X., Clotault, J., Upadhyaya, H., Cubry, P., Rhoné, B., Gueye, M., Sunkar, R., Dupuy, C., Sparvoli, F., Cheng, S., Mahala, R., Singh, B., Yadav, R., Lyons, E., Datta, S., Hash, C., Devos, K., Buckler IV, E.S., Bennetzen, J., Paterson, A.H., Ozias-Akins, P., Grando, S., Wang, J., Mohapatra, T., Weckwerth, W., Reif, J.C., Liu, X., Vigouroux, Y., Xu, X. 2017. Pearl millet genome sequence provides a resource to improve agronomic traits in arid environments. Nature Communications. 35(10):969.


Progress 10/01/17 to 09/30/18

Outputs
Progress Report Objectives (from AD-416): Objective 1: Create approaches and tools for identifying causal variants directly from genomic sequencing of diverse germplasm and species of C4 crops. [NP301, C1, PS1A] Objective 2: Identify deleterious mutations, and model their impact on crop efficiency and heterosis in C4 crops. [NP301, C3, PS3A] Objective 3: Identify adaptive variants for drought and temperature tolerance across C4 crops. [NP301, C1, PS1B] Objective 4: Establish community tools for processing and integration of sequence haplotypes to estimate their breeding effects in crop productivity. [NP301, C4, PS4A] Approach (from AD-416): Increasing grass crop productivity is key for feeding the world over the next 50 years and this will require removing the deleterious variants in every genome, as well as adapting the crops to highly variable and stressful environments. This project will build better breeding models for improving and adapting maize and sorghum by surveying the natural variation across their entire group of wild relative species - the Andropogoneae. With over 1,000 species, the Andropogoneae are the most productive and water-use efficient plants in the world. Yet, for applied purposes, we have only tapped the variation from a handful of species. This project will lead an effort to survey DNA-level variation across this entire clade and analyze the variation with statistical and machine learning approaches. This will allow us to develop two sets of applied models for maize and sorghum. First, we will quantitatively estimate the deleterious impact on yield for every nucleotide in the genome. Second, we will identify the genes with a high capacity for adaptation to drought, flooding, temperature tolerance and their properties. These approaches and models will be deployed via integration with big data bioinformatics. This project will produce DNA-level knowledge that can be used across breeding programs and crops, and applied through either genomic selection or genome editing. Since this project just began in March 2018, there is no significant progress to report. Please refer to the annual report for project 8062- 21000-039-00D, titled, "Development and Application of Genetic, Genomic, and Bioinformatic Resources in Maize," for more information.

Impacts
(N/A)

Publications