Source: AGRICULTURAL RESEARCH SERVICE submitted to NRP
MAIZEGDB - DATABASE AND COMPUTATIONAL RESOURCES FOR MAIZE GENETICS, GENOMICS, AND BREEDING RESEARCH
Sponsoring Institution
Agricultural Research Service/USDA
Project Status
ACTIVE
Funding Source
Reporting Frequency
Annual
Accession No.
0444388
Grant No.
(N/A)
Cumulative Award Amt.
(N/A)
Proposal No.
(N/A)
Multistate No.
(N/A)
Project Start Date
Apr 10, 2023
Project End Date
Apr 9, 2028
Grant Year
(N/A)
Program Code
[(N/A)]- (N/A)
Recipient Organization
AGRICULTURAL RESEARCH SERVICE
RR #3 BOX 45B
AMES,IA 50011
Performing Department
(N/A)
Non Technical Summary
(N/A)
Animal Health Component
10%
Research Effort Categories
Basic
90%
Applied
10%
Developmental
0%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
2011510108090%
2012410108010%
Goals / Objectives
Objective 1: Improve maize trait analysis (e.g., drought and cold tolerance, disease and pest resistance), germplasm development, genetic studies, and breeding through stewardship of maize genomes, pan genomes, genetic data, and phenotype data. Goal 1.A: Bring in reference-quality genome assemblies of domesticated maize outgroups that include stress-resilient varieties and connect gene-model and genome-browser pan-gene relationships between these genomes and domesticated maize. Goal 1.B: Represent and integrate maize diversity through hosting maize genomes, pan-genomes, graph information, and whole-genome sequencing data. Objective 2: Identify and curate key datasets (e.g., 3-D protein structure, pangenome gene functions) that will serve to enhance maize functional genome annotation with an emphasis on the targeted curation of traits related to abiotic and biotic stress and climate change. Goal 2.A: Integrate maize stress-response expression and trait data with MaizeGDB genomes and functional genome annotation tools. Goal 2.B: Integrate 3-D gene model protein structures across maize genomes, compare them within a pan-gene framework, and create gene function predictions based on protein structure similarity. Objective 3: Develop infrastructure to integrate, add value to, and visualize multi-omics data sets, enable comparative genomics, facilitate genome to phenome knowledge discovery, and provide analysis through artificial intelligence approaches and genomic discovery tools. Goal 3.A: Provide comparative and pan-genome resources to understand diversity and organize genes and develop artificial intelligence approaches to facilitate exploration of the complex relationship between phenotype and genotype. Objective 4: Provide community support services, build strategic partnerships, and provide database training and outreach activities for user communities and stakeholders. Goal 4.A: Facilitate communication among maize researchers to support research community needs and create and leverage synergistic activities with other databases and plant research communities.
Project Methods
The Maize Genetics and Genomics Database (MaizeGDB â¿¿ http://www.maizegdb. org) is the model organism database for maize. MaizeGDBâ¿¿s overall aim is to provide long-term storage, support, and stability to the maize research communityâ¿¿s data and to provide informatics services for access, integration, visualization, and knowledge discovery. The MaizeGDB website, database, and underlying resources allow plant researchers to understand basic plant biology, make genetic enhancement, facilitate breeding efforts, and translate those findings into products that increase crop quality and production. To accelerate research and breeding progress, generated data must be made freely and easily accessible. Curation of high-quality and high-impact datasets has been the foundation of the MaizeGDB project since its inception over 25 years ago. MaizeGDB serves as a two-way conduit for getting maize research data to and from our stakeholders. The maize research community uses data at MaizeGDB to facilitate their research, and in return, their published data gets curated at MaizeGDB. The information and data provided at MaizeGDB and facilitated through outreach has directly been used in research that has had broad commercial, social, and academic impacts. The MaizeGDB team will make accessible high-quality, actively curated and reliable genetic, genomic, and phenotypic description datasets. At the root of high-quality genome annotation lies well-supported assemblies and annotations. For this reason, we focus our efforts on benefitting researchers by developing a system to ensure long-term stewardship of both a representative reference genome sequence assembly with associated structural and functional annotations as well as additional reference-quality genomes that help represent the diversity of maize. In addition, we will enable researchers to access data in a customized and flexible manner by deploying tools that enable direct interaction with the MaizeGDB database. Continued efforts to engage in education, outreach, and organizational needs of the maize research community will involve the creation and deployment of video and one-on-one tutorials, updating maize Cooperators on developments of interest to the community, and supporting the information technology needs of the Maize Genetics Executive Committee and Annual Maize Genetics Conference Steering Committee.

Progress 10/01/23 to 09/30/24

Outputs
PROGRESS REPORT Objectives (from AD-416): Objective 1: Improve maize trait analysis (e.g., drought and cold tolerance, disease and pest resistance), germplasm development, genetic studies, and breeding through stewardship of maize genomes, pan genomes, genetic data, and phenotype data. Goal 1.A: Bring in reference-quality genome assemblies of domesticated maize outgroups that include stress-resilient varieties and connect gene- model and genome-browser pan-gene relationships between these genomes and domesticated maize. Goal 1.B: Represent and integrate maize diversity through hosting maize genomes, pan-genomes, graph information, and whole-genome sequencing data. Objective 2: Identify and curate key datasets (e.g., 3-D protein structure, pangenome gene functions) that will serve to enhance maize functional genome annotation with an emphasis on the targeted curation of traits related to abiotic and biotic stress and climate change. Goal 2.A: Integrate maize stress-response expression and trait data with MaizeGDB genomes and functional genome annotation tools. Goal 2.B: Integrate 3-D gene model protein structures across maize genomes, compare them within a pan-gene framework, and create gene function predictions based on protein structure similarity. Objective 3: Develop infrastructure to integrate, add value to, and visualize multi-omics data sets, enable comparative genomics, facilitate genome to phenome knowledge discovery, and provide analysis through artificial intelligence approaches and genomic discovery tools. Goal 3.A: Provide comparative and pan-genome resources to understand diversity and organize genes and develop artificial intelligence approaches to facilitate exploration of the complex relationship between phenotype and genotype. Objective 4: Provide community support services, build strategic partnerships, and provide database training and outreach activities for user communities and stakeholders. Goal 4.A: Facilitate communication among maize researchers to support research community needs and create and leverage synergistic activities with other databases and plant research communities. Approach (from AD-416): The Maize Genetics and Genomics Database (MaizeGDB � http://www.maizegdb. org) is the model organism database for maize. MaizeGDB�s overall aim is to provide long-term storage, support, and stability to the maize research community�s data and to provide informatics services for access, integration, visualization, and knowledge discovery. The MaizeGDB website, database, and underlying resources allow plant researchers to understand basic plant biology, make genetic enhancement, facilitate breeding efforts, and translate those findings into products that increase crop quality and production. To accelerate research and breeding progress, generated data must be made freely and easily accessible. Curation of high-quality and high-impact datasets has been the foundation of the MaizeGDB project since its inception over 25 years ago. MaizeGDB serves as a two-way conduit for getting maize research data to and from our stakeholders. The maize research community uses data at MaizeGDB to facilitate their research, and in return, their published data gets curated at MaizeGDB. The information and data provided at MaizeGDB and facilitated through outreach has directly been used in research that has had broad commercial, social, and academic impacts. The MaizeGDB team will make accessible high-quality, actively curated and reliable genetic, genomic, and phenotypic description datasets. At the root of high- quality genome annotation lies well-supported assemblies and annotations. For this reason, we focus our efforts on benefitting researchers by developing a system to ensure long-term stewardship of both a representative reference genome sequence assembly with associated structural and functional annotations as well as additional reference- quality genomes that help represent the diversity of maize. In addition, we will enable researchers to access data in a customized and flexible manner by deploying tools that enable direct interaction with the MaizeGDB database. Continued efforts to engage in education, outreach, and organizational needs of the maize research community will involve the creation and deployment of video and one-on-one tutorials, updating maize Cooperators on developments of interest to the community, and supporting the information technology needs of the Maize Genetics Executive Committee and Annual Maize Genetics Conference Steering Committee. ARS scientists from the Maize Genetics and Genomics Database (MaizeGDB) project in Ames, Iowa, provide valuable tools and resources for investigative research and crop improvement by leveraging maize genetics, genomics, and breeding data. In line with Objective 1, MaizeGDB continues to enhance its stewardship efforts to encompass a wide range of high- quality genome sequences, including genome assemblies and annotations from closely-related species. This collection enables researchers to explore the rich diversity of maize genetics and genomics. The team has expanded its collection of high-quality genome sequences, including those from maize's wild relatives, to over 100 genomes and 1,400 supporting datasets. New features include tools linking gene data, genomic markers, and protein data across various maize varieties. A new data center focused on pan-genes (genes shared across multiple species within a genus, allowing for the study of their evolution and variation) offers interactive visualization tools, such as sequence alignments and gene trees, to investigate the evolution of genes across the Zea genus. These advancements will enhance the ability to analyze and utilize data from different sources, ultimately facilitating targeted crop improvement and addressing global food security challenges. As part of Objective 2, MaizeGDB curates and hosts important datasets that contribute to functional annotation of the maize genome. The recent focus is on datasets related to stress tolerance and climate change, including how maize responds to challenges like drought, heat, pathogens, and pests. A new gold-standard dataset is now available, featuring information on approximately 3,000 genes from 25 studies. This dataset reveals how genes are expressed differently under various stress conditions, providing valuable insights into maize's genetic resilience. Additionally, MaizeGDB has integrated other valuable datasets, including AI-derived gene annotations, a comprehensive protein atlas, and an updated genome-wide association study atlas. These resources enable researchers to explore the genetic basis of stress tolerance and climate resilience in maize, ultimately supporting the development of more resilient and sustainable crops. Regarding Objective 3, MaizeGDB is building a robust infrastructure to handle large-scale datasets and facilitate knowledge discovery. This updated infrastructure will enable researchers to integrate and visualize diverse types of genetic and genomic data, leading to new insights into the genes underlying agronomically important traits. Two new tools were developed: PanEffect, an AI-powered workflow that predicts and visualizes the effects of hundreds of millions of potential genetic variations in maize, and SNPversity 2.0, a tool that analyzes whole genome sequencing data from over 1,500 maize lines and wild relatives. These two tools are connected to allow researchers to easily filter, visualize, and download variations in the maize genome based on specific locations and accessions and quickly see the possible functional impact the variations have at the protein level. These enhancements will streamline data analysis, accelerate scientific breakthroughs, and ultimately improve our understanding of maize genetics and genomics. Objective 4 focuses on MaizeGDB's role as the central hub for maize research, facilitating communication and collaboration among researchers worldwide. The MaizeGDB team actively engages with the maize research community to identify their needs and priorities, enabling tailored resources and services to better serve their requirements. Additionally, strategic partnerships have been formed with dozens of agricultural biological databases, collaborating on data standards and interoperability. Training and outreach activities are provided to empower user communities and stakeholders, ensuring they can maximize the benefits of MaizeGDB. Through stewardship efforts, infrastructure development, curation of key datasets, and community support services, MaizeGDB continues to enhance the landscape of maize genetics, genomics, and breeding research. Artificial Intelligence (AI)/Machine Learning (ML) The ARS database for maize, Maize Genetics and Genomics Database (MaizeGDB), leveraged machine learning techniques to assign functional context to maize genes and gene products. The project used both the Ceres and Atlas HPC Clusters. In one project, protein language models were used to assess the impact of all possible amino acid substitutions on the maize proteome. The resulting scores, which indicate the functional impact of these substitutions, range from low to strong. To facilitate the exploration of these results, a tool called PanEffect was developed. PanEffect allows users to visualize over 550 million potential amino acid substitutions in the maize reference genome and observe the effects of 2. 3 million natural variations in the maize pan-genome. Additionally, the variant effect scores for all 51 maize genomes included in the pan-genome are available for download, totaling over 20 billion variant effect scores. This dataset provides invaluable insights for researchers and breeders working to improve maize. In a second project, both protein language and diffusion models were employed to predict interspecies protein- protein interactions between maize and the plant pathogenic fungus Fusarium graminearum, which causes ear and stalk rot disease. A fine-tuned protein language model designed to detect interaction regions was used on a curated set of Fusarium effector proteins to identify the amino acids most likely involved in protein binding. Using these predicted positions on the proteins, we generated the protein structure for potential short-binder proteins using diffusion models. Higher-quality binders were identified, and a structure search was performed against a database of maize protein models (generated by AlphaFold). The top proteins were selected and verified using functional enrichment and protein-protein structure modeling. This project highlights the innovative use of AI and machine learning to advance our understanding of plant-pathogen interactions. By predicting these interspecies protein-protein interactions, we can uncover new insights into the mechanisms of plant immune responses and identify potential targets for developing disease-resistant crops. These predicted protein-protein interactions are currently being used as targets for experimental validation. ACCOMPLISHMENTS 01 Unlocking maize diversity with pan-genomics. Pan-genomes encompass all the genetic sequences found in a collection of genomes, and are therefore more valuable than single-reference genomes for studying species diversity. This is especially true for a species like maize, which has a remarkably diverse and complex genome. Presenting maize pan- genome data, analyses, and visualization is further complicated by the extensive gene functional annotations present at the Maize Genetics and Genomics Database (MaizeGDB). ARS scientists in Ames, Iowa, have enhanced the MaizeGDB database to include genetic information from the entire Zea genus, covering both maize, and its wild relative, teosinte. This advancement in maize pan-genomics can help transform our ability to harness the full genetic potential of the Zea genus. The work enables researchers to unlock new insights into maize diversity and gene function, which are essential for developing more resilient and productive crop varieties to meet global food security challenges.

Impacts
(N/A)

Publications

  • Woodhouse, M.H., Cannon, E.K., Portwood II, J.L., Harper, E.C., Gardiner, J.M., Schaeffer, M.L., Andorf, C.M. 2021. A pan-genomic approach to genome databases using maize as a model system. Biomed Central (BMC) Plant Biology. 21. Article 385. https://doi.org/10.1186/s12870-021-03173-5.
  • Sen, S., Woodhouse, M.H., Portwood Ii, J.L., Andorf, C.M. 2023. Maize feature store: a centralized resource to manage and analyze curated maize multi-omics features for machine learning applications. Database: The Journal of Biological Databases and Curation . 2023. Article baad078. https://doi.org/10.1093/database/baad078.
  • Poretsky, E., Andorf, C.M., Sen, T.Z. 2024. PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language models. Plant Direct. 7(12). Article e554. https://doi.org/10. 1002/pld3.554.
  • Poretsky, E., Cagirici, H.B., Andorf, C.M., Sen, T.Z. 2024. Harnessing the predicted maize pan-interactome for putative gene function prediction and prioritization of candidate genes for important traits. Genetics. 14(5). Article jkae059. https://doi.org/10.1093/g3journal/jkae059.
  • Andorf, C.M., Haley, O., Hayford, R.K., Portwood II, J.L., Harding, S.F., Sen, S., Cannon, E.K., Gardiner, J.M., Kim, H., Woodhouse, M.R. 2024. PanEffect: a pan-genome visualization tool for variant effects in maize. Bioinformatics. 40(2). Article btae073. https://doi.org/10.1093/ bioinformatics/btae073.
  • Cannon, E.K., Portwood II, J.L., Hayford, R.K., Hayley, O.C., Gardiner, J. M., Andorf, C.M., Woodhouse, M.R. 2024. Enhanced pan-genomic resources at the maize genetics and genomics database. Genetics. 227(1). https://doi. org/10.1093/genetics/iyae036.


Progress 10/01/22 to 09/30/23

Outputs
PROGRESS REPORT Objectives (from AD-416): Objective 1: Improve maize trait analysis (e.g., drought and cold tolerance, disease and pest resistance), germplasm development, genetic studies, and breeding through stewardship of maize genomes, pan genomes, genetic data, and phenotype data. Goal 1.A: Bring in reference-quality genome assemblies of domesticated maize outgroups that include stress-resilient varieties and connect gene- model and genome-browser pan-gene relationships between these genomes and domesticated maize. Goal 1.B: Represent and integrate maize diversity through hosting maize genomes, pan-genomes, graph information, and whole-genome sequencing data. Objective 2: Identify and curate key datasets (e.g., 3-D protein structure, pangenome gene functions) that will serve to enhance maize functional genome annotation with an emphasis on the targeted curation of traits related to abiotic and biotic stress and climate change. Goal 2.A: Integrate maize stress-response expression and trait data with MaizeGDB genomes and functional genome annotation tools. Goal 2.B: Integrate 3-D gene model protein structures across maize genomes, compare them within a pan-gene framework, and create gene function predictions based on protein structure similarity. Objective 3: Develop infrastructure to integrate, add value to, and visualize multi-omics data sets, enable comparative genomics, facilitate genome to phenome knowledge discovery, and provide analysis through artificial intelligence approaches and genomic discovery tools. Goal 3.A: Provide comparative and pan-genome resources to understand diversity and organize genes and develop artificial intelligence approaches to facilitate exploration of the complex relationship between phenotype and genotype. Objective 4: Provide community support services, build strategic partnerships, and provide database training and outreach activities for user communities and stakeholders. Goal 4.A: Facilitate communication among maize researchers to support research community needs and create and leverage synergistic activities with other databases and plant research communities. Approach (from AD-416): The Maize Genetics and Genomics Database (MaizeGDB � http://www.maizegdb. org) is the model organism database for maize. MaizeGDB�s overall aim is to provide long-term storage, support, and stability to the maize research community�s data and to provide informatics services for access, integration, visualization, and knowledge discovery. The MaizeGDB website, database, and underlying resources allow plant researchers to understand basic plant biology, make genetic enhancement, facilitate breeding efforts, and translate those findings into products that increase crop quality and production. To accelerate research and breeding progress, generated data must be made freely and easily accessible. Curation of high-quality and high-impact datasets has been the foundation of the MaizeGDB project since its inception over 25 years ago. MaizeGDB serves as a two-way conduit for getting maize research data to and from our stakeholders. The maize research community uses data at MaizeGDB to facilitate their research, and in return, their published data gets curated at MaizeGDB. The information and data provided at MaizeGDB and facilitated through outreach has directly been used in research that has had broad commercial, social, and academic impacts. The MaizeGDB team will make accessible high-quality, actively curated and reliable genetic, genomic, and phenotypic description datasets. At the root of high- quality genome annotation lies well-supported assemblies and annotations. For this reason, we focus our efforts on benefitting researchers by developing a system to ensure long-term stewardship of both a representative reference genome sequence assembly with associated structural and functional annotations as well as additional reference- quality genomes that help represent the diversity of maize. In addition, we will enable researchers to access data in a customized and flexible manner by deploying tools that enable direct interaction with the MaizeGDB database. Continued efforts to engage in education, outreach, and organizational needs of the maize research community will involve the creation and deployment of video and one-on-one tutorials, updating maize Cooperators on developments of interest to the community, and supporting the information technology needs of the Maize Genetics Executive Committee and Annual Maize Genetics Conference Steering Committee. ARS scientists from the Maize Genetics and Genomics Database (MaizeGDB) team in Ames, Iowa, provide valuable tools and resources for investigative research and crop improvement by leveraging maize genetics, genomics, and breeding data. For Objective 1, MaizeGDB continues to expand its stewardship efforts to encompass a wider range of high-quality genome sequences, including genome assemblies and annotations from closely related species. These resources represent the rich diversity of maize genomic and genetic data available to researchers, comprising over 100 genomes and more than a thousand supporting datasets. The genomes and accompanying tools will enable researchers to explore functional regions in the genome through the genome browser tool. MaizeGDB supports over 50 genome browsers and provides access to over 130 datasets for sequence similarity searches. These efforts will significantly enhance researchers' ability to analyze and leverage data from different sources and data types, facilitating targeted crop improvement. For Objective 2, MaizeGDB continues to identify, curate, and host key datasets that contribute to the functional genome annotation of maize. Notably, the focus is on datasets related to abiotic and biotic stress, as well as climate change. A gold-standard dataset comprising approximately 3,000 genes, derived from 25 published studies, is being generated by MaizeGDB. This demonstrates that differential gene expression under nine biotic and eleven abiotic conditions in maize. These curated datasets serve as valuable references for studying and understanding the genetic basis of stress tolerance and climate resilience in maize. Integrating these high-quality datasets, including 3- D protein structure and pan-genome gene functions, enriches the resources at MaizeGDB for researchers studying the impact of these factors on maize traits. Regarding Objective 3, MaizeGDB continues to develop robust infrastructure to accommodate multi-omics datasets and facilitate knowledge discovery. The updated infrastructure will enable the integration and visualization of diverse omics data. One notable tool called Maize Feature Store streamlines the use of machine-learning features in biological research by providing easy access to thousands of commonly used features derived from various omic data sources. The improved infrastructure empowers researchers to perform comparative genomics and gain insights into the complex relationships between the genome and phenome. Additionally, artificial intelligence approaches and genomic discovery tools have been embraced, providing advanced analysis capabilities for researchers working with maize data. These methods can successfully predict and search 3-D protein structures, identify genes associated with stress response, and assign functional labels to genes. These techniques are especially crucial for identifying functional insights in domains with scarce or absent experimental data. Specifically, we have used machine learning to predict phosphorylation sites within protein sequences. Phosphorylation is an important post-translational modification that regulates a variety of essential biological processes but has limited experimental data for maize and other important crops. These enhancements promote efficient and comprehensive data analysis, accelerating scientific discoveries in maize genetics and genomics. Objective 4 focuses on MaizeGDB's role as the central hub for maize research, facilitating communication and collaboration among researchers worldwide. The MaizeGDB team actively engages with the maize research community to identify their needs and priorities, enabling tailored resources and services to better serve their requirements. Additionally, strategic partnerships have been forged with dozens of agricultural biological databases, collaborating on data standards and interoperability. Training and outreach activities are provided to empower user communities and stakeholders, ensuring they can maximize the benefits of MaizeGDB. Through stewardship efforts, infrastructure development, curation of key datasets, and community support services, MaizeGDB continues to enhance the landscape of maize genetics, genomics, and breeding research. Artificial Intelligence (AI)/Machine Learning (ML) The ARS database for maize, Maize Genetics and Genomics Database (MaizeGDB), leveraged machine learning techniques to assign functional context to maize genes and gene products. The project used both the Ceres and Atlas HPC Clusters. In one project, classical, non-neural network- based approaches are employed to identify genes associated with stress response and climate adaptability. By utilizing these methods, candidate genes will be identified that play crucial roles in coping with various stressors and adapting to changing climates. In a second project, MaizeGDB is utilizing representations from protein language models to determine which amino acids in a protein are most likely to be phosphorylated. By training the model on extensive protein sequence data, specific amino acids that are highly prone to phosphorylation will be identified. This information will provide valuable insights into post-translational modifications and their potential impact on protein function in maize. A third project employs protein language models and deep learning techniques to assign functional annotation to maize genes. By training the models on a vast amount of proteomic data using protein language models, MaizeGDB can predict the ontology terms associated with molecular function, cellular localization, and biological processes. This approach can enhance our understanding of the genetic mechanisms underlying important traits and biological processes in maize.

Impacts
(N/A)

Publications