Source: CORNELL UNIVERSITY submitted to NRP
APPLYING DEEP REASONING TO THE MODELING OF CROSS SPECIES GENE REGULATORY NETWORKS
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
COMPLETE
Funding Source
Reporting Frequency
Annual
Accession No.
1028149
Grant No.
2022-67011-36564
Cumulative Award Amt.
$180,000.00
Proposal No.
2021-09431
Multistate No.
(N/A)
Project Start Date
Jan 1, 2022
Project End Date
Dec 31, 2024
Grant Year
2022
Program Code
[A7101]- AFRI Predoctoral Fellowships
Recipient Organization
CORNELL UNIVERSITY
(N/A)
ITHACA,NY 14853
Performing Department
Computational Biology
Non Technical Summary
In the next decade, hundreds of species of plants will need to be bred or genetically modified to withstand the effects of climate change. Unfortunately, many species do not have the financial resources to perform the research efforts needed to perform this scale of breeding and genetic engineering. In these species, researchers often look toward mathematical models that predict the effects of modifying genetics. Current models often require large amounts of expensive training data, extended computational time, and rarely use information from other species to inform predictions. This project aims to improve current models by leveraging a modern, biologically-informed prediction framework. The prediction framework, called deep reasoning, has the ability to use data from multiple more-resourced species in a way that encodes general biological rules into the model's predictions. In addition to leveraging this model into questions about the modification of genetics, this project will provide a computational web-tool that will allow for both plant breeders and the greater biological community to explore the regulation of genetics in contexts such as climate change scenarios. Furthermore, this model and the resulting trait prediction tool, can provide a low-cost, quick way for genetic engineers to understand what types of modifications a species may need in order to flourish in a variety of conditions.
Animal Health Component
50%
Research Effort Categories
Basic
10%
Applied
50%
Developmental
40%
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
2011510108060%
2017310208030%
2012299108110%
Goals / Objectives
The major goals of this project are to develop mathematical models that can predict the impact of gene regulation across plant species and to create computational tools that can aid plant breeders' decisions on possible gene targets.In order to develop a cross-species model, the first objective is to leverage publicly available sequencing data in Arabidopsis, maize, rice, soybean, and sorghum by curating a library for gene expression, open chromatin, and dimerization data.We aim to improve current gene regulatory network prediction accuracy by at least 10% by designing a deep reasoning method that will predict the expression of transcription factor target genes.With the improved model, we aim to detect the dysregulation of gene expression within each of the aforementioned species. Gene expression data from the multi-experimental data library will be used to validate the model's detections.In addition to the dysregulation of gene expression, the model will be used to predict the effects of perturbations in transcription factor expression. Validation of this model will be performed with the use of real transgenic gene expression data obtained from publicly available databases.A digital tool will be created from the results of this model. The tool will allow synthetic biologists or researchers to explore the possible effects of manipulating the gene expression of transcription factors.
Project Methods
Methods Overview: In order to determine the effects of genetic manipulation, stress response or any perturbations in the regulation of gene expressions, we will develop a method, informed by cross-species regulatory experimental assays, to predict gene expression profiles. We will use this method to make inferences about the relationship between the variation in a gene's expression and its internal regulators.We will process, curate, and integrate over 10,000 plant RNA-seq datasets obtained from EMBL-EBI expression atlas database. These experimental data are from maize, rice, arabidopsis, soybean, sorghum, wheat, and tomato and will form the training and test sets for future gene expression predictions. The curated data will be transferred to the publicly available repository Cyverse and/or Ag Data Commons.To further place the data in a biological context, we will mathematically integrate the gene expression data with data understood by the gene regulation community to be useful for understanding gene expression dynamics. These data sources are: chromatin accessibility information (ATAC-seq) , transcription factor motifs from the JASPAR database, and dimerization information as identified from plant proteome resources.In order to predict gene expression profiles, we will design a deep reasoning-focused machine learning method. Deep reasoning refers to a portion of the model that, given biological constraints, such as transcription factor binding rules, can guide a prediction model to both biologically tractable and more accurate results. We expect this model to perform at least 10% better than current gene regulatory network prediction models.We will perform a series of tests to both, explore the accuracy and determine the biological interpretability of the model. To test the accuracy of the model, we will be comparing its predictions with gene expression profiles from plant RNA-seq atlases. We also will be comparing the model's predictions with the gene expression profiles in plants not used for training the model. To test the interpretability of the features within the model, we will be employing visualization techniques specialized for deep learning models and sensitivity analyses to ensure the proper representation of biological data is maintained throughout the model.Understanding the molecular basis for phenotypes at the regulatory level is a goal for modern crop improvement. It has been suggested that variation in gene expression is constrained by the balance of interacting regulators. To examine this phenomenon in each species, we will compare the distribution of genes that are expressed in maize and poplar in correspondence to the transcription factor's presence across maize.There are a lack of models that can detect causative mutations which affect gene expression and phenotypes while also being able to take into account the regulatory components that work together. To validate our model's practical application, we will first use tomato transgenic data to predict expression of genes which are known to be affected by modified variants. Next, we will use groundcherry gene expression data to detect the variation of gene expressions induced by gene editing. Using rank correlations, we should be able to see how well the model predicts resulting gene expressions with respect to these experiments.Given the model's ability to detect these expressions, we will create a web-based computational tool that can be used in further synthetic biology studies. The publicly available tool will, given an input of transcription factors, will output the genes which are expressed with the transcription factor. Just as the induced cis-regulatory regions provide a way to connect regulatory regions directly to gene expressions, this tool will be able to provide the connection of TF expression to gene expressions in a wide variety of contexts, like under stress and cold conditions.

Progress 01/01/22 to 12/31/24

Outputs
Target Audience:1. Computational Biology Community: Presented posters and student seminars to the Cornell computational biology program, Plant and Animal Genome Conferences. 2. Applied Plant Breeders in precision agriculture: Presented posters to the Plant and Animal Genome Conference and National Association of Plant Breeders Conference. 3. Underserved communities in science, technology, engineering, and mathematics (STEM): National Convention for National Society of Black Engineers and Society for the Advancement of Chicanos/Hispanics and Native Americans in Science (SACNAS) National Diversity in STEM Conference. Changes/Problems:Taylor graduated in Spring of 2024, 10 months before this Fellowship termed. Taylor was offered and accepted a position as a Senior Data Scientist working in Applied Biological AI & Strategy & Gene Editing, at Corteva Agriscience. What opportunities for training and professional development has the project provided? Mentoring: One-on-one mentorship program with Bayer Crop Sciences professional Weekly 2-hour writing group with Black PhD Students Workshops: Scientific Writing for Journal Articles, Creating an Industry-Ready Resume Seminars: Cornell Plant Breeding and Genetics Seminar, Cornell Department of Computational Biology SeminarsAttendance of Conferences: National Society of Black Engineers 49th Convention, Plant and Animal Genome Conference, Corteva DELTA, National Association of Plant Breeders, Society for the Advancement of Chicanos/Hispanics and Native Americans in Science conference How have the results been disseminated to communities of interest?In 2022, I gave poster presentations at the International Conference for Arabidopsis Research, National Society of Black Engineers 49th Convention, and the Plant and Animal Genome Conference. To share my research and give K-12 students an idea of scientific life, I gave a Black History Month presentation for Black and Indigenous students of the Ithaca area middle school. To share findings with the greater computational biology community, I have gave a student seminar to the Cornell University Department of Computational Biology and the AfroBiotech Symposium. In 2023, I presented my research at the Plant and Animal Genome Conference, the National Convention for the National Society of Black Engineers, the National Association of Plant Breeders Conference, and the Society for the Advancement of Chicanos/Hispanics and Native Americans in Science (SACNAS) National Diversity in STEM Conference. My paper on regulatory network-based machine learning for gene expression was put on biorXiv, a preprint server, to allow open access to my work and results. In 2024, I gave a dissertation seminar to the Cornell Community on my thesis work, "Learning Regulatory Contributions to Gene Expression Variation in Grasses." What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? I developed PLExBench, the first benchmark suite designed for evaluating gene expression prediction in plants, focusing on Arabidopsis thaliana and Zea mays. PLExBench provides a standardized set of tasks that enables rigorous comparison of state-of-the-art prediction methods, offering a critical tool for assessing their strengths and weaknesses. Additionally, I applied deep learning models to dissect cis-regulatory contributions to tissue-specific gene expression in maize. Leveraging perturbation-driven experimental data, I systematically uncovered key regulatory mechanisms and demonstrated the power of deep learning approaches in accurately predicting tissue-specific gene expression profiles.

Publications

  • Type: Other Journal Articles Status: Published Year Published: 2024 Citation: Wrightsman T, Ferebee TH, Romay MC, AuBuchon-Elder T, Phillips AR, Syring M, Kellogg ES, Buckler ES (2024). Current genomic deep learning architectures generalize across grass species but not alleles. bioRxiv https://doi.org/10.1101/2024.04.11.589024
  • Type: Theses/Dissertations Status: Published Year Published: 2024 Citation: Learning Regulatory Contributions to Gene Expression Variation in Grasses TH Ferebee - 2024


Progress 01/01/23 to 12/31/23

Outputs
Target Audience:1. Computational Biology Community: Presented posters and student seminars to the Cornell computational biology program, Plant and Animal Genome Conferences. 2. Applied Plant Breeders in precision agriculture: Presented posters to the Plant and Animal Genome Conference and National Association of Plant Breeders Conference 3. Underserved communities in science, technology, engineering, and mathematics (STEM): National Convention for National Society of Black Engineers and Society for the Advancement of Chicanos/Hispanics and Native Americans in Science (SACNAS) National Diversity in STEM Conference. Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Mentoring: One-on-one mentorship program with Bayer Crop Sciences professional Weekly 2-hour writing group with Black PhD Students Workshops: Scientific Writing for Journal Articles, Creating an Industry-Ready Resume Seminars: Cornell Plant Breeding and Genetics Seminar, Cornell Department of Computational Biology SeminarsAttendance of Conferences: National Society of Black Engineers 49th Convention, Plant and Animal Genome Conference, Corteva DELTA, National Association of Plant Breeders, Society for the Advancement of Chicanos/Hispanics and Native Americans in Science conference. How have the results been disseminated to communities of interest?In 2023, I presented my research at the Plant and Animal Genome Conference, the National Convention for the National Society of Black Engineers, the National Association of Plant Breeders Conference, and the Society for the Advancement of Chicanos/Hispanics and Native Americans in Science (SACNAS) National Diversity in STEM Conference. My paper on regulatory network-based machine learning for gene expression was put on biorXiv, a preprint server, to allow open access to my work and results. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? I developed PLExBench, the first benchmark suite designed for evaluating gene expression prediction in plants, focusing on Arabidopsis thaliana and Zea mays. PLExBench provides a standardized set of tasks that enables rigorous comparison of state-of-the-art prediction methods, offering a critical tool for assessing their strengths and weaknesses. Additionally, I applied deep learning models to dissect cis-regulatory contributions to tissue-specific gene expression in maize. Leveraging perturbation-driven experimental data, I systematically uncovered key regulatory mechanisms and demonstrated the power of deep learning approaches in accurately predicting tissue-specific gene expression profiles.

Publications

  • Type: Other Journal Articles Status: Published Year Published: 2023 Citation: Schulz AJ, Zhai J, AuBuchon-Elder T, El-Walid M, Ferebee TH, Gilmore EH, Hufford MB, Johnson LC, Kelloff EA, La T, Long E, Miller ZR, Romay MC, Seetharam AS, Stitzer MC, Wrightsman T, Buckler ES, Monier B, Hsu SK (2023). Fishing for a reelGene: evaluating gene models with evolution and machine learning. bioRxiv https://doi.org/10.1101/2023.09.19.558246
  • Type: Other Journal Articles Status: Published Year Published: 2023 Citation: Ferebee TH, Buckler ES (2023). Exploring the utility of regulatory network-based machine learning for gene expression prediction in maize. bioRxiv https://doi.org/10.1101/2023.05.11.540406
  • Type: Peer Reviewed Journal Articles Status: Published Year Published: 2023 Citation: Khaipho-Burch M, Ferebee T, Giri A, Ramstein G, Monier B, Yi E, Romay MC, Buckler ES (2023). Elucidating the patterns of pleiotropy and its biological relevance in maize. PLoS Genetics 19(3):e1010664. https://doi.org/10.1371/journal.pgen.1010664.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2023 Citation: Ferebee, T. H., & Buckler, E. S. On the effectiveness of graph neural networks for maize gene expression prediction [Poster]. Plant and Animal Genomes Conference, San Diego, United States of America.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2023 Citation: Ferebee, T. H., & Buckler, E. S. Applications of machine learning for understanding regulatory contributions to gene expression variation in grasses [Poster]. National Society of Black Engineers 49th Convention, Kansas City, United States of America.


Progress 01/01/22 to 12/31/22

Outputs
Target Audience:I have interacted with the following communities: Computational Biology Community: Presented posters and student seminars to the Cornell computational biology program, Plant and Animal Genome Conferences and the International Conference on Arabidopsis Research. Applied Plant Breeders in precision agriculture: Presented posters to the Plant and Animal Genome Conference and the International Conference on Arabidopsis Research communities. Underserved communities in science, technology, engineering, and mathematics (STEM): Gave a 30 minute talk to Black Women in Computational Biology, to Black and Indigenous elementary students of the Ithaca area, presented poster at the 49th National Convention for National Society of Black Engineers Changes/Problems:I contracted COVID-19 which slowed down progress for around 9 weeks. The next reporting period will be where I can catch up on the unmet goals. What opportunities for training and professional development has the project provided? Mentoring: One-on-one mentorship program with Bayer Crop Sciences professional Weekly 2 hour writing group with Black PhD Students LinkedIn Learning Modules: Crafting and Sharing Your Personal Story, Creating the Perfect Elevator Pitch, PowerPoint for Mac Essential Trainings, 8 Easy Ways to Make Your PowerPoint Stand Out, Intermediate SQL and Data Analysis Workshops: Scientific Writing for Journal Articles, Creating an Industry-Ready Resume Seminars: Cornell Plant Breeding and Genetics Seminar, Cornell Department of Computational Biology Seminars, Resisting Erasure: Black Woman Scholars In Defense of Themselves, Application of Deep Learning on Graphs Attendance of Conferences: National Society of Black Engineers 49th Convention, Plant and Animal Genome Conference, International Conference for Arabidopsis Research How have the results been disseminated to communities of interest?I have given poster presentations at the International Conference for Arabidopsis Research, National Society of Black Engineers 49th Convention, and the Plant and Animal Genome Conference. To share my research and give K-12 students an idea of scientific life, I have given a Black History Month presentation for Black and Indigenous students of the Ithaca area middle school. To share findings with the greater computational biology community, I have given a student seminar to the Cornell University Department of Computational Biology and the AfroBiotech Symposium. What do you plan to do during the next reporting period to accomplish the goals?Goals for the next reporting period include 1) developing and testing the next set of models for predicting gene expression and 2) preparations for journal submissions. For developing and testing the next set of models, we would like to first test feature representations for the regulatory sequences. These representations will allow us to ask questions about the effects of perturbations in regulatory features, which is one of the goals not yet met. Models that will be tested will be autoencoders, local linear embeddings, and matrix factorizations. Next, we will be taking the expression data and using it to determine plant contexts, such as plant tissue. Finally, we will combine the best performing sequence representations and the plant context expression data in order to predict gene expression between different contexts. This portion of the model will give some important information on how a possible dysregulation could impact the plant, as stated in one of the unmet objectives. Finally, to complete the 2nd main goal, these results will be gathered into a research journal article for publication.

Impacts
What was accomplished under these goals? In this reporting period, we aimed to explore the efficacy of using machine learning (ML) to predict gene expression in maize using sorghum, rice and thale cress. Gene expression prediction models have the opportunity to provide genetic engineers and breeders insights on the interactive effects of modifying gene targets. To this end, we created machine learning gene expression prediction models (graph autoencoders) based on the structure of the relationships between genes. To compare our results, we also looked at a simpler model with no network information to provide a basis for comparison. To address our first goal of data collection and processing, we collected RNA sequencing data from Zea mays, Sorghum bicolor and Oryza sativa. We also collected a gene regulatory network, heavily verified with experimental data, from thale cress. We used these data to predict gene expression within and between the species. Within species, non-network models and network deep learning models predicted abundance with correlations of 0.7 and 0.6, respectively. When predicting out of species expression abundance in unseen experiments, the network deep learning model predicted expression with correlations of 0.5. Our results suggest that gene expression prediction accounting for discrete biological network structure can be used to predict target gene expression values in unseen experiments with moderate accuracy. For the next set of aims, we collected gene expression, transcription-factor binding sites, and regulatory sequences from over 30 grass species so that we can begin to predict gene expression in those species. We performed quality controls on the sequencing data, as well as, built and implemented a gene expression quantification pipeline.

Publications

  • Type: Conference Papers and Presentations Status: Accepted Year Published: 2023 Citation: Ferebee, T. H., & Buckler, E. S. Applications of machine learning for understanding regulatory contributions to gene expression variation in grasses [Poster]. National Society of Black Engineers 49th Convention, Kansas City, United States of America.
  • Type: Conference Papers and Presentations Status: Accepted Year Published: 2023 Citation: Ferebee, T. H., & Buckler, E. S. On the effectiveness of graph neural networks for maize gene expression prediction [Poster]. Plant and Animal Genomes Conference, San Diego, United States of America.
  • Type: Conference Papers and Presentations Status: Accepted Year Published: 2022 Citation: Ferebee, T. H., & Buckler, E. S. Cross-species prediction of maize gene expression using graph neural networks [Poster]. International Conference for Arabidopsis Research, Belfast, Northern Ireland.
  • Type: Conference Papers and Presentations Status: Accepted Year Published: 2022 Citation: Ferebee, T. H., & Buckler, E. S. Cross-species prediction of maize gene expression using graph neural networks [Talk]. Cornell Computational Biology Seminar Series, Ithaca, United States of America.
  • Type: Conference Papers and Presentations Status: Accepted Year Published: 2022 Citation: Ferebee, T. H., & Buckler, E. S. Cross-species prediction of maize gene expression using graph neural networks [Talk]. AfroBiotech Conference, Virtual, United States of America.