Performing Department
(N/A)
Non Technical Summary
In the U.S., industrial hemp (i.e., Cannabis sativa varieties producing <0.3% total THC, as opposed to drug-type or 'marijuana' varieties) has deep roots as an integral part of the early economy, supporting the production of textiles, rope, paper, etc. However, U.S. hemp research and production were virtually eradicated in the early 20th century but have since been reinvigorated following its legalization by the 2014 and 2018 Farm Bills. Due to prolonged prohibition, hemp has lost nearly 100 years of scientific and agricultural development compared to other major crops. Still, there is strong potential for a growing hemp industry to make significant contributions to the U.S. bioeconomy and GDP per capita. This is because of the variety of industries hemp can support: construction, livestock feed/bedding, nutrition supplements, essential oil, medicine, food, plastic alternatives, and body care. The limited understanding and development of industrial hemp that persists globally highlights an opportunity for the growth of the U.S. bioeconomy and to become a global leader in hemp production and research.However, much of the genetic diversity in hemp remains unsampled, limiting our understanding of the genetic variation available for crop improvement. Hemp is a dioecious (i.e., separate male and female plants) crop, exhibiting extreme genetic variability and complex (X/Y) sex chromosomes, which presents challenges in computational analyses and hinders the development of efficient molecular markers that could be used for genetic selection and crop improvement. Knowledge of agriculturally important genes in hemp is generally lacking. Still, it is especially the case for plant traits linked to sex chromosomes, which are important because different sex-specific traits are preferred by growers: males (XY) produce fiber better for textiles, whereas females (XX) produce 'grain' (for food/oil) and flowers enriched with medicinal compounds.This project aims to accelerate the establishment of hemp as a useful, sustainable crop and an important U.S. commodity by leveraging the USDA Hemp Germplasm Collection and cutting-edge DNA-sequencing techniques to address the abovementioned issues. Germplasms are important sources of genetic variation, but the USDA Hemp Germplasm Collection is largely uncharacterized, presenting limited value for plant breeding. The USDA Hemp Germplasm Collection contains hundreds of unique hemp lines, including some from U.S. feral populations and its diversity hotspot in Asia, both of which are underrepresented in previous genomic sampling efforts despite their potential contributions to crop improvement. Through this project, I will generate and analyze a massive genomic dataset composed of the full array of genetic diversity available in the USDA Hemp Germplasm Collection, resulting in invaluable data/results that will be made available to breeding programs and other hemp stakeholders. I will also leverage these resulting data to discover gene variants tied to agriculturally important plant traits, which will benefit molecular marker development for genetic selection and crop improvement.
Animal Health Component
25%
Research Effort Categories
Basic
75%
Applied
25%
Developmental
(N/A)
Goals / Objectives
Goal statement:Germplasm collections are important sources of genetic variation for plant breeding programs, but the USDA Hemp Germplasm Collection (Geneva, NY) is largely uncharacterized, presenting limited value to breeders. This project aims to accelerate U.S. hemp (Cannabis sativaproducing <0.3% THC) breeding initiatives by analyzing and providing a robust genomic dataset representing the full array of diversity in the USDA Hemp Germplasm Collection. I aim to generate the most diverse and comprehensive long-read sequencing dataset for any crop species to date. This project will demonstrate the utility of germplasm collections for genomic research that may be leveraged for plant breeding and crop improvement. I will also assemble additional reference genomes for underrepresentedlineages in the current C. sativapangenome, which will enable this and future projects to investigate potentially new, adaptive, and/or rare alleles that may benefit hemp breeding programs. The more comprehensive C. sativa pangenome resulting from this study will also enable more robust association tests aimed to identifygenomic variants underlyingagronomically important phenotypes, leading to more efficient molecular markers and inevitablyincreasing its utility tohemp breeding programs.Last,I aim to advance our knowledge of overall sex chromosome diversity inC. sativa, by testing for major XY karyotype groups and whether potentially distinct groups correspond to specific phenotypes important to growers.Objectives:Genotype almost the entire USDA Hemp Germplasm Collection (471 unique accessions) using long-read (PacBio HiFi) low-pass DNA-sequencing and assess population structureAssemble five high-quality reference genomes for underrepresented hemp lineages, like U.S. ferals and others from the species' diversity hotspot (Asia), and usethose to build a more representative pangenome forC. sativaUse the updatedpangenome--which will be one of first to effectively integratephased-diploid sex chromosome (X/Y) assemblies in plants--for long-read mapping, comprehensive variant detection, and gene discovery in graph-based (trait) associationmapping analysesDevelop molecular markers for agronomic, sex-specific traits linked to sex chromosomes and others identified across the pangenome
Project Methods
Traditionally, low-pass sequencing for large-scale mapping analyses has been performed using Illumina short-reads due to immense cost savings. However, with the advent of new technologies, long-read sequencing is becoming more cost-effective. To characterize and assess all genomic diversity in the USDA Hemp Germplasm Collection, I will sequence at least one individual from each of 471 accessions with long-read low-pass sequencing, aiming for 4x coverage per sample by pooling 24 individuals per PacBio Revio flow cell. This project will be one of the first to utilize long-reads for large-scale, low-pass genotyping, which will help lay the groundwork for future studies as the field continues transitioning into the era of third-generation sequencing. Additionally, through this project, I aim to illustrate the utility of germplasm collections for these types of studies, which may impact the future of germplasm curation.If funds allow, I will sequence both sexes for each USDA accession. Otherwise, I will focus on XY males to capture both the X and Y while acknowledging that autosomal variation may exist between sexes that might reduce power in association mapping. PacBio HiFi long-reads will be used to identify genomic variants and to characterize the structural variation (SV) landscape across the collection. I will identify all genomic variants in an initial pass using a single male reference, then use that data to estimate ancestry and relatedness and define core lineage groups in the Germplasm Collection. Five lineages will then be chosen for reference-quality genome assembly and used to construct a more diverse pangenome for the species. Genome assembly samples will be subjected to deeper PacBio sequencing, Hi-C for scaffolding and phasing, and RNA-seq for five tissue types for annotation, using the latest assembly and annotation standards. I will then generate a pangenome graph with all existing, diploid phased, chromosome-scale Cannabis genomes on NCBI's GenBank and those from this study to map long-read low-pass reads from all sampled individuals and re-test for genomic variants based on the updated graph. I will also test for associations between SVs and nearly 150 key agronomic phenotypes collected for this project.Our pangenome graph will contain phased diploid assemblies representing maternal and paternal haplotypes of every individual, increasing power by capturing more haplotypes for mapping traits. Notably, phased X/Y chromosomes will also be integrated into the graph to make them available for trait mapping, representing a novel contribution to the field; sex chromosomes have historically been removed from analyses in other species (including human) due to complications in coverage and structural variation. The additional diploid phased assemblies from this study will also widen the representation of diverse genome types in the C. sativa pangenome, taking full advantage of its heterozygosity/variation and enabling the most comprehensive characterization of the USDA Hemp Germplasm Collection by mapping long-reads from nearly the entire collection.A marker toolkit that accounts for sex chromosome variation across C. sativa will benefit hemp breeding programs, given the substantial resources devoted to selecting for specific XX/XY karyotypes for various purposes. Insights gained regarding sex chromosome variation in this study will lead to more efficient markers for sex-genotyping across diverse collections. I will develop markers that test for sex chromosome karyotypes that result in different sex-expression patterns, using the pangenome to identify conserved and variable regions. Due to the obvious importance of sex-linked loci controlling different agronomic traits, along with the importance of plant sex to growers, C. sativa sex chromosomes deserve more attention in genome analysis. However, sex chromosome analysis in C. sativa and many other crops is largely avoided because computational tools commonly lack the functionality to account for their unique biology compared to autosomes. I will collate all sex chromosome variants and test whether they correlate with changes in different agronomically important phenotypes by mapping long-reads to a pangenome graph explicitly built to incorporate sex chromosomes--with careful consideration of which sex is being mapped (i.e., XX vs. XY) when assessing read coverage/depth in the non-recombining regions between the X and Y, relative to the pseudoautosomal regions (PARs: where X-Y recombination still occurs) of the X vs. Y and autosomes. For example, whereas XY karyotypes should yield around 50% of read mapping depth across the X- and Y-specific non-recombining regions and 100% depth throughout the PAR and autosomes, females should exhibit 100% depth across the X and all autosomes, with only mismapped reads on the Y, assuming no SVs. Effective integration of both X and Y chromosomes into a pangenome graph will provide a framework for including large hemizygous genome regions in graph-based analyses.Efforts: Effective and widespread communication of methods and results from this project is a high priority to me and all others involved. Findings from this study will not only be impactful to hemp breeders/stakeholders, but they will also be of interest to the Plant Breeding field overall. To reach the most people and extend the impact of this project, I will present these methods and findings in at least three first-author manuscripts and by speaking at several international conferences and various academic and public outreach events across the U.S. I will also leverage the data and methods generated through this project to continue training the next generation of scientists, by providing hands-on experience and mentorship through HudsonAlpha's BioTrain internship program.Evaluation: Major milestones for this project include:Completing DNA-sequencing and initial genotyping for all samples planned for this study by Jul. 2025Generating all five new genome assemblies and finalizing the updated hemp pangenome for the association mapping analysis by Oct. 2025Completing all mapping, sex chromosome, and gene discovery analyses by the end of 2025Begin testing markers during Jul. 2026Present at scientific conferences, including Plant and Animal Genomes Conference 2025/2026, Cannabis Research Conference 2025, CROPS 2026, National Association of Plant Breeders 2026; USDA-ARS Hemp Field Day 2026Submitting manuscripts for publication in Sept. 2025, Apr. 2026, and Dec. 2026.My progress will be determined according to those dates, for which my Mentor, Collaborating Mentor, and Advisory Board will hold me accountable at recurrent one-on-ones and quarterly group meetings. My findings will also be evaluated by peer review of three manuscripts planned for this project. Presentations at conferences will be one of the main avenues I will use to communicate findings to the field. As this project would be funded by the public and is driven to benefit the public, data/results dissemination is a major priority for everyone involved. All related to the USDA Hemp Germplasm Collection will be disseminated through the USDA-ARS Germplasm Resources Information Network (GRIN) and other USDA online channels with support from Dr. Zach Stansell (Collaborating Mentor). All genome resources will also be made freely available and easily accessible via online databases like NCBI GenBank to encourage the reuse of this valuable data and benefit the scientific community. Similarly, all genome, phenome, and marker resources will be linked corresponding to the 471 U.S. National Plant Germplasm accessions analyzed in this study, including recorded phenotype data for many priority traits. These data represent an invaluable resource to the hemp and broader C. sativa breeding community and have the potential to advance hemp cultivation and inevitably bolster the U.S. bioeconomy and strengthen food security.