Comprehensive Identification and Mapping and Characterization of Hessian Fly genes using a Innovative Whole Genome Sequencing Approach

COMPREHENSIVE IDENTIFICATION AND MAPPING AND CHARACTERIZATION OF HESSIAN FLY GENES USING A INNOVATIVE WHOLE GENOME SEQUENCING APPROACH

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

COMPLETE

Funding Source

NRI COMPETITIVE GRANT

Reporting Frequency

Annual

Accession No.

0212900

Grant No.

2008-35302-18816

Cumulative Award Amt.

$400,000.00

Proposal No.

2007-04624

Multistate No.

(N/A)

Project Start Date

Feb 1, 2008

Project End Date

Jan 31, 2011

Grant Year

2008

Program Code

[51.2C]- Arthropod and Nematode Biology and Management (C): Tools, Resources and Genomics

Recipient Organization
BAYLOR COLLEGE OF MEDICINE
(N/A)
HOUSTON,TX 77030

Performing Department
(N/A)

Non Technical Summary
The Hessian fly is a major pest of US wheat crops, and the world's most important wheat pest. Researchers including many in the US funded by the USDA are trying to find better ways to control this pest and reduce the damage done to wheat crop yields. Often genetic information is required, specifically about gene and protein structure, for both basic and applied research into this pest - for example identification of pesticide resistance genes, and protein sequences of pesticide target proteins to allow better design of pesticide targets, and bacterial expression of pesticide targets allowing interactions between the pesticide and its target to be studied and better understood. The identification of single genes and proteins of interest is an expensive and time-consuming process when conducted a gene at a time. In this proposal we will rapidly and very inexpensively identify, characterize and map every gene in the genome to speed research into this important pest species. We are applying new massively parallel sequencing technologies to dramatically reduce the cost of sequencing projects of this size from tens of millions of dollars in the late 90's to ~ 5 million dollars around 2003, to $400,000 dollars in this proposal. In other species, the availability of the "toolkit" of genes and proteins that make up an organism has accelerated the progress and results from research dramatically - for example laboratories can now study the entire set of ligand gated ion channels (a target of several major pesticides) with the confidence that they are not missing any, and with the full protein sequence of each of the genes. Whilst until now the high cost of sequencing has made this global approach uneconomic for species with small communities of researchers, the new lower costs make research on insect species uneconomical without a whole genome sequence, and full description of the gene and protein sets. Whilst there is always a delay between the acquisition of primary basic data and actual results in the field, we have no doubt that the data produced by this proposal will dramatically speed the efforts of Hessian fly researchers to reduce the damage caused by this important pest.

Animal Health Component

(N/A)

Research Effort Categories

Basic

100%

Applied

(N/A)

Developmental

(N/A)

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
211	3110	1130	100%

Knowledge Area
211 - Insects, Mites, and Other Arthropods Affecting Plants;

Subject Of Investigation
3110 - Insects;

Field Of Science
1130 - Entomology and acarology;

Keywords

Goals / Objectives
OBJECTIVES We will identify, characterize and map the vast majority of genes of the wheat pest Mayetiola destructor - the Hessian fly. 1. Generate raw sequence data representing 12-fold coverage of the Hessian fly genome with 19 runs (21 attempted allowing 10% failure rate) of the GS-FLX genome sequencer (454 inc) each run generating 100Mb of sequence in 250bp reads. 2. Generate 32X clone coverage paired-end data with 3kb and 10kb insert sizes using the 454 GS-20. This paired-end data will be used in the assembly process to determine the order and orientation of the majority of contigs in the assembled sequence. 3. Assemble 2Gb of raw 454 GS-FLX sequence reads and paired-end data into sequence scaffolds of ordered and oriented contigs, followed by placement on the existing physical map. 4. Generate ~1,200,000 EST sequences from a variety of Hessian fly tissues, to provide an extensive transcribed sequence data set to drive automated gene identification and annotation. 5. Produce an automated annotation of the assembled Hessian fly genome sequence based on EST data and protein homologies, using the BCM-HGSC import of the Ensembl gene annotation pipeline, and other gene prediction programs including NCBI Gnomen. 6. Deposit data in public databases, and the BCM-HGSC website; establish database collaborations with Flybase and the KSU Arthropod Genomics Center.

Project Methods
We will generate 12-fold random sequence coverage of the Hessian fly genome using a pyrosequencing technology platform from 454. Additionally, paired end sequence data and transcription data will be generated. This random sequence will be assembled using the Atlas assembly suite of software tools developed at the Baylor College of Medicine Human Genome Sequencing Center into a draft genome sequence. Gene sequences will be annotated automatically using existing annotation software pipelines with reference to extensive transcription sequence data also generated by this project. All results will be placed in multiple publicly accessible data repositories.

Progress 02/01/08 to 01/31/11

Outputs
OUTPUTS: I am pleased to report that all of the technical and scientific objectives of this proposal are complete, and we are in the process of a small community analysis of the completed genome to generate publications based on connections between the biology and annotated genome and genes of M. destructor. Sequence generation and genome assembly. We generated 9 454 titanium runs generating 3,694,907,229 bp of fragment sequence, or 23X coverage of the Hessian Fly genome with an average read length of 323.2 bp. To help with assembly we also generated 2,595,291 were successful 3kb pe paired end reads giving a total "clone" coverage of 6.48 Gb, or 41X clone coverage of the Hessian fly genome. Additionally we performed 6, 20kb paired end titanium runs, or 338X clone coverage of the Hessian fly genome. We assembled this sequence to generate the Mdes 1.0 genome assembly. The assembly comprises 153Mb of sequence, is available from NCBI, the BCM-HGSC website and from Agripest base at KSU. Alignment of EST sequences to the genome found that it contains the vast majority of Hessian fly genes. The contig N50 length is 14kb, and the scaffold N50 length is 756kb. 60% of the genome assembly was placed on M.destructor chromosomes using a physical map provided by Jeff Stuart. The quality of the assembly was assessed by alignment of RNAseq data, >95% of RNAseq transcripts could be aligned to the genome assembly ensuring its completeness. Transcript sequencing and genome annotation. We generated 4 Illumina lanes of RNA seq data (95 bp read length, paired end data, ~250bp insert size) from 4 diiferent life stages: pooled female eggs (57M reads), female first instar larvae (64M reads), male first instar larvae (60M reads) and female third instar larvae(50M reads). This data plus protein sequences form other species, and ab-inito gene predictions was used to Run Maker 2.0 and generate evidence based gene models for M. destructor. We annotated 13,284 protein coding genes with an average length of 394 amino acids. Data dissemination. The genome assembly is available via NCBI genbank, the BCM-HGSC webpage, and Agripestbase at KSU. Additionally blast resources are available at all three places, and a GMOD based browser for looking at the annotate assembly is also available at Agripestbase. Agripestbase is also running the community annotation of the Hessian fly with the GMOD Apollo annotation tool PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Target Audience: The release of the intermediate Hessian fly assembly has already been used extensively by Molecular Hessian fly laboratories to accelerate their research into the hessian fly. The numbers of investigators is expanding as the annotation consortium scales up with interest. Our current target audience is entomologists, molecular insect scientists, plant pathologists, but by publication we hope to bring wheat breeders, and growers into the knowledge circle to use molecular information about gall formation in the quest to generate wheat strains with long term resistance to M.destructor. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
Whilst we have completed our specific aims of annotating all of the Hessian fly genes, our broader goal is to greatly accelerate Hessian fly (M.destructor) research. The Hessian fly is a wheat pest that lays eggs on young wheat plants. The larvae that hatch trick the plant into growing a gall around them, and feeding the growing insect - at the detriment of the plant giving stunted growth and poor crop yields. We now have a list of all 13,284 proteins that can possibly enable gall formation. The gene list has already accelerated research into the genes and processes causing gall formation. Small Secreted Salivary Gland Proteins (SSSGPs) have been identified by the Stuart Lab at the University of Purdue by genome mapping made much faster by the availability of these sequences. The Chen Lab at KSU has additionally found hundreds of these genes in the genome sequence based on sequence similarity. Expression levels fro the RNAseq data generated here show that these genes are expressed at the first instar larvae stage at the onset of gall formation. Additional analysis of the genome is focusing on sex determination, and the choice of host by comparing sequences from a nearby species that prefers barley to wheat. A small consortium is being formed around the Hessian fly sequence, to study all aspects of Hessian fly biology, and to prepare multiple publications.

Publications

No publications reported this period

Progress 02/01/09 to 01/31/10

Outputs
OUTPUTS: Progress report Delays are due to a decision to wait for the 454 titanium platform and assembly difficulties. We have completed all sequence generation goals. Here are the original objectives, interspersed with our current progress: OBJECTIVES 1. Generate raw sequence data representing 120-fold coverage of the Hessian fly: We generated 23X coverage of the Hessian Fly genome with an average read length of 323.2 bp, on the 454 platform, and an additional 12X coverage on the Illumina platform. 2. Generate 32X "clone" coverage paired-end data with 3kb and 10kb insert sizes: We generated 41X clone coverage of the Hessian fly genome in 3kb and 338X clone coverage in 20kb insert sizes. 3. Assemble 2Gb of raw 454 GS-FLX sequence reads and paired-end data into sequence scaffolds of ordered and oriented contigs, followed by placement on the existing physical map. Results: A 0.5 version assembly is available on the BCM-HGSC website (link below) with contig N50 of 9.8kb and Scaffold N50 of 271kb. Unfortunately the 9.8 kb N50 contig length is unlikely to encompassed the majority of genes in single contigs. An improved assembly is being prepared, its current statistics include a contig N50 of 14.1kb and a scaffold N50 of 1.06Mb. Whilst these statistics are significantly better, we require further improvements in the contig N50 before accepting a final assembly (the improved scaffold N50 is more than sufficient). Our aim is to increase the contig N50 beyond 20kb, to ensure most genes be contained within a single contig. 4. Generate ~1,200,000 EST sequences from a variety of Hessian fly tissues: One Illumina paired end lane with 110bp read length produced 7.15 million clones each with 220bp of raw sequence data. The vastly increased depth of sequencing on the illumina platform allows annotation of a higher percentage of Hessian fly genes. We are currently (March 2010) performing additional EST sequencing using RNA from mixed sex eggs, male and female mid stage larvae, and mixed sex late stage larvae. 5. Produce an automated annotation of the assembled Hessian fly genome sequence: We are waiting on the final assembly and additional EST data to start work on this aim. 6. Deposit data in public databases: We are in the process of placing all raw sequence data into the short read trace archive. The initial version of the assembly is already available to the public via the HGSC website at: http://www.hgsc.bcm.tmc.edu/project-species-i-Hessian_fly.hgscpageLo cation=Hessian_fly . As additional assemblies and annotations become available, they will also be made available to the public as described in the original grant. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
Impact: The release of the intermediate Hessian fly assembly has already been used extensively by Dr. Stuarts laboratory and other laboratories to accelerate their research into the hessian fly. In particular Dr Stuarts work to clone susceptibility and resistance genes has been greatly accelerated. We expect this impact will become large as the final assembly an annotation is released and advertised more broadly.

Publications

No publications reported this period

Progress 02/01/08 to 01/31/09

Outputs
OUTPUTS: Our year one goals for the Hessian fly genome project were to: 1 generate 12X raw sequence coverage of the Hessian fly genome on the 454 FLX platform, 2 to generate 32X "clone coverage" of the genome in paired end data, and 3, to assemble these sequence reads in a genome sequence of ordered and oriented contigs. Year 2 goals aim to identify and annotate hessian fly genes. There has been one biological complication (that to be honest should have been foreseen) that has delayed the project. The hessian fly has an unusual sex determination system, wherein females wither give birth to all male or all female offspring. To allow the creation of a inbred line for sequencing and assembly Jeff Stuart identified a female that produced mostly male but some female offspring, and an inbred line was created, that is >95% male. Unfortunately the hessian fly has two X chromosomes accounting for approximately 40% of the total genome, and in males these will be at half coverage (or 6X) with our original sequencing strategy, and likely produce smaller contigs. To overcome this limitation, we used an updated 454 chemistry (XLR) which produces longer read lengths and more data per run to produce 20X sequence coverage of the Hessian fly genome - ensuring that X chromosomes will have at least 10X sequence coverage. The reagents for this upgrade only became widely available in September 2008, which has caused a 5-6 month delay. 11 XLR runs were performed, (10 of which were successful) generating a total of 3,582Mb of raw sequence or 22.6X coverage of the 158Mb genome, fulfilling objective one of the grant. Of these 5 454-XLR runs were of paired end sequence libraries and these produced 38.75 X paired end coverage of the Hessian fly genome where both ends could be mapped within the initial assemblies. Our third goal for the year is the assembly of the raw sequence into ordered and oriented Because of the delay in obtaining sequence, the assembly process was only started in Jan 09, and at this stage we can only report an intermediate assembly at this time. This initial assembly produced an assembly of 126Mb total size and a N50 contig size of 3.5kb. Initial assembly details: Number Of (> 500bp) Contigs = 53,939 Number Of Bases = 126,251,682 Avg Contig Size = 2,340 N50 Contig Size = 3,509 largest Contig Size = 62,836 Q40 Plus Bases = 118,284,880, 93.69% Q39 Minus Bases = 7,966,802, 6.31% Whilst this assembly is clearly not good enough, we will be releasing it as an initial assembly on the Human Genome Sequencing Center website, and making it web searchable via blast to aid researchers on the Hessian fly, and anyone who has an interest. A fuller improved assembly is in progress, and we expect to produce a higher quality product in the coming year (likely less than 6 months). Unfortunately this can only count as a partial fulfillment of goal three for the year, and we hope to catch up over year two of the grant. PARTICIPANTS: Stephen Richards (PD) directed the accumulation of sequence data for this project, and performed the initial assembly of the sequence data. Jeff Stuart (co PD) produced and provided pure isolated Hessian fly DNA from an inbred hessian fly line TARGET AUDIENCES: Not relevant to this project. PROJECT MODIFICATIONS: We produced sequence using an upgrades version of the 454 pyro-sequencing platform (upgrade from FLX to XLR). This allowed more sequence to be generated at the same cost, but caused the delay of the project by 6 months. We increased the sequence coverage of the Hessian fly genome generated from 12X to 22X, to enable proper assembly of the X chromosomes (40% of the genome) from an inbred line mostly of male individuals.

Impacts
The availability of the genome sequence of the Hessian fly is the first step towards the comprehensive identification, mapping and characterization of Hessian fly genes. We are currently making this genomic information available and searchable via a web based blast on our website to accelerate Hessian fly research into virulence and other genes affecting pest reduction of crop yields.

Publications

No publications reported this period