Source: BAYLOR COLLEGE OF MEDICINE submitted to
IMPROVING REFERENCE GENOMES FOR AGRICULTURALLY IMPORTANT ANIMALS
Sponsoring Institution
National Institute of Food and Agriculture
Project Status
TERMINATED
Funding Source
Reporting Frequency
Annual
Accession No.
1000517
Grant No.
2013-67015-21228
Project No.
TEXW-2013-00978
Proposal No.
2013-00978
Multistate No.
(N/A)
Program Code
A1201
Project Start Date
Sep 1, 2013
Project End Date
Aug 31, 2018
Grant Year
2013
Project Director
Worley, K. C.
Recipient Organization
BAYLOR COLLEGE OF MEDICINE
(N/A)
HOUSTON,TX 77030
Performing Department
Molecular & Human Genetics
Non Technical Summary
High quality reference genomes are the cornerstone of genetic analyses. Even "high quality" draft sequences have thousands of remaining gaps that impact the assessment of genes and other functional elements. We propose to apply our successful methods using Pacific Biosciences (PacBio) sequence and PBJelly software to improve the high quality draft genome sequences for two agriculturally important animals. Specific Objectives: We will (1) Produce whole genome shotgun sequence using the PacBio technology from the Hereford cow and the Texel ram, (2) apply the PBJelly software to address gaps remaining in the current draft genome sequences, (3) in collaboration with other genome improvement efforts, publish improved draft genome sequences for the bovine and ovine genomes. Approach: The PacBio sequence reads are long (2 kb average, up to 10 kb) and therefore useful for spanning gaps in a draft genome. The PBJelly method addresses most gaps and closes many (35 % to 70 %). In all cases the contiguity improves, the contig N50 increases two to four fold. PBJelly takes genome scaffold sequences as input, produces an improved genome scaffold as output, can be used iteratively and combined with other genome improvement efforts to benefit from each method. Potential impact: The stated goal of "Development of community resources and tools, especially focusing on: Improvement of genome assembly and annotation." will be met by our proposal to improve the cow and the sheep references. These better genomes will impact the genetic studies of production animals and breeds around the world.
Animal Health Component
75%
Research Effort Categories
Basic
100%
Applied
(N/A)
Developmental
(N/A)
Classification

Knowledge Area (KA)Subject of Investigation (SOI)Field of Science (FOS)Percent
3043399108050%
3043699108050%
Goals / Objectives
High quality reference genomes are the cornerstone of genetic analyses. Even "high quality" draft sequences have thousands of remaining gaps that impact the assessment of genes and other functional elements. We propose to apply our successful methods using Pacific Biosciences (PacBio) sequence and PBJelly software to improve the high quality draft genome sequences for agriculturally important animals. Specific Objectives: We will (1) Produce whole genome shotgun sequence using the PacBio technology from the Hereford cow and the Texel ram, (2) apply the PBJelly software to address gaps remaining in the current draft genome sequences, (3) in collaboration with other genome improvement efforts, publish improved draft genome sequences for the bovine and ovine genomes.
Project Methods
For each genome, we will obtain a high quality DNA sample from the sequenced animal. The DNA will be checked for quality and quantity and libraries made using library construction methods appropriate to the Pac Bio system, and sequenced using the current best practices for Pac Bio (English, Richards et al. 2012). The current best genome assembly will be obtained from the genome collaborators or GenBank as appropriate. The sequence data will be analyzed and the gaps in the genome filled using the Pac Bio sequence reads and the PBJelly software (English, Richards et al. 2012). The new, post-PBJelly version of the genome will be quality checked at BCM-HGSC and in collaboration with the genome research communities. Any iteration of improvements with other efforts will be performed, and a final improved assembly version generated and released to GenBank and from there to the appropriate animal genome database. Samples are available from the previously sequenced animals for both the sheep and the cow projects. Mike Heaton at USDA Meat Animal Research Center has blood (stored at 4oC) and buffy coat samples available from the sheep ram reference animal (Texel breed, animal # 200118011) that was the donor for the BAC library (see letter of collaboration). Lee Alexander of USDA ARS has blood and tissue samples from L1 Dominette 01449, the bovine reference animal (see letter of collaboration). For this proposal, 50 ug of high molecular weight DNA is required. Pac Bio sequencing is a production technology in the BCM-HGSC (English, Richards et al. 2012). Current sequencing is not considered to be in the experimental or technology development state. Sequence metrics such as number of post filtered reads, number of post filtered bases, number of mapped reads, number of mapped bases, read length, mapped read length, mapped sub-read length, read quality and mapped subread accuracy are measured and tracked for each sequencing event (SMRT Cell). Improvements to the PacBio processes have been introduced by the Pacific Biosciences company with regularity over the last year and a half. The BCM-HGSC has a close collaboration with the PacBio development group, with bi-weekly conference calls that maintain excellent communication between BCM-HGSC and PacBio staff working on these improvements. We have been given early access to many new advances. Laboratory procedures used at the BCM-HGSC have evolved with these changes. Future improvements are expected from the Pacific Biosciences company, and we will continue to work with Pacific Biosciences to optimize the implementation at BCM-HGSC. The PBJelly algorithm is described in detail elsewhere (English, Richards et al. 2012). Briefly, reads are aligned using BLASR (Li, Zhu et al. 2010) and gaps in the assembly are evaluated. Reads that are associated with each gap are then assessed and assembled. The resulting gap-filling contigs are then spliced into the assembly. Contigs that address specific gaps but do not span all the way across them may be used to extend the flanking contig ends. Furthermore, process can be iterated using the newly-extended flanking contigs and the remaining reduced gap in a second round of PBJelly gap filling. The quality of the gap-filling alignments is evaluated at each step. The software is robust, and can be run relatively quickly on a mammalian sized data set using the substantial BCM-HGSC computational resources. The processing is embarrassingly parallel, since each gap can be processed separately in the large compute cluster with small memory resources, or bundled into groups to match the available large memory resources. This will allow iterations of PBJelly improvements to alternate with other genome improvement methods. Assembly quality is assessed using a variety of metrics. We measure the contiguity using the contig size distribution, including the contig and scaffold N50, N75, N90, N95, and mean as well as the total assembled genome size. Quality of the assembly is measured by mapping reads back to the assembly and checking for mate pair consistency and by comparison to available ESTs, cDNAs, and genes, looking for high identity matches and assessing the number of ESTs matched over 95%, 90%, 75% and 50% of their length. Syntenic comparisons to genomes of closely related species and to available finished sequences are also useful consistency checks. Global checks like these are useful, but it is also important to work with the research community and assess individual genes and gene families in a deeper analysis.

Progress 09/01/13 to 08/31/18

Outputs
Target Audience:The target audience for this project are researchers who rely upon the reference genomes to study genetics of sheep and cattle. This includes breeders who use genotyping information as well as the scientific community working to understand the biology of these animals. Changes/Problems:Decreases in cost of long read sequencing allowed additional work including a de novo long read assemby of a Rambouillet sheep genome and some sequencing for FAANG annotation of the new reference genome. Described in the Accomplishments section. What opportunities for training and professional development has the project provided?No students are funded as a part of this project. Young professionals, including a recent graduate, Shwetha Muralli, learned techniques for genome assembly improvement and software development with these data. Ms. Muralli has gone on to work with Evan Eichler, still using the skills she developed here. How have the results been disseminated to communities of interest?Results from this project have been presented at scientific meetings (see citation list) and released to the public through the NCBI (Table 1). What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? All of the initially proposed goals were completed ahead of schedule. The samples for both the Texel ram and the Hereford cow were acquired and the sequence generated in the first year of the grant. The improved reference genomes and the raw data were submitted to NCBI and available for researchers worldwide in 2015 (Table 1). The improved reference genomes have been annotated by NCBI and Ensembl and are being used by the research communities. In addition, the rapid drop in price for the long-read sequencing allowed for additional goals to be reached. New de novo genome assemblies from long reads were proposed for the Rambouillet sheep and the Hereford cow, with the Rambouillet genome produced as a part of this funding and the Hereford data generated here available for a joint assembly with data generated at USDA MARC by Tim Smith and Juan Medrano. Submission of the Rambouillet data to the SRA and GenBank is in progress. In the meantime, the data are available from the HGSC website (https://www.hgsc.bcm.edu/other-mammals/sheep-genome-project). All the reference genomes are improved, being more complete, and more contiguous with more complete genes and proteins found in the assemblies. Both of these assemblies used the PBJelly method[4] for improvement. The improved bovine genome assembly used the UMD_3.1 genome assembly for the initial scaffolds[5]. The Texel sheep genome assembly used the Oar_v3.1 genome assembly for the initial scaffolds[3]. The Oar_v4.0 improved Texel sheep assembly statistics by chromosome indicate fewer unplaced contigs, fewer gaps on each chromosome resulting in shorter overall length and longer ungapped length. Consistent with our experience with other genomes, the contigN50 values for the short read assemblies improved several fold with the long read gap-filling technique PBJelly[1] and Pacific Biosciences long reads. De novo long read assemblies increase the contigN50 values by an additional order of magnitude, as illustrated with the Rambouillet genome assembly. PacBio long read data was used to error-correct itself and the error corrected reads were assembled. A number of experiments were performed using both the Falcon and Celera assembly engines and both Arrow and Quiver error correction methods to define the best preliminary assembly contigs. In addition, a Fermi assembly of Illumina data was used to generate high quality but short contigs to use to evaluate the base quality of the different long read assemblies. The Celera assembly had shorter contigs but better quality so this assembly was used for the subsequent steps. The initial contigs were already much more contiguous and complete than the previous Texel assemblies. Hi-C data was generated from a blood sample and used for scaffolding with the Phase Genomics PGS method. The contig statistics did not change and the genome was scaffolded into fewer pieces than Oar_v4.0 and largely into chromosome groups. Of the 10,441 initial contigs, 8,035 clustered, 7,829 ordered and 2,047 had high quality orientation based upon the Hi-C data. The genome is in less that half the number of scaffolds as Oar_v4.0. The later assembly steps changed the scaffold and contig statistics slightly as scaffolds were broken and merged for consistency with mate pair data and the map information in Oar_v4.0 and as gaps were filled with PBJelly and and the final base quality was polished with long read data using Arrow and short read data using Pilon. The final version has a 2.6 Mb contig N50. The assembly was compared to the BUSCO conserved mammalian ortholog set at each of the stages. The alignment of ESTs, mRNAs, and RefSeq mRNAs to the assemblies showed that more of these transcribed sequences align over more of their length in the final Rambouillet assembly than in the less contiguous Fermi Illumina assembly and Oar_v4.0. Additional samples from more than 100 tissues have been collected from the same reference animal for FAANG assays. These assays include both long and short read transcript sequencing that will be available to produce an improved annotation for the new Rambouillet genome assembly. Since these data include long read sequence that captures multiple exon splice sites in one read, we anticipate that this will aid the definition of alternative transcript forms for each gene. The deeper sequencing afforded by the Illumina sequencing will sample more of the transcribed exons and provide evidence for exon annotation. Other assays that mark other types of features (e.g. histone acetylation) will also be available for the genome annotation as a result of a new USDA NIFA award 2017-67016-26301 for FAANG analyses. The FAANG sequencing completed to date was accomplished in the extension year of this award. Tissues (65) were received in the summer of 2016 and RNA prepared and sequencing libraries generated starting in October 2016. Sequence was generated during the first half of 2017 from 20 samples for miRNA sequence, 9 samples for mRNA sequence and 5 samples for long-read mRNA sequence using funding from this award as well as USDA NIFA award 2013-67015-21372/sub #130428-00001237 "Building the Sheep Genome Database" to Noelle Cockett, subcontract to Kim Worley and carry forward funding from a University of Sydney contract for sheep genomics resources. Additional sequencing and combined analysis is ongoing with preliminary data release is planned for 2018. Best practices for annotation using long-read data or a combination of long-read and short-read data are not yet defined, so the multiple data types from the same tissues from a single individual available as a part of this project are very useful for determining these methods. Initial analyses of the long-read sequencing data using the Texel Oar_v4.0 assembly and annotations found about 2/3 (15,888) of the annotated genes were matched with a transcript from at least one tissue, and only about 1/3 (6,334) were matched with a transcript from all 5 sequenced tissues. Suggesting that each additional tissue samples additional genes. MicroRNA species sequenced identified matches to annotated sheep, cow and goat annotated microRNAs as well as novel predicted microRNAs in each of the sequenced tissues. Deep mRNA sequencing using Illumina technology on nine tissues has yet to be analyzed. Further evaluation of the different transcripts found in each tissue using the Rambouillet genome develop here are ongoing. Table 1. Products of this Award Original Improved Texel reference genome GCA_000298735.2 Improved Hereford reference genome GCA_000003205.6 PacBio Texel data SRX1212984 PacBio Hereford data SRP061777 Additional Highest quality, de novo long read Rambouillet reference genome PEKD00000000 Highest quality, de novo long read Hereford reference genome Assembly by USDA MARC PacBio Rambouillet data SRA In Progress[6] Illumina Rambouillet data SRA In Progress[6] Hi-C Illumina Rambouillet data SRA In Progress[6] Illumina miRNA for FAANG Analysis in progress References 1. English AC, et al. PLoS ONE 2012, 7:e47768. 2. Elsik CG, et al. Science 2009, 324:522-528. 3. Jiang Y, et al. Science 2014, 344:1168-1173. 4. English AC, et al BMC Genomics 2015, 16:286. 5. Zimin AV, et al Genome Biol 2009, 10:R42. 6. https://www.hgsc.bcm.edu/other-mammals/sheep-genome-project

Publications

  • Type: Conference Papers and Presentations Status: Other Year Published: 2017 Citation: 21. Yue Liu, Shwetha C. Murali, R. Alan Harris, Adam C. English, Xiang Qin, Evette Skinner, Stephen Richards, Jeffrey Rogers, Yi Han, Vanesa Vee, Min (Mark) Wang, Qingchang Meng, Mike Heaton, Timothy Smith, Brian Dalrymple, James Kijas, Noelle E. Cockett, Eric Boerwinkle, Donna M. Muzny, Richard A. Gibbs and Kim C. Worley. Sheep Reference Genome Sequence Updates: Texel Improvements and Rambouillet Progress. 35th International Society for Animal Genetics, Salt Lake City, Utah, July 23-27, 2016. 22. Yue Liu, R. Alan Harris, Xiang Qin, Stephen Richards, Jeffrey Rogers, Yi Han, Vanesa Vee, Min (Mark) Wang, Qingchang Meng, Mike Heaton, Timothy Smith, Brian Dalrymple, Stephen White, Brenda Murdoch, James Kijas, Noelle E. Cockett, Donna M. Muzny, Richard A. Gibbs and Kim C. Worley. Rambouillet Sheep Genome Resources. Cattle / Sheep / Goat 1 Workshop, International Plant and Animal Genome XXV Conference, January 13-18, 2017, San Diego, California. 23. Yue Liu, R. Alan Harris, Xiang Qin, Stephen Richards, Jeffrey Rogers, Yi Han, Vanesa Vee, Min (Mark) Wang, Qingchang Meng, Mike Heaton, Timothy Smith, Brian Dalrymple, Stephen White, Brenda Murdoch, James Kijas, Noelle E. Cockett, Donna M. Muzny, Richard A. Gibbs and Kim C. Worley. Rambouillet Sheep Genome Resources. International Sheep Genomics Workshop, International Plant and Animal Genome XXV Conference, January 13-18, 2017, San Diego, California. 24. S. Koren, A. M. Phillippy, D. M. Bickhart, A. L. Archibald, J. F. Medrano, M. Watson, A. Warr, A. V. Zimin, R. J. Hall, C.-S. Chin, R. E. Green, N. J. Putnam, E. Tseng, B. D. Rosen, Y. Liu, S. C. Murali, K. C. Worley, B. L. Sayre, A. R. Hastie, S. Chan, J. Lee, E. T. Lam, I. Liachko, S. T. Sullivan, J. N. Burton, H. J. Huson, J. L. Hutchinson, Y. Zhou, J. Sun, A. Crisa, F. A. Ponce de Leon, J. C. Schwartz, J. A. Hammond, G. C. Waldbeiser, S. G. Schroeder, G. E. Liu, M. J. Dunham, J. Shendure, T. S. Sonstegard, C. P. Van Tassell, D. J. Nonneman, G. A. Rohrer, S. J. Schultheiss, C. Dreischer, T. P.L. Smith. Combinations of long read-based contigs, optical maps and DNA crosslinking-based scaffolding produce high quality assemblies of livestock genomes. Advances in Genome Biology and Technology, Hollywood, Florida, February 13-17, 2017. 25. Yue Liu, R. Alan Harris, Xiang Qin, Stephen Richards, Jeffrey Rogers, Yi Han, Vanesa Vee, Min (Mark) Wang, Qingchang Meng, Mike Heaton, Timothy Smith, Brian Dalrymple, Stephen White, Brenda Murdoch, James Kijas, Noelle E. Cockett, Donna M. Muzny, Richard A. Gibbs and Kim C. Worley. Rambouillet Sheep Genome Resources. Biology of Genomes, May 9-13, 2017, Cold Spring Harbor, New York. 26. Yue Liu, R. Alan Harris, Xiang Qin, Stephen Richards, Jeffrey Rogers, Yi Han, Vanesa Vee, Min (Mark) Wang, Qingchang Meng, Mike Heaton, Timothy Smith, Brian Dalrymple, Stephen White, Brenda Murdoch, James Kijas, Noelle E. Cockett, Donna M. Muzny, Richard, Kim C. Worley. Rambouillet Sheep Genome Resources. Second Southeast Texas Evolutionary Genetics and Genomics, June 2, 2017, Galveston, Texas. 27. Yue Liu, R. A Harris, Xiang Qin, Stephen Richards, Jeffrey Rogers, Yi Han, Vanesa Vee, Min Wang, Qingchang Meng, Mike P Heaton, Timothy P Smith, Brian P Dalrymple, Stephen N White, Brenda Murdoch, James Kijas, Noelle E Cockett, Donna M Muzny, Kim C Worley. Rambouillet Sheep Genome and FAANG RNA Resources. 36th Conference of the International Society of Animal Genetics, July 16-21, 2017, Dublin, Ireland.


Progress 09/01/15 to 08/31/16

Outputs
Target Audience: The target audience for this project are researchers who rely upon the reference genomes to study genetics of sheep and cattle. This includes breeders who use genotyping information as well as the scientific community working to understand the biology of these animals. Changes/Problems:Reduction in the cost of PacBio sequence has allowed the generation of more sequence than initially budgeted. This will result in 4 improved genome references; improved genomes with gaps filled using PBJelly methods for the Texel sheep and the cow and new de novo assemblies of long-read PacBio data for the rambouillet sheep and the cow. In collaboration with other projects, 101 tissues were collected from the rambouillet sheep reference animal. Transcript sequencing using PacBio IsoSeq, Illumina mRNAseq and Illumina miRNAseq is planned. One tissue will be sequenced using all three methods, and 4 additional tissues will be sequenced with the Illumina methods to generate evidence for annotation of the genome. Other tissues will be available for future additional sequencing. What opportunities for training and professional development has the project provided?No students are funded as a part of this project. Young professionals, including a recent graduate are learning techniques for genome assembly improvement and software development with these data. How have the results been disseminated to communities of interest?Results from this project are publicly available in the international sequence databases (GenBank, EMBL, DDBJ) and have been presented at scientific meetings (see products list and citation list). What do you plan to do during the next reporting period to accomplish the goals?During the remainder of year 3 and the 12-month no-cost extension, we will complete the Rambouillet genome assembly with scaffolding, quality evaluations and submission to GenBank. In addition, transcriptome sequencing of the FAANG tissues is planned to provide gene model annotation evidence for the genome annotation. A number of these tissues (10 to 15) will be sequenced with the remaining funds using the PacBio IsoSeq protocol for full-length mRNA sequencing. These data will improve the annotation by providing evidence of particular transcript isoforms. Other Illumina sequencing is also planned in complementary efforts to sequence miRNAs and to deeply sequence mRNAs from ~60 tissues. These data will also be available for genome annotation. Using sequence from one animal for both the genomic DNA and the RNA evidence for annotation will minimize the number of sequence differences between the two data types and provide better data for evidence to annotate different transcripts. The long-read data allows multiple exons to be strung together into transcript variants while the Illumina data allows sampling of lower abundance transcripts. Together, we expect these data to provide annotations that are among the highest quality available. Making the combined genome and annotation resource most useful for the research community. We will complete the PacBio IsoSeq data generation, primary analysis (assembly of transcripts) and submission to NCBI sequence read archive. This will bring together data to support a high quality annotation through the NCBI RefSeq and Ensembl annotation processes.

Impacts
What was accomplished under these goals? The improved genome assemblies for the Texel sheep and the Hereford cow have been submitted to GenBank and are publicly available under accessions AMGL00000000.2 and AAFC00000000.4, respectively. The raw data are available in the NCBI sequence read archive under accessions SRX1212984 for the Texel sheep and SRX1123975 to SRX1123979 for the Hereford cow. The initial goals for the project were revised in year two, with plans to produce a full de novo assembly using ~50x Pacific Biosciences (PacBio) long-read data from a single Rambouillet individual for the sheep and to collaborate with Tim Smith and Juan Medrano to assemble de novo ~50x PacBio data for the cow. In addition, we added plans to collect samples from the sheep Rambouillet reference animal for FAANG assays to provide very high quality annotation of the reference genome, with the transcriptome sequencing of some of these samples also planned. Progress in the third year toward the initial and revised goals are discussed below. The improved sheep reference genome for the Texel breed is in GenBank AMGL00000000.2. The assembly of these data improved the genome contiguity twelve-fold from 42 kb to 501 kb Contig N50. With recent decrease in sequencing costs, it became possible to produce enough data to generate a de novo genome assembly of a single individual. A Rambouillet ewe was selected for sequencing and for other types of genomic analysis in the FAANG project. This animal, USMARC ID 200935900 ("Benz 2616" born 9/23/2009), donated blood for the sequencing, which was collected by Tim Smith. Data (Table 1) was generated during year two with 21.7 million reads produced with a mean (subread) read length of 9.1 kb, N50, and half the data (5.7 million reads) had a subread length of 12.9 kb (read N50). The longest subread was 68 kb and the total amount of data was 198 Gb or ~ 66 fold coverage of a 3 Gb genome. The high error rate in the PacBio data is addressed by using the shorter reads to correct errors in the longer reads prior to or during the assembly process. We have used both the Falcon and Celera assembly and error correction methods and have generated de novo assemblies with contiguity of 3.6 Mb and 2.1 Mb contig N50. These two assemblies were compared to contigs from an Illumina assembly of data from the same animal to assess base error rates and compared to EST data to assess completeness of the transcript representation. Based upon these metrics, the Celera contigs were chosen as the basis for further improvements, with polishing using Quiver, and further gap filling using the contigs from the Falcon assembly. The revised assembly will be scaffolded with the Lachesis method using Hi-C proximity ligation sequence that we generated from the same animal, followed by additional base quality polishing using Pilon and the Illumina data. This assembly is planned for public release before the end of the year. Table 1 Primary Assembly Metrics Falcon Assembly Celera Assembly Contig N50 3,612,954 2,156,026 Number to N50 227 365 Total Length 2,645,357,239 2,848,742,196 Mean 421,235 272,842 Max 14,866,463 16,257,484 Total Number 6,280 10,441 Samples were collected from 101 tissues from the Rambouillet reference animal with the intent of using those samples for FAANG assays to enhance the annotation of the sheep genomes. Additional sequencing of these samples is planned for the 12-month no-cost extension period and discussed in the plans below.

Publications

  • Type: Conference Papers and Presentations Status: Other Year Published: 2016 Citation: Yue Liu, Shwetha C. Murali, R. Alan Harris, Adam C. English, Xiang Qin, Evette Skinner, Stephen Richards, Jeffrey Rogers, Yi Han, Vanesa Vee, Min (Mark) Wang, Qingchang Meng, Mike Heaton, Timothy Smith, Brian Dalrymple, James Kijas, Noelle E. Cockett, Eric Boerwinkle, Donna M. Muzny, Richard A. Gibbs and Kim C. Worley. Reference Genome Sequence Updates: Texel Improvements and Rambouillet Progress. International Plant and Animal Genome XXIV Conference, January 9-13, 2016, San Diego, California.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2016 Citation: Yue Liu, Shwetha C. Murali, R. Alan Harris, Adam C. English, Xiang Qin, Evette Skinner, Mike Heaton, Timothy Smith, Brian Dalrymple, James Kijas, Noelle E. Cockett, Eric Boerwinkle, Donna M. Muzny, Richard A. Gibbs and Kim C. Worley. Sheep Reference Genome Sequence Updates: Texel Improvements and Rambouillet Progress. Biology of Genomes, May 10-14, 2016, Cold Spring Harbor, New York.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2016 Citation: Yue Liu, Shwetha C. Murali, R. Alan Harris, Adam C. English, Xiang Qin, Evette Skinner, Stephen Richards, Jeffrey Rogers, Yi Han, Vanesa Vee, Min (Mark) Wang, Qingchang Meng, Mike Heaton, Timothy Smith, Brian Dalrymple, James Kijas, Noelle E. Cockett, Eric Boerwinkle, Donna M. Muzny, Richard A. Gibbs and Kim C. Worley. Sheep Reference Genome Sequence Updates: Texel Improvements and Rambouillet Progress. Southeast Texas Evolutionary Genetics and Genomics, June 3, 2016, Houston, Texas.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2016 Citation: Yue Liu, Shwetha C. Murali, R. Alan Harris, Adam C. English, Xiang Qin, Evette Skinner, Stephen Richards, Jeffrey Rogers, Yi Han, Vanesa Vee, Min (Mark) Wang, Qingchang Meng, Mike Heaton, Timothy Smith, Brian Dalrymple, James Kijas, Noelle E. Cockett, Eric Boerwinkle, Donna M. Muzny, Richard A. Gibbs and Kim C. Worley. Sheep Reference Genome Sequence Updates: Texel Improvements and Rambouillet Progress. 35th International Society for Animal Genetics, Salt Lake City, Utah, July 23-27, 2016.


Progress 09/01/14 to 08/31/15

Outputs
Target Audience:The target audience for this project ar researchers who rely upon the reference genomes to study genetics and genomics of sheep and cattle. This includes breeders who use genotyping information as well as the scientific community working to understand the biology of these and related animals. Changes/Problems:The reduced sequencing cost and improved assembly methods have enabled a full de novo assembly of a second sheep breed using the funds available here. The data has been generated for this assembly and the assembly is planned for early in the third funding period. What opportunities for training and professional development has the project provided?No students are funded as a part of this project. Young professionals, including a recent graduate are learning techniques for genome assembly improvement and software development with these data. How have the results been disseminated to communities of interest?Results from this project have been submitted to GenBank and presented at scientific meetings (see citation list). The bovine sequence data is available in the NCBI SRA under accessions SRX1123975 to SRX1123979. The cow genome improvements using PacBio data and PBJelly software produced an improved genome assembly that has been submitted to GenBank and will be available under accession AAFC00000000.4. This Texel sheep genome assembly has been submitted to GenBank and will be available under accession AMGL00000000.2 What do you plan to do during the next reporting period to accomplish the goals?We will work with our newly-generated PacBio data from the Rambouillet ewe to produce a de novo assembly from a single individual. We will apply the current best methods for error correction of PacBio reads using the PacBio data and de novo assembly using the Celera assembler. We will also work with collaborators to use the PacBio data sets generated here at USDA MARC and UC Davis from the Hereford L1 Dominette cow to assemble the data de novo. With our collaborators, we will evaluate the quality of these improved genome sequence assemblies, and work toward release to the public databases.

Impacts
What was accomplished under these goals? The improved genome assemblies for the Texel sheep and the Hereford cow have been submitted to GenBank and will be publicly available as soon as NCBI finishes processing the files. The initial goals for the project have been revised, with plans to produce a full de novo assembly using ~50x Pacific Biosciences (PacBio) long-read data from a single individual for the sheep and to collaborate with Tim Smith and Juan Medrano to assemble de novo ~50x PacBio data for the cow. Progress in the second year toward the initial goals and the revised goals are discussed below. The improved sheep reference genome for the Texel breed used spleen frozen tissue samples from animal USMARC ID 200118011 for DNA sequencing using the PacBio technology. The ~19x coverage data were used to improve the public Oar_v3.1 assembly using the PBJelly method (English et al., 2012). This genome assembly has been submitted to GenBank and will be available under accession AMGL00000000.2 when NCBI finishes processing the files. The sequence files have been submitted to the NCBI SRA and are currently being processed. The assembly of these data improved the genome contiguity from 42 kb to 501 kb Contig N50. With the improvement in the sequencing efficiency, it became possible to produce enough data to generate a de novo genome assembly of a single individual. A Rambouillet ewe was selected for sequencing with plans to use the same animal for other types of genomic analysis in the FAANG project. This animal, USMARC ID 200935900 ("Benz 2616" born 9/23/2009), donated blood for the sequencing, which was collected by Tim Smith. Data was generated over the second half of the year, with assembly planned for the beginning of the third year. The data is of high quality, with the latest data having a 20 kb average read length. The Hereford cow genome sequence from the reference animal L1 Dominette 01449 used frozen lung tissue samples obtained from Tim Smith, Tara Marostica, and Lee Alexander. DNA was prepared from the sample and libraries sequenced using the PacBio technology produced ~19x data. The cow genome improvements using PacBio data and PBJelly software produced an improved genome assembly that has been submitted to GenBank and will be available under accession AAFC00000000.4. The sequence data is available in the NCBI SRA under accessions SRX1123975 to SRX1123979.

Publications

  • Type: Conference Papers and Presentations Status: Other Year Published: 2015 Citation: K. C. Worley. Improving the reference genomes for the sheep and the cow. Cattle/Sheep/Goat 1 Workshop, International Plant and Animal Genome XXIII Conference, January 10-14, 2015, San Diego, California.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2015 Citation: K. C. Worley. Improving the reference - Better genomes for the sheep and the cow. NRSP8 Workshop, International Plant and Animal Genome XXIII Conference, January 10-14, 2015, San Diego, California.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2015 Citation: K. C. Worley. Improvement to OARv3.1 using long read technology. ISGC Workshop, International Plant and Animal Genome XXIII Conference, January 10-14, 2015, San Diego, California.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2015 Citation: K. C. Worley. Reference Genome Assembly in 2015  Methods for the 21st Century. The Resurgence of Reference Quality Genome Sequence, International Plant and Animal Genome XXIII Conference, January 10-14, 2015, San Diego, California.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2015 Citation: Yue Liu, Shwetha C. Murali, Daniel S. T. Hughes, Adam C. English, Xiang Qin, Yi Han,Vanesa Vee, Min (Mark) Wang, Eric Boerwinkle, Donna M. Muzny, Jeffrey Rogers, Stephen Richards, Kim C. Worley and Richard A. Gibbs . De novo Genome Assembly for the 21st Century. The Biology of Genomes, May 5-9, 2015, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2015 Citation: K. C. Worley, A. C. English, X. Qin, S. C. Murali, D. S. T. Hughes, Y. Han,V. Vee, M. Wang, T. Smith, J. E. Decker, B. Dalrymple, J. Kijas, N. E. Cockett, J. F. Taylor, J. F. Medrano, D. Schwartz, S. Zhou, D. M. Muzny and R. A. Gibbs. Improving the Reference through Long Read Technology - Better Genomes for the Sheep and the Cow. The Biology of Genomes, May 5-9, 2015, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York.


Progress 09/01/13 to 08/31/14

Outputs
Target Audience: The target audience for this project are researchers who rely upon the reference genomes to study genetics of sheep and cattle. This includes breeders who use genotyping information as well as the scientific community working to understand the biology of these animals. Changes/Problems: Brian Dalrymple has been unable to contribute as much as planned to this effort due to the loss of personnel at CSIRO. Jared Decker lost his funding to improve the bovine genome sequence by reassembly of the existing Sanger data. Both of these have delayed the assessment of the existing genome sequences, but BCM personnel are working on this task. Reduction in the cost of PacBio sequence has allowed the generation of more sequence than initially budgeted. Prioritizing additional work on other individuals or other species is a task for the coming year. What opportunities for training and professional development has the project provided? No students are funded as a part of this project. Young professionals, including a recent graduate are learning techniques for genome assembly improvement and software development with these data. How have the results been disseminated to communities of interest? Results from this project have been presented at scientific meetings (see citation list). What do you plan to do during the next reporting period to accomplish the goals? We will continue to work with our collaborators and existing data sets to identify issues in the current sheep and cow reference genome sequences in order to address known flaws. We will apply our genome improvement methods appropriate to the available data, including using the improved PBJelly methods with the PacBio sequence data that we have produced. With our collaborators, we will evaluate the quality of these improved genome sequence assemblies, and work toward release to the public databases.

Impacts
What was accomplished under these goals? Goals for the first year of the project have been met or exceeded, the detailed progress for the year included the following. The sheep samples from animal USMARC ID 200118011, the source of the BAC library was obtained from Brad Freking and Michael P. Heaton and DNA prepared from the spleen frozen tissue sample. DNA sequencing libraries for Pacific Biosciences (PacBio) genomic sequencing were also generated. The improved throughput and decreased cost of the PacBio sequencing allowed the generation of ~19x PacBio sequence data for the sheep sample. Preliminary assembly of these data improved the genome contiguity from 42 kb to 501 kb Contig N50. Current efforts are focused on identifying and correcting known issues with the published genome sequence prior to applying the PBJelly methods and the PacBio data to produce a final improved assembly. Brian Dalrymple has identified two known issues including small (20 to 500 bp) artifactual tandem duplications and multi-copy repeat families that have generic consensus sequences rather than the sequence of the particular repeat at each location. BCM employees are working to identify and address these issues. New methods developed by Adam English at BCM to use the PacBio data with his Honey tool1 to identify structural variants are being explored to identify the small duplications within the genome sequence in order to correct these issues. The generation of the cow genome sequence, originally scheduled for completion at the end of November, 2014 is ahead of schedule, having been completed in July, 2014. The cow lung and liver samples from the reference animal L1 Dominette 01449 were obtained from Tim Smith, Tara Marostica, and Lee Alexander and DNA prepared from the frozen lung tissue sample. PacBio sequencing produced ~19x PacBio data. The cow genome improvements using PacBio data were intended to be applied following generation of an improved genome sequence by Jared Decker. Unfortunately, this preliminary improvement has not been generated. BCM employees are working directly with David Schwartz and Shinguo Zho to use the new optical map data and the two existing genome assemblies to identify the best current starting material for PacBio improvement efforts. We received the locations of inconsistencies between the optical map and each of the two existing genome assemblies (UMD 3.1 and Btau 4.6) with the coordinates as in each of the genome sequences. The Schwartz laboratory is working to provide the locations of inconsistencies with either genome sequence in optical map coordinates in order to facilitate this process to identify consistent regions and regions with quality issues. Using Honey with the PacBio data to identify potential issues in the current genome is another approach that will be tried. In addition to tasks specifically targeted to the two genomes above, we have been working to identify issues with submitting PacBio read data and improved genome assemblies to the NCBI repositories. These and other improvements to the PBJelly methods that are being made as a part of this process will be applied to the sheep and the cow genome. 1 English, A. C., Salerno, W. J. & Reid, J. G. PBHoney: identifying genomic variants via long-read discordance and interrupted mapping. BMC Bioinformatics 15, 180, doi:10.1186/1471-2105-15-180 (2014).

Publications

  • Type: Journal Articles Status: Published Year Published: 2014 Citation: Y. Jiang, M. Xie, W. Chen, R. Talbot, J. F. Maddox, T. Faraut, C. Wu, D. M. Muzny, Y. Li, W. Zhang, J. Stanton, R. Brauning, W. C. Barris, T. Hourlier, B. L. Aken, S. M. J. Searle, D. L. Adelson, C. Bian, G. R. Cam, Y. Chen, S. Cheng, U. DeSilva, K. Dixen, Y. Dong, G. Fan, I. R. Franklin, S. Fu, R. Guan, M. A. Highland, M. E. Holder, G. Huang, S. N. Jhangiani, D. Kalra, C. L. Kovar, S. L. Lee, W. Liu, X. Liu, C. Lu, T. Lv, T. Mathew, S. McWilliam, S. Pan, D. Robelin, B. Servin, D. Townley, W. Wang, B. Wei, S. N. White, X. Yang, C. Ye, Y. Yue, P. Zeng, Q. Zhou, J. B. Hansen, K. Kristensen, R. A. Gibbs, P. Flicek, C. C. Warkup, H. E. Jones, V. H. Oddy, F. W. Nicholas, J. C. McEwan, J. Kijas, J. Wang, K. C. Worley, A. L. Archibald, N. Cockett, X. Xun, W. Wang, B. P. Dalrymple. The reference genome of sheep provides new insights into the evolution of ruminants. Science 2014 Jun 6;344(6188):1168-73. doi: 10.1126/science.1252806. PMID: 24904168.
  • Type: Other Status: Other Year Published: 2014 Citation: Kim C. Worley, Adam C. English, Xiang Qin, Shwetha C. Murali, Daniel S. T. Hughes, Stephen Richards, Jeffrey Rogers, Yi Han,Vanesa Vee, Min (Mark) Wang, Eric Boerwinkle, Donna M. Muzny and Richard A. Gibbs. Improving reference genomes for agriculturally important animals. NIFA Joint Animal Nutrition, Growth and Lactation; Feed Efficiency; and Animal Genomics Project Director Meeting, July 24-25, 2014, Kansas City, Kansas.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: J. Rogers, A. English, Y. Han, S. Richards, M. Raveedran, D.l. Hughes, V. Vee, M. Wang, D. Rio Deiros, Y. Liu, D. M. Muzny, K. C. Worley, and R. A. Gibbs. Upgrading large genomes using Pacific Biosciences long reads and PBJelly software. Sequencing, Finishing and Analysis in the Future, May 28-30, 2014, Sante Fe, New Mexico.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: Adam English, Will Salerno, Christine Beck, Oliver Hampton, Jeffrey Rogers, Yi Han, Vanesa Vee, Mark Wang, Donna M. Muzny, Jeffrey G. Reid, Kim C. Worley, and Richard A. Gibbs. PBHoney: Resolving Structural Variation Using Long-Read Sequencing. Pacific Biosciences: A SMRT� Sequencing approach to Reference Genomes, Annotation, and Haplotyping #2326 Workshop, PAG ASIA, May 19-21, 2014, Singapore, Singapore.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: A. English, S. Richards, J. Rogers, Y. Han, D.l Hughes, D. Rio Deiros, V. Vee, M. Wang, D. M. Muzny, J. G. Reid, K. C. Worley, and R. A. Gibbs. Improving Genomes for Agricultural Species. PAG ASIA, May 19-21, 2014, Singapore, Singapore. Abstract 13768.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: Kim C. Worley, Adam English, Stephen Richards, Jeffrey Rogers, Yi Han, Daniel Hughes, David Rio Deiros, Vanesa Vee, Mark Wang, Eric Boerwinkle, Donna M. Muzny, Jeffrey G. Reid, and Richard A. Gibbs. Better genomes using long reads and PBJelly 2. The Biology of Genomes, May 6-10, 2014, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York. Abstract 353.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: K. C. Worley, Y. Jiang, M. Xie, W. Chen, A. L. Archibald, N. Cockett, X. Xun, W. Wang, J. C. McEwan, J. Kijas, and B. P. Dalrymple for the International Sheep Genome Consortium. Ruminating on the sheep genome: Evolution of digestion and lipid metabolism. The Biology of Genomes, May 6-10, 2014, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York. Abstract 354.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: W. Salerno, K. C. Worley, A. English, et al. Comprehensive identification of structural variants in a well-characterized personal human genome. The Biology of Genomes, May 6-10, 2014, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York. Abstract 283.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: K. C. Worley. Improving the sheep genome using long reads and PBJelly 2. ISGC Workshop, International Plant and Animal Genome XXII Conference, January 11-15, 2014, San Diego, California.
  • Type: Conference Papers and Presentations Status: Other Year Published: 2014 Citation: K. C. Worley. Improving genomes using long reads and PBJelly 2. Bioinformatics Workshop, International Plant and Animal Genome XXII Conference, January 11-15, 2014, San Diego, California. Abstract #9773.