Development of EST (Expressed Sequence Tag) Database for Coffee, Studies of Genetic Variation and Anchoring to Other Plants

DEVELOPMENT OF EST (EXPRESSED SEQUENCE TAG) DATABASE FOR COFFEE, STUDIES OF GENETIC VARIATION AND ANCHORING TO OTHER PLANTS

Sponsoring Institution

State Agricultural Experiment Station

Project Status

COMPLETE

Funding Source

STATE

Reporting Frequency

Annual

Accession No.

0190766

Grant No.

(N/A)

Cumulative Award Amt.

(N/A)

Proposal No.

(N/A)

Multistate No.

(N/A)

Project Start Date

Sep 1, 2001

Project End Date

Sep 30, 2009

Grant Year

(N/A)

Program Code

[(N/A)]- (N/A)

Recipient Organization
CORNELL UNIVERSITY
(N/A)
ITHACA,NY 14853

Performing Department
PLANT BREEDING

Non Technical Summary
It will be impractical to fully sequence the genomes of many plants in the near future. Therefore, it is critical that we create a common foundation of information and strategies by which we can compare genes and gene functions among plants. The goals of this project are to generate targeted gene (EST) databases for coffee, obtaining approximately 45,000 5' ESTs using cDNA libraries from multiple tissues.

Animal Health Component

10%

Research Effort Categories

Basic

60%

Applied

10%

Developmental

30%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	2232	1080	100%

Knowledge Area
201 - Plant Genome, Genetics, and Genetic Mechanisms;

Subject Of Investigation
2232 - Coffee;

Field Of Science
1080 - Genetics;

Keywords

gene environment interaction

localization

Goals / Objectives
Coffee is a major commodity yet has received little or no attention with respect to genomics research. The opportunities for generating gene sequence information in this species is enormous. Should such sequence information be generated by another private entity and claimed as intellectual property, it could lockout Nestle (or any other interested corporation) from future derived benefits. Generating a baseline of gene sequence information in coffee would not only lead to many new gene discoveries, but would also connect this species to genomic research now being conducted on other plants (e.g. arabidopsis, tomato, and maize)and would pave the way for future genetic improvement of these commodities via both classical and molecular techniques.

Project Methods
High Throughput single-pass sequencing can generate large numbers of expressed sequence tags (ESTs) from cDNA clones (Adams et al. 1995). Due to its relatively low cost, this approach has been widely accepted as the most efficient way to identify large numbers of genes from an organism. In addition to identifying genes, the EST strategy can identify the sets of genes that are expressed in a particular tissue, or under particular conditions, as well as providing a source of probes from mapping studies and microarray analysis (Adams et al., 1995). ESTs have even greater value when used for comparative studies. For example, a recent report used the genes for known Drosophila mutants to identify human ESTs, which have now become candidate genes for human diseases (Banfi et al. 1996). With the incredible value of ESTs, it is therefore no surprise that they comprise over 70 percent of entries, and nearly 40 of the base pairs in the dbEST division of GenBank (release 105). The latest version contains over 1,500,000 ESTs from 106 different species, and yet only 5 percent of these are from plant species. Of the 70,000 plants ESTs, nearly 90 percent are from Arabidopsis and rice (Newman et al, 1994) and no other plant species has more the 2,000 ESTs in the database. The situation is significantly different in the private sector. Several large U.S. companies have invested heavily in maize EST programs in the hope that this data will help identify genes for valuable agronomic traits and for intellectual property rights. These EST data are not generally available to the plant research community (Cohen, J. 1997).

Progress 10/01/07 to 09/30/08

Outputs
OUTPUTS: The goal is to weave a network of genome information, such that any sequence or genetic information from one species can be applied directly to the other via comparative genetic maps. To accomplish this, we have established or obtained genetic mapping populations for coffee, tomato, potato, eggplant, pepper and tobacco. All coffee sequences have been compared with sequences from tomato and other solanaceous species, in an effort to find orthologs. More than 2000 COS (conserved ortholog set) marker genes were thus identified and are now being mapping in coffee, tomato, potato, pepper and tobacco. So far, we have mapped more than 600 COSII in tomato, 500 in pepper, 450 in eggplant, 300 in coffee and 200 in tobacco. Coffee (a member of the family Rubiaceae with chromosome number x=11) and tomato (a member of the family Solanaceae with chromosome number x=12) are estimated to have shared a common ancestor approximately 85 MYA. The results indicate each segment of the coffee genome corresponds to a single segment of the tomato genome (Crouzillat et al, in preparation). The same conclusion extends to the other taxonomically and phylogenetically related taxa (Wu et al 2009a, 2009b in press). The changes (in order of frequency) are paracentric inversions, translocations and small insertions/duplications. The comparative maps for pepper and have been completed and the manuscripts accepted for publication. Nicotiana and coffee should be completed in the next 6-12 months. PARTICIPANTS: Nothing significant to report during this reporting period. TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
The comparative COSII maps (based on PCR markers) should provide useful tools for breeders and genetics of tomato, potato, eggplant, pepper, coffee and other related taxa. The combined results of these studies provide compelling evidence that: 1) coffee, tomato, pepper, egglant, potato, tobacco (and probably all other solanaceous/rubiaceous species) share an ancestral karyotype whose genetic compositon can be inferred from the resultant, extant genomes/comparative maps. 2) the ancestral species that gave rise to the core euasterid families Solanaceae, Convulvulaceae and Rubiaceae had a basic chromosome number of x=11 or 12. 3) No whole genome duplication event (i.e. polyploidization) occurred immediately prior to or after the radiation of either the family Solanaceae or Rubiaceae as has been recently suggested.

Publications

Wang Y, Diehl, A, Wu F, Vrebalov J, Giovannoni J, Siepel A, Tanksley S (2008) Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae. Genetics 180: 391 - 408.
Liang, H, Carlson, J, Leebens-Mack J, Wall PK, Mueller L, Buzgo, M, Landherr L, Hu Y, DiLoreto D, Ilut D, Fields, D, Tanksley S, Ma H, dePamphilis C (2008) An EST database for Liriodendron tulipifera L. floral buds: the first EST resource for functional and comparative genomics in Liriodendron. Tree Genetics & Genomes, Volume 4:3.
Privat I, Foucrier S, Prins A, Epalle T, Eychenne M, Kandalaf L, Caille V, Lin C, Tanksley S, Foyer C, McCarthy J.(2008) Differential regulation of grain sucrose accumulation and metabolism in Coffea arabica (Arabica) and Coffea canephora (Robusta) revealed through gene expression and enzyme activity analysis. The New Phytologist 178: 781-797.
Wu F, Eannetta N ,Xu Y, Tanksley SD (2009) detailed synteny map of the eggplant genome based on conserved ortholog set II (COSII) markers. Theor Appl Genet 1432-2242 (Online)

Progress 01/01/07 to 12/31/07

Outputs
Using the coffee EST database, and similar sequence databases from related solanaceous species (e.g. tomato, pepper, eggplant, tobacco, potato), we have endeavored to decipher the syntentic relationships amongst the genomes of these plant taxa. The goal is to weave a network of genome information, such that any sequence or genetic information from one species can be applied directly to the other via comparative genetic maps. To accomplish this, we have established or obtained genetic mapping populations for coffee, tomato, potato, eggplant, pepper and tobacco. All coffee sequences have been compared with sequences from tomato and other solanaceous species, in an effort to find orthologs. More than 2000 COS (conserved ortholog set) marker genes were thus identified and are now being mapping in coffee, tomato, potato, pepper and tobacco. So far, we have mapped more than 600 COSII in tomato, 200 in pepper, 150 in eggplant, 300 in coffee and 150 in tobacco. Coffee (a member of the family Rubiaceae with chromosome number x=11) and tomato (a member of the family Solanaceae with chromosome number x=12) are estimated to have shared a common ancestor approximately 85 MYA (WIKSTROM et al. 2001). The results indicate each segment of the coffee genome corresponds to a single segment of the tomato genome (Crouzillat et al, unpublished data). The same conclusion extend to the other taxonomically and phylogenetically related taxa. The changes (in order of frequency) are paracentric inversions, translocations and small insertions/duplications. We hope to complete these synteny maps for all of these species within then next 12-18 months. These results demonstrate that the COSII gene sets can be used for comparative genome mapping between plant families. Further, they provide insights into genome evolution in euasterids.

Impacts
The combined results of these studies provide compelling evidence that: 1) coffee, tomato, pepper, egglant, potato, tobacco (and probably all other solanaceous/rubiaceous species) share an ancestral karyotype whose genetic compositon can be inferred from the resultant, extant genomes/comparative maps. 2) the ancestral species that gave rise to the core euasterid families Solanaceae, Convulvulaceae and Rubiaceae had a basic chromosome number of x=11 or 12. 3) No whole genome duplication event (i.e. polyploidization) occurred immediately prior to or after the radiation of either the family Solanaceae or Rubiaceae as has been recently suggested. These comparative synteny maps, combined with the the 'Universal Primers' for these univeral orthologs, such facilitate and expedite genetic and breeding research in all relavant taxa.

Publications

McCarthy, A., Biget, L., Lin, C., Petiard, V., Tanksley, S., and McCarthy, J. 2007. Expression, crystallization and preliminary X-ray analysis of the XMT and DXMT N-methyltransferases from Coffea canephora (robusta). Acta Crystallographica F63, 304-307.

Progress 01/01/06 to 12/31/06

Outputs
Sequencing and databasing of all coffee ESTs (mainly from cDNA libraries derived from mRNA from developing pods and seeds) has been completed. All coffee sequences have been compared with sequences from tomato and other solanaceous species, in an effort to find orthologs. More than 2000 COS (conserved ortholog set) marker genes were thus identified and are now being mapping in coffee, tomato, potato, pepper and tobacco. Most comparative mapping studies in plants have been restricted to species within the same plant family. Coffee (a member of the family Rubiaceae with chromosome number x=11) and tomato (a member of the family Solanaceae with chromosome number x=12) are estimated to have shared a common ancestor approximately 85 MYA (WIKSTROM et al. 2001). To test the efficacy of COSII markers for comparative mapping across such large phylogenetic distances, a subset of COSII markers is being mapped in both tomato and diploid coffee (Coffea canephora). The results indicate each segment of the coffee genome corresponds to a single segment of the tomato genome (Crouzillat et al, unpublished data). For example the long arm of tomato chromosome 7, encompassing 14 COSII markers and 43 cM, corresponds to a 46 cM segment in coffee linkage group E in which gene order has been preserved . Likewise the short arm of tomato chromosome 7, comprised of 8 COSII makers and 28 cM, corresponds to a 30 cM segment of coffee chromosome F - although the two syntenous segments differ by at least 2 paracentric inversions. Thus far we have observed no cases in which single coffee chromosomes (defined by COSII markers) shown a networked synteny with two corresponding tomato chromosomal pieces, or vice versa. Such would be the case if polyploidization had affected either the tomato or coffee lineage. These results demonstrate that the COSII gene sets can be used for comparative genome mapping between plant families. Further, they provide insights into genome evolution in euasterids.

Impacts
The enabling power of this new ortholog resource was demonstrated in phylogenetic studies, as well as in comparative mapping across plant families - tomato (family Solanaceae) and coffee (family Rubiaceae) - relying for the first time on incomplete EST-derived dataset. The 'Universal Ortholog Primers' thus generatd have proved useful for mapping and genetic studies in both the families Solanaceae, Rubiaceae and Convulvulaceae. The combined results of these studies provide compelling evidence that: 1) the ancestral species that gave rise to the core euasterid families Solanaceae, Convulvulaceae and Rubiaceae had a basic chromosome number of x=11 or 12. 2) No whole genome duplication event (i.e. polyploidization) occurred immediately prior to or after the radiation of either the family Solanaceae or Rubiaceae as has been recently suggested. The 'Universal Primers' for these univeral orthologs have been made freely available to the public through the Solanaceae Genome Network database (http://www.sgn.cornell.edu/).

Publications

Wu F, Mueller L, Crouzillat D, Petiard V. 2006. Bioinformatics and Phylogenetics to Identify Large Sets of Single Copy, Orthologous Genes (COSII) for Comparative, Evolutionary and Systematic Studies: A Test Case in the Euasterid Plant Clade Genetics; 174: 1407-1420

Progress 01/01/05 to 12/31/05

Outputs
27 tomato cDNA libraries, covering as many developmental stages/tissues/induction profiles as possible, have been sequenced, generating over 150,000 tomato ESTs (http://sgn.cornell.edu), corresponding to approx. 1/2 of the 38,000 estimated tomato genes and have been used to create a public microarray and expression database. We maintain these clones and distribute them to the research community via a SGN order page. Over 60,000 potato ESTs have also been generated with a heavy focus on tubers and tuberization which are unique to potato. Also, approx. 5000 ESTs have been generated from petunia and eggplant. A set of approx.1500 genes from the high density tomato map were screened as overgo probes on the tomato HindIII BAC library. The goal was to anchor the existing finger print contig map and establish a set of seed BACs to initiate sequencing of each tomato chromosome. Over 1000 of the mapped overgo markers have been unambiguously anchored to the high-density map. A subset of BACs, anchored to each chromosome at regular intervals along the genetic map (referred to as seed BACs) are serving as starting points for initial sequencing by the 10 countries currently sequencing the tomato genome. Activities include verification of seed BAC integrity (via sequencing with primers designed to the same marker from which the corresponding overgo probe was designed) prior to delivery to our collaborators. Over 300,000 BAC end sequences have been generated from 75,000 HindIII, 50,000 MboI and 50,000 EcoRI clones, respectively. Additional end sequences will be generated for 5,000 MboI clones and 20,000 HindIII clones. At SGN, BAC end sequences have been base-called, quality and vector trimmed, analyzed for contamination, blasted against different sequence databases for annotation and classification into functional, gene and repeat categories, and loaded into a web-queryable database. Selected BAC clones anchored to orthologous map positions in the tomato, potato, eggplant, petunia and pepper genomes have been sequenced. The goal was to shed light on microsynteny and gene conservation across SOL genomes. The results indicate a nearly perfect conservation of gene content and gene order across the species examined, indicating the tomato genome sequence will be a good predictor of gene content and order in all SOL species. Methods for BAC FISH anlaysis have been perfected, allowing accurate physical placement of BAC clones relative to euchromatin-heterochromatin boundaries, and telomeres. We have localized 7 BAC inserts to unique chromosomal sites and estimated their distance from euchromatin/heterochromatin borders. Also, pachytene BAC FISH (as well as training) is provided to the international community sequencing the tomato genome, especially those with no FISH capabilities (UK, Italy, France) A centromere-specific sequence (from tomato) has also been isolated (Stack unpublished data). This centromere-specific sequence is being used to precisely localize centromeres of all 12 chromosomes on the physical and genetic maps.

Impacts
Sequencing the tomato genome is the cornerstone of a larger international effort; International Solanaceae Genome Initiative (SOL Initiative). The goal is to establish a network of information, resources and scientists to tackle two universal biological questions that the Solanaceae genomes are suited to address: 1) How can a common set of genes/proteins give rise to a wide range of morphologically and ecologically distinct organisms that occupy our planet? 2) How can a deeper understanding of the genetic basis of plant diversity be harnessed to better meet the needs of society in an environmentally-friendly and sustainable manner? The tomato reference sequence will provide gene content and order, which is demonstrated to be similar to other Solanaceae genomes, making the phenotypic and evolutionary diversity in this family accessible for exploration at the sequence level. The tomato genome is connected to other important members of the family by detailed comparative genetic maps and the level of microsynteny is known to be well conserved with respect to gene content and order. Because the Solanaceae represents a distinct and divergent clade of flowering plants, distant from Arabidopsis, Medicago, maize and rice, the tomato genome sequence provides a rich resource for investigating the forces of gene and genome evolution over long periods of time. The SOL Genome Network, produces a web based Newsletter circulated to approx. 100 subscribers worldwide, incorporating news, availability of new tomato sequencing data, new bioinformatics tools, and other related information.

Publications

Lin C, Mueller L, McCarthy J, Crouzillat D, Petiard V, Tanksley S. 2005. Coffee and tomato share common gene repertoires as revealed by deep sequencing of seed and cherry transcripts. Theor Appl Genet 112:114-130
Wang Y, van der Hoeven R, Nielsen R, Mueller L, and Tanksley S. 2006. Characteristics of the Tomato Nuclear Genome as Determined by Sequencing Unmethylated DNA. Theor Appl Genet 112:72
Fei, Z., Tang, X., Alba, R., White J., Ronning, C., Martin, G., Tanksley, S. and Giovannoni, J. 2004. Comprehensive EST analysis of tomato and comparative genomics of fruit ripening. Plant Journal. 40:47-59.
Frary A, Xu Y, Liu J, Mitchell S, Tedeschhi E, Tanksley S. 2005. Development of a set of PCR-based anchor markers encompassing the tomato genome and evaluation of their usefulness for genetics and breeding experiments. Theor Appl Genet 111:291-312.
Mueller LA, Tanksley SD, Giovannoni JJ, Van Eck J, Stack S, Choi D, Kim BD, Chen M, Cheng Z, Li C, Ling H, Xue Y, Seymour G, Bishop G, Bryan G, Sharma R, Khurana J, Tyagi A, Chattopadhyay, D, Singh NK, Stiekema W, Lindhout P, Jesse T, Lankhorst RK, Bouzayen M, Shibata D, Tabata S, Granell A, Botella MA, Giuliano G, Frusciante L, Causse M, and Zamir D. 2005. The Tomato Sequencing Project, the first cornerstone of the International Solanaceae Project (SOL). Comp Funct Genom, 6:153-158.

Progress 01/01/04 to 12/31/04

Outputs
Coffee EST database. An EST database has been generated for coffee based on sequences from approximately 47,000 cDNA clones derived from five different stages/tissues, with a special focus on developing seeds. When computationally assembled, these sequences correspond to 13,175 unigenes, which were analyzed with respect to functional annotation, expression and evolution. Compared with Arabidopsis, the coffee unigenes encode a higher proportion of proteins related to protein modification/turnover and metabolism an observation that may explain the high diversity of metabolites found in coffee and related species. Among the most highly expressed genes detected are those encoding stage-specific, 2S and 11S seed storage proteins. Several gene families were found to be either expanded or unique to coffee when compared with Arabidopsis. A high proportion of these families encode proteins assigned to functions related to disease resistance. Such families may have expanded and evolved rapidly under the intense pathogen pressure experienced by a tropical, perennial species like coffee. Unlike Arabidopsis, Solanaceae has a nearly perfect gene-for-gene match with coffee, suggesting that it will be a much better model for coffee genetics/molecular biology. These results are consistent with the fact that coffee and Solanaceae share very similar chromosome architecture and are closely related, both belonging to the Asterid I clade of dicot plant families.

Impacts
Coffee EST database represents a new public resource facilitating genomic, molecular and breeding research in coffee. Through insilico gene expression analysis, we've identified a number of highly expressed genes showing high specificity for different stages of seed development as well as pericarp tissue surrounding the seeds. Many of these genes are unique to coffee and/or the Asterid clade of higher plants, while the functions of most of these genes remain to be determined, the fact they've been identified points to promoters, which can potentially be used to drive gene expression in specific stages/tissues of the coffee plant. Many of these genes are specific to defined periods of seed and/or pericarp development, critically important for insect/pathogen resistance and determining quality of the coffee bean with respect to commercial coffee products. Coffee, as a member of the family Rubiaceae, is distantly related to model species Arabidopsis. A computational comparison of the coffee EST-derived unigene set with the sequence databases for Arabidopsis and tomato indicates that tomato (whose genome is currently being sequenced) is a better model for coffee than Arabidopsis, with most coffee genes having clear orthologs in the tomato genome. Results are consistent with coffee and tomato sharing very similar chromosome architecture and are closely related. Identifying orthologous genes between coffee and tomato facilitates developing comparative maps for these species and to sharing of genomic and biological tools/discoveries, an outcome expediting research in both taxa.

Publications

No publications reported this period

Progress 01/01/03 to 12/31/03

Outputs
We have generated 6 cDNA libraries for EST sequencing. So far, we have generated approximately 45,000 ESTs. These have been subjected to trimming, contiging and annotation. We are also using these sequences to develop COS (conserved ortholog markers) for mapping of the coffee genome so that it can be compared with the genomes of other solanaceous species. Mapping of these new COS markers has already begun in tomato

Impacts
This work will help develop a genetic map and sequence database for coffee which is the highest value agricultural commodity. It will also tie together coffee and tomato (and other solanaceous species) into a common network of genomic resources, accelerating genetic/breeding research in all these species.

Publications

No publications reported this period

Progress 01/01/02 to 12/31/02

Outputs
We have generated 4 cDNA libraries for EST sequencing. So far, we have generated approximately 20,000 ESTs. These have been subjected to trimming, contiging and annotation. We are also using these sequences to develop COS (conserved ortholog markers) for mapping of the coffee genome so that it can be compared with the genomes of other solanaceous species.

Impacts
(N/A)

Publications

No publications reported this period