Lustered instead using CAP3 [29]. These represented a further 1,968 TUs in addition
Lustered instead using CAP3 [29]. These represented a further 1,968 TUs in addition to the 8,944 TUs that aligned to the gene models [8]. In total, we obtained 9,145 transcripts present more than once across different libraries and 3,225 single copy transcripts, thereby comprising 12,370 TUs. The top 20 most abundant transcripts are represented by cDNAs varying from 2,079 to 316 copies in all the 16 libraries (Table 2). The most abundant transcript (G49202), with 2,079 copies, belongs to a P. tricornutum-specific gene family (family ID 4628) with 9 members [8]. All nine encoded proteins contain predicted signal peptides and transcripts for them were detected in one or more cDNA libraries. They do not show any homology with known proteins (e-value cutoff = 10-5) with the exception of G49297, which shows some PubMed ID:https://www.ncbi.nlm.nih.gov/pubmed/28607003 similarity to a bacterial protein containing a carbohydrate binding domain. When the above nine transcripts were subjected to PSI-Blast, we found a few transcriptsMaheswari et al. Genome Biology 2010, 11:R85 http://genomebiology.com/2010/11/8/RPage 4 ofFigure 1 Transcript diversity across libraries. (a) Rarefaction curves of cDNAs sequenced from 16 different cDNA libraries. (b) Plot showing the Simpson’s diversity index across the 16 libraries. For two-letter library codes, see Table 1.showing low homology (e-value cutoff = 10-3, iterations = 3) to murine-like glycoprotein most typically associated with animal viruses. Eight of the genes belonging to the above gene family are localized on chromosome 21. The absence of this gene family in T. pseudonana and its high level of expression across various cDNA libraries may order PXD101 indicate that it represents a P. tricornutum-specific expanded glycoprotein gene family. By comparing all of these highly expressed transcripts with those in 14 other eukaryotic genomes (see Materials and methods), we found that many are either present only in the two available diatom genomes or only in P. tricornutum (Table 2). Expression studies therefore represent a valuable resource for gene annotation in diatom and related genomes. Within the top 20 most abundant transcripts, some also encode highly conserved proteins such as glutamate dehydrogenase and glyceraldehyde-3-phosphate dehydrogenase, as well as others found in higher plants but not in animals (for example, ammonium transporter, light harvesting protein and alternative oxidase) (Table 2).A range of different clustering and functional annotation methods was used to identify the libraries with similar gene expression patterns and to assess functional significance. We first made a hierarchical clustering [30] of the 9,145 transcripts expressed more than once, after normalizing transcript abundance in each individual library to library size. By this method we were able to identify libraries that share similar patterns of expression with reference to the presence or absence of a transcript and its relative abundance. Figure 2 shows the results visualized using `Java Treeview’ [31]. For example, from this analysis we see that libraries made from cells grown in chemostat cultures cluster together (NS, NR, C1 and C4). The oval morphotype (OM) and tropical accession (TA) libraries, which were derived from oval morphotypes grown at low salinity and low temperature, respectively, were also seen to cluster together. We classified transcripts into three categories: core transcripts (represented across all 16 eukaryotic genomes), diatom-specific transcripts (expanded in.