Parameters used in the Horse transcriptome analysis
Revision as of 03:24, 14 April 2012 by J
Alignment of 1.3 billion reads against the horse reference genome
We performed the alignment of all RNA-Seq raw sequences against the Ensembl genome (Release 62) using TopHat 1.2.0 with two options, --mate-inner-dist=200, --allow-indels for paired-end sequences. ), and all processes were finished within one day with 96 CPU cores.
To increase the performance of the alignment process, we used cluster computers with the SGE (Sun Grid Engine 6.2; http://wikis.sun.com/display/GridEngine/Home/
The procedure of selecting the exons identified by the Cufflink without a gene model
Usually, unigene sequences should contain the translated regions with sufficiently long lengths.
Assuming that most of unigene sequences obtained from this study were not full-length cDNA sequences, the state machine to select a series of exons was designed with two main conditions:
1) In the case of unigene clusters containing one exon, the unigene sequences of which 40% region was translated well were selected, and
2) In the case of multiple exons, once the translated region in a certain exon was found, it should at least continue to the next exons.
Identification of unigene clusters with the current gene model as well as de novo exon structures provided by Cufflink
To identify novel genes not predicted by the gene prediction software, we utilized the results generated by Cufflink without genome annotation data. From the results of 24 samples, we attempted to cluster the novel genes based on the genomic coordination to define unigene clusters (UCs). The generated UCs were subjected to the filter which extracted UCs overlapping with the genome annotation. The expressed genes annotated by the pipeline and the filtered UCs were merged as the final set of UCs.
Assembly of the unmapped sequences with the SOAPdenovo
From the BAM files generated by TOPHAT, a Perl script extracted two categories of raw reads:
1) raw reads of which both pair-end sequences were not matched on the reference genome,
2) raw reads of which one of the pair-end sequences was not mapped on the reference genome.
To find an optimized k-mer value for these unmapped sequences, we tested from k=17 to k=25, and k=21 showed the best results in de novo assembly (Supplementary Table S6). With k=21, we assembled 24 sets of the unassembled sequences.