Building a new reference transcriptome

The distinguishing feature of (what I call) semi-model organisms is that while they may have a decent genome reference, their transcriptome annotation is poor. There can be several reasons for this, but generally it boils down to lack of resources and/or attention – it takes a lot of effort to build a high quality transcriptome!

For this purpose, we’ve already installed the chicken reference genome set on the HPC (as part of the data you loaded at the beginning). In this case we’ve loaded in the Illumina iGenomes project into the RNAseq-semimodel location.

See paper:

Map all the reads to the genome with TopHat

We’ll be using the TopHat software.

Do:

module load TopHat2/2.0.12

tophat -p 4 \
    -G  $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Annotation/Genes/genes.gtf \
    --transcriptome-index=$HOME/RNAseq-semimodel/tophat/transcriptome \
    -o tophat_all \
    $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/Bowtie2Index/genome \
 female_repl1_R1.qc.fq.gz,male_repl1_R1.qc.fq.gz,female_repl2_R1.qc.fq.gz,male_repl2_R1.qc.fq.gz \
 female_repl1_R2.qc.fq.gz,male_repl1_R2.qc.fq.gz,female_repl2_R2.qc.fq.gz,male_repl2_R2.qc.fq.gz

Questions:

  • What are all these parameters?!
  • How do we pick the transcriptome/genome?
  • Why is it so slow?
  • What is different about mapping RNAseq reads vs mapping genomic reads?

Links:

Evaluating the mapping

Check out the details:

less tophat_all/align_summary.txt

Merge the new transcriptome with the existing reference transcriptome

We already have some decent gene models; let’s merge our new and the old ones:

ls -1 cuff_all/transcripts.gtf > cuff_list.txt

cuffmerge -g $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Annotation/Genes/genes.gtf \
    -o cuffmerge_all \
    -s $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \
    cuff_list.txt

Do some cleanup:

curl -O http://2014-msu-rnaseq.readthedocs.org/en/latest/_static/remove-nostrand.py
python remove-nostrand.py cuffmerge_all/merged.gtf > cuffmerge_all/nostrand.gtf

Questions:

  • why do you want to merge?
  • why would you have a list of more than one thing in list.txt?
  • come to think of it, why aren’t you (re)mapping all your reads every time?
  • what’s with the ‘remove nostrand’ script?

Extracting your new transcriptome sequences

To get a look at the actual DNA sequences, do:

gffread -w cuffmerge_all.fa \
        -g $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \
        cuffmerge_all/nostrand.gtf

Questions:

  • What’s the difference between a GTF file and the FA file?

Checking out your new transcriptome

Take a look at the top of your FASTA file:

head -30 cuffmerge_all.fa

Head on over to the chicken genome browser and try BLATing the sequence!

Next: Mapping reads to the transcriptome with TopHat


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.