Data analysis 2: more sequence fu

In Data analysis & differential expression, we talked about using existing annotations, transferred to your new gene models from the primary gene models. What if you want to annotate any new genes (genes that weren’t in the original annotation)?

This is much more technically challenging, and the pipeline doesn’t work entirely on the HPC just yet. Here are some tips to get started.

Running TransDecoder to turn transcripts into ORFs

cufflinks produces transcripts, but many programs want protein sequences. TransDecoder will turn transcripts into (predicted) peptide sequences:

module load TransDecoder
TransDecoder.LongOrfs -t cuffmerge_all.fa

curl -O
python cuffmerge_all.fa.transdecoder_dir/longest_orfs.pep > cuffmerge_orfs.pep

The file ‘cuffmerge_orfs.pep’ now contains entries that look like this:

>TCONS_00000001|m.1 TCONS_00000001|g.1 type:complete len:173 gc:universal TCONS_00000001:696-1214(+)

– which is to say, protein sequences rather than DNA transcripts as in cuffmerge_all.fa.

These can now be fed into InterProScan.

Running InterProScan (iprscan)

InterProScan will go through and integrate information from a number of databases into an annotation of your sequences.

module load iprscan
# -i cuffmerge_orfs.pep -f tsv

(The last command doesn’t work yet on the HPC!)

This will give you output in a tab-delimited format, cuffmerge_orfs.pep.tsv.

This can then potentially be used to annotate any new genes with GO terms and other putative functional annotation.

Next: Miscellaneous advice

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.