🤖  RNA-seq alignment using bowtie

This protocol outlines how to perform RNA alignment with bowtie using Paramecium as model organism.
If you want to align short reads (50bp or less), bowtie is more suitable than bowtie2 because of local alignment.

If you dont have a conda install it first.

Lets create a new environment using conda and activate it.
conda create --name paramecium
conda activate paramecium   

Next install bowtie using conda.
conda install -c bioconda bowtie

We are now ready to start downloading necessary files. So lets create a new working directory that we will name paramecium_playground:
mkdir paramecium_playground
cd paramecium_playground

Download reference genome (fasta) and annotation (gtf/gff3)

If you dont have curl on your computer make sure you first have it installed to download files.
sudo apt install curl

Current assembly of Paramecium tetraulia genome has the name ASM16542v1. Its 21 Mb in size compressed. Lets download it:
curl http://ftp.ensemblgenomes.org/pub/protists/release-55/fasta/paramecium_tetraurelia/dna/Paramecium_tetraurelia.ASM16542v1.dna.toplevel.fa.gz --output ASM16542v1.dna.toplevel.fa.gz

Lets also download the annotation file that describes all the annotated transcripts and gene features of the Paramecium genome:
curl http://ftp.ensemblgenomes.org/pub/protists/release-55/gff3/paramecium_tetraurelia/Paramecium_tetraurelia.ASM16542v1.55.gff3.gz --output ASM16542v1.55.gff3.gz 

Build an index

Unzip the fasta file: gzip -d ASM16542v1.dna.toplevel.fa.gz
Turn your fasta file of the genome into an index: 
bowtie-build ASM16542v1.dna.toplevel.fa paramecium

Which will create the following file structure with bowtie index files (.ebwt):
📂paramecium_playground
┣ 📄ASM16542v1.55.gff3.gz
┣ 📄ASM16542v1.dna.toplevel.fa
┣ 📜paramecium.1.ebwt
┣ 📜paramecium.2.ebwt
┣ 📜paramecium.3.ebwt
┣ 📜paramecium.4.ebwt
┣ 📜paramecium.rev.1.ebwt
┗ 📜paramecium.rev.2.ebwt

Test aligning a short sequence to the genome

The Paramecium genome uses Piwi-associated small RNAs that are generated upon the elimination of tens of thousands of short transposon-derived DNA to remove transposable elements in its genome. Allen et al. 2017 Cell found that they could detect concatenated excised elements by using Outward-directed primers on internal eliminated sequences (IESs).

Concatenation and circularization of excised DNA segments provides a template
for production of regulatory RNA. From: Allen et al. 2017 Cell

Lets use their primer sequences to find the crypticIES location in the reference genome:
crypIESa sequence (from Figure 1)
5’ TATAGGCATATAGAATAAGCAAATCTATA 3’

bowtie -c paramecium TATAGGCATATAGAATAAGCAAATCTATA

Which produces the following output:
0 + CT868031 179778 TATAGGCATATAGAATAAGCAAATCTATA IIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
# reads processed: 1
# reads with at least one reported alignment: 1 (100.00%)
# reads that failed to align: 0 (0.00%)
Reported 1 alignments to 1 output stream(s)

The default output of Bowtie reports the information with the following:

- 0 Name of read that aligned. 
- + Reference strand aligned to, + for forward strand, - for reverse 
- CT868031 Name of reference sequence (usually chromosome) where alignment occurs, or numeric ID if no name was provided 
- 179778 Zero-based offset into the forward reference strand where leftmost character of the alignment occurs
- TATAGGCATATAGAATAAGCAAATCTATA Read sequence (reverse-complemented if orientation is -).
- IIIIIIIIIIIIIIIIIIIIIIIIIIIII ASCII-encoded read qualities (reversed if orientation is -). The encoded quality values are on the Phred scale and the encoding is ASCII-offset by 33 (ASCII char !).
0 Comma-separated list of mismatch descriptors. If there are no mismatches in the alignment, this field is empty. A single descriptor has the format offset:reference-base>read-base. The offset is expressed as a 0-based offset from the high-quality (5’) end of the read.

Use some grep to count how many contigs we have in our genome

As you can see we have an alignment onto contig CT868031 so we might ask ourself how many contigs we have in the genome. Lets count them using grep applied to the original fasta file.
grep -c "^>" ASM16542v1.dna.toplevel.fa

Which gives the number 697 contigs in the genome.

Lets output all of their IDs into a text file list:
grep -o -E "^>\w+" ASM16542v1.dna.toplevel.fa | tr -d ">" > paramecium_contigIDs.txt

We can then display just the first 10 IDs in this list from the file using head.
head -n 10 paramecium_contigIDs.txt
Likewise we can display the last 10 entries in the file using tail:
tail -n 10 paramecium_contigIDs.txt