Read alignment and visualization using JBrowse2

This tutorial shows how to go from a set of fastq files, perform alignment and then visualize the alignments relative to genome annotations using JBrowse2.

We start with activating our conda environment (adding channels in the following order):
conda config --env --add channels defaults
conda config --env --add channels bioconda
conda config --env --add channels conda-forge

conda activate paramecium   

Install dependencies

We need to install samtools.
conda install -c bioconda samtools
Then install JBrowse2:
conda install -c bioconda jbrowse2
If you have difficulties installing jbrowse2 from bioconda you can also install npm and then do it manually: https://jbrowse.org/jb2/docs/quickstart_web/ 
Install GenomeTools so we can add the genome annotation to JBrowse2.
conda install -c bioconda genometools-genometools

We also need to download the Paramecium reference genome annotations.
curl http://ftp.ensemblgenomes.org/pub/protists/release-55/gff3/paramecium_tetraurelia/Paramecium_tetraurelia.ASM16542v1.55.gff3.gz --output ASM16542v1.55.gff3.gz

Download paired-end fastq data from European Nucleotide Archive, first read 1:
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR182/001/ERR1827401/ERR1827401_1.fastq.gz --output read1.fastq.gz

Then read 2:
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR182/001/ERR1827401/ERR1827401_2.fastq.gz --output read2.fastq.gz

Perform paired-end alignment

Run alignment with bowtie2:
bowtie2 -x paramecium -1 read1.fastq.gz -2 read2.fastq.gz -S alignment.sam              

This will align using the paramecium index and perform paired-end alignment with read1 and read2 and then save the output into alignment.sam.

Figure 1. paired-end reads

Next, convert the SAM file to a binary BAM file:
samtools view -S -b alignment.sam > alignment.bam

Doing anything meaningful such as calling variants or visualizing alignments requires that the BAM is further manipulated. It must be sorted such that the alignments occur in “genome order”. That is, ordered positionally based upon their alignment coordinates on each chromosome.
samtools sort alignment.bam -o alignment.sorted.bam

Next, lets index the sorted alignment:
samtools index alignment.sorted.bam

Create a JBrowse2 local server

Make a new directory called dev:
mkdir dev
cd dev 

Then create JBrowse2 inside of this development directory:
jbrowse create jbrowse2
cd jbrowse2

Start the server:
npx serve .

Open your browser http://localhost:3000/ and you will see something like this:

Configuration not found. You may have arrived here if you requested a config that does not exist or you have not set up your JBrowse yet.

Sample JBrowse config:

  • Volvox sample data
No config.json found. If you want to learn how to complete your setup, visit our quick start guide.

To cancel the server running press CTRL + C while having the terminal window open.

Next lets create an index fasta file (.fai) using samtools. You should have your paramecium fasta (📄ASM16542v1.dna.toplevel.fa) file in the directory structure:

📂paramecium_playground
┣ 📄ASM16542v1.55.gff3.gz
┣ 📄ASM16542v1.dna.toplevel.fa
┣ 📜paramecium.1.bt2
┣ 📜paramecium.2.bt2
┣ 📜paramecium.3.bt2
┣ 📜paramecium.4.bt2
┣ 📜paramecium.rev.1.bt2
┣ 📜paramecium.rev.2.bt2
┗ 📂dev
     ┗ 📂jbrowse2

Since you are standing in the jbrowse2 directory and the genome (ASM16542v1.dna.toplevel.fa) is two levels up we can call it using ../../ASM16542v1.dna.toplevel.fa :

samtools faidx ../../ASM16542v1.dna.toplevel.fa
jbrowse add-assembly ../../ASM16542v1.dna.toplevel.fa --load copy --out .

Which will output:
Added assembly "ASM16542v1.dna.toplevel" to ./config.json

Next, we will add the BAM file from the alignment:
jbrowse add-track ../../alignment.sorted.bam --load copy --out .

Which will give the following output:
Added track with name "alignment.sorted" and trackId "alignment.sorted" to ./config.json

Add the genome annotation file by doing the following using GenomeTools (gt):
gt gff3 -sortlines -tidy -retainids ../../ASM16542v1.55.gff3.gz > ../../ASM16542v1.55.sorted.gff3
bgzip ../../ASM16542v1.55.sorted.gff3
tabix ../../ASM16542v1.55.sorted.gff3.gz
jbrowse add-track ../../ASM16542v1.55.sorted.gff3.gz --load copy
We should now see the output:
Added track with name "ASM16542v1.55.sorted.gff3" and trackId "ASM16542v1.55.sorted.gff3" to ./config.json
The final step of loading your JBrowse instance may include adding a "search index" so that you can search by genes or other features by their name or ID.
jbrowse text-index --out .

Lets now rerun the server now that both alignments and reference genome has been added:
npx serve .

Browse the alignments

If you now open your browser http://localhost:3000/ you will see JBrowse2 start screen, click Linear Genome View in the center:

Figure 2. JBrowse start screen.


Open the ASM16542v1.dna.toplevel genome. 

Figure3. Open the ASM16542v1.dna.toplevel genome.

Open track selector

Figure 4. Open the track selector.

Select to show the annotation track (gff3) and then the alignment track (.bai):

Figure 5. select to display the annotations from the GFF file.

Then lets check a specific gene by searching for it. Search for: GSPATT00027514001

Figure 6. select to display the alignments as well.
Figure 7. Search for gene GSPATT00027514001.

It will take some seconds for Jbrowse to search. But when it shows up as GSPATT00027514001-1 click on it.

FIGURE 8. When JBrowse2 has found a correct match click on it.

Zoom out a little bit so you can see the entire gene body.

Figure 9. Press Zoom out so you can see clearly the entire gene body
Figure 10. In the annotations Coding Sequences (CDS) are shown in ocre and UTRs are shown in teal.

We might ask what gene symbol the gene ID GSPATT00027514001 has? The problem is that the annotation GFF3 file provided by Ensembl does not contain any gene symbols for us to guess its function. So lets use PANTHER, http://www.pantherdb.org/, to search for the gene ID. PANTHER stands for Protein ANalysis THrough Evolutionary Relationships and its purpose is to provide a gene family phylogenetic trees of protein coding genes.

Figure. Select Genes and orthologs from PANTHER search menu. 

Then search for GSPATT00027514001:

Figure. Search for GSPATT00027514001 in Genes and orthologs

Which will display the following result and we can infact see that the gene ID is equivalent to the translation initiation factor EF1a:

 
Gene ID
 
Gene Name
Gene Symbol

Persistent id

Orthologs

PANTHER Family/Subfamily
PANTHER Protein Class
Species
1. PARTE|EnsemblGenome=GSPATT00027514001|UniProtKB=A0EIP4 S1-like domain-containing protein
GSPATT00027514001
PTN002729134
orthologs
EUKARYOTIC TRANSLATION INITIATION FACTOR 4C (PTHR21668:SF0) translation initiation factor Paramecium tetraurelia


We can also find the gene in the Uniprot database where it has the gene name A0EIP4. 

With the advent of Google DeepBrain's AlphaFold we can even see the predicted 3D protein structure:

Figure 13. A0EIP4 protein according to AlphaFold.