Genomics pipeline
Sequencing platform: Illumina and Oxford Nanopore
0. Introduction
In this tutorial you will learn how to perform a hybrid assembly of bacterial genomes. Hybrid assembly is the best strategy for resolving genomes with repetitive regions in particular, because they are difficult to solve just with the short and long reads coming from Illumina and Oxford Nanopore platforms, respectively.
Since this website covers all the steps needed to perform the processing of Illumina and Oxford Nanopore reads, we will just explain how to perform the hybrid assembly. Have a look at the sections Genomics Illumina and Genomics Nanopore to know how to perform all the analyses prior to the hybrid assembly.
0.1 Requirements
This pipeline has been set up on a Linux server (Ubuntu distribution), with 96 cores and 500 MB of RAM memory. You have to be careful and change some parameters according to the computational power of your machine.
Furthermore, here we assume that all the packages, programs, conda environments, and databases are already correctly installed. Otherwise, you must install or download all of them as the first step.
Some knowledge of bash is also desirable.
| Program | Version |
|---|---|
| Unicycler | v.0.5.1 |
| Quast | v.5.2.0 |
| Checkm2 | v.1.1.0 |
0.2 Some useful tricks
-Remember that if you want to create or modify an script, you can write in the terminal:
nano name_of_the_script.sh (or whichever editor you use)
This command will open the nano text-editor, where you can write or paste your script. There are some keyboard shortcuts to manage the text editor.
If the script is written in bash, then you have to write this in the terminal to execute the script:
bash name_of_the_script.sh
-Preferably work with screens. Some steps are very time-consuming, so it is necessary to run them on a screen. Hence, you can leave the server working when you go home to rest, or you can use the server to perform other analyses meanwhile. Working with screen is mandatory if you have internet connection problems.
To create a screen, just run this command: screen -S name_of_screen. Once created, you can detach it and continue working on your server with this keyboard shortcut: ctr + a (and then) d
0.3 Initial data
To perform the hybrid assembly, we first had to filter the sequences—removing adapters, low-quality reads, and so on from both the raw reads obtained by Illumina and Nanopore.
That is to say, we have to assembly the reads obtained after applying the tool TrimGalore or fastp (Illumina; see the section Genomics Illumina, and those reads obtained after applying the software Chopper (Nanopore; see the section Genomes Nanopore).
1. Hybrid assembly
We will assembly all of them by means of the tool Unicycler1. This software shines when Illumina reads and long reads from the same isolte have to be assembled.
Be careful with the names of the reads, especially if the reads obtained from Illumina and Nanopore do not have the same name, as in this example.
For illustrative purposes, in this tutorial we have used reads named “sample_Illumina” and “sample_Nanopore”. Yo will have to adapt the script according to your samples’ name.
#!/usr/bin/env bash
set -euo pipefail
ILLUMINA_DIR="/home/ana/hybrid_assembly/03.trimgalore_results" #path to the Illumina reads, previously trimmed
NANOPORE_DIR="/home/ana/hybrid_assembly/12.chopper_results" #path to the Nanopore reads, already trimmed
OUTPUT_DIR="/home/ana/hybrid_assembly/14b.unicycler_results" #output directory
THREADS=48 #adapt it according to your machine's power
mkdir -p "$OUTPUT_DIR" #create the output directory
echo "Looking for the samples in: $ILLUMINA_DIR..."
shopt -s nullglob
R1_FILES=("$ILLUMINA_DIR"/*_1_val_1.fq.gz) #adapt it according to the name of yur reads.
if [ ${#R1_FILES[@]} -eq 0 ]; then
echo "Not *_1_val_1.fq.gz files found"
exit 1
fi
#Loop over all the files
for r1 in "${R1_FILES[@]}"; do
r2=${r1/_1_val_1.fq.gz/_2_val_2.fq.gz}
illumina_name=$(basename "$r1" _1_val_1.fq.gz)#extract the name of the samples sequenced by Illumina
#remove the prefix "Illumina" (this is specific of this dataset, you can )
sample=${illumina_name%_Illumina}
nanopore_file="$NANOPORE_DIR/${sample}_Nanopore.fastq" #the same for Nanopore reads
echo "......Processing sample: $sample"
# Check whether F and R files are available
if [ ! -f "$r2" ]; then
echo "ERROR: R read is not found for sample: $sample"
continue
fi
if [ ! -f "$nanopore_file" ]; then
echo "WARNING: Nanopore reads for sample $sample are not found --> omitted."
continue
fi
#Run Unicycler:
unicycler \
-1 "$r1" \ #Illumina F reads
-2 "$r2" \ #Illumina R reads
-l "$nanopore_file" \ #Nanopore reads
-o "$OUTPUT_DIR/$sample" \ #name of the assembled reads
-t "$THREADS" \
--racon_path /home/ana/miniconda3/bin/racon #Unicycler calls Racon tool, so you have to write the path to Racon
# Validation
if [ -f "$OUTPUT_DIR/$sample/assembly.fasta" ]; then #change here according to the name of your files.
echo "Assembly finished for sample: $sample"
else
echo "The assembly failed for sample: $sample"
fi
done
echo "......Results saved in: $OUTPUT_DIR"These lines create a directory for each sample analyzed. Inside this folder you can see these outputs (among others):
-assembly.gfa: this is the final assembly in GFA format.
-assembly.fasta: the same but in fasta format.
-XXXX.spades_graph_kY.gfa: where XXX is a number and Y indicates the k-mer size analyzed. Non-altered graphs obtained for each k-mer size.
Now, we can work with the files assembly.fasta.
2. Check the quality of the assemblies
We have to check the quality of the assembled genomes, and we can do that in two different ways.
2.a) QUAST software
We can use a script written in Python to evaluate the quality of the assembly. Specifically, we can use the quast.py script, whose main usage is already described in this website2.
You can run the script described in the section 7.a) of this tutorial. Remember that you have to adapt the script according to the paths of the current folder, the name of the files, the extension of the fasta files etc.
2.b) CheckM2 software
Software CheckM23 is a machine learning-based tool that predicts the completeness and contamination of bacterial genomes or bins.
You can run the script described in the section 7.b) of this tutorial. Remember that you have to adapt the script according to the paths of the current folder, the name of the files, the extension of the fasta files etc.
Footnotes
Wick, R.R., Judd, L.M., Gorrie, C.L., Holt, K.E. (2017) Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology 13: e1005595. https://doi.org/10.1371/journal.pcbi.1005595↩︎
Mikheenko, A., Saveliev, V., Hirsch, P., Gurevich, A., (2023) WebQUAST: online evaluation of genome assemblies, Nucleic Acids Research, 51:W601–W606, https://doi.org/10.1093/nar/gkad406↩︎
Chklovski, A., Parks, D.H., Woodcroft, B.J. et al., (2023) CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods 20, 1203–1212. https://doi.org/10.1038/s41592-023-01940-w↩︎