Genomics pipeline
Sequencing platform: Illumina
0. Introduction
In this tutorial you will learn how to perform genomics analyses. We will go through all the steps needed: from the processing of raw sequencing reads coming from Illumina platform until the obtaining of high quality genomes.
0.1 Requirements
This pipeline has been set up on a Linux server (Ubuntu distribution), with 96 cores and 500 MB of RAM memory. You have to be careful and change some parameters according to the computational power of your machine.
Furthermore, here we assume that all the packages, programs, conda environments, and databases are already correctly installed. Otherwise, you must install or download all of them as the first step.
Some knowledge of bash is also desirable.
| Program | Version |
|---|---|
| FastQC | v.0.12.1 |
| MultiQC | v.1.3 |
| TrimGalore | v.0.6.10 |
| fastp | v.0.22.0 |
| Bowtie | v.2.5.5 |
| SPAdes | v.3.13.1 |
| QUAST | v.5.2.0 |
| CheckM2 | v.1.1.0 |
0.2 Some useful tricks
-Remember that if you want to create or modify an script, you can write in the terminal:
nano name_of_the_script.sh (or whichever editor you use)
This command will open the nano text-editor, where you can write or paste your script. There are some keyboard shortcuts to manage the text editor.
If the script is written in bash, then you have to write this in the terminal to execute the script:
bash name_of_the_script.sh
-Preferably work with screens. Some steps are very time-consuming, so it is necessary to run them on a screen. Hence, you can leave the server working when you go home to rest, or you can use the server to perform other analyses meanwhile. Working with screen is mandatory if you have internet connection problems.
To create a screen, just run this command: screen -S name_of_screen. Once created, you can detach it and continue working on your server with this keyboard shortcut: ctr + a (and then) d
1. Analyze the quality of the raw reads
Before starting to analyze the data, it is important to verify that the sequencing company has performed the sequencing properly. For that purpose, we will check different quality parameters. First, we will use the software FastQC1. Let’s run these command lines to run FastQC
#!/usr/bin/env bash
set -euo pipefail #this avoids running the entire script if there are initial errors. We will use it along the whole protocol.
INPUT_DIR="/home/ana/hybrid_assembly/00.raw_data/Illumina" #input directory where raw reads given by the sequencing company are stored
OUTPUT_DIR="/home/ana/hybrid_assembly/01.fastqc_results" #output directory
THREADS=8 #change it according to your machine
mkdir -p "$OUTPUT_DIR" #create the output directory
FILES=("$INPUT_DIR"/*.fastq.gz) #change it according to the ending pattern of the names of your raw reads
# Check if there are files in your directory
if [ ${#FILES[@]} -eq 0 ]; then
echo "No se encontraron archivos .fastq.gz"
exit 1
fi
#Loop over all files in the specified directory
for fq in "${FILES[@]}"; do
sample=$(basename "$fq" .fastq.gz) #extract the name of the samples
echo "Processing sample: $sample"
fastqc \
-t "$THREADS" \
-o "$OUTPUT_DIR" \
"$fq"
done
echo "......Finished"
echo "......Results stored in: $OUTPUT_DIR"This script created a file containing the quality report (name: samplename_fastqc.html, where samplename is the name of your sample/genome under study) and a directory named samplename_fastq.gz. If the sequencing was performed as Paired-End strategy, then you will have these two files for each Forward and Reverse (F and R now on) files for each genome/sample. Thus, four FastQC files for each genome or sample.
If we download the samplename_fastq.html report, we can see that it is a very detailed quality summary. We can check basic statistics (for instance, the total number of sequences, the size range, the percentage of GC), as well as per base quality information, per base quality scores, among other data.
But if we have many or several different genomes, maybe we prefer have a quick look at all the genomes in just one plot or table. For that purpose we will use the tool MultiQC2.
This tool aggregates our data and summaryzes it in beautiful plots. We can performd all the agglomeration by running these lines:
#!/usr/bin/env bash
set -euo pipefail
INPUT_DIR="/home/ana/hybrid_assembly/01.fastqc_results" #Input directory where FastQC results are stored
OUTPUT_DIR="/home/ana/hybrid_assembly/02.multiqc_report" #ouput directory
mkdir -p "$OUTPUT_DIR" #create output directory
echo "Runnin MultiQC over FastQC results..."
multiqc \
"$INPUT_DIR" \
--outdir "$OUTPUT_DIR" \
--force #Overwrite any existing reports
echo "......MultiQC finished"
echo "......Report available in: $OUTPUT_DIR"MultiQC has many different possibilities and flags, you have better to have a look at the manual (multiqc --help).
As a result, these lines give us a directory and a file named multiqc_report.html. If we download it we will see all the details regarding the total number of unique and duplicated reads, sequence quality histograms, per sequence quality scores, per base sequence and GC content, per sequence N content, and much more details about all the F and R files.
2. Trimming of the reads
There are different tools that perform the trimming of the raw reads. In this pipeline, we will introduce two of them. Both are efficient tools, so have a look at the script and keeping in mind your main objetives, select one of them for your analyses. You can even compare the results given by both tools each other.
2.a) TrimGalore tool
Let’s remove short reads, reads with too many ambigüities, reads with low quality score etc. We will use the toolTrim Galore3 for that purpose. It is a perl script, which has to be downloaded in your machine, and you should know its location or path. Have a look at these lines:
#!/usr/bin/env bash
set -euo pipefail
INPUT_DIR="/home/ana/hybrid_assembly/00.raw_data/Illumina" #input file where Illumina raw reads are stored.
OUTPUT_DIR="/home/ana/hybrid_assembly/03.trimgalore_results" #output directory
JOBS=40 #number of jobs to run in parallel. Change this value according to your machine power
TRIMGALORE_DIR="/home/ana/Metagenomas_prueba_illumina/trim_galore" #path to the TrimGalore script. Change it accordingly.
mkdir -p "$OUTPUT_DIR" #create output directory
#Searching the files
echo "Looking for Paired-End files in: $INPUT_DIR..."
#Look for all F (R1) files
R1_FILES=("$INPUT_DIR"/*_1.fastq.gz) #change the ending pattern of the names according to your files
if [ ${#R1_FILES[@]} -eq 0 ]; then
echo "Not files detected for *_1.fastq.gz"
exit 1
fi #it stops if no F (R1) file are found
#Run TrimGalore in parallel:
printf "%s\n" "${R1_FILES[@]}" | parallel -j "$JOBS" '
r1={}
r2=${r1/_1.fastq.gz/_2.fastq.gz} #the pattern of R (R2) files
sample=$(basename "$r1" _1.fastq.gz) #extract the sample name
echo "Processing sample: $sample"
perl "'"$TRIMGALORE_DIR"'" \
--paired \ #performs length trimming for PE-files
--stringency 2 \ #minimum overlapping size between the adapters and the raw reads. Change it according to your experiment (for example, here have to reduce the stringency because we have 150bp-PE)
--length 75 \ #remove all reads shorter than 75 bp. Adapt it according to your needs
--quality 20 \ #minimun phred score
--max_n 2 \#maximum ambiguities
--trim-n \#removes Ns from either side of the read
--gzip \
--no_report_file \
--suppress_warn \
--output_dir "'"$OUTPUT_DIR"'" \
"$r1" "$r2"
'
#In this experiment, we do not know the adapter that were used during the sequencing, so we allow the software to auto-detect it. But if you are working, for instance, with Nextera adapters, you should add the flag "--nextera"
echo "......Trim Galore finished"
echo "......Results stored in: $OUTPUT_DIR"At the time of writing this pipeline, the multithreading option for Trim Galore was not available.
TrimGalore gives the users many options, and they have to adapt the analysis to their own dataset. So, we encourage having a look at this user guide. For instance, it is convenient to change the stringency parameter if the run was 2x150 bp.
This step is time-consuming; take it in mind, and do not forget to create a screen.
In our example, Trim Galore created files whose name’s endin pattern is 1_val_1.fq.gz and 2_val_2.fg.gz for F and R reads, respectively. Keep the ending pattern of your files’ names in mind, you will need to remember them further.
2.b) fastp tool
The software fastp4 is a tool designed to perform ultrafast preprocessing of Illumina reads. It is very powerful due to all its advantages: it cuts adapters, trim polyG regions commonly detected in NovaSeq and NextSeq data, removes reads with low quality score, among others. This program can also support the error correction. We suggest that you have a close look at the user guide.
Run these lines to do the trimming:
#!/usr/bin/env bash
set -euo pipefail
CONDA_ENV="fastp_env" #indicate the name of the environment where fastp is installed
INPUT_DIR="/home/ana/hybrid_assembly/00.raw_data/Illumina" #path to the folder containing the raw reads
OUTPUT_DIR="/home/ana/hybrid_assembly/04.fastp_results" #output directory
JOBS=16 #Number of jobs to run in parallel. Adapt it to your machine's power
#Activate conda environment
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate "$CONDA_ENV"
mkdir -p "$OUTPUT_DIR" #create output directory
#Look for F and R files
R1_FILES=("$INPUT_DIR"/*_1.fastq.gz) #adapt it according to the extension of your raw data
if [ ${#R1_FILES[@]} -eq 0 ]; then
echo "Files *_1.fastq.gz not found"
exit 1
fi
export OUTPUT_DIR
printf "%s\n" "${R1_FILES[@]}" | parallel -j "$JOBS" '
r1={}
r2=${r1/_1.fastq.gz/_2.fastq.gz}
sample=$(basename "$r1" _1.fastq.gz) #extract sample name
echo "Processing sample: $sample"
fastp \
--in1 "$r1" \ #F input reads
--in2 "$r2" \ #R input reads
--out1 "$OUTPUT_DIR/${sample}_R1.trimmed.fastq.gz" \ #Specify the name you want to give the output files (F)
--out2 "$OUTPUT_DIR/${sample}_R2.trimmed.fastq.gz" \ #Specify the name you want to give the output files (R)
--unpaired1 "$OUTPUT_DIR/${sample}_unpaired_R1.fastq.gz" \#Specify the name you want to give those F reads that pass the trimming but their R does not.
--unpaired2 "$OUTPUT_DIR/${sample}_unpaired_R2.fastq.gz" \ #the same for R reads
--detect_adapter_for_pe \ #auto-detect the adapters. Highly recommended option to obtain ultra high quality reads
--cut_mean_quality 20 \ #minimum quality score
--cut_front \ #to move a sliding window from 5' to 3' (read the documentation)
--cut_tail \#to move a sliding window from 3' to 5' (read the documentation)
--length_required 75 \ #minimum size of the reads
--thread 1 \
--overrepresentation_analysis \ #analyze the overrepresented reads (adapters, Ns, etc.)
--html "$OUTPUT_DIR/${sample}.fastp.html" \ #report in html format
--json "$OUTPUT_DIR/${sample}.fastp.json" #report in json format.
'
echo "......fastp finished"
echo "......Results stored in: $OUTPUT_DIR"The lines give us several files:
-samplename.fastp.html: this file is a report of. If you download it, you can see many details about the type of sequencing, mean size of the reads before and after the trimming, number of sequences, quality reads before and after the trimming, etc.
-samplename.fastp.json: the report in json format
-samplename_R1.trimmed.fastq.gz: trimmed files (F)
-samplename_R2.trimmed.fastq.gz: trimmed files (R)
-samplename_unpaired_R1.fastq.gz: F reads that pass the trimming but the corresponding R does not
-samplename_unpaired_R2.fastq.gz: R reads that pass the trimming but their F does not.
These are two options for filtering sequences. For convenience, in this tutorial we will use 2.a) TrimGalore tool, but you can use either one and follow the rest of the tutorial. Just remember to adjust the file names to match the ones you’ve obtained.
If you have sequenced bacterial genomes using Illumina and Nanopore platforms and want to perform a hybrid assembly, now is the right time. In that case, go to Section X and do not continue with this pipeline.
EXTRA: removal of the phiX174 genome
During the Illumina sequencing, it is common to introduce the phiX174 bacteriophage genome to perform internal sequencing quality controls. We can add a step to remove any potential contamination with this genome. We will do this using the Bowtie2 tool5.
But first, we have to donwload the genome of the paghe, for example by doing:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gzLet’s unzip the file:
gunzip *fna.gzBowtie2 is an aligner that needs to work with a special index database (see the corresponding manual), obtained from the genome(s) we want to compare with. So, let’s create it:
bowtie2-build --threads 48 GCF_000819615.1_ViralProj14015_genomic.fna phi_indexNow, let’s move all the index-realted files to a new directory:
mkdir bowtie_index
mv phi_index* /home/ana/hybrid_assembly/bowtie_index/And that is the moment to remove the phiX174 genome:
#!/bin/bash
set -euo pipefail
THREADS=80 #change it according to your machine power
INPUT_DIR="/home/ana/hybrid_assembly/03.trimgalore_results" #path to the trimmed reads
OUTPUT_SAM_DIR="/home/ana/hybrid_assembly/20.sam_host" #directory where phiX174 reads will be stored (SAM format)
OUTPUT_FASTQ_DIR="/home/ana/hybrid_assembly/21.bowtie_noPhiX174_fastq" #path to the ouput files, without the paghe's genome
BOWTIE_INDEX="/home/ana/hybrid_assembly/bowtie_index/phi_index" #path where bowtie2 index database is saved. Indicate it accordingly. We just indicate the prefix
mkdir -p "$OUTPUT_SAM_DIR" #create SAM directory
mkdir -p "$OUTPUT_FASTQ_DIR" #create ouput directory
#Loop over trimmed-reads files
for F1 in "$INPUT_DIR"/*_1_val_1.fq.gz; do #Be careful with the ending pattern of the files'names. Change it according to your files' names
# Extract the prefix of the samples
BASE=$(basename "$F1" "_1_val_1.fq.gz")
F2="$INPUT_DIR/${BASE}_2_val_2.fq.gz"
# Check that the R file exists
if [[ ! -f "$F2" ]]; then
echo "ERROR: missing pair for $BASE"
continue
fi
echo ">>> Cleaning host reads for sample $BASE"
#Align against the phague's genome
bowtie2 -p $THREADS --sensitive-local \
-x "$BOWTIE_INDEX" \
-1 "$F1" -2 "$F2" \ #-1 and -2 for F and R files, respectively
-S "$OUTPUT_SAM_DIR/${BASE}.sam"
# Convert SAM to BAM
samtools view --threads $THREADS -bS "$OUTPUT_SAM_DIR/${BASE}.sam" > "$OUTPUT_SAM_DIR/${BASE}.bam"
rm "$OUTPUT_SAM_DIR/${BASE}.sam"
# Extract pairs of non-matching reads. These are the reads we are interested in
#visit samtool's manual to understand the meaning of all the flags
samtools view -b -f 12 -F 256 "$OUTPUT_SAM_DIR/${BASE}.bam" > "$OUTPUT_SAM_DIR/ext_${BASE}.bam"
rm "$OUTPUT_SAM_DIR/${BASE}.bam"
# Sort by name
samtools sort -n "$OUTPUT_SAM_DIR/ext_${BASE}.bam" > "$OUTPUT_SAM_DIR/sort_${BASE}.bam"
rm "$OUTPUT_SAM_DIR/ext_${BASE}.bam"
# Extract non-matching reads to fastq files
#visit bedtools's manual to understand the meaning of all the flags
bedtools bamtofastq \
-i "$OUTPUT_SAM_DIR/sort_${BASE}.bam" \
-fq "$OUTPUT_FASTQ_DIR/${BASE}_nohost_R1.fastq" \
-fq2 "$OUTPUT_FASTQ_DIR/${BASE}_nohost_R2.fastq"
echo ">>> Done sample $BASE"
doneWe have included these lines of code as supplementary material, not as the main content of the tutorial. Therefore, if you use them, in the following steps you must adjust the names of the input files to match the names of the outputs generated now (generally, the previous lines of code generate files of the type: samplename_Illumina_nohost_R1.fastq and samplename_Illumina_nohost_R1.fastq)
3. Assembly of the reads into genomes
One of the most used tools to assemble the reads into contigs or scaffolds de novo is SPAdes6. In summary, SPAdes is based on De Bruijn graphs.
As you know, SPAdes is an script that should be loaded in our machine, so write down the path in order to use it further. Then, you just have to run these lines:
#!/usr/bin/env bash
set -euo pipefail
SPADES="/home/ana/miniconda3/bin/spades.py" #the path to the SPAdes script.
INPUT_DIR="/home/ana/hybrid_assembly/03.trimgalore_results/" #input directory where trimmed reads are stored
SPADES_DIR="/home/ana/hybrid_assembly/05.spades_trimgalore_results" #output directory
GENOMES_DIR="/home/ana/hybrid_assembly/06.assembled_genomes_trimgalore" #path to the already assembled genomes in fasta format
THREADS=16 #change it according to your machine
JOBS=4 #number of jobs to be run in parallel
mkdir -p "$SPADES_DIR" #create the output directory
mkdir -p "$GENOMES_DIR" #create the directory to store the genomes
R1_FILES=("$INPUT_DIR"/*_1_val_1.fq.gz) #indicate the name of the R files. Change it according to the ending pattern of the names of the trimmed files.
#Look for the files
if [ ${#R1_FILES[@]} -eq 0 ]; then
echo "Files *_1_val_1.fq.gz not found"
exit 1
fi
#export variables
export SPADES SPADES_DIR GENOMES_DIR THREADS
#Run SPAdes in parallel
printf "%s\n" "${R1_FILES[@]}" | parallel -j "$JOBS" '
r1={}
r2=${r1/_1_val_1.fq.gz/_2_val_2.fq.gz}
sample=$(basename "$r1" _1_val_1.fq.gz) #extract the name of the sample
echo "Assembly genome for sample: $sample"
"$SPADES" \
--isolate \ #indicate for high-coverage isolate (multicell data)
-1 "$r1" \ #F reads
-2 "$r2" \ #R reads
-k 55,75,97 \ #comma separated list indicating k-mer sizes
-o "$SPADES_DIR/$sample" \
-t "$THREADS"
#copy the fasta files of the assembled genomes to another folder
if [ -f "$SPADES_DIR/$sample/contigs.fasta" ]; then
cp "$SPADES_DIR/$sample/contigs.fasta" "$GENOMES_DIR/${sample}.fna"
else
echo "WARNING: contigs.fasta not found for $sample"
fi
'
echo "......SPAdes assembly finished"
echo "......SPAdes outputs in $SPADES_DIR"
echo "......Final genomes in $GENOMES_DIR"This script creates two directories: 05.spades_trimgalore_results and 06.assembled_genomes_trimgalore in this example. The first one conatins one folder for each analyzed genome, which in turn contains these files and folders:
-KX, where X is the size of the k-mers. This folder contains the assembled contigs and the assembly graphs. There is a folder for each k-mer size indicated in the script.
-spades.log and params.txt: these file are similar to logfiles.
-contigs.fasta: fasta files of the contigs for the selected k-mer
-before_rr.fasta: fasta files of the contigs before resolving the repetitive regions.
-scaffolds.fasta: fasta files of the scaffold constructed based on the selected k-mer.
-assembly_graph.fasta, assembly_graph_with_scaffolds.gfa: assembly graphs. They can be visualized with the tool Bandage.
SPAdes creates also other files. If you want to have a close look, visit the developers’ website.
The other directory named 06.assembled_genomes_trimgalore contains the fasta file of each assembled genome. In this example, they have .fna extension, but you can change it.
4. Contig filtering
By using the Illumina platform, we obtain short reads that result in the assembly of many contigs. Some of them are too short and have insufficient coverage, so it is often advisable to remove them. In this tutorial, we propose retaining only those contigs that are longer than 200 bp and have a coverage depth greater than 10. These parameters should be adjusted according to your results and the characteristics of the samples and sequencing. The script we propose here is based on the format of the names of the contigs. If we ran SPAdes to perform the assembly, the header of the fasta files will look like this:
head -n 2 06.assembled_genomes_trimgalore/Acibau_Illumina.fna
>NODE_1_length_237808_cov_34.133406
TAGCAAGCTACCTTCCCCCGCTCGACTTGCATGTGTTAAGCCTGCCGCCAGCGTTCAATC-NODE_X (where X is a number), indicates the number of each contig.
-length_YYY, indicates the total length, in bp, of each contig.
-cov_ZZZZ, indicates the coverage of each contig.
Keeping this name structure, let’s run these lines:
#!/usr/bin/env bash
set -euo pipefail
INPUT_DIR="/home/ana/hybrid_assembly/06.assembled_genomes_trimgalore" #path to the assembled genomes
OUTPUT_DIR="/home/ana/hybrid_assembly/15c.filtered_contigs_trimgalore" #output directory
# Cut-off. Change them according to your interests.
MIN_LENGTH=200 #length
MIN_COV=10#coverage
mkdir -p "$OUTPUT_DIR" #create the output directory
#Loop over all the files in the input directory
for file in "$INPUT_DIR"/*.fna; do #be careful with the ending pattern of your files. Change the extension accordingly.
filename=$(basename "$file") #extract the name of the sample
sample="${filename%.fna}"
output_file="$OUTPUT_DIR/${sample}.fna" #the name you want to give to the output files
echo "Processing: $sample"
awk -v min_len="$MIN_LENGTH" -v min_cov="$MIN_COV" '
BEGIN { keep=0 }
/^>/ {
keep=0
# Extract the length and the coverage from the name of the contigs
if (match($0, /length_([0-9]+)_cov_([0-9.]+)/, arr)) {
len = arr[1]
cov = arr[2]
if (len > min_len && cov > min_cov) { #save if the filters are passed
keep=1
print
}
}
next
}
{
if (keep == 1) {
print
}
}
' "$file" > "$output_file"
done
echo "Filtering finished"5. Check the quality of the assemblies
Now, we have to check the quality of the assembled genomes, and we can do that in two different ways.
5.a) QUAST software
We can use a script written in Python to evaluate the quality of the assembly. Specifically, we can use the quast.py script, whose main usage is already described in this website7.
You can run the script described in the section 7.a) of the pipeline to analyze Nanopore genomes. Remember that you have to adapt the script according to the paths of the current folder, the name of the files, the extension of the fasta files etc.
5.b) CheckM2 software
Software CheckM28 is a machine learning-based tool that predicts the completeness and contamination of bacterial genomes or bins.
You can run the script described in the section 7.b) of the pipeline for Nanopore genome analysis. Remember that you have to adapt the script according to the paths of the current folder, the name of the files, the extension of the fasta files etc.
6. Analysis of the high quality genomes
From now on, you can work with your genomes as described in the section 8 of the pipeline for Nanopore genome analysis and subsequent sections. We recommend that you follow all the guidelines provided there. Don’t forget to adapt the names of your directories, files, and paths to match your data.
Footnotes
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/↩︎
Ewels, P., Magnusson, M., Lundin, S., Käller, M., (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, 32:3047–3048, https://doi.org/10.1093/bioinformatics/btw354↩︎
https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/↩︎
Chen, S., (2025) fastp 1.0: An ultra-fast all-round tool for FASTQ data quality control and preprocessing. iMeta 4:e70078. https://doi.org/10.1002/imt2.70078.↩︎
Langmead, B., Wilks, C., Antonescu, V., Charles, R. (2019) Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics. 35:421–32. https://doi.org/10.1093/bioinformatics/bty648↩︎
Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., Korobeynikov, A. (2020) Using SPAdes de novo assembler. Current Protocols in Bioinformatics 70: e102. https://doi.org/10.1002/cpbi.102↩︎
Mikheenko, A., Saveliev, V., Hirsch, P., Gurevich, A., (2023) WebQUAST: online evaluation of genome assemblies, Nucleic Acids Research, 51:W601–W606, https://doi.org/10.1093/nar/gkad406↩︎
Chklovski, A., Parks, D.H., Woodcroft, B.J. et al., (2023) CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods 20, 1203–1212. https://doi.org/10.1038/s41592-023-01940-w↩︎