Genomics pipeline

Sequencing platform: Illumina

Author

Ana V. Lasa (ana.lasa@eez.csic.es)

0. Introduction

In this tutorial you will learn how to perform genomics analyses. We will go through all the steps needed: from the processing of raw sequencing reads coming from Illumina platform until the obtaining of high quality genomes.

0.1 Requirements

This pipeline has been set up on a Linux server (Ubuntu distribution), with 96 cores and 500 MB of RAM memory. You have to be careful and change some parameters according to the computational power of your machine.

Furthermore, here we assume that all the packages, programs, conda environments, and databases are already correctly installed. Otherwise, you must install or download all of them as the first step.

Some knowledge of bash is also desirable.

Programs and their versions used in this pipeline
Program	Version
FastQC	v.0.12.1
MultiQC	v.1.3
TrimGalore	v.0.6.10
fastp	v.0.22.0
Bowtie	v.2.5.5
SPAdes	v.3.13.1
QUAST	v.5.2.0
CheckM2	v.1.1.0

0.2 Some useful tricks

-Remember that if you want to create or modify an script, you can write in the terminal:

nano name_of_the_script.sh (or whichever editor you use)

This command will open the nano text-editor, where you can write or paste your script. There are some keyboard shortcuts to manage the text editor.

If the script is written in bash, then you have to write this in the terminal to execute the script:

bash name_of_the_script.sh

-Preferably work with screens. Some steps are very time-consuming, so it is necessary to run them on a screen. Hence, you can leave the server working when you go home to rest, or you can use the server to perform other analyses meanwhile. Working with screen is mandatory if you have internet connection problems.

To create a screen, just run this command: screen -S name_of_screen. Once created, you can detach it and continue working on your server with this keyboard shortcut: ctr + a (and then) d

1. Analyze the quality of the raw reads

Before starting to analyze the data, it is important to verify that the sequencing company has performed the sequencing properly. For that purpose, we will check different quality parameters. First, we will use the software FastQC¹. Let’s run these command lines to run FastQC

#!/usr/bin/env bash
set -euo pipefail #this avoids running the entire script if there are initial errors. We will use it along the whole protocol.

INPUT_DIR="/home/ana/hybrid_assembly/00.raw_data/Illumina" #input directory where raw reads given by the sequencing company are stored
OUTPUT_DIR="/home/ana/hybrid_assembly/01.fastqc_results" #output directory

THREADS=8 #change it according to your machine

mkdir -p "$OUTPUT_DIR" #create the output directory

FILES=("$INPUT_DIR"/*.fastq.gz) #change it according to the ending pattern of the names of your raw reads

# Check if there are files in your directory
if [ ${#FILES[@]} -eq 0 ]; then
    echo "No se encontraron archivos .fastq.gz"
    exit 1
fi

#Loop over all files in the specified directory
for fq in "${FILES[@]}"; do

    sample=$(basename "$fq" .fastq.gz) #extract the name of the samples

    echo "Processing sample: $sample"

    fastqc \
        -t "$THREADS" \
        -o "$OUTPUT_DIR" \
        "$fq"

done

echo "......Finished"
echo "......Results stored in: $OUTPUT_DIR"

This script created a file containing the quality report (name: samplename_fastqc.html, where samplename is the name of your sample/genome under study) and a directory named samplename_fastq.gz. If the sequencing was performed as Paired-End strategy, then you will have these two files for each Forward and Reverse (F and R now on) files for each genome/sample. Thus, four FastQC files for each genome or sample.

If we download the samplename_fastq.html report, we can see that it is a very detailed quality summary. We can check basic statistics (for instance, the total number of sequences, the size range, the percentage of GC), as well as per base quality information, per base quality scores, among other data.

But if we have many or several different genomes, maybe we prefer have a quick look at all the genomes in just one plot or table. For that purpose we will use the tool MultiQC².

This tool aggregates our data and summaryzes it in beautiful plots. We can performd all the agglomeration by running these lines:

#!/usr/bin/env bash
set -euo pipefail

INPUT_DIR="/home/ana/hybrid_assembly/01.fastqc_results" #Input directory where FastQC results are stored

OUTPUT_DIR="/home/ana/hybrid_assembly/02.multiqc_report" #ouput directory

mkdir -p "$OUTPUT_DIR" #create output directory

echo "Runnin MultiQC over FastQC results..."

multiqc \
    "$INPUT_DIR" \
    --outdir "$OUTPUT_DIR" \
    --force #Overwrite any existing reports 

echo "......MultiQC finished"
echo "......Report available in: $OUTPUT_DIR"

MultiQC has many different possibilities and flags, you have better to have a look at the manual (multiqc --help).

As a result, these lines give us a directory and a file named multiqc_report.html. If we download it we will see all the details regarding the total number of unique and duplicated reads, sequence quality histograms, per sequence quality scores, per base sequence and GC content, per sequence N content, and much more details about all the F and R files.

2. Trimming of the reads

There are different tools that perform the trimming of the raw reads. In this pipeline, we will introduce two of them. Both are efficient tools, so have a look at the script and keeping in mind your main objetives, select one of them for your analyses. You can even compare the results given by both tools each other.

2.a) TrimGalore tool

Let’s remove short reads, reads with too many ambigüities, reads with low quality score etc. We will use the toolTrim Galore³ for that purpose. It is a perl script, which has to be downloaded in your machine, and you should know its location or path. Have a look at these lines:

#!/usr/bin/env bash
set -euo pipefail

INPUT_DIR="/home/ana/hybrid_assembly/00.raw_data/Illumina" #input file where Illumina raw reads are stored.

OUTPUT_DIR="/home/ana/hybrid_assembly/03.trimgalore_results" #output directory

JOBS=40 #number of jobs to run in parallel. Change this value according to your machine power

TRIMGALORE_DIR="/home/ana/Metagenomas_prueba_illumina/trim_galore" #path to the TrimGalore script. Change it accordingly.

mkdir -p "$OUTPUT_DIR" #create output directory

#Searching the files
echo "Looking for Paired-End files in: $INPUT_DIR..."

#Look for all F (R1) files
R1_FILES=("$INPUT_DIR"/*_1.fastq.gz) #change the ending pattern of the names according to your files

if [ ${#R1_FILES[@]} -eq 0 ]; then
    echo "Not files detected for *_1.fastq.gz"
    exit 1
fi #it stops if no F (R1) file are found

#Run TrimGalore in parallel:

printf "%s\n" "${R1_FILES[@]}" | parallel -j "$JOBS" '
    r1={}
    r2=${r1/_1.fastq.gz/_2.fastq.gz} #the pattern of R (R2) files

    sample=$(basename "$r1" _1.fastq.gz) #extract the sample name

    echo "Processing sample: $sample"

   perl "'"$TRIMGALORE_DIR"'" \ 
        --paired \ #performs length trimming for PE-files
        --stringency 2 \ #minimum overlapping size between the adapters and the raw reads. Change it according to your experiment (for example, here have to reduce the stringency because we have 150bp-PE)
        --length 75 \ #remove all reads shorter than 75 bp. Adapt it according to your needs
        --quality 20 \ #minimun phred score
        --max_n 2 \#maximum ambiguities
        --trim-n \#removes Ns from either side of the read
        --gzip \
        --no_report_file \
        --suppress_warn \
        --output_dir "'"$OUTPUT_DIR"'" \
        "$r1" "$r2"
'
#In this experiment, we do not know the adapter that were used during the sequencing, so we allow the software to auto-detect it. But if you are working, for instance, with Nextera adapters, you should add the flag "--nextera" 

echo "......Trim Galore finished"
echo "......Results stored in: $OUTPUT_DIR"

At the time of writing this pipeline, the multithreading option for Trim Galore was not available.

TrimGalore gives the users many options, and they have to adapt the analysis to their own dataset. So, we encourage having a look at this user guide. For instance, it is convenient to change the stringency parameter if the run was 2x150 bp.

Warning

This step is time-consuming; take it in mind, and do not forget to create a screen.

In our example, Trim Galore created files whose name’s endin pattern is 1_val_1.fq.gz and 2_val_2.fg.gz for F and R reads, respectively. Keep the ending pattern of your files’ names in mind, you will need to remember them further.

2.b) fastp tool

The software fastp⁴ is a tool designed to perform ultrafast preprocessing of Illumina reads. It is very powerful due to all its advantages: it cuts adapters, trim polyG regions commonly detected in NovaSeq and NextSeq data, removes reads with low quality score, among others. This program can also support the error correction. We suggest that you have a close look at the user guide.

Run these lines to do the trimming:

#!/usr/bin/env bash
set -euo pipefail

CONDA_ENV="fastp_env" #indicate the name of the environment where fastp is installed

INPUT_DIR="/home/ana/hybrid_assembly/00.raw_data/Illumina" #path to the folder containing the raw reads
OUTPUT_DIR="/home/ana/hybrid_assembly/04.fastp_results" #output directory

JOBS=16 #Number of jobs to run in parallel. Adapt it to your machine's power

#Activate conda environment
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate "$CONDA_ENV"

mkdir -p "$OUTPUT_DIR" #create output directory

#Look for F and R files
R1_FILES=("$INPUT_DIR"/*_1.fastq.gz) #adapt it according to the extension of your raw data

if [ ${#R1_FILES[@]} -eq 0 ]; then
    echo "Files *_1.fastq.gz not found"
    exit 1
fi

export OUTPUT_DIR

printf "%s\n" "${R1_FILES[@]}" | parallel -j "$JOBS" '
    r1={}
    r2=${r1/_1.fastq.gz/_2.fastq.gz}

    sample=$(basename "$r1" _1.fastq.gz) #extract sample name

    echo "Processing sample: $sample"

    fastp \
        --in1 "$r1" \ #F input reads
        --in2 "$r2" \ #R input reads
        --out1 "$OUTPUT_DIR/${sample}_R1.trimmed.fastq.gz" \ #Specify the name you want to give the output files (F)
        --out2 "$OUTPUT_DIR/${sample}_R2.trimmed.fastq.gz" \ #Specify the name you want to give the output files (R)
        --unpaired1 "$OUTPUT_DIR/${sample}_unpaired_R1.fastq.gz" \#Specify the name you want to give those F reads that pass the trimming but their R does not.
        --unpaired2 "$OUTPUT_DIR/${sample}_unpaired_R2.fastq.gz" \ #the same for R reads
        --detect_adapter_for_pe \ #auto-detect the adapters. Highly recommended option to obtain ultra high quality reads
        --cut_mean_quality 20 \ #minimum quality score
        --cut_front \ #to move a sliding window from 5' to 3' (read the documentation)
        --cut_tail \#to move a sliding window from 3' to 5' (read the documentation)
        --length_required 75 \ #minimum size of the reads
        --thread 1 \
        --overrepresentation_analysis \ #analyze the overrepresented reads (adapters, Ns, etc.)
        --html "$OUTPUT_DIR/${sample}.fastp.html" \ #report in html format
        --json "$OUTPUT_DIR/${sample}.fastp.json" #report in json format.
'

echo "......fastp finished"
echo "......Results stored in: $OUTPUT_DIR"

The lines give us several files:

-samplename.fastp.html: this file is a report of. If you download it, you can see many details about the type of sequencing, mean size of the reads before and after the trimming, number of sequences, quality reads before and after the trimming, etc.

-samplename.fastp.json: the report in json format

-samplename_R1.trimmed.fastq.gz: trimmed files (F)

-samplename_R2.trimmed.fastq.gz: trimmed files (R)

-samplename_unpaired_R1.fastq.gz: F reads that pass the trimming but the corresponding R does not

-samplename_unpaired_R2.fastq.gz: R reads that pass the trimming but their F does not.

These are two options for filtering sequences. For convenience, in this tutorial we will use 2.a) TrimGalore tool, but you can use either one and follow the rest of the tutorial. Just remember to adjust the file names to match the ones you’ve obtained.

Note

If you have sequenced bacterial genomes using Illumina and Nanopore platforms and want to perform a hybrid assembly, now is the right time. In that case, go to Section X and do not continue with this pipeline.

EXTRA: removal of the phiX174 genome

During the Illumina sequencing, it is common to introduce the phiX174 bacteriophage genome to perform internal sequencing quality controls. We can add a step to remove any potential contamination with this genome. We will do this using the Bowtie2 tool ⁵.

But first, we have to donwload the genome of the paghe, for example by doing:

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz

Let’s unzip the file:

gunzip *fna.gz

Bowtie2 is an aligner that needs to work with a special index database (see the corresponding manual), obtained from the genome(s) we want to compare with. So, let’s create it:

bowtie2-build --threads 48 GCF_000819615.1_ViralProj14015_genomic.fna phi_index

Now, let’s move all the index-realted files to a new directory:

mkdir bowtie_index
mv phi_index* /home/ana/hybrid_assembly/bowtie_index/

And that is the moment to remove the phiX174 genome:

#!/bin/bash
set -euo pipefail

THREADS=80 #change it according to your machine power
INPUT_DIR="/home/ana/hybrid_assembly/03.trimgalore_results" #path to the trimmed reads
OUTPUT_SAM_DIR="/home/ana/hybrid_assembly/20.sam_host"  #directory where phiX174 reads will be stored (SAM format)
OUTPUT_FASTQ_DIR="/home/ana/hybrid_assembly/21.bowtie_noPhiX174_fastq" #path to the ouput files, without the paghe's genome 
BOWTIE_INDEX="/home/ana/hybrid_assembly/bowtie_index/phi_index" #path where bowtie2 index database is saved. Indicate it accordingly. We just indicate the prefix

mkdir -p "$OUTPUT_SAM_DIR" #create SAM directory
mkdir -p "$OUTPUT_FASTQ_DIR" #create ouput directory

#Loop over trimmed-reads files
for F1 in "$INPUT_DIR"/*_1_val_1.fq.gz; do #Be careful with the ending pattern of the files'names. Change it according to your files' names

    # Extract the prefix of the samples
    BASE=$(basename "$F1" "_1_val_1.fq.gz")
    F2="$INPUT_DIR/${BASE}_2_val_2.fq.gz"

    # Check that the R file exists
    if [[ ! -f "$F2" ]]; then
        echo "ERROR: missing pair for $BASE"
        continue
    fi

    echo ">>> Cleaning host reads for sample $BASE"

    #Align against the phague's genome
    bowtie2 -p $THREADS --sensitive-local \
        -x "$BOWTIE_INDEX" \
        -1 "$F1" -2 "$F2" \ #-1 and -2 for F and R files, respectively
        -S "$OUTPUT_SAM_DIR/${BASE}.sam"

    #  Convert SAM to BAM
    samtools view --threads $THREADS -bS "$OUTPUT_SAM_DIR/${BASE}.sam" > "$OUTPUT_SAM_DIR/${BASE}.bam"
    rm "$OUTPUT_SAM_DIR/${BASE}.sam"

#  Extract pairs of non-matching reads. These are the reads we are interested in
    #visit samtool's manual to understand the meaning of all the flags
    samtools view -b -f 12 -F 256 "$OUTPUT_SAM_DIR/${BASE}.bam" > "$OUTPUT_SAM_DIR/ext_${BASE}.bam"
    rm "$OUTPUT_SAM_DIR/${BASE}.bam"

    # Sort by name
    samtools sort -n "$OUTPUT_SAM_DIR/ext_${BASE}.bam" > "$OUTPUT_SAM_DIR/sort_${BASE}.bam"
    rm "$OUTPUT_SAM_DIR/ext_${BASE}.bam"

    # Extract non-matching reads to fastq files
    #visit bedtools's manual to understand the meaning of all the flags
    bedtools bamtofastq \
        -i "$OUTPUT_SAM_DIR/sort_${BASE}.bam" \
        -fq "$OUTPUT_FASTQ_DIR/${BASE}_nohost_R1.fastq" \
        -fq2 "$OUTPUT_FASTQ_DIR/${BASE}_nohost_R2.fastq"

    echo ">>> Done sample $BASE"

done

Note

We have included these lines of code as supplementary material, not as the main content of the tutorial. Therefore, if you use them, in the following steps you must adjust the names of the input files to match the names of the outputs generated now (generally, the previous lines of code generate files of the type: samplename_Illumina_nohost_R1.fastq and samplename_Illumina_nohost_R1.fastq)

3. Assembly of the reads into genomes

One of the most used tools to assemble the reads into contigs or scaffolds de novo is SPAdes ⁶. In summary, SPAdes is based on De Bruijn graphs.

As you know, SPAdes is an script that should be loaded in our machine, so write down the path in order to use it further. Then, you just have to run these lines:

#!/usr/bin/env bash
set -euo pipefail

SPADES="/home/ana/miniconda3/bin/spades.py" #the path to the SPAdes script.

INPUT_DIR="/home/ana/hybrid_assembly/03.trimgalore_results/" #input directory where trimmed reads are stored
SPADES_DIR="/home/ana/hybrid_assembly/05.spades_trimgalore_results" #output directory
GENOMES_DIR="/home/ana/hybrid_assembly/06.assembled_genomes_trimgalore" #path to the already assembled genomes in fasta format

THREADS=16 #change it according to your machine
JOBS=4 #number of jobs to be run in parallel 

mkdir -p "$SPADES_DIR" #create the output directory
mkdir -p "$GENOMES_DIR" #create the directory to store the genomes

R1_FILES=("$INPUT_DIR"/*_1_val_1.fq.gz) #indicate the name of the R files. Change it according to the ending pattern of the names of the trimmed files.

#Look for the files
if [ ${#R1_FILES[@]} -eq 0 ]; then
    echo "Files *_1_val_1.fq.gz not found"
    exit 1
fi

#export variables
export SPADES SPADES_DIR GENOMES_DIR THREADS

#Run SPAdes in parallel
printf "%s\n" "${R1_FILES[@]}" | parallel -j "$JOBS" '
    r1={}
    r2=${r1/_1_val_1.fq.gz/_2_val_2.fq.gz}

    sample=$(basename "$r1" _1_val_1.fq.gz) #extract the name of the sample

    echo "Assembly genome for sample: $sample"

    "$SPADES" \
        --isolate \ #indicate for high-coverage isolate (multicell data)
        -1 "$r1" \ #F reads
        -2 "$r2" \ #R reads
        -k 55,75,97 \ #comma separated list indicating k-mer sizes
        -o "$SPADES_DIR/$sample" \ 
        -t "$THREADS"
    
    #copy the fasta files of the assembled genomes to another folder
    if [ -f "$SPADES_DIR/$sample/contigs.fasta" ]; then
        cp "$SPADES_DIR/$sample/contigs.fasta" "$GENOMES_DIR/${sample}.fna"
    else
        echo "WARNING: contigs.fasta not found for $sample"
    fi
'
echo "......SPAdes assembly finished"
echo "......SPAdes outputs in $SPADES_DIR"
echo "......Final genomes in $GENOMES_DIR"

This script creates two directories: 05.spades_trimgalore_results and 06.assembled_genomes_trimgalore in this example. The first one conatins one folder for each analyzed genome, which in turn contains these files and folders:

-KX, where X is the size of the k-mers. This folder contains the assembled contigs and the assembly graphs. There is a folder for each k-mer size indicated in the script.

-spades.log and params.txt: these file are similar to logfiles.

-contigs.fasta: fasta files of the contigs for the selected k-mer

-before_rr.fasta: fasta files of the contigs before resolving the repetitive regions.

-scaffolds.fasta: fasta files of the scaffold constructed based on the selected k-mer.

-assembly_graph.fasta, assembly_graph_with_scaffolds.gfa: assembly graphs. They can be visualized with the tool Bandage.

SPAdes creates also other files. If you want to have a close look, visit the developers’ website.

The other directory named 06.assembled_genomes_trimgalore contains the fasta file of each assembled genome. In this example, they have .fna extension, but you can change it.

4. Contig filtering

By using the Illumina platform, we obtain short reads that result in the assembly of many contigs. Some of them are too short and have insufficient coverage, so it is often advisable to remove them. In this tutorial, we propose retaining only those contigs that are longer than 200 bp and have a coverage depth greater than 10. These parameters should be adjusted according to your results and the characteristics of the samples and sequencing. The script we propose here is based on the format of the names of the contigs. If we ran SPAdes to perform the assembly, the header of the fasta files will look like this:

head -n 2 06.assembled_genomes_trimgalore/Acibau_Illumina.fna
>NODE_1_length_237808_cov_34.133406
TAGCAAGCTACCTTCCCCCGCTCGACTTGCATGTGTTAAGCCTGCCGCCAGCGTTCAATC

-NODE_X (where X is a number), indicates the number of each contig.

-length_YYY, indicates the total length, in bp, of each contig.

-cov_ZZZZ, indicates the coverage of each contig.

Keeping this name structure, let’s run these lines:

#!/usr/bin/env bash
set -euo pipefail

INPUT_DIR="/home/ana/hybrid_assembly/06.assembled_genomes_trimgalore" #path to the assembled genomes
OUTPUT_DIR="/home/ana/hybrid_assembly/15c.filtered_contigs_trimgalore" #output directory

# Cut-off. Change them according to your interests.
MIN_LENGTH=200 #length 
MIN_COV=10#coverage

mkdir -p "$OUTPUT_DIR" #create the output directory

#Loop over all the files in the input directory
for file in "$INPUT_DIR"/*.fna; do #be careful with the ending pattern of your files. Change the extension accordingly.
    filename=$(basename "$file") #extract the name of the sample
    sample="${filename%.fna}"

    output_file="$OUTPUT_DIR/${sample}.fna" #the name you want to give to the output files

    echo "Processing: $sample"

    awk -v min_len="$MIN_LENGTH" -v min_cov="$MIN_COV" '
    BEGIN { keep=0 }

    /^>/ {
        keep=0

        # Extract the length and the coverage from the name of the contigs
        if (match($0, /length_([0-9]+)_cov_([0-9.]+)/, arr)) {
            len = arr[1]
            cov = arr[2]

            if (len > min_len && cov > min_cov) { #save if the filters are passed
                keep=1
                print
            }
        }
        next
    }

    {
        if (keep == 1) {
            print
        }
    }
    ' "$file" > "$output_file"

done

echo "Filtering finished"

5. Check the quality of the assemblies

Now, we have to check the quality of the assembled genomes, and we can do that in two different ways.

5.a) QUAST software

We can use a script written in Python to evaluate the quality of the assembly. Specifically, we can use the quast.py script, whose main usage is already described in this website ⁷.

You can run the script described in the section 7.a) of the pipeline to analyze Nanopore genomes. Remember that you have to adapt the script according to the paths of the current folder, the name of the files, the extension of the fasta files etc.

5.b) CheckM2 software

Software CheckM2⁸ is a machine learning-based tool that predicts the completeness and contamination of bacterial genomes or bins.

You can run the script described in the section 7.b) of the pipeline for Nanopore genome analysis. Remember that you have to adapt the script according to the paths of the current folder, the name of the files, the extension of the fasta files etc.

6. Analysis of the high quality genomes

From now on, you can work with your genomes as described in the section 8 of the pipeline for Nanopore genome analysis and subsequent sections. We recommend that you follow all the guidelines provided there. Don’t forget to adapt the names of your directories, files, and paths to match your data.

Footnotes

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/↩︎
Ewels, P., Magnusson, M., Lundin, S., Käller, M., (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, 32:3047–3048, https://doi.org/10.1093/bioinformatics/btw354↩︎
https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/↩︎
Chen, S., (2025) fastp 1.0: An ultra-fast all-round tool for FASTQ data quality control and preprocessing. iMeta 4:e70078. https://doi.org/10.1002/imt2.70078.↩︎
Langmead, B., Wilks, C., Antonescu, V., Charles, R. (2019) Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics. 35:421–32. https://doi.org/10.1093/bioinformatics/bty648↩︎
Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., Korobeynikov, A. (2020) Using SPAdes de novo assembler. Current Protocols in Bioinformatics 70: e102. https://doi.org/10.1002/cpbi.102↩︎
Mikheenko, A., Saveliev, V., Hirsch, P., Gurevich, A., (2023) WebQUAST: online evaluation of genome assemblies, Nucleic Acids Research, 51:W601–W606, https://doi.org/10.1093/nar/gkad406↩︎
Chklovski, A., Parks, D.H., Woodcroft, B.J. et al., (2023) CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods 20, 1203–1212. https://doi.org/10.1038/s41592-023-01940-w↩︎

--- title: "Genomics pipeline" subtitle: "Sequencing platform: Illumina" author: "Ana V. Lasa (ana.lasa@eez.csic.es)" format: html: theme: united lang: en css: styles.css editor: visual crossref: fig-prefix: "Figure" tbl-prefix: "Table" --- # 0. Introduction In this tutorial you will learn how to perform genomics analyses. We will go through all the steps needed: from the processing of raw sequencing reads coming from **Illumina** platform until the obtaining of high quality genomes. ## 0.1 Requirements This pipeline has been set up on a Linux server (Ubuntu distribution), with 96 cores and 500 MB of RAM memory. You have to be careful and change some parameters according to the computational power of your machine. Furthermore, here we assume that all the packages, programs, conda environments, and databases are already correctly installed. Otherwise, you must install or download all of them as the first step. Some knowledge of **bash** is also desirable. | Program | Version | |------------|----------| | FastQC | v.0.12.1 | | MultiQC | v.1.3 | | TrimGalore | v.0.6.10 | | fastp | v.0.22.0 | | Bowtie | v.2.5.5 | | SPAdes | v.3.13.1 | | QUAST | v.5.2.0 | | CheckM2 | v.1.1.0 | : Programs and their versions used in this pipeline ## 0.2 Some useful tricks -Remember that if you want **to create or modify an script**, you can write in the terminal: `nano name_of_the_script.sh` (or whichever editor you use) This command will open the nano text-editor, where you can write or paste your script. There are some [keyboard shortcuts](https://www.nano-editor.org/dist/latest/cheatsheet.html "check the shortcuts") to manage the text editor. If the script is written in bash, then you have to write this in the terminal to execute the script: `bash name_of_the_script.sh` -Preferably work with **screens**. Some steps are very time-consuming, so it is necessary to run them on a screen. Hence, you can leave the server working when you go home to rest, or you can use the server to perform other analyses meanwhile. Working with screen is mandatory if you have internet connection problems. To create a screen, just run this command: `screen -S name_of_screen`. Once created, you can detach it and continue working on your server with this keyboard shortcut: `ctr + a (and then) d` # 1. Analyze the quality of the raw reads Before starting to analyze the data, it is important to verify that the sequencing company has performed the sequencing properly. For that purpose, we will check different quality parameters. First, we will use the [software **FastQC**](https://github.com/Morgan-Feuz/fastqc)[^1]. Let's run these command lines to run FastQC [^1]: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ``` r #!/usr/bin/env bash set -euo pipefail #this avoids running the entire script if there are initial errors. We will use it along the whole protocol. INPUT_DIR="/home/ana/hybrid_assembly/00.raw_data/Illumina" #input directory where raw reads given by the sequencing company are stored OUTPUT_DIR="/home/ana/hybrid_assembly/01.fastqc_results" #output directory THREADS=8 #change it according to your machine mkdir -p "$OUTPUT_DIR" #create the output directory FILES=("$INPUT_DIR"/*.fastq.gz) #change it according to the ending pattern of the names of your raw reads # Check if there are files in your directory if [ ${#FILES[@]} -eq 0 ]; then echo "No se encontraron archivos .fastq.gz" exit 1 fi #Loop over all files in the specified directory for fq in "${FILES[@]}"; do sample=$(basename "$fq" .fastq.gz) #extract the name of the samples echo "Processing sample: $sample" fastqc \ -t "$THREADS" \ -o "$OUTPUT_DIR" \ "$fq" done echo "......Finished" echo "......Results stored in: $OUTPUT_DIR" ``` This script created a file containing the quality report (name: `samplename_fastqc.html`, where *samplename* is the name of your sample/genome under study) and a directory named `samplename_fastq.gz`. If the sequencing was performed as Paired-End strategy, then you will have these two files for each Forward and Reverse (F and R now on) files for each genome/sample. Thus, four FastQC files for each genome or sample. If we download the `samplename_fastq.html` report, we can see that it is a very detailed quality summary. We can check basic statistics (for instance, the total number of sequences, the size range, the percentage of GC), as well as per base quality information, per base quality scores, among other data. But if we have many or several different genomes, maybe we prefer have a quick look at all the genomes in just one plot or table. For that purpose we will use the [tool **MultiQC**](https://seqera.io/multiqc/ "Click here for more details")[^2]. [^2]: Ewels, P., Magnusson, M., Lundin, S., Käller, M., (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, 32:3047–3048, https://doi.org/10.1093/bioinformatics/btw354 This tool aggregates our data and summaryzes it in beautiful plots. We can performd all the agglomeration by running these lines: ``` r #!/usr/bin/env bash set -euo pipefail INPUT_DIR="/home/ana/hybrid_assembly/01.fastqc_results" #Input directory where FastQC results are stored OUTPUT_DIR="/home/ana/hybrid_assembly/02.multiqc_report" #ouput directory mkdir -p "$OUTPUT_DIR" #create output directory echo "Runnin MultiQC over FastQC results..." multiqc \ "$INPUT_DIR" \ --outdir "$OUTPUT_DIR" \ --force #Overwrite any existing reports echo "......MultiQC finished" echo "......Report available in: $OUTPUT_DIR" ``` MultiQC has many different possibilities and flags, you have better to have a look at the manual (`multiqc --help`). As a result, these lines give us a directory and a file named `multiqc_report.html`. If we download it we will see all the details regarding the total number of unique and duplicated reads, sequence quality histograms, per sequence quality scores, per base sequence and GC content, per sequence N content, and much more details about all the F and R files. # 2. Trimming of the reads There are different tools that perform the trimming of the raw reads. In this pipeline, we will introduce two of them. Both are efficient tools, so have a look at the script and keeping in mind your main objetives, select one of them for your analyses. You can even compare the results given by both tools each other. ## 2.a) TrimGalore tool Let's remove short reads, reads with too many ambigüities, reads with low quality score etc. We will use the [tool](https://seqera.io/multiqc/ "Go to the user guide")**Trim Galore**[^3] for that purpose. It is a perl script, which has to be downloaded in your machine, and you should know its location or path. Have a look at these lines: [^3]: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/ ``` r #!/usr/bin/env bash set -euo pipefail INPUT_DIR="/home/ana/hybrid_assembly/00.raw_data/Illumina" #input file where Illumina raw reads are stored. OUTPUT_DIR="/home/ana/hybrid_assembly/03.trimgalore_results" #output directory JOBS=40 #number of jobs to run in parallel. Change this value according to your machine power TRIMGALORE_DIR="/home/ana/Metagenomas_prueba_illumina/trim_galore" #path to the TrimGalore script. Change it accordingly. mkdir -p "$OUTPUT_DIR" #create output directory #Searching the files echo "Looking for Paired-End files in: $INPUT_DIR..." #Look for all F (R1) files R1_FILES=("$INPUT_DIR"/*_1.fastq.gz) #change the ending pattern of the names according to your files if [ ${#R1_FILES[@]} -eq 0 ]; then echo "Not files detected for *_1.fastq.gz" exit 1 fi #it stops if no F (R1) file are found #Run TrimGalore in parallel: printf "%s\n" "${R1_FILES[@]}" | parallel -j "$JOBS" ' r1={} r2=${r1/_1.fastq.gz/_2.fastq.gz} #the pattern of R (R2) files sample=$(basename "$r1" _1.fastq.gz) #extract the sample name echo "Processing sample: $sample" perl "'"$TRIMGALORE_DIR"'" \ --paired \ #performs length trimming for PE-files --stringency 2 \ #minimum overlapping size between the adapters and the raw reads. Change it according to your experiment (for example, here have to reduce the stringency because we have 150bp-PE) --length 75 \ #remove all reads shorter than 75 bp. Adapt it according to your needs --quality 20 \ #minimun phred score --max_n 2 \#maximum ambiguities --trim-n \#removes Ns from either side of the read --gzip \ --no_report_file \ --suppress_warn \ --output_dir "'"$OUTPUT_DIR"'" \ "$r1" "$r2" ' #In this experiment, we do not know the adapter that were used during the sequencing, so we allow the software to auto-detect it. But if you are working, for instance, with Nextera adapters, you should add the flag "--nextera" echo "......Trim Galore finished" echo "......Results stored in: $OUTPUT_DIR" ``` At the time of writing this pipeline, the multithreading option for Trim Galore was not available. TrimGalore gives the users many options, and they have to adapt the analysis to their own dataset. So, we encourage having a look at [this user guide](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md). For instance, it is convenient to change the `stringency` parameter if the run was 2x150 bp. ::: callout-warning This step is time-consuming; take it in mind, and do not forget to create a screen. ::: In our example, Trim Galore created files whose name's endin pattern is `1_val_1.fq.gz` and `2_val_2.fg.gz` for F and R reads, respectively. Keep the ending pattern of your files' names in mind, you will need to remember them further. ## 2.b) fastp tool The [software **fastp**](https://github.com/OpenGene/fastp?tab=readme-ov-file#get-fastp)[^4] is a tool designed to perform ultrafast preprocessing of Illumina reads. It is very powerful due to all its advantages: it cuts adapters, trim polyG regions commonly detected in NovaSeq and NextSeq data, removes reads with low quality score, among others. This program can also support the error correction. We suggest that you have a close look at the user guide. [^4]: Chen, S., (2025) fastp 1.0: An ultra-fast all-round tool for FASTQ data quality control and preprocessing. iMeta 4:e70078. https://doi.org/10.1002/imt2.70078. Run these lines to do the trimming: ``` r #!/usr/bin/env bash set -euo pipefail CONDA_ENV="fastp_env" #indicate the name of the environment where fastp is installed INPUT_DIR="/home/ana/hybrid_assembly/00.raw_data/Illumina" #path to the folder containing the raw reads OUTPUT_DIR="/home/ana/hybrid_assembly/04.fastp_results" #output directory JOBS=16 #Number of jobs to run in parallel. Adapt it to your machine's power #Activate conda environment source "$(conda info --base)/etc/profile.d/conda.sh" conda activate "$CONDA_ENV" mkdir -p "$OUTPUT_DIR" #create output directory #Look for F and R files R1_FILES=("$INPUT_DIR"/*_1.fastq.gz) #adapt it according to the extension of your raw data if [ ${#R1_FILES[@]} -eq 0 ]; then echo "Files *_1.fastq.gz not found" exit 1 fi export OUTPUT_DIR printf "%s\n" "${R1_FILES[@]}" | parallel -j "$JOBS" ' r1={} r2=${r1/_1.fastq.gz/_2.fastq.gz} sample=$(basename "$r1" _1.fastq.gz) #extract sample name echo "Processing sample: $sample" fastp \ --in1 "$r1" \ #F input reads --in2 "$r2" \ #R input reads --out1 "$OUTPUT_DIR/${sample}_R1.trimmed.fastq.gz" \ #Specify the name you want to give the output files (F) --out2 "$OUTPUT_DIR/${sample}_R2.trimmed.fastq.gz" \ #Specify the name you want to give the output files (R) --unpaired1 "$OUTPUT_DIR/${sample}_unpaired_R1.fastq.gz" \#Specify the name you want to give those F reads that pass the trimming but their R does not. --unpaired2 "$OUTPUT_DIR/${sample}_unpaired_R2.fastq.gz" \ #the same for R reads --detect_adapter_for_pe \ #auto-detect the adapters. Highly recommended option to obtain ultra high quality reads --cut_mean_quality 20 \ #minimum quality score --cut_front \ #to move a sliding window from 5' to 3' (read the documentation) --cut_tail \#to move a sliding window from 3' to 5' (read the documentation) --length_required 75 \ #minimum size of the reads --thread 1 \ --overrepresentation_analysis \ #analyze the overrepresented reads (adapters, Ns, etc.) --html "$OUTPUT_DIR/${sample}.fastp.html" \ #report in html format --json "$OUTPUT_DIR/${sample}.fastp.json" #report in json format. ' echo "......fastp finished" echo "......Results stored in: $OUTPUT_DIR" ``` The lines give us several files: \-`samplename.fastp.html`: this file is a report of. If you download it, you can see many details about the type of sequencing, mean size of the reads before and after the trimming, number of sequences, quality reads before and after the trimming, etc. \-`samplename.fastp.json`: the report in json format \-`samplename_R1.trimmed.fastq.gz`: trimmed files (F) \-`samplename_R2.trimmed.fastq.gz`: trimmed files (R) \-`samplename_unpaired_R1.fastq.gz`: F reads that pass the trimming but the corresponding R does not \-`samplename_unpaired_R2.fastq.gz`: R reads that pass the trimming but their F does not. These are two options for filtering sequences. For convenience, in this tutorial we will use [2.a) TrimGalore tool], but you can use either one and follow the rest of the tutorial. Just remember to adjust the file names to match the ones you’ve obtained. ::: callout-note If you have sequenced bacterial genomes using Illumina and Nanopore platforms and want to perform a hybrid assembly, now is the right time. In that case, go to Section X and do not continue with this pipeline. ::: # EXTRA: removal of the phiX174 genome During the Illumina sequencing, it is common to introduce the phiX174 bacteriophage genome to perform internal sequencing quality controls. We can add a step to remove any potential contamination with this genome. We will do this using the [**Bowtie2** tool](https://www.metagenomics.wiki/tools/bowtie2 "go to user guide")[^5]. [^5]: Langmead, B., Wilks, C., Antonescu, V., Charles, R. (2019) Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics. 35:421–32. https://doi.org/10.1093/bioinformatics/bty648 But first, we have to donwload the genome of the paghe, for example by doing: ``` r wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz ``` Let's unzip the file: ``` r gunzip *fna.gz ``` Bowtie2 is an aligner that needs to work with a special index database (see the corresponding manual), obtained from the genome(s) we want to compare with. So, let's create it: ``` r bowtie2-build --threads 48 GCF_000819615.1_ViralProj14015_genomic.fna phi_index ``` Now, let's move all the index-realted files to a new directory: ``` r mkdir bowtie_index mv phi_index* /home/ana/hybrid_assembly/bowtie_index/ ``` And that is the moment to remove the phiX174 genome: ``` r #!/bin/bash set -euo pipefail THREADS=80 #change it according to your machine power INPUT_DIR="/home/ana/hybrid_assembly/03.trimgalore_results" #path to the trimmed reads OUTPUT_SAM_DIR="/home/ana/hybrid_assembly/20.sam_host" #directory where phiX174 reads will be stored (SAM format) OUTPUT_FASTQ_DIR="/home/ana/hybrid_assembly/21.bowtie_noPhiX174_fastq" #path to the ouput files, without the paghe's genome BOWTIE_INDEX="/home/ana/hybrid_assembly/bowtie_index/phi_index" #path where bowtie2 index database is saved. Indicate it accordingly. We just indicate the prefix mkdir -p "$OUTPUT_SAM_DIR" #create SAM directory mkdir -p "$OUTPUT_FASTQ_DIR" #create ouput directory #Loop over trimmed-reads files for F1 in "$INPUT_DIR"/*_1_val_1.fq.gz; do #Be careful with the ending pattern of the files'names. Change it according to your files' names # Extract the prefix of the samples BASE=$(basename "$F1" "_1_val_1.fq.gz") F2="$INPUT_DIR/${BASE}_2_val_2.fq.gz" # Check that the R file exists if [[ ! -f "$F2" ]]; then echo "ERROR: missing pair for $BASE" continue fi echo ">>> Cleaning host reads for sample $BASE" #Align against the phague's genome bowtie2 -p $THREADS --sensitive-local \ -x "$BOWTIE_INDEX" \ -1 "$F1" -2 "$F2" \ #-1 and -2 for F and R files, respectively -S "$OUTPUT_SAM_DIR/${BASE}.sam" # Convert SAM to BAM samtools view --threads $THREADS -bS "$OUTPUT_SAM_DIR/${BASE}.sam" > "$OUTPUT_SAM_DIR/${BASE}.bam" rm "$OUTPUT_SAM_DIR/${BASE}.sam" # Extract pairs of non-matching reads. These are the reads we are interested in #visit samtool's manual to understand the meaning of all the flags samtools view -b -f 12 -F 256 "$OUTPUT_SAM_DIR/${BASE}.bam" > "$OUTPUT_SAM_DIR/ext_${BASE}.bam" rm "$OUTPUT_SAM_DIR/${BASE}.bam" # Sort by name samtools sort -n "$OUTPUT_SAM_DIR/ext_${BASE}.bam" > "$OUTPUT_SAM_DIR/sort_${BASE}.bam" rm "$OUTPUT_SAM_DIR/ext_${BASE}.bam" # Extract non-matching reads to fastq files #visit bedtools's manual to understand the meaning of all the flags bedtools bamtofastq \ -i "$OUTPUT_SAM_DIR/sort_${BASE}.bam" \ -fq "$OUTPUT_FASTQ_DIR/${BASE}_nohost_R1.fastq" \ -fq2 "$OUTPUT_FASTQ_DIR/${BASE}_nohost_R2.fastq" echo ">>> Done sample $BASE" done ``` ::: callout-note We have included these lines of code as supplementary material, not as the main content of the tutorial. Therefore, if you use them, in the following steps you must adjust the names of the input files to match the names of the outputs generated now (generally, the previous lines of code generate files of the type: `samplename_Illumina_nohost_R1.fastq` and `samplename_Illumina_nohost_R1.fastq`) ::: # 3. Assembly of the reads into genomes One of the most used tools to assemble the reads into contigs or scaffolds *de novo* is [SPAdes](https://github.com/ablab/spades/blob/spades_3.15.5/README.md "click here for more details")[^6]. In summary, SPAdes is based on De Bruijn graphs. [^6]: Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., Korobeynikov, A. (2020) Using SPAdes de novo assembler. Current Protocols in Bioinformatics 70: e102. https://doi.org/10.1002/cpbi.102 As you know, SPAdes is an script that should be loaded in our machine, so write down the path in order to use it further. Then, you just have to run these lines: ``` r #!/usr/bin/env bash set -euo pipefail SPADES="/home/ana/miniconda3/bin/spades.py" #the path to the SPAdes script. INPUT_DIR="/home/ana/hybrid_assembly/03.trimgalore_results/" #input directory where trimmed reads are stored SPADES_DIR="/home/ana/hybrid_assembly/05.spades_trimgalore_results" #output directory GENOMES_DIR="/home/ana/hybrid_assembly/06.assembled_genomes_trimgalore" #path to the already assembled genomes in fasta format THREADS=16 #change it according to your machine JOBS=4 #number of jobs to be run in parallel mkdir -p "$SPADES_DIR" #create the output directory mkdir -p "$GENOMES_DIR" #create the directory to store the genomes R1_FILES=("$INPUT_DIR"/*_1_val_1.fq.gz) #indicate the name of the R files. Change it according to the ending pattern of the names of the trimmed files. #Look for the files if [ ${#R1_FILES[@]} -eq 0 ]; then echo "Files *_1_val_1.fq.gz not found" exit 1 fi #export variables export SPADES SPADES_DIR GENOMES_DIR THREADS #Run SPAdes in parallel printf "%s\n" "${R1_FILES[@]}" | parallel -j "$JOBS" ' r1={} r2=${r1/_1_val_1.fq.gz/_2_val_2.fq.gz} sample=$(basename "$r1" _1_val_1.fq.gz) #extract the name of the sample echo "Assembly genome for sample: $sample" "$SPADES" \ --isolate \ #indicate for high-coverage isolate (multicell data) -1 "$r1" \ #F reads -2 "$r2" \ #R reads -k 55,75,97 \ #comma separated list indicating k-mer sizes -o "$SPADES_DIR/$sample" \ -t "$THREADS" #copy the fasta files of the assembled genomes to another folder if [ -f "$SPADES_DIR/$sample/contigs.fasta" ]; then cp "$SPADES_DIR/$sample/contigs.fasta" "$GENOMES_DIR/${sample}.fna" else echo "WARNING: contigs.fasta not found for $sample" fi ' echo "......SPAdes assembly finished" echo "......SPAdes outputs in $SPADES_DIR" echo "......Final genomes in $GENOMES_DIR" ``` This script creates two directories: `05.spades_trimgalore_results` and `06.assembled_genomes_trimgalore` in this example. The first one conatins one folder for each analyzed genome, which in turn contains these files and folders: \-`KX`, where X is the size of the k-mers. This folder contains the assembled contigs and the assembly graphs. There is a folder for each k-mer size indicated in the script. \-`spades.log` and `params.txt`: these file are similar to logfiles. \-`contigs.fasta`: fasta files of the contigs for the selected k-mer \-`before_rr.fasta`: fasta files of the contigs before resolving the repetitive regions. \-`scaffolds.fasta`: fasta files of the scaffold constructed based on the selected k-mer. \-`assembly_graph.fasta`, `assembly_graph_with_scaffolds.gfa`: assembly graphs. They can be visualized with the [tool **Bandage**](http://rrwick.github.io/Bandage/). SPAdes creates also other files. If you want to have a close look, visit the developers' [website](https://github.com/ablab/spades/blob/spades_3.15.5/README.md). The other directory named `06.assembled_genomes_trimgalore` contains the fasta file of each assembled genome. In this example, they have `.fna` extension, but you can change it. # 4. Contig filtering By using the Illumina platform, we obtain short reads that result in the assembly of many contigs. Some of them are too short and have insufficient coverage, so it is often advisable to remove them. In this tutorial, we propose retaining only those contigs that are longer than 200 bp and have a coverage depth greater than 10. These parameters should be adjusted according to your results and the characteristics of the samples and sequencing. The script we propose here is based on the format of the names of the contigs. If we ran SPAdes to perform the assembly, the header of the fasta files will look like this: ``` r head -n 2 06.assembled_genomes_trimgalore/Acibau_Illumina.fna >NODE_1_length_237808_cov_34.133406 TAGCAAGCTACCTTCCCCCGCTCGACTTGCATGTGTTAAGCCTGCCGCCAGCGTTCAATC ``` \-`NODE_X` (where X is a number), indicates the number of each contig. \-`length_YYY`, indicates the total length, in bp, of each contig. \-`cov_ZZZZ`, indicates the coverage of each contig. Keeping this name structure, let's run these lines: ``` r #!/usr/bin/env bash set -euo pipefail INPUT_DIR="/home/ana/hybrid_assembly/06.assembled_genomes_trimgalore" #path to the assembled genomes OUTPUT_DIR="/home/ana/hybrid_assembly/15c.filtered_contigs_trimgalore" #output directory # Cut-off. Change them according to your interests. MIN_LENGTH=200 #length MIN_COV=10#coverage mkdir -p "$OUTPUT_DIR" #create the output directory #Loop over all the files in the input directory for file in "$INPUT_DIR"/*.fna; do #be careful with the ending pattern of your files. Change the extension accordingly. filename=$(basename "$file") #extract the name of the sample sample="${filename%.fna}" output_file="$OUTPUT_DIR/${sample}.fna" #the name you want to give to the output files echo "Processing: $sample" awk -v min_len="$MIN_LENGTH" -v min_cov="$MIN_COV" ' BEGIN { keep=0 } /^>/ { keep=0 # Extract the length and the coverage from the name of the contigs if (match($0, /length_([0-9]+)_cov_([0-9.]+)/, arr)) { len = arr[1] cov = arr[2] if (len > min_len && cov > min_cov) { #save if the filters are passed keep=1 print } } next } { if (keep == 1) { print } } ' "$file" > "$output_file" done echo "Filtering finished" ``` # 5. Check the quality of the assemblies Now, we have to check the quality of the assembled genomes, and we can do that in two different ways. ## 5.a) QUAST software We can use a script written in Python to evaluate the quality of the assembly. Specifically, we can use the `quast.py` script, whose main usage is already described in [this website](https://quast.sourceforge.net/docs/manual.html#sec1 "click to see the manual")[^7]. [^7]: Mikheenko, A., Saveliev, V., Hirsch, P., Gurevich, A., (2023) WebQUAST: online evaluation of genome assemblies, Nucleic Acids Research, 51:W601–W606, https://doi.org/10.1093/nar/gkad406 You can run the script described in the **section 7.a)** of the [pipeline to analyze Nanopore genomes](Genomes_Nanopore.qmd). Remember that you have to adapt the script according to the paths of the current folder, the name of the files, the extension of the fasta files etc. ## 5.b) CheckM2 software Software [**CheckM2**](https://github.com/chklovski/CheckM2 "click to see more details")[^8] is a machine learning-based tool that predicts the completeness and contamination of bacterial genomes or bins. [^8]: Chklovski, A., Parks, D.H., Woodcroft, B.J. *et al*., (2023) CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods 20, 1203–1212. https://doi.org/10.1038/s41592-023-01940-w You can run the script described in the **section 7.b)** of the [pipeline for Nanopore genome analysis](Genomes_Nanopore.qmd). Remember that you have to adapt the script according to the paths of the current folder, the name of the files, the extension of the fasta files etc. # 6. Analysis of the high quality genomes From now on, you can work with your genomes as described in the **section 8** of the [pipeline for Nanopore genome analysis](Genomes_Nanopore.qmd) and subsequent sections. We recommend that you follow all the guidelines provided there. Don’t forget to adapt the names of your directories, files, and paths to match your data.