A Snakemake workflow for de novo genome assembly using Illumina reads.
This workflow was developed for the Fungarium Sequencing Project (FSP) at Royal Botanic Gardens, Kew.
The usage of this workflow is described in the Snakemake Workflow Catalog.
Detailed information about workflow configuration can also be found in the config/README.md.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository or its DOI.
- Snakemake workflow:
FSP_assembly_benchmarking
The workflow was developed to benchmark the performance of different short reads assemblers using hDNA (historical DNA).
It may be useful to other projects that deal with difficult samples as the FSP, and need to find the best short reads assembler(s) for their own case.
The workflow was designed to process several samples in parallel.
Each sample is assembled using the following assemblers:
- SPAdes - multi k-mer assembler
- MEGAHIT - multi k-mer assembler
- AbySS - single k-mer assembler
- SparseAssembler - single k-mer assembler
- Minia - single k-mer assembler
- MaSuRCA - single k-mer assembler
Note: the workflow was designed to use previously pre-processed reads, although it works also with raw reads (but keep in mind that this doesn't pre-process reads automatically).
The user can choose which assemblers to use by setting them as True or False in the config.yml.
The user can choose between three different strategies to set the k-mer size for the genome assembly process:
manual: the k-mer size is set manually for each assembler in the config.yml.kmergenie: KMerGenie is used to estimate the best k-mer size for genome assembly for each library. Note that KMerGenie should be used with haploid or diploid genomes only.reads_length: SeqKit is used to calculate the median reads length and the k-mer size is set to be 2/3rds of the median.
The user can choose which k-mer strategues to use by setting them as True or False in the config.yml.
Furthermore, the user can choose to use two different input reads:
- forward and reverse reads files
- merged reads
The user can choose which type of reads to use by setting them as True or False in the config.yml.
The assemblies produced by each assembler for each sample are then quality inspected with the following tools:
- BUSCO - evaluates each produced assembly quality in terms of expected gene content. It is run twice for each sample, once using a general dataset, and once using a more closely related dataset. See
config/README.mdfor more details. - QUAST - computes assembly statistics.
- MerquryFK - computes k-mer analysis for each assembly and compares its content with the k-mer computed for raw reads by FastK.
The best assembly for each sample is then selected. The selection is based on the highest complete and single BUSCO content (absolute number, not percentage). If more than one assembly have the highest complete and single copy BUSCO score, the assembly with the highest aUN (calculated by QUAST) among those is selected.
The best assembly is then handed over to pypolca, to improve the assembly by performing substitution, insertion, and deletion errors correction.
The improved best assembly is then quality assessed again with BUSCO, QUAST, and MerquryFK, and aligned back to the reads with bwa-mem2 and samtools, which is also used to analyse the genome coverage.
git clone https://github.com/LiaOb21/FSP_assembly_benchmarking.git
The directory that contains your input samples (e.g. data) must be structured in the following way:
data/
├── 048ds
│ ├── 048ds_merge.fq.gz
│ ├── 048ds_trimmed.R1.fq.gz
│ ├── 048ds_trimmed.R2.fq.gz
│ ├── 048ds_unmerged.R1.fq.gz
│ └── 048ds_unmerged.R2.fq.gz
└── 048ss
├── 048ss_merge.fq.gz
├── 048ss_trimmed.R1.fq.gz
├── 048ss_trimmed.R2.fq.gz
├── 048ss_unmerged.R1.fq.gz
└── 048ss_unmerged.R2.fq.gz
In this example 048ds and 048ss are the only two samples present in the input directory. Note how each subdirectory is named after the sample, and the files inside each sample subdirectory have a standardised name. This is crucial for Snakemake to work properly. Even in the case in which you want to use raw reads, the name of the fastq (gzipped) files must match those shown in the snippet above.
This workflow runs BUSCO on each produced assembly twice using two different databases. One database more "general" (e.g. fungi_odb), and one as closely related as possible to the organisms analysed (e.g. agaricales_odb).
The recommendation here is to store all the BUSCO databases for the group of organisms of interest in a single directory, to allow the workflow to automatically use the closest database to the target organism for the second BUSCO run.
First of all, obtain the full list of busco lineages and use the awk command to extract the lineages of the group of interest (e.g. fungi). The _odb12 suffix refers to the version of the busco databases. The following commands can be used also with other groups of organisms or database versions, modifying the relevant parts of the code.
cd to/where/you/want/to/store/busco/databases
conda activate busco
busco --list > busco_lineages.txt
gawk '/fungi_odb12/{flag=1; indent=length($0)-length(ltrim($0)); print "fungi_odb12"; next}
flag && /- [a-z_]*_odb12/ {
current_indent=length($0)-length(ltrim($0))
if(current_indent <= indent) flag=0
else print gensub(/.*- ([a-z_]*_odb12).*/, "\\1", "g")
}
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }' busco_lineages.txt > fungi_busco_lineages.txt
fungi_busco_lineages.txt will be the input for this field in the config file:
busco:
database_list: "to/where/you/want/to/store/busco/databases/fungi_busco_lineages.txt"
You will need to download all the databases you need for your analysis before hands. You can easily do this in this way:
for i in $(cat fungi_busco_lineages.txt); do
echo "downloading $i database"
busco --download_path . --download $i
done
As you will download dabases before starting the workflow, you should set "--offline": True in the config.yaml for the BUSCO parameters. For more details, please refer to config/README.md.
We need a tab separated taxonomy file containing at least the following columns: 'Sample', 'Family', 'Order', 'Class', 'Phylum'. Name the columns exactly as shown here (with capital letters).
Example:
Here the column 'Genus' will be ignored (as BUSCO databases are available up to family level), but it's okay if it is in the file.
🍄🍄🍄 Note: before running the workflow you should set up your config.yml. You can find it in config/config.yml. The config/README.md explains how to set up the config.yml. 🍄🍄🍄
First install snakemake using conda (make sure to install version 9+):
conda create -n snakemake snakemake
conda activate snakemake
If you want to monitor the workflow through a console you can install the snkmt plugin within the snakemake environment (this is not mandatory):
pip install snakemake-logger-plugin-snkmt # used for monitoring resources
🍄🍄🍄 IMPORTANT: Snakemake workflows must be always run from the cloned directory. 🍄🍄🍄
To run the workflow, you can use the following command:
cd ~/FSP_assembly_benchmarking
snakemake --configfile config/config.yml --software-deployment-method conda --snakefile workflow/Snakefile --cores 8
If you want to use snkmt add this flag to the previous command: --logger snkmt.
If you only want to run the workflow using forward and reverse reads, the only files you need in your input data directory are *_trimmed.R1.fq.gz and *_trimmed.R2.fq.gz.
If you want to use merged reads (that you have to merge previously) the input directory must contain all the files showed in Your input data directory structure, including *_unmerged.R1.fq.gz, *_unmerged.R2.fq.gz, *_trimmed.R1.fq.gz, and *_trimmed.R2.fq.gz, as some of the assemblers needs these file in order to run in this mode, and the reads alignment (bwa mem2) and the k-mer computing (fastk) are always performed using *_trimmed.R1.fq.gz and *_trimmed.R2.fq.gz.
Make sure to always name the files as shown in Your input data directory structure section.
conda create -c conda-forge -c bioconda -n snakemake snakemake
conda activate snakemake
pip install snakemake-executor-plugin-cluster-generic # to handle SLURM job submission
pip install snakemake-logger-plugin-snkmt # used for monitoring resources (optional)
🍄🍄🍄 IMPORTANT: Snakemake workflows must be always run from the cloned directory (i.e. FSP_assembly_benchmarking). Therefore, the following script must be placed in that directory. 🍄🍄🍄
Create run_snakemake.sh:
#!/bin/bash
#SBATCH --job-name=snakemake-fsp
#SBATCH --partition=medium
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --output=snakemake-%j.out
#SBATCH --error=snakemake-%j.err
# Create logs directory
mkdir -p logs
# Load conda environment
source ~/.bashrc
conda activate snakemake
# Run Snakemake (individual jobs will be distributed to appropriate partitions)
echo "Starting Snakemake workflow at $(date)"
snakemake --profile profile/ --logger snkmt
echo "Workflow completed at $(date)"
With this sbatch script example, Snakemake will look for the default Snakefile, which is workflow/Snakefile, and will run the workflow using that file.
🍄🍄🍄 IMPORTANT: These are examples of sbatch script. Make sure to set up your sbatch script according to your system settings. 🍄🍄🍄
In alternative, you can launch Snakemake in a screen shell using srun for an interactive run.
Submit the workflow throgh the sbatch script:
sbatch run_snakemake.sh
Monitor the workflow:
# Monitor using snkmt console
conda activate snakemake
snkmt console
# Monitor with SLURM
squeue -u $USER
squeue -o "%.18i %.100j %.8u %.2t %.10M %.6D %R" -u $USER # to see full name of jobs including sample name
# Monitor through the logs
less snakemake-JOBID.err # to have a look to the log
grep -A 20 "Error in rule" snakemake-JOBID.err # to screen the log for possible errors
To run the workflow from command line, change the working directory.
cd path/to/snakemake-workflow-nameAdjust options in the default config file config/config.yml.
Before running the complete workflow, you can perform a dry run using:
snakemake --dry-runTo run the workflow with test files using conda:
snakemake --cores 2 --sdm conda --directory .testTo run the workflow with apptainer / singularity, add a link to a container registry in the Snakefile, for example container: "oras://ghcr.io/<user>/<repository>:<version>" for Github's container registry.
Run the workflow with:
snakemake --cores 2 --sdm conda apptainer --directory .test- Lia Obinu
- Royal Botanic Gardens, Kew
- ORCID profile
The other members of the FSP bioinformatics team, Wu Huang and Niall Garvey, and George Mears also contributed to the development and testing of this part of the workflow (which is only a part of the full pipeline developed for the FSP).
Köster, J., Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., Forster, J., Lee, S., Twardziok, S. O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., & Nahnsen, S. Sustainable data analysis with Snakemake. F1000Research, 10:33, 10, 33, 2021. https://doi.org/10.12688/f1000research.29032.2.
- Update the deployment and references sections.
- Update the
README.mdbadges. Add or remove badges forconda/singularity/apptainerusage depending on the workflow's deployment options.