Snakemake-based workflow for the assembly of chloroplast genomes

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

About CLAW

CLAW (Chloroplast Long-read Assembly Workflow) is an mostly-automated Snakemake-based workflow for the assembly of chloroplast genomes. CLAW uses chloroplast long-reads, which are baited out of larger read libraries (e.g., an Oxford Nanopore Technologies MinION read library derived from photosynthetic tissue), for assembly with Flye and/or Unicycler. CLAW was designed with the novice bioinformatician in mind - it is easy to install and easy to use, requiring only minimal user input.

Download and Install

Clone Git repository:

git clone https://github.com/aaronphillips7493/CLAW
Move into the directory:
```
 cd long-read-chloroplast-assembly
```
Install conda (if not already present):

https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

Make sure you have snakemake and Biopython installed. We use Mamba for increased speed:

a) Create a conda environment:
```
 conda create --name snakemake
```
b) Activate the conda environment:
```
 conda activate snakemake
```
c) Install mamba in the environment:
```
 conda install mamba
```
d) Install snakemake and biopython in the environment:
```
 mamba install snakemake biopython
```
e) OPTIONAL: create all environments needed for CLAW:
```
 snakemake -j 1 --conda-create-envs-only --use-conda
```

Steps

Test CLAW. We provide a test read file containing ONT reads from Oryza sativa ("chloro_assembly/reads/DRR196880_subset.fastq") and the reference Oryza sativa chloroplast genome ("chloro_assembly/reference/NC_008155.1_single.fasta").
```
 snakemake --profile profiles/slurm --use-conda --keep-going
```
#note: profiles/slurm may not be appropriate for your PC

This test should complete with no errors, and should generate a rotated choloroplast fasta file ("chloro_assembly/{sample}~{assembler}_chloroplast.fasta") derived from Flye and/or Unicycler. If outside network access is problematic, run downloadReference.sh to download the reference genome. Alternatively, download your reference genome of interest manually and save it into the directory

"chloro_assembly/reference/{file_name}\_single.fasta"

Run your samples through CLAW.

a) modify "config.yml":

 i) ncbi_reference_accession = change this to the NCBI accession number for the reference chloroplast genome of interest. Default = NC_008155.1 (O. sativa). If outside network access is problematic, run downloadReference.sh to download the reference genome. Alternatively, download your reference genome of interest manually and save it into the directory "chloro_assembly/reference/{file_name}\_single.fasta". Make sure to change this parameter even if you do not use CLAW to download the reference genome.
 ii) my_Email = provide an email address. Neccessary for the automated download of genomes from NCBI - prevents system abuse.
 iii) fast_file = declare whether your read file is in FASTA ("fasta"/"fasta.gz") or FASTQ ("fastq"/"fastq.gz") format.
 iv) randseed = you can supply a seed (integer) here if you wish to re-run read extraction the same way with each redo, or leave this blank to let CLAW randomly generate a seed each time it is run. Change this parameter if assembly fails, and re-run random read subsampling.
 v) numberReads = declare the max. number of reads CLAW will subsample. Change value if assembly fails or to increase or decrease genome coverage. Default = 3000.
 vi) readMinLength = declare the smallest read to be used in the assembly. Default = 5000 bp.
 vii) flyeParameter = tell Flye what kind of reads you are using as input (raw, corrected or HQ ONT long reads, or raw, corrected, or HIFI PacBio long reads). Default = --nano-raw.
 viii) minimap2Parameter = tell minimap2 what kinds of reads you are using (ONT, PacBio, or HIFI). Default = map-ont.
 iX) chloroplastSize = the expected size of the chloroplast genome to be assembled. This is usually set as the size of the reference chloroplast genome. Default = 135000.
 X) cpus = declare the number of CPUs to use. Default = 4.

b) make sure you have saved your reads in the directory:

 chloro_assembly/reads

c) make sure you have saved your reference genome in the directory:

 chloro_assembly/reference

d) run {The Workflow}:

 snakemake --profile profiles/slurm --use-conda --keep-going

If {The Workflow} fails, try modifying "randSeed" and/or "numberReads" in config.yml. You will need to delete the files in the following directories to re-run {The Workflow}:
```
 chloro_assembly/assemblies chloro_assembly/subReads chloro_assembly/alignments chloro_assembly/dotPlots
```
You may also need to delete the file:

chloro_assembly/{sample}~{assembler}_chloroplast.fasta

Note

While the read library used for chloroplast genome assembly is enriched for chloroplast reads, there is a possibility that the library will also contain mitochondrial reads because of high sequence similarity between mitochondrial and chloroplast genomes. Thus, it is possible for CLAW to assemble mitochondrial (likely fragmented) contigs too. CLAW makes no attempt to resolve this, so users will have to manually check the assembled contigs for similarity to chloroplast and/or mitochondrial sequences.

While we report on the successful runs of CLAW here, there were several instances in which CLAW failed to complete or when chloroplast genome assembly failed. Typically, CLAW failed to complete due to insufficient memory allocation for certain steps of CLAW. These steps were commonly the read mapping and genome assembly steps. In this case, we were able to overcome this issue by modifying the memory allocation in the “cluster-configs/default.yaml” file. Future releases should aim to integrate dynamic memory allocations to further minimise user input for the successful completion of CLAW. When genome assembly failed, we found that re-running CLAW with a different seed (randomly generated as part of the CLAW, unless specified by the user in “config.yaml”) resulted in successful completion of genome assembly. We also found it necessary to, on occasion, adjust the number of chloroplast reads used for genome assembly as too many reads (and therefore too high genome coverage) could cause the assemblers to fail. Thus, by modifying the random seed and the number of reads used for assembly we were able to generate assemblies for all 19 test datasets.

Code Snippets

shell:
    """
    if [ -s {input.genome} ]
    then
        mkdir -p {params.dir}
        for contig in `grep '^>' {input.genome} | sed -e 's/>//g'`
        do
            echo $contig > {params.dir}/tmp
            seqtk subseq {input.genome} {params.dir}/tmp > {params.dir}/$contig.fasta

            nucmer --maxmatch {input.index} {params.dir}/$contig.fasta -p {params.dir}/out
            show-coords -THrd {params.dir}/out.delta > {params.dir}/out.coords
            start=`sort -k6,6hr {params.dir}/out.coords | head -n 1| cut -f3`
            echo ">$contig" >> {output}
            echo "$start XXX"
            if [ $start == 1 ]
            then
                grep -v '^>' {params.dir}/$contig.fasta | tr -d '\n' >> {output}
                echo "" >> {output} 
            elif [ ! -z $start ]
            then
                grep -v '^>' {params.dir}/$contig.fasta | tr -d '\n' > {params.dir}/temp.fasta
                cut -c ${{start}}- {params.dir}/temp.fasta > {params.dir}/start.fasta
                cut -c -$[start-1] {params.dir}/temp.fasta > {params.dir}/end.fasta
                cat {params.dir}/start.fasta {params.dir}/end.fasta | tr -d '\n' >> {output}
                echo "" >> {output}
            else
                grep -v '^>' {params.dir}/$contig.fasta | tr -d '\n' >> {output}
                echo "" >> {output}
            fi
            rm -rf {params.dir}/*
        done
        rm -rf {params.dir}
    else
        touch {output}
    fi
    """

SnakeMake seqtk subSeq mummer2circos From line 82 of main/Snakefile

shell:
    """
    nucmer {input.reference} {input.query} -p {params}
    # delta-filter -1 -i 50 {params}.delta > {params}.1delta
    mummerplot --fat --png --large {params}.delta -p {params}
    rm {params}.filter {params}.fplot {params}.gp {params}.rplot
    """

SnakeMake mummer2circos From line 143 of main/Snakefile

shell:
    """
    bamCoverage -b {input} -o {output.bigwig}
    """

SnakeMake DeepTools From line 166 of main/Snakefile

shell:
    """
    samtools sort -o {output.bam} {input}
    samtools index {output.bam}
    """

SnakeMake SAMtools From line 182 of main/Snakefile

shell:
    """
    minimap2 -ax {config[minimap2_parameter]} -t {threads} {input.assembly} {input.fastFile} \
        | samtools view -b -F 4 -@ {threads} \
        > {output.bam}
    """

SnakeMake SAMtools Minimap2 From line 202 of main/Snakefile

shell:
    """
    mkdir -p chloro_assembly/assemblies/
    unicycler -l {input} -o {params.assembly}
    """

SnakeMake Unicycler From line 229 of main/Snakefile

shell:
    """
    mkdir -p chloro_assembly/assemblies/
    flye --threads {threads} --genome-size {config[chloroplast_size]} -o {params.flye_assembly} {config[flye_parameter]} {input}
    """

SnakeMake Flye From line 250 of main/Snakefile

shell:
    """
    bamCoverage -b {input} -o {output.bigwig}
    """

SnakeMake DeepTools From line 271 of main/Snakefile

shell:
    """
    samtools sort -o {output.bam} {input}
    samtools index {output.bam}
    """

SnakeMake SAMtools From line 287 of main/Snakefile

shell:
    """
    minimap2 -ax {config[minimap2_parameter]} -t {threads} {input.reference} {input.fastFile} \
        | samtools view -b -F 4 -@ {threads} \
        > {output.bam}
    """

SnakeMake SAMtools Minimap2 From line 307 of main/Snakefile

shell:
    """
    echo {params.random_seed}
    seqtk sample -s {params.random_seed} {input} {config[number_reads]} > {output}
    """

SnakeMake seqtk From line 332 of main/Snakefile

shell:
    """
    samtools view {input.bam} | cut -f1 | sort | uniq > {output.list}
    seqtk subseq {input.fastFile} {output.list} \
        | bioawk -c fastx \
            'length($seq) > {config[read_min_length]} && length($seq) < {config[chloroplast_size]} \
            {{print \">\"$name\"\\n\"$seq}}' > {output.fastFile}
    """

SnakeMake SAMtools seqtk subSeq bioawk From line 355 of main/Snakefile

shell:
    """
    minimap2 -ax {config[minimap2_parameter]} -t {threads} {input.reference} {input.fastFile} \
        | samtools view -b -F 4 -@ {threads} \
        | samtools sort -o {output.bam}
    """

SnakeMake SAMtools Minimap2 From line 382 of main/Snakefile

shell:
    """
    awk 'NR == 1 {{print substr($1,2,length($1)), \"0\", \"10000\"}}' {input} > chloro_assembly/reference/index.bed
    seqtk subseq {input} chloro_assembly/reference/index.bed > {output}
    rm chloro_assembly/reference/index.bed
    """

SnakeMake seqtk subSeq From line 405 of main/Snakefile

shell:
    """
    head -n 1 {input} > {output}
    tail -n +2 {input} | tr -d '\n' >> {output}
    tail -n +2 {input} | tr -d '\n' >> {output}
    """

SnakeMake From line 426 of main/Snakefile

shell:
    """
    mv {input} {output}
    """

SnakeMake From line 453 of main/Snakefile

ShowHide 9 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/aaronphillips7493/CLAW

Name: claw

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

Accessed: 12

License: None

Keywords:

bioawk mummer2circos DeepTools Flye Minimap2 SAMtools seqtk Snakemake subSeq Unicycler

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free