Pore-C Snakemake Workflow for Processing and Analysis of Nanopore Sequencing Data

public 1yr ago Version: 0.3.0 0 bookmarks

View Workflow

Introduction

Overview:

This pipeline manages a pore-c workflow starting from raw fastq files and converting them to standard file formats for use by downstream tools. The steps involved are:

Pre-processing a reference genome or draft assembly to generate auxiliary files used in downstream analyses
Creating virtual digests of the genome
Filtering the raw reads to remove any that might break downstream tools
Align against a reference genome
Processing results to filter spurious alignments, detect ligation junctions and assign fragments. The results are stored in a parquet table for downstream processing.
Converting the results to the following formats:

2. Getting started

In most cases, it is best to pre-install conda before starting. All other dependencies will be installed automatically when running the pipeline for the first time.

Requirements:

This pipeline requires a computer running Linux (Ubuntu 16). >64Gb of memory would be recommended. The pipeline has been tested on minimal server installs of these operating systems.

Most software dependencies are managed using conda . To install conda, please install miniconda3 and refer to installation instructions . You will need to accept the license agreement during installation and we recommend that you allow the Conda installer to prepend its path to your .bashrc file when asked.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Check if the conda has successfully installed

conda -h

If conda has installed correctly, you should see the follow output. If you do not see the below output, you may need to close and reopen your terminal.

$ conda
usage: conda [-h] [-V] command ...
conda is a tool for managing and deploying applications, environments and packages.
Options:
positional arguments:
 command
 clean Remove unused packages and caches.
 config Modify configuration values in .condarc. This is modeled
 after the git config command. Writes to the user .condarc
 file ($HOME/.condarc) by default.
 create Create a new conda environment from a list of specified
 packages.
..............

Installation:

Clone this git repository to the location where you want to run your analysis and create the conda environment that will be used to run the pipeline

git clone https://github.com/nanoporetech/Pore-C-Snakemake.git
cd pore-c-snakemake
## Creates environment and the dependencies will install automatically
conda env create
conda activate pore_c_snakemake

Note before you run any of the snakemake commands below you need to make sure that you've run conda activate pore_c_snakemake .

3. Usage

Testing:

Test data is included in the .test subfolder ( git-lfs is required to download them). To run the tests use

snakemake --use-conda test -j 4 --config=output_dir=results.test

The results of the test run will appear in the results.test directory.

Configure workflow:

The pipeline configuration is split across several files:

* `config/config.yaml` - A yaml file containing settings for the pipeline. Input data is specified in the following tab-delimited files.
* `config/basecall.tsv` - Metadata and locations of the pore-c sequencing run fastqs.
* `config/references.tsv` - Locations of the draft/scaffold/reference assemblies that the pore-c reads will be mapped to.
* `config/phased_vcfs.tsv` - [Optional] The location of phased vcf files that can be used to haplotag poreC reads.

Execute workflow:

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

in combination with any of the modes above. See the Snakemake documentation for further details.

Workflow targets

The pipeline defines several targets that can be speficied on the command line:

all: The default target which builds the pore_c contact and concatemer parquet files under the merged_contacts directory.
cooler: Builds a multi-resolution .mcool file.
pairs: Builds a pairix -indexed pairs file.
juicer: Builds a .hic file compatible with the juicebox suite of tools.
salsa: Builds a .bed file for use with the salsa2 scaffolding tool.
mnd: Builds a .mnd.txt file compatible with the 3d-dna scaffolding tool [experimental].

To build the files for a particular target:

snakemake --use-conda -j 8 <target>

4. Output files

Once the pipeline has run successfully you should expect the following files in the output directory:

refgenome/ :
- {refgenome_id}.rg.metadata.csv - chromosome metadata in csv format.
- {refgenome_id}.rg.chromsizes - reference genome chromosome lengths
- {refgenome_id}.rg.fa.gz - reference genome compressed with bgzip
- {refgenome_id}.rg.fa.gz.fai - samtools indexed reference genome
- {refgenome_id..rg.fa.gz.bwt - bwa index reference genome
virtual_digest/ :
- {enzyme}_{refgenome_id}.vd.fragments.parquet - A table containing the intervals generated by the virtual digest.
- {enzyme}_{refgenome_id}.vd.digest_stats.csv - virtual digest aggregate statistics
basecall/ :
- {enzyme}_{run_id}.rd.{batch_id}.fq.gz - basecalls that have passed filtering split into batches of 50,000 (can be changed in config).
- {enzyme}_{run_id}.rd.catalog.yaml - an intake catalog containing read metadata.
- {enzyme}_{run_id}.rd.read_metadata.parquet - a table of per-read statistics.
- {enzyme}_{run_id}.rd.summary.csv - a table of aggregate statistics for the reads.
mapping/ :
- {enzyme}_{run_id}_{batch_id}_{refgenome_id}.coord_sort.bam - bam alignment file sorted by genome coordinate with an alignment index added to the query name.
- {enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.coord_sort.bam - whatshap-produced text file mapping alignments to phase sets (dummy file is produced if unphased).
align_table/ :
- {enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.at.alignment.parquet - a parquet file with alignment information extracted from the corresponding bam file
- {enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.at.pore_c.parquet - a parquet file with the same information as the alignment parquet with additional data on fragment assignments and the pass-fail status of each alignment.
contacts/ :
- {enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.contacts.parquet - a table derived from the pore_c.parquet file consisting of all pairwise contacts (equivalent to a .pairs file).
merged_contacts/ :
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.contacts.parquet - a merged version of the contacts file for a run
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.concatemers.parquet - a table with per-read (aka concatemer) statistics
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.concateme_summary.csv - a table with per-run statistics
matrix/
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.matrix.catalog.yaml - an intake catalog containing metadata about the aggregate matrix.
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.matrix.coo.csv.gz - aggregate read counts in the format 'bin1_id,bin2_id,count' - suitable for use with cooler load the bin width for this set by the *base* matrix resolution in the config file.
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.matrix.cool - the aggregate contact counts in cooler format
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.matrix.counts.mcool - a multi-resolution cool file.
pairs/ :
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.pairs.pairs.gz - contains fragment position and fragment pairs in pairs format .
assembly/ :
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.salsa2.bed - optional a bed file compatible with the salsa2 scaffolding tool.
juicebox/ :
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.hicRef - optional a restriction site format file.
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.hic - optional a hic medium format file of pairwise contacts.
- {enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.mnd.txt - optional a merged_no_dups format file (experimental).

Code Snippets

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    " contacts export {input.contacts} cooler {params.prefix} --fragment-table {input.fragments} --chromsizes {input.chromsizes} 2>{log} "

SnakeMake cooler From line 17 of rules/exports.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    " contacts export {input.contacts} cooler {params.prefix} --by-haplotype --fragment-table {input.fragments} --chromsizes {input.chromsizes} 2>{log} "

SnakeMake cooler From line 38 of rules/exports.smk

shell:
    "cooler zoomify -n {threads} -r {params.resolutions} -o {output} {input} 2>{log}"

SnakeMake cooler From line 57 of rules/exports.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    " contacts export {input.contacts} pairs {params.prefix} --chromsizes {input.chromsizes} 2>{log} "

SnakeMake From line 76 of rules/exports.smk

shell:
    "(pairtools sort {input} --nproc {threads} > {params.uncomp_file} &&  bgzip {params.uncomp_file} -) 2>{log}"

SnakeMake From line 95 of rules/exports.smk

shell:
    "pairix {input} 2>{log}"

SnakeMake Pairix From line 110 of rules/exports.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    " contacts export {input.contacts} salsa_bed {params.prefix}  2>{log}"

SnakeMake From line 128 of rules/exports.smk

shell:
    "wget -O - {params.url} > {output} 2>{log}"

SnakeMake From line 142 of rules/exports.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "refgenome fragments-to-hicref {input} {output} 2>{log}"

SnakeMake From line 158 of rules/exports.smk

shell:
    "java -Xmx2g -jar {input.tools} pre {input.pairs} {output} {input.chromsizes} -f {input.hicref} &>{log}"

SnakeMake From line 175 of rules/exports.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    " contacts export {input.contacts} merged_no_dups {params.prefix} --reference-fasta {input.refgenome} 2>{log}"

SnakeMake From line 194 of rules/exports.smk

shell:
    "( bwa {params.cli_opts} -t {threads} "
    "{input.refgenome}  {input.fastq} "
    " | pore_c alignments reformat-bam - - "
    " | samtools sort -O bam -m {params.memory} -@ {params.sort_threads} -o {output.bam} -) 2>{log} ;"
    " samtools index {output.bam} 2>{log} "

SnakeMake SAMtools BWA From line 23 of rules/mapping.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "alignments create-table {input.bam} {output.pq} --alignment-haplotypes {input.alignment_haplotypes} 2>{log}"

SnakeMake From line 70 of rules/mapping.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "alignments assign-fragments {input.align_table} {input.fragments_table} {output} 2>{log}"

SnakeMake From line 88 of rules/mapping.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "alignments to-contacts {input} {output.contacts} 2>{log}"

SnakeMake From line 107 of rules/mapping.smk

run:
    with open(output[0], "w") as fh:
        fh.write("{}\n".format("\n".join(input)))

SnakeMake From line 137 of rules/mapping.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "contacts merge {input} {output} --fofn"

SnakeMake From line 154 of rules/mapping.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "contacts summarize {input.contacts} {input.read_summary} {output.pq} {output.csv} "
    "--user-metadata '{params.metadata}' 2>{log}"

SnakeMake From line 175 of rules/mapping.smk

shell:
    """
    mkdir -p {params.bindir} ;
    curl -L {params.f5curl} \
        | tar -zxf - ;
    mv f5c-{params.version}/f5c_x86_64_linux* {params.bindir};
    """

SnakeMake From line 9 of rules/methylation.smk

shell:
    "( pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "alignments filter-bam {input.bam} {input.pore_c_table} {output.filtered_bam} "
    " --clean-read-name ;"
    " samtools index {output.filtered_bam} ; ) 2>{log} "

SnakeMake SAMtools From line 32 of rules/methylation.smk

shell:
    """
    {input.binary} index -d {input.fast5} -s {input.summary} {input.fastq} 2>{log}
    """

SnakeMake From line 63 of rules/methylation.smk

shell:
    "touch {output}"

SnakeMake From line 74 of rules/methylation.smk

shell:
    """
    {input.binary} call-methylation -t {threads} \
        {params.cli_opts} -B 6.0M -K 550 --iop 20 \
        -b {input.bam} -g {input.reference} -r {input.fastq} \
        > {output.per_read} 2>{log}
    """

SnakeMake From line 97 of rules/methylation.smk

run:
    shell("head -n 1 {input[0]} > {output} ")
    for f in input:
        shell("tail -n+2 {f}  >> {output}")

SnakeMake From line 111 of rules/methylation.smk

shell:
    """
    {input.binary} meth-freq -s -i {input.per_read} > {output.split_pos} 2>{log}
    """

SnakeMake From line 130 of rules/methylation.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "reads prepare {params.fname} {params.prefix} --max-read-length {config[max_read_length]} "
    " --batch-size {config[reads_per_batch]} 2> {log}"

SnakeMake From line 16 of rules/reads.smk

run:
    expand_tilde = os.path.expanduser(params.fname)
    path_is_absolute = os.path.isabs(expand_tilde)
    # cp -rs requires an absolute path for the source
    if path_is_absolute:
        shell("cp -rs {params.fname} {output}")
    else:
        shell('cp -rs "$(pwd)/{params.fname}" {output}')

SnakeMake From line 29 of rules/reads.smk

shell:
    """
    cp {params.fname} {output}
    """

SnakeMake From line 44 of rules/reads.smk

run:
    def fast5_are_not_split(directory):
        folders = []
        for split_type in ["*fast5_pass*", "*fast5_fail*", "*fast5_skip*"]:
            folders.extend(list(Path(directory).rglob(split_type)))
        return len(folders) == 0


    seq_sum = pd.read_table(input.summary)
    print(seq_sum.head(1).T)
    if "filename_fast5" in seq_sum.columns:
        fast5_column = "filename_fast5"
    elif "filename" in seq_sum.columns:
        fast5_column = "filename"
    else:
        raise ValueError("Neither 'filename' nor 'filename_fast5' columns found in summary file")

    example_fn = seq_sum[fast5_column][0]
    if any(tag in example_fn for tag in ["_pass_", "_fail_", "_skip_"]):
        print("Pass/fail files already uniquely named; No adjustments needed")
        shell("cp {input.summary} {output.modified_summary}")
    elif fast5_are_not_split(input.fast5):
        print("Fast5 files were not split into pass/fail folders; " + "No adjustments needed")
        shell("cp {input.summary} {output.modified_summary}")
    else:
        seq_sum.loc[seq_sum.passes_filtering == True, fast5_column] = seq_sum[fast5_column].replace(
            "\.fast5$", ".pass.fast5", regex=True
        )
        seq_sum.loc[seq_sum.passes_filtering == False, fast5_column] = seq_sum[fast5_column].replace(
            "\.fast5$", ".fail.fast5", regex=True
        )
        seq_sum.to_csv(output.modified_summary, sep="\t", index=False)

        def adjust_filenames_command(passes_filtering):
            command = (
                "find "
                + input.fast5
                + " -type l -name '*.fast5' "
                + "| grep fast5_{status} "
                + "| grep -v -e '\.{status}\.fast5$' -e '_{status}_' "
                + "| xargs rename 's!\.fast5$!\.{status}\.fast5!' ;"
            ).format(status=passes_filtering)
            return command

        shell(adjust_filenames_command("pass"))
        shell(adjust_filenames_command("fail"))

SnakeMake From line 66 of rules/reads.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "refgenome prepare {params.fasta} {params.prefix} --genome-id {wildcards.refgenome_id} 2> {log}"

SnakeMake From line 18 of rules/refgenome.smk

shell:
    "pore_c {DASK_SETTINGS} --dask-num-workers {threads} "
    "refgenome virtual-digest {input} {wildcards.enzyme} {params.prefix} -n {threads} 2> {log}"