Pore-C Snakemake Workflow for Processing and Analysis of Nanopore Sequencing Data
Introduction
Overview:
This pipeline manages a pore-c workflow starting from raw fastq files and converting them to standard file formats for use by downstream tools. The steps involved are:
-
Pre-processing a reference genome or draft assembly to generate auxiliary files used in downstream analyses
-
Creating virtual digests of the genome
-
Filtering the raw reads to remove any that might break downstream tools
-
Align against a reference genome
-
Processing results to filter spurious alignments, detect ligation junctions and assign fragments. The results are stored in a parquet table for downstream processing.
-
Converting the results to the following formats:
2. Getting started
In most cases, it is best to pre-install conda before starting. All other dependencies will be installed automatically when running the pipeline for the first time.
Requirements:
This pipeline requires a computer running Linux (Ubuntu 16). >64Gb of memory would be recommended. The pipeline has been tested on minimal server installs of these operating systems.
Most software dependencies are managed using conda . To install conda, please install miniconda3 and refer to installation instructions . You will need to accept the license agreement during installation and we recommend that you allow the Conda installer to prepend its path to your .bashrc file when asked.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Check if the conda has successfully installed
conda -h
If conda has installed correctly, you should see the follow output. If you do not see the below output, you may need to close and reopen your terminal.
$ conda
usage: conda [-h] [-V] command ...
conda is a tool for managing and deploying applications, environments and packages.
Options:
positional arguments:
command
clean Remove unused packages and caches.
config Modify configuration values in .condarc. This is modeled
after the git config command. Writes to the user .condarc
file ($HOME/.condarc) by default.
create Create a new conda environment from a list of specified
packages.
..............
Installation:
Clone this git repository to the location where you want to run your analysis and create the conda environment that will be used to run the pipeline
git clone https://github.com/nanoporetech/Pore-C-Snakemake.git
cd pore-c-snakemake
## Creates environment and the dependencies will install automatically
conda env create
conda activate pore_c_snakemake
Note
before you run any of the snakemake commands below you need to make sure that you've run
conda activate pore_c_snakemake
.
3. Usage
Testing:
Test data is included in the
.test
subfolder (
git-lfs
is required to download them). To run the tests use
snakemake --use-conda test -j 4 --config=output_dir=results.test
The results of the test run will appear in the
results.test
directory.
Configure workflow:
The pipeline configuration is split across several files:
* `config/config.yaml` - A yaml file containing settings for the pipeline. Input data is specified in the following tab-delimited files.
* `config/basecall.tsv` - Metadata and locations of the pore-c sequencing run fastqs.
* `config/references.tsv` - Locations of the draft/scaffold/reference assemblies that the pore-c reads will be mapped to.
* `config/phased_vcfs.tsv` - [Optional] The location of phased vcf files that can be used to haplotag poreC reads.
Execute workflow:
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
in combination with any of the modes above. See the Snakemake documentation for further details.
Workflow targets
The pipeline defines several targets that can be speficied on the command line:
-
all: The default target which builds the
pore_c
contact and concatemer parquet files under themerged_contacts
directory. -
cooler: Builds a multi-resolution
.mcool
file. -
pairs: Builds a
pairix
-indexed pairs file. -
juicer: Builds a
.hic
file compatible with thejuicebox
suite of tools. -
salsa: Builds a
.bed
file for use with thesalsa2
scaffolding tool. -
mnd: Builds a
.mnd.txt
file compatible with the3d-dna
scaffolding tool [experimental].
To build the files for a particular target:
snakemake --use-conda -j 8 <target>
4. Output files
Once the pipeline has run successfully you should expect the following files in the output directory:
-
refgenome/
:-
{refgenome_id}.rg.metadata.csv
- chromosome metadata in csv format. -
{refgenome_id}.rg.chromsizes
- reference genome chromosome lengths -
{refgenome_id}.rg.fa.gz
- reference genome compressed with bgzip -
{refgenome_id}.rg.fa.gz.fai
- samtools indexed reference genome -
{refgenome_id..rg.fa.gz.bwt
- bwa index reference genome
-
-
virtual_digest/
:-
{enzyme}_{refgenome_id}.vd.fragments.parquet
- A table containing the intervals generated by the virtual digest. -
{enzyme}_{refgenome_id}.vd.digest_stats.csv
- virtual digest aggregate statistics
-
-
basecall/
:-
{enzyme}_{run_id}.rd.{batch_id}.fq.gz
- basecalls that have passed filtering split into batches of 50,000 (can be changed in config). -
{enzyme}_{run_id}.rd.catalog.yaml
- an intake catalog containing read metadata. -
{enzyme}_{run_id}.rd.read_metadata.parquet
- a table of per-read statistics. -
{enzyme}_{run_id}.rd.summary.csv
- a table of aggregate statistics for the reads.
-
-
mapping/
:-
{enzyme}_{run_id}_{batch_id}_{refgenome_id}.coord_sort.bam
- bam alignment file sorted by genome coordinate with an alignment index added to the query name. -
{enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.coord_sort.bam
- whatshap-produced text file mapping alignments to phase sets (dummy file is produced if unphased).
-
-
align_table/
:-
{enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.at.alignment.parquet
- a parquet file with alignment information extracted from the corresponding bam file -
{enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.at.pore_c.parquet
- a parquet file with the same information as the alignment parquet with additional data on fragment assignments and the pass-fail status of each alignment.
-
-
contacts/
:-
{enzyme}_{run_id}_{batch_id}_{refgenome_id}_{phase_set_id}.contacts.parquet
- a table derived from thepore_c.parquet
file consisting of all pairwise contacts (equivalent to a.pairs
file).
-
-
merged_contacts/
:-
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.contacts.parquet
- a merged version of the contacts file for a run -
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.concatemers.parquet
- a table with per-read (aka concatemer) statistics -
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.concateme_summary.csv
- a table with per-run statistics
-
-
matrix/
-
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.matrix.catalog.yaml
- an intake catalog containing metadata about the aggregate matrix. -
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.matrix.coo.csv.gz
- aggregate read counts in the format 'bin1_id,bin2_id,count' - suitable for use withcooler load
the bin width for this set by the*base*
matrix resolution in the config file. -
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.matrix.cool
- the aggregate contact counts in cooler format -
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.matrix.counts.mcool
- a multi-resolution cool file.
-
-
pairs/
:-
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.pairs.pairs.gz
- contains fragment position and fragment pairs in pairs format .
-
-
assembly/
:-
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.salsa2.bed
- optional a bed file compatible with the salsa2 scaffolding tool.
-
-
juicebox/
:-
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.hicRef
- optional a restriction site format file. -
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.hic
- optional a hic medium format file of pairwise contacts. -
{enzyme}_{run_id}_{refgenome_id}_{phase_set_id}.mnd.txt
- optional a merged_no_dups format file (experimental).
-
Code Snippets
17 18 19 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " " contacts export {input.contacts} cooler {params.prefix} --fragment-table {input.fragments} --chromsizes {input.chromsizes} 2>{log} " |
38 39 40 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " " contacts export {input.contacts} cooler {params.prefix} --by-haplotype --fragment-table {input.fragments} --chromsizes {input.chromsizes} 2>{log} " |
57 58 | shell: "cooler zoomify -n {threads} -r {params.resolutions} -o {output} {input} 2>{log}" |
76 77 78 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " " contacts export {input.contacts} pairs {params.prefix} --chromsizes {input.chromsizes} 2>{log} " |
95 96 | shell: "(pairtools sort {input} --nproc {threads} > {params.uncomp_file} && bgzip {params.uncomp_file} -) 2>{log}" |
110 111 | shell: "pairix {input} 2>{log}" |
128 129 130 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " " contacts export {input.contacts} salsa_bed {params.prefix} 2>{log}" |
142 143 | shell: "wget -O - {params.url} > {output} 2>{log}" |
158 159 160 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "refgenome fragments-to-hicref {input} {output} 2>{log}" |
175 176 | shell: "java -Xmx2g -jar {input.tools} pre {input.pairs} {output} {input.chromsizes} -f {input.hicref} &>{log}" |
194 195 196 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " " contacts export {input.contacts} merged_no_dups {params.prefix} --reference-fasta {input.refgenome} 2>{log}" |
23 24 25 26 27 28 | shell: "( bwa {params.cli_opts} -t {threads} " "{input.refgenome} {input.fastq} " " | pore_c alignments reformat-bam - - " " | samtools sort -O bam -m {params.memory} -@ {params.sort_threads} -o {output.bam} -) 2>{log} ;" " samtools index {output.bam} 2>{log} " |
70 71 72 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "alignments create-table {input.bam} {output.pq} --alignment-haplotypes {input.alignment_haplotypes} 2>{log}" |
88 89 90 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "alignments assign-fragments {input.align_table} {input.fragments_table} {output} 2>{log}" |
107 108 109 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "alignments to-contacts {input} {output.contacts} 2>{log}" |
137 138 139 | run: with open(output[0], "w") as fh: fh.write("{}\n".format("\n".join(input))) |
154 155 156 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "contacts merge {input} {output} --fofn" |
175 176 177 178 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "contacts summarize {input.contacts} {input.read_summary} {output.pq} {output.csv} " "--user-metadata '{params.metadata}' 2>{log}" |
9 10 11 12 13 14 15 | shell: """ mkdir -p {params.bindir} ; curl -L {params.f5curl} \ | tar -zxf - ; mv f5c-{params.version}/f5c_x86_64_linux* {params.bindir}; """ |
32 33 34 35 36 | shell: "( pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "alignments filter-bam {input.bam} {input.pore_c_table} {output.filtered_bam} " " --clean-read-name ;" " samtools index {output.filtered_bam} ; ) 2>{log} " |
63 64 65 66 | shell: """ {input.binary} index -d {input.fast5} -s {input.summary} {input.fastq} 2>{log} """ |
74 75 | shell: "touch {output}" |
97 98 99 100 101 102 103 | shell: """ {input.binary} call-methylation -t {threads} \ {params.cli_opts} -B 6.0M -K 550 --iop 20 \ -b {input.bam} -g {input.reference} -r {input.fastq} \ > {output.per_read} 2>{log} """ |
111 112 113 114 | run: shell("head -n 1 {input[0]} > {output} ") for f in input: shell("tail -n+2 {f} >> {output}") |
130 131 132 133 | shell: """ {input.binary} meth-freq -s -i {input.per_read} > {output.split_pos} 2>{log} """ |
16 17 18 19 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "reads prepare {params.fname} {params.prefix} --max-read-length {config[max_read_length]} " " --batch-size {config[reads_per_batch]} 2> {log}" |
29 30 31 32 33 34 35 36 | run: expand_tilde = os.path.expanduser(params.fname) path_is_absolute = os.path.isabs(expand_tilde) # cp -rs requires an absolute path for the source if path_is_absolute: shell("cp -rs {params.fname} {output}") else: shell('cp -rs "$(pwd)/{params.fname}" {output}') |
44 45 46 47 | shell: """ cp {params.fname} {output} """ |
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | run: def fast5_are_not_split(directory): folders = [] for split_type in ["*fast5_pass*", "*fast5_fail*", "*fast5_skip*"]: folders.extend(list(Path(directory).rglob(split_type))) return len(folders) == 0 seq_sum = pd.read_table(input.summary) print(seq_sum.head(1).T) if "filename_fast5" in seq_sum.columns: fast5_column = "filename_fast5" elif "filename" in seq_sum.columns: fast5_column = "filename" else: raise ValueError("Neither 'filename' nor 'filename_fast5' columns found in summary file") example_fn = seq_sum[fast5_column][0] if any(tag in example_fn for tag in ["_pass_", "_fail_", "_skip_"]): print("Pass/fail files already uniquely named; No adjustments needed") shell("cp {input.summary} {output.modified_summary}") elif fast5_are_not_split(input.fast5): print("Fast5 files were not split into pass/fail folders; " + "No adjustments needed") shell("cp {input.summary} {output.modified_summary}") else: seq_sum.loc[seq_sum.passes_filtering == True, fast5_column] = seq_sum[fast5_column].replace( "\.fast5$", ".pass.fast5", regex=True ) seq_sum.loc[seq_sum.passes_filtering == False, fast5_column] = seq_sum[fast5_column].replace( "\.fast5$", ".fail.fast5", regex=True ) seq_sum.to_csv(output.modified_summary, sep="\t", index=False) def adjust_filenames_command(passes_filtering): command = ( "find " + input.fast5 + " -type l -name '*.fast5' " + "| grep fast5_{status} " + "| grep -v -e '\.{status}\.fast5$' -e '_{status}_' " + "| xargs rename 's!\.fast5$!\.{status}\.fast5!' ;" ).format(status=passes_filtering) return command shell(adjust_filenames_command("pass")) shell(adjust_filenames_command("fail")) |
18 19 20 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "refgenome prepare {params.fasta} {params.prefix} --genome-id {wildcards.refgenome_id} 2> {log}" |
39 40 41 | shell: "pore_c {DASK_SETTINGS} --dask-num-workers {threads} " "refgenome virtual-digest {input} {wildcards.enzyme} {params.prefix} -n {threads} 2> {log}" |
55 56 | shell: "bwa index {input} 2>{log}" |
Support
- Future updates
Related Workflows





