This is the git repository for AMBIC Epigenomics project with Betenbaugh and Timp lab at JHU.
Use this repository for processing the raw data prior to analysis.
We are working on uploading scripts that are being used for our analysis of the data.
Getting Started
Prerequisites
The list of packages necessary for nanopore analysis :
R packages :
The list of packages necessary for ATAC-seq analysis :
Installing
All of these tools are available via conda (environment.yml) or via the website links.
To install via conda,
conda env update -n base --file ./environment.yml # to install to the base environment
conda env create -f ./environment.yml # to create a new environment
Snakemake pipeline
We are using
snakemake
to generate reproducible and easy-to-follow pipelines.
You can either install snakemake and follow the steps to run the pipeline or directly look at the snakemake scripts to determine what commands to perform.
Note : The snakemake workflow may not work completely as we have tested the workflow in limited settings. You may have to make edits to the code to make it work.
Preparing snakemake
To use snakemake, first modify the snakemake_config.yml file as appropriate for your environment.
Secondly, modify the nanopore_sample_info.csv and atacseq_sample_info.csv as necessary.
For nanopore sequencing data, we will start from basecalled data;
set up the data as follows :
-
fastq name : [sample].fastq.gz
-
summary name (optional) : [sample].summary.txt
-
move fastq and summary files to [workdir]/reads
-
fast5 in [workdir]/reads/[sample]/
For ATAC-seq data :
-
fastq name : [sample].fastq.gz
-
move fastq files into [workdir]/data/atacseq/fastq/
Running snakemake
To run snakemake, use the snakemake command in the root directory of this repo.
snakemake --cores 16
For running specific parts of the pipeline, any of the following (currently broken):
snakemake --cores 16 parse_nanopore
snakemake --cores 16 nanopore_methylation
snakemake --cores 16 nanopore_sv
snakemake --cores 16 parse_atacseq
For specific outputs,
snakemake --cores 16 path/to/output
e.g. :
snakemake --cores 16 methylation/[sample].cpg.meth.tsv.gz
will perform methylation calling on [sample] and any steps before that automatically, using 16 cores.
Code Snippets
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | import sys import random import argparse def parse_args(): ''' Gives options ''' parser = argparse.ArgumentParser(description='Saves reads below a alignment threshold and discards all others') parser.add_argument('-k', help='Alignment number cutoff') parser.add_argument('--paired-end', dest='paired_ended', action='store_true', help='Data is paired-end') args = parser.parse_args() alignment_cutoff = int(args.k) paired_ended = args.paired_ended return alignment_cutoff, paired_ended if __name__ == "__main__": ''' Runs the filtering step of choosing multimapped reads ''' [alignment_cutoff, paired_ended] = parse_args() if paired_ended: alignment_cutoff = int(alignment_cutoff) * 2 # Store each line in sam file as a list of reads, # where each read is a list of elements to easily # modify or grab things current_reads = [] current_qname = '' for line in sys.stdin: read_elems = line.strip().split('\t') if read_elems[0].startswith('@'): sys.stdout.write(line) continue # Keep taking lines that have the same qname if read_elems[0] == current_qname: # Add line to current reads current_reads.append(line) pass else: # Discard if there are more than the alignment cutoff if len(current_reads) >= alignment_cutoff: current_reads = [line] current_qname = read_elems[0] elif len(current_reads) > 0: # Just output all reads, which are then filtered with # samtools for read in current_reads: sys.stdout.write(str(read)) # And then discard current_reads = [line] current_qname = read_elems[0] else: # First read in file current_reads.append(line) current_qname = read_elems[0] |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | import os import math import sys import argparse import gzip import numpy as np from collections import namedtuple import pysam import re import multiprocessing as mp import time from multiprocess_utils import listener,init_mp,close_mp start_time = time.time() def parseArgs() : # dir of source code srcpath=sys.argv[0] srcdir=os.path.dirname(os.path.abspath(srcpath)) # parser parser = argparse.ArgumentParser(description='make wig file') parser.add_argument('-v','--verbose', action='store_true',default=False, help="verbose output") parser.add_argument('-i','--input',type=str,required=False,default="stdin", help="input file (methylation freq, default stdin)") parser.add_argument('-o','--output',type=argparse.FileType('w'),required=False, default=sys.stdout,help="output file (default stout)") parser.add_argument('-t','--threshold',type=int,required=False, default=3,help="coverage thresdhold (default 3)") parser.add_argument('-c','--coverage',type=str,required=False, help="output path for coverage file (optional)") args = parser.parse_args() args.srcdir=srcdir return args if __name__=="__main__": args=parseArgs() if args.coverage : covout = open(args.coverage,'w') print("track type=wiggle_0",file=covout) def printfxn(prechrom,chrom,pos,meth,cov,out,covout) : if chrom != prechrom : prechrom = fields[0] print("variableStep chrom={}".format(chrom),file=out) print("variableStep chrom={}".format(chrom),file=covout) print("{}\t{}".format(pos,meth/cov),file=out) print("{}\t{}".format(pos,cov),file=covout) return prechrom else : covout = None def printfxn(prechrom,chrom,pos,meth,cov,out,covout) : if chrom != prechrom : prechrom = fields[0] print("variableStep chrom={}".format(chrom),file=out) print("{}\t{}".format(pos,meth/cov),file=out) return prechrom if args.input == "stdin" : if args.verbose : print("reading from stdin",file=sys.stderr) in_fh = sys.stdin elif args.input.split(".")[-1] == "gz" : if args.verbose : print("reading gzipped input {}".format(args.input),file=sys.stderr) in_fh = gzip.open(args.input,'rt') else : if args.verbose : print("reading {}".format(args.input),file=sys.stderr) in_fh = open(args.input) i = 0 chrom="" # header print("track type=wiggle_0",file=args.output) for line in in_fh : fields = line.strip().split("\t") meth = int(fields[3]) unmeth = int(fields[4]) cov = meth+unmeth if cov < args.threshold : continue chrom = printfxn(chrom,fields[0],fields[1],meth,cov,args.output,covout) i += 1 if args.verbose : if i % 100000 == 0 : print("processed {} lines".format(i),file=sys.stderr) in_fh.close() |
13 14 15 | shell: "trim_galore {input} " "-o {wildcards.dir}/fastq_trimmed &> {log}" |
28 29 | shell: "bowtie2-build {input} {params} --threads {threads} &> {log}" |
44 45 46 47 48 | shell: "bowtie2 --very-sensitive -p {threads} -t --local " "-x {params} -U {input.fq} 2> {log} | " "samtools view -Sb - | " "samtools sort -o {output} -" |
59 60 61 62 63 64 65 | shell: "picard -Xmx8G -Xms256M -XX:ParallelGCThreads={threads} " "-Djava.io.tmpdir={wildcards.dir}/tmp " "MarkDuplicates INPUT={input} OUTPUT={output} " "METRICS_FILE={wildcards.dir}/bam/{wildcards.sample}.dupmetrics.txt " "VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true " "REMOVE_DUPLICATES=false &> {log}" |
72 73 74 | shell: "samtools view -F 1804 -b {input} > {output} && " "samtools index {output}" |
81 82 83 84 85 | shell: "bedtools bamtobed -i {input} | " "awk 'BEGIN{{OFS=\"\t\"}}{{$4=\"N\";$5=\"1000\";print $0 }}' | " "sort -k1,1 -k2,2n -T {wildcards.dir}/tmp | " "bgzip > {output}" |
93 94 95 96 | shell: "gunzip -c {input} | " "sort -k1,1 -k2,2n -T {wildcards.dir}/tmp | " "bgzip > {output}" |
107 108 109 110 111 112 113 | shell: "macs2 callpeak -t {input} -f BED " "-n {wildcards.dir}/peaks/allsamples " "-g mm -p 0.01 --shift -75 --extsize 150 " "--nomodel -B --SPMR --keep-dup all --call-summits &> {log} && " "{params}/scripts/atacseq_bed_to_saf.sh " "{wildcards.dir}/peaks/allsamples_peaks.narrowPeak > {output}" |
126 127 128 | shell: "featureCounts --verbose -a {input.peaks} " "-o {output} -F SAF -T {threads} {input.reads} &> {log}" |
14 15 16 | shell: "trim_galore {input} " "-o {wildcards.dir}/fastq_trimmed &> {log}" |
29 30 | shell: "bowtie2-build {input} {params} --threads {threads} &> {log}" |
44 45 46 47 | shell: "bowtie2 -k 4 -p {threads} -t --local " "-x {params} -U {input.fq} 2> {log} | " "samtools view -Sb - > {output}" |
56 57 58 59 60 61 62 63 64 | shell: "samtools sort -n -T " "{wildcards.dir}/bam/{wildcards.sample}.qsorting {input} | " "samtools view -h - | " "python {params}/scripts/assign_multimappers.py -k 4 | " "samtools view -F 1804 -Su - | " "samtools sort -T " "{wildcards.dir}/bam/{wildcards.sample}.mpsorting " "-o {output}" |
75 76 77 78 79 80 81 | shell: "picard -Xmx8G -Xms256M -XX:ParallelGCThreads={threads} " "-Djava.io.tmpdir={wildcards.dir}/tmp " "MarkDuplicates INPUT={input} OUTPUT={output} " "METRICS_FILE={wildcards.dir}/bam/{wildcards.sample}.dupmetrics.txt " "VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true " "REMOVE_DUPLICATES=false &> {log}" |
88 89 90 | shell: "samtools view -F 1804 -b {input} > {output} && " "samtools index {output}" |
99 100 101 102 103 104 | shell: "[ -e {params} ]||mkdir {params} && " "bedtools bamtobed -i {input} | " "awk 'BEGIN{{OFS=\"\t\"}}{{$4=\"N\";$5=\"1000\";print $0 }}' | " "sort -k1,1 -k2,2n -T {params} | " "bgzip > {output}" |
113 114 115 116 117 118 | shell: "macs2 callpeak -t {input} -f BED " "-n {wildcards.dir}/peaks/{wildcards.sample} " "-g mm -p 0.01 --shift -75 --extsize 150 " "--nomodel -B --SPMR --keep-dup all --call-summits &> {log} && " "touch {output}" |
130 131 132 133 134 135 136 | shell: "[ -e {params.tmpdir} ]||mkdir {params.tmpdir} && cat {input} | " "sort -u -k1,1 -k2,2n -k3,3n -T {params.tmpdir} | " "bedtools merge -i stdin | " "awk 'OFS=\"\t\"{{ print $0,\"peak\"NR }}' > {output.peaks} && " "{params.codedir}/scripts/atacseq_bed_to_saf.sh " "{output.peaks} > {output.saf}" |
149 150 151 | shell: "featureCounts --verbose -a {input.peaks} -O " "-o {output} -F SAF -T {threads} {input.reads} &> {log}" |
14 15 16 | shell: "trim_galore {input} " "-o {wildcards.dir}/fastq_trimmed &> {log}" |
29 30 | shell: "bowtie2-build {input} {params} --threads {threads} &> {log}" |
44 45 46 47 | shell: "bowtie2 -k 4 -p {threads} -t --local " "-x {params} -U {input.fq} 2> {log} | " "samtools view -Sb - > {output}" |
56 57 58 59 60 61 62 63 64 | shell: "samtools sort -n -T " "{wildcards.dir}/bam/{wildcards.sample}.qsorting {input} | " "samtools view -h - | " "python {params}/scripts/assign_multimappers.py -k 4 | " "samtools view -F 1804 -Su - | " "samtools sort -T " "{wildcards.dir}/bam/{wildcards.sample}.mpsorting " "-o {output}" |
75 76 77 78 79 80 81 | shell: "picard -Xmx8G -Xms256M -XX:ParallelGCThreads={threads} " "-Djava.io.tmpdir={wildcards.dir}/tmp " "MarkDuplicates INPUT={input} OUTPUT={output} " "METRICS_FILE={wildcards.dir}/bam/{wildcards.sample}.dupmetrics.txt " "VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true " "REMOVE_DUPLICATES=false &> {log}" |
88 89 90 | shell: "samtools view -F 1804 -b {input} > {output} && " "samtools index {output}" |
99 100 101 102 103 104 | shell: "[ -e {params} ]||mkdir {params} && " "bedtools bamtobed -i {input} | " "awk 'BEGIN{{OFS=\"\t\"}}{{$4=\"N\";$5=\"1000\";print $0 }}' | " "sort -k1,1 -k2,2n -T {params} | " "bgzip > {output}" |
114 115 116 117 118 | shell: "[ -e {params} ]||mkdir {params} && " "gunzip -c {input} | " "sort -k1,1 -k2,2n -T {params} | " "bgzip > {output}" |
129 130 131 132 133 134 135 | shell: "macs2 callpeak -t {input} -f BED " "-n {wildcards.dir}/peaks/allsamples " "-g mm -p 0.01 --shift -75 --extsize 150 " "--nomodel -B --SPMR --keep-dup all --call-summits &> {log} && " "{params}/scripts/atacseq_bed_to_saf.sh " "{wildcards.dir}/peaks/allsamples_peaks.narrowPeak > {output}" |
148 149 150 | shell: "featureCounts --verbose -a {input.peaks} -O " "-o {output} -F SAF -T {threads} {input.reads} &> {log}" |
22 23 24 25 26 27 28 | shell: "ngmlr -x ont --bam-fix " "-t {threads} -r {params} -q {input} 2> {log} | " "samtools view -q 20 -b - | " "samtools sort -T bam/{wildcards.sample}.sorting " "-o {output} && " "samtools index {output}" |
35 36 | shell: "minimap2 -x map-ont -d {output} {input}" |
47 48 49 50 51 52 | shell: "minimap2 --MD -L -t {threads} -a {input} 2> {log} | " "samtools view -q 20 -b - | " "samtools sort -T bam/{wildcards.sample}.sorting " "-o {output} && " "samtools index {output}" |
65 66 | shell: "nanopolish index --verbose -d {wildcards.pre} " |
81 82 83 84 | shell: "nanopolish call-methylation -v -t {threads} -q cpg " "-g {params} -r {input.fq} -b {input.bam} | " "gzip > {output}" |
94 95 96 97 98 | shell: "python {params}/nanopore-methylation-utilities/mtsv2bedGraph.py -g {input.fa} -i {input.meth} | " "sort -T tmp -k1,1 -k2,2n | " "bgzip > {output} && " "tabix -p bed {output}" |
110 111 112 113 114 | shell: "python {params}/nanopore-methylation-utilities/convert_bam_for_methylation.py " "-t {threads} -c {input.meth} -b {input.bam} | " "samtools sort -o {output} && " "samtools index {output}" |
125 126 127 128 129 | shell: "python -u {params}/nanopore-methylation-utilities/parseMethylbed.py frequency -v " "-i {input} -m {wildcards.mod} 2> {log} | " "bgzip > {output} && " "tabix -b 2 -e 2 {output}" |
140 141 142 | shell: "python {params}/scripts/makeWig.py -v -i {input} " "-o {output.methwig} -c {output.covwig} &> {output.log}" |
150 151 | shell: "wigToBigWig {input} {output}" |
167 168 169 170 | shell: "sniffles -m {input} -v {output} -t {threads} " "--tmp_file sv/{wildcards.sample}.sniffles.tmp " "-s 2 &> {log}" |
177 178 | shell: "bcftools sort -o {output} -T sv {input}" |
187 188 189 | shell: "echo {input} | tr \" \" \"\\n\" > {output.raw} && " "SURVIVOR merge {output.raw} 1000 1 1 -1 -1 -1 {output.vcf}" |
201 202 203 204 205 | shell: "sniffles -m {input.bam} -v {output} -t {threads} " "--Ivcf {input.candidates} --tmp_file " "sv/{wildcards.sample}.SURVIVORsniffles.tmp " "-s 2 -n -1 --genotype --cluster &> {log}" |
214 215 216 | shell: "echo {input} | tr \" \" \"\\n\" > {output.raw} && " "SURVIVOR merge {output.raw} 1000 -1 1 -1 -1 -1 {output.vcf}" |
Support
- Future updates
Related Workflows





