AMBIC Epigenomics project analysis

public 1yr ago 0 bookmarks

View Workflow

This is the git repository for AMBIC Epigenomics project with Betenbaugh and Timp lab at JHU.
Use this repository for processing the raw data prior to analysis.
We are working on uploading scripts that are being used for our analysis of the data.

Getting Started

Prerequisites

The list of packages necessary for nanopore analysis :

R packages :

The list of packages necessary for ATAC-seq analysis :

Installing

All of these tools are available via conda (environment.yml) or via the website links.
To install via conda,

conda env update -n base --file ./environment.yml # to install to the base environment
conda env create -f ./environment.yml # to create a new environment

Snakemake pipeline

We are using snakemake to generate reproducible and easy-to-follow pipelines.
You can either install snakemake and follow the steps to run the pipeline or directly look at the snakemake scripts to determine what commands to perform.
Note : The snakemake workflow may not work completely as we have tested the workflow in limited settings. You may have to make edits to the code to make it work.

Preparing snakemake

To use snakemake, first modify the snakemake_config.yml file as appropriate for your environment.

Secondly, modify the nanopore_sample_info.csv and atacseq_sample_info.csv as necessary.

For nanopore sequencing data, we will start from basecalled data;
set up the data as follows :

fastq name : [sample].fastq.gz
summary name (optional) : [sample].summary.txt
move fastq and summary files to [workdir]/reads
fast5 in [workdir]/reads/[sample]/

For ATAC-seq data :

fastq name : [sample].fastq.gz
move fastq files into [workdir]/data/atacseq/fastq/

Running snakemake

To run snakemake, use the snakemake command in the root directory of this repo.

snakemake --cores 16

For running specific parts of the pipeline, any of the following (currently broken):

snakemake --cores 16 parse_nanopore
snakemake --cores 16 nanopore_methylation
snakemake --cores 16 nanopore_sv
snakemake --cores 16 parse_atacseq

For specific outputs,

snakemake --cores 16 path/to/output

e.g. :

snakemake --cores 16 methylation/[sample].cpg.meth.tsv.gz

will perform methylation calling on [sample] and any steps before that automatically, using 16 cores.

Code Snippets

import sys
import random
import argparse

def parse_args():
    '''
    Gives options
    '''
    parser = argparse.ArgumentParser(description='Saves reads below a alignment threshold and discards all others')
    parser.add_argument('-k', help='Alignment number cutoff')
    parser.add_argument('--paired-end', dest='paired_ended', action='store_true', help='Data is paired-end')
    args = parser.parse_args()
    alignment_cutoff = int(args.k)
    paired_ended = args.paired_ended

    return alignment_cutoff, paired_ended


if __name__ == "__main__":
    '''
    Runs the filtering step of choosing multimapped reads
    '''

    [alignment_cutoff, paired_ended] = parse_args()

    if paired_ended:
        alignment_cutoff = int(alignment_cutoff) * 2

    # Store each line in sam file as a list of reads, 
    # where each read is a list of elements to easily 
    # modify or grab things
    current_reads = [] 
    current_qname = ''

    for line in sys.stdin:

        read_elems = line.strip().split('\t')

        if read_elems[0].startswith('@'):
            sys.stdout.write(line)
            continue

        # Keep taking lines that have the same qname
        if read_elems[0] == current_qname:
            # Add line to current reads
            current_reads.append(line)
            pass
        else:
            # Discard if there are more than the alignment cutoff
            if len(current_reads) >= alignment_cutoff:
                current_reads = [line]
                current_qname = read_elems[0]
            elif len(current_reads) > 0:
                # Just output all reads, which are then filtered with
                # samtools
                for read in current_reads:
                    sys.stdout.write(str(read))

                # And then discard
                current_reads = [line]
                current_qname = read_elems[0]
            else:
                # First read in file
                current_reads.append(line)
                current_qname = read_elems[0]

Python SAMtools From line 7 of scripts/assign_multimappers.py

import os
import math
import sys
import argparse
import gzip
import numpy as np
from collections import namedtuple
import pysam
import re
import multiprocessing as mp
import time
from multiprocess_utils import listener,init_mp,close_mp
start_time = time.time()

def parseArgs() :
    # dir of source code
    srcpath=sys.argv[0]
    srcdir=os.path.dirname(os.path.abspath(srcpath))
    # parser
    parser = argparse.ArgumentParser(description='make wig file')
    parser.add_argument('-v','--verbose', action='store_true',default=False,
            help="verbose output")
    parser.add_argument('-i','--input',type=str,required=False,default="stdin",
            help="input file (methylation freq, default stdin)")
    parser.add_argument('-o','--output',type=argparse.FileType('w'),required=False, 
            default=sys.stdout,help="output file (default stout)")
    parser.add_argument('-t','--threshold',type=int,required=False,
            default=3,help="coverage thresdhold (default 3)")
    parser.add_argument('-c','--coverage',type=str,required=False,
            help="output path for coverage file (optional)")
    args = parser.parse_args()
    args.srcdir=srcdir
    return args

if __name__=="__main__":
    args=parseArgs()
    if args.coverage :
        covout = open(args.coverage,'w')
        print("track type=wiggle_0",file=covout)
        def printfxn(prechrom,chrom,pos,meth,cov,out,covout) :
            if chrom != prechrom :
                prechrom = fields[0]
                print("variableStep chrom={}".format(chrom),file=out)
                print("variableStep chrom={}".format(chrom),file=covout)
            print("{}\t{}".format(pos,meth/cov),file=out)
            print("{}\t{}".format(pos,cov),file=covout)
            return prechrom
    else :
        covout = None
        def printfxn(prechrom,chrom,pos,meth,cov,out,covout) :
            if chrom != prechrom :
                prechrom = fields[0]
                print("variableStep chrom={}".format(chrom),file=out)
            print("{}\t{}".format(pos,meth/cov),file=out)
            return prechrom
    if args.input == "stdin" :
        if args.verbose : print("reading from stdin",file=sys.stderr)
        in_fh = sys.stdin
    elif args.input.split(".")[-1] == "gz" :
        if args.verbose : 
            print("reading gzipped input {}".format(args.input),file=sys.stderr)
        in_fh = gzip.open(args.input,'rt')
    else :
        if args.verbose : print("reading {}".format(args.input),file=sys.stderr)
        in_fh = open(args.input)
    i = 0
    chrom=""
    # header
    print("track type=wiggle_0",file=args.output)
    for line in in_fh :
        fields = line.strip().split("\t")
        meth = int(fields[3])
        unmeth = int(fields[4])
        cov = meth+unmeth
        if cov < args.threshold : continue
        chrom = printfxn(chrom,fields[0],fields[1],meth,cov,args.output,covout)
        i += 1
        if args.verbose : 
            if i % 100000 == 0 :
                print("processed {} lines".format(i),file=sys.stderr)

    in_fh.close()

Python numpy pysam From line 2 of scripts/makeWig.py

shell: 
	"trim_galore {input} "
	"-o {wildcards.dir}/fastq_trimmed &> {log}"

SnakeMake Trim_Galore From line 13 of snakemake/atacseq_data_parse_alt.smk

shell:
	"bowtie2-build {input} {params} --threads {threads} &> {log}"

SnakeMake Bowtie 2 From line 28 of snakemake/atacseq_data_parse_alt.smk

shell: 
	"bowtie2 --very-sensitive -p {threads} -t --local " 
	"-x {params} -U {input.fq} 2> {log} | " 
	"samtools view -Sb - | "
	"samtools sort -o {output} -"

SnakeMake SAMtools Bowtie 2 From line 44 of snakemake/atacseq_data_parse_alt.smk

shell:
	"picard -Xmx8G -Xms256M -XX:ParallelGCThreads={threads} "
	"-Djava.io.tmpdir={wildcards.dir}/tmp "
	"MarkDuplicates INPUT={input} OUTPUT={output} "
	"METRICS_FILE={wildcards.dir}/bam/{wildcards.sample}.dupmetrics.txt "
	"VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true "
	"REMOVE_DUPLICATES=false &> {log}"

SnakeMake Picard From line 59 of snakemake/atacseq_data_parse_alt.smk

shell:
	"samtools view -F 1804 -b {input} > {output} && "
	"samtools index {output}"

SnakeMake SAMtools From line 72 of snakemake/atacseq_data_parse_alt.smk

shell:
	"bedtools bamtobed -i {input} | "
	"awk 'BEGIN{{OFS=\"\t\"}}{{$4=\"N\";$5=\"1000\";print $0 }}' | "
	"sort -k1,1 -k2,2n -T {wildcards.dir}/tmp | "
	"bgzip > {output}"

SnakeMake BEDTools From line 81 of snakemake/atacseq_data_parse_alt.smk

shell:
	"gunzip -c {input} | "
	"sort -k1,1 -k2,2n -T {wildcards.dir}/tmp | "
	"bgzip > {output}"

SnakeMake From line 93 of snakemake/atacseq_data_parse_alt.smk

shell:
	"macs2 callpeak -t {input} -f BED "
	"-n {wildcards.dir}/peaks/allsamples "
	"-g mm -p 0.01 --shift -75 --extsize 150 "
	"--nomodel -B --SPMR --keep-dup all --call-summits &> {log} && "
	"{params}/scripts/atacseq_bed_to_saf.sh "
	"{wildcards.dir}/peaks/allsamples_peaks.narrowPeak > {output}"

SnakeMake macs2 From line 107 of snakemake/atacseq_data_parse_alt.smk

shell:
	"featureCounts --verbose -a {input.peaks} "
	"-o {output} -F SAF -T {threads} {input.reads} &> {log}"

SnakeMake FeatureCounts From line 126 of snakemake/atacseq_data_parse_alt.smk

shell: 
	"trim_galore {input} "
	"-o {wildcards.dir}/fastq_trimmed &> {log}"

SnakeMake Trim_Galore From line 14 of snakemake/atacseq_data_parse.smk

shell:
  "bowtie2-build {input} {params} --threads {threads} &> {log}"

SnakeMake Bowtie 2 From line 29 of snakemake/atacseq_data_parse.smk

shell: 
	"bowtie2 -k 4 -p {threads} -t --local " 
	"-x {params} -U {input.fq} 2> {log} | " 
	"samtools view -Sb - > {output}"

SnakeMake SAMtools Bowtie 2 From line 44 of snakemake/atacseq_data_parse.smk

shell:
	"samtools sort -n -T "
	"{wildcards.dir}/bam/{wildcards.sample}.qsorting {input} | "
	"samtools view -h - | "
	"python {params}/scripts/assign_multimappers.py -k 4 | "
	"samtools view -F 1804 -Su - | "
	"samtools sort -T "
	"{wildcards.dir}/bam/{wildcards.sample}.mpsorting "
	"-o {output}"

SnakeMake SAMtools From line 56 of snakemake/atacseq_data_parse.smk

shell:
	"picard -Xmx8G -Xms256M -XX:ParallelGCThreads={threads} "
	"-Djava.io.tmpdir={wildcards.dir}/tmp "
	"MarkDuplicates INPUT={input} OUTPUT={output} "
	"METRICS_FILE={wildcards.dir}/bam/{wildcards.sample}.dupmetrics.txt "
	"VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true "
	"REMOVE_DUPLICATES=false &> {log}"

SnakeMake Picard From line 75 of snakemake/atacseq_data_parse.smk

shell:
	"samtools view -F 1804 -b {input} > {output} && "
	"samtools index {output}"

SnakeMake SAMtools From line 88 of snakemake/atacseq_data_parse.smk

shell:
	"[ -e {params} ]||mkdir {params} && "
	"bedtools bamtobed -i {input} | "
	"awk 'BEGIN{{OFS=\"\t\"}}{{$4=\"N\";$5=\"1000\";print $0 }}' | "
	"sort -k1,1 -k2,2n -T {params} | "
	"bgzip > {output}"

SnakeMake BEDTools From line 99 of snakemake/atacseq_data_parse.smk

shell:
	"macs2 callpeak -t {input} -f BED "
	"-n {wildcards.dir}/peaks/{wildcards.sample} "
	"-g mm -p 0.01 --shift -75 --extsize 150 "
	"--nomodel -B --SPMR --keep-dup all --call-summits &> {log} && "
	"touch {output}"

SnakeMake macs2 From line 113 of snakemake/atacseq_data_parse.smk

shell:
	"[ -e {params.tmpdir} ]||mkdir {params.tmpdir} && cat {input} | "
	"sort -u -k1,1 -k2,2n -k3,3n -T {params.tmpdir} | "
	"bedtools merge -i stdin | "
	"awk 'OFS=\"\t\"{{ print $0,\"peak\"NR }}' > {output.peaks} && "
	"{params.codedir}/scripts/atacseq_bed_to_saf.sh "
	"{output.peaks} > {output.saf}"

SnakeMake BEDTools From line 130 of snakemake/atacseq_data_parse.smk

shell:
	"featureCounts --verbose -a {input.peaks} -O "
	"-o {output} -F SAF -T {threads} {input.reads} &> {log}"

SnakeMake FeatureCounts From line 149 of snakemake/atacseq_data_parse.smk

shell: 
	"trim_galore {input} "
	"-o {wildcards.dir}/fastq_trimmed &> {log}"

SnakeMake Trim_Galore From line 14 of snakemake/atacseq_data_parse_v1.smk

shell:
  "bowtie2-build {input} {params} --threads {threads} &> {log}"

SnakeMake Bowtie 2 From line 29 of snakemake/atacseq_data_parse_v1.smk

shell: 
	"bowtie2 -k 4 -p {threads} -t --local " 
	"-x {params} -U {input.fq} 2> {log} | " 
	"samtools view -Sb - > {output}"

SnakeMake SAMtools Bowtie 2 From line 44 of snakemake/atacseq_data_parse_v1.smk

shell:
	"samtools sort -n -T "
	"{wildcards.dir}/bam/{wildcards.sample}.qsorting {input} | "
	"samtools view -h - | "
	"python {params}/scripts/assign_multimappers.py -k 4 | "
	"samtools view -F 1804 -Su - | "
	"samtools sort -T "
	"{wildcards.dir}/bam/{wildcards.sample}.mpsorting "
	"-o {output}"

SnakeMake SAMtools From line 56 of snakemake/atacseq_data_parse_v1.smk

shell:
	"picard -Xmx8G -Xms256M -XX:ParallelGCThreads={threads} "
	"-Djava.io.tmpdir={wildcards.dir}/tmp "
	"MarkDuplicates INPUT={input} OUTPUT={output} "
	"METRICS_FILE={wildcards.dir}/bam/{wildcards.sample}.dupmetrics.txt "
	"VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true "
	"REMOVE_DUPLICATES=false &> {log}"

SnakeMake Picard From line 75 of snakemake/atacseq_data_parse_v1.smk

shell:
	"samtools view -F 1804 -b {input} > {output} && "
	"samtools index {output}"

SnakeMake SAMtools From line 88 of snakemake/atacseq_data_parse_v1.smk

shell:
	"[ -e {params} ]||mkdir {params} && "
	"bedtools bamtobed -i {input} | "
	"awk 'BEGIN{{OFS=\"\t\"}}{{$4=\"N\";$5=\"1000\";print $0 }}' | "
	"sort -k1,1 -k2,2n -T {params} | "
	"bgzip > {output}"

SnakeMake BEDTools From line 99 of snakemake/atacseq_data_parse_v1.smk

shell:
	"[ -e {params} ]||mkdir {params} && "
	"gunzip -c {input} | "
	"sort -k1,1 -k2,2n -T {params} | "
	"bgzip > {output}"

SnakeMake From line 114 of snakemake/atacseq_data_parse_v1.smk

shell:
	"macs2 callpeak -t {input} -f BED "
	"-n {wildcards.dir}/peaks/allsamples "
	"-g mm -p 0.01 --shift -75 --extsize 150 "
	"--nomodel -B --SPMR --keep-dup all --call-summits &> {log} && "
	"{params}/scripts/atacseq_bed_to_saf.sh "
	"{wildcards.dir}/peaks/allsamples_peaks.narrowPeak > {output}"

SnakeMake macs2 From line 129 of snakemake/atacseq_data_parse_v1.smk

shell:
	"featureCounts --verbose -a {input.peaks} -O "
	"-o {output} -F SAF -T {threads} {input.reads} &> {log}"

SnakeMake FeatureCounts From line 148 of snakemake/atacseq_data_parse_v1.smk

shell:
	"ngmlr -x ont --bam-fix "
	"-t {threads} -r {params} -q {input} 2> {log} | "
	"samtools view -q 20 -b - | "
	"samtools sort -T bam/{wildcards.sample}.sorting "
	"-o {output} && "
	"samtools index {output}"

SnakeMake SAMtools NGMLR From line 22 of snakemake/nanopore_data_parse.smk

shell:
	"minimap2 -x map-ont -d {output} {input}"

SnakeMake Minimap2 From line 35 of snakemake/nanopore_data_parse.smk

shell:
	"minimap2 --MD -L -t {threads} -a {input} 2> {log} | "
	"samtools view -q 20 -b - | "
	"samtools sort -T bam/{wildcards.sample}.sorting "
	"-o {output} && "
	"samtools index {output}"

SnakeMake SAMtools Minimap2 From line 47 of snakemake/nanopore_data_parse.smk

shell:
	"nanopolish index --verbose -d {wildcards.pre} "

SnakeMake nanopolish From line 65 of snakemake/nanopore_data_parse.smk

shell:
	"nanopolish call-methylation -v -t {threads} -q cpg "
	"-g {params} -r {input.fq} -b {input.bam} | " 
	"gzip > {output}"

SnakeMake nanopolish From line 81 of snakemake/nanopore_data_parse.smk

shell:
	"python {params}/nanopore-methylation-utilities/mtsv2bedGraph.py -g {input.fa} -i {input.meth} | "
	"sort -T tmp -k1,1 -k2,2n | "
	"bgzip > {output} && "
	"tabix -p bed {output}"

SnakeMake tabix From line 94 of snakemake/nanopore_data_parse.smk

	shell:
		"python {params}/nanopore-methylation-utilities/convert_bam_for_methylation.py "
        "-t {threads} -c {input.meth} -b {input.bam} | "
		"samtools sort -o {output} && "
		"samtools index {output}"

SnakeMake SAMtools From line 110 of snakemake/nanopore_data_parse.smk

shell: 
	"python -u {params}/nanopore-methylation-utilities/parseMethylbed.py frequency -v " 
	"-i {input} -m {wildcards.mod} 2> {log} | " 
	"bgzip > {output} && " 
	"tabix -b 2 -e 2 {output}"

SnakeMake tabix From line 125 of snakemake/nanopore_data_parse.smk

shell: 
	"python {params}/scripts/makeWig.py -v -i {input} " 
	"-o {output.methwig} -c {output.covwig} &> {output.log}"

SnakeMake From line 140 of snakemake/nanopore_data_parse.smk

shell: 
	"wigToBigWig {input} {output}"

SnakeMake wigToBigWig From line 150 of snakemake/nanopore_data_parse.smk

shell: 
	"sniffles -m {input} -v {output} -t {threads} "
	"--tmp_file sv/{wildcards.sample}.sniffles.tmp " 
	"-s 2  &> {log}"

SnakeMake sniffles From line 167 of snakemake/nanopore_data_parse.smk

shell:
	"bcftools sort -o {output} -T sv {input}"

SnakeMake BCFtools From line 177 of snakemake/nanopore_data_parse.smk

shell:
	"echo {input} | tr \" \" \"\\n\" > {output.raw} && "
	"SURVIVOR merge {output.raw} 1000 1 1 -1 -1 -1 {output.vcf}"

SnakeMake SURVIVOR From line 187 of snakemake/nanopore_data_parse.smk

shell:
	"sniffles -m {input.bam} -v {output} -t {threads} "
	"--Ivcf {input.candidates} --tmp_file "
	"sv/{wildcards.sample}.SURVIVORsniffles.tmp " 
	"-s 2 -n -1 --genotype --cluster &> {log}"

SnakeMake sniffles From line 201 of snakemake/nanopore_data_parse.smk

shell:
	"echo {input} | tr \" \" \"\\n\" > {output.raw} && "
	"SURVIVOR merge {output.raw} 1000 -1 1 -1 -1 -1 {output.vcf}"

SnakeMake SURVIVOR From line 214 of snakemake/nanopore_data_parse.smk

ShowHide 30 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/timplab/ambic-epigenome

Name: ambic-epigenome

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free