Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This is a
snakemake
pipeline for bisulfite sequencing data, it implements:
-
Adapter trimming and quality check
-
Quality reports and statistics (
fastqc
+multiqc
) -
Methylation extraction with
bismark
(bowtie2
/hisat2
as aligners) -
DMR identification with
Methylkit
(in all contexts)
Authors
- Skander Hatira (@skanderhatira)
Usage
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
Step 1: Obtain a copy of this workflow
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
git clone https://forgemia.inra.fr/irhs-bioinfo/biseps.git
Step 2: Configure workflow
Configure the workflow according to your needs via editing the files in the
config/
folder. Adjust
config.yaml
to configure the workflow execution, and
samples.tsv
,
units.tsv
to specify your sample setup.
Step 3: Install Snakemake
Install Snakemake using conda :
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation .
Step 4: Execute workflow
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --profile config/profiles/local -n
See the Snakemake documentation for further details.
Step 5: Investigate results
After successful execution, you can create a self-contained interactive
.html
report with all results via:
snakemake --report report.html
Build docker image
A docker image of this workflow can be built from the repository by running this command:
docker build -t biseps --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) --build-arg USERNAME=$USER .
To run this container with your data you need to bind volumes specyfing raw data, configuration files, and necessary resources
Testing
The
.test
directory contains subsampled
.fastq
files for two samples (multi-lane + biological replicates) and a
.fasta
file containing genome sequence from
NCBI
.
It also contains CX reports from
DMRcaller
to test the DMR Identification Pipeline
To test the pipeline you have to be on a
conda
available on your machine
$PATH
:
Testing Alignment :
-
You can either use the preconfigured .yaml profile :
snakemake --profile config/profiles/test
-
Or enter the command line arguments as such :
snakemake --cores $N --use-conda --configfile .test/config/config.yaml
Testing differential methylation extraction :
-
You can either use the preconfigured .yaml profile :
snakemake --profile config/profiles/testComparison
-
Or enter the command line arguments as such :
snakemake --cores $N --use-conda --configfile .test/comparison/config.yaml
or a
docker
enabled machine to build and run the image with a mounted folder containing necessary data and configuration files pointing to that data:
docker run --mount type=bind,src="$(pwd)/.test",dst=/biseps/.test,readonly biseps \
--profile config/profiles/test
Note that your output data won't be accessible as it isn't mounted/stored in a
docker
volume
, refer to
docker
documentation
on best practices to persist data in running containers.
Code Snippets
37 38 39 | shell: "bismark --score_min {params.score_min} -N {params.N} -L {params.L} --{params.aligner} --bam {params.aligner_options} {params.flags} {params.extra} " "--temp_dir {output.tempdir} -o {params.outdir} --parallel {params.instances} {params.genome} -1 {input.r1} -2 {input.r2} 2> {log}; " |
50 51 52 53 | shell: "mv {input.bam} {output.bam}; " "mv {input.report} {output.report}; " "mv {input.nucl} {output.nucl}; " |
66 67 | shell: "samtools view -h -@ {resources.cpus} -o {output.sam} {input} ;" |
93 94 95 96 | shell: "deduplicate_bismark -p {input} -o {params.basename} --output_dir {params.outdir} --sam 2> {log};" "samtools sort {output.dedup} -o {output.sort_bam} ;" "samtools index {output.sort_bam}" |
109 110 | shell: "bamCoverage -b {input.bam} -o {output} 2> {log}" |
23 24 | shell: "bash workflow/scripts/parallel_commands.sh \'bismark_genome_preparation {params.genome} --{params.aligner} --parallel {resources.cpus} --genomic_composition {params.extra}\' \'samtools faidx {params.genome_file} \' 2> {log} " |
33 34 | shell: "bismark_methylation_extractor {input} --genome_folder {params.genome} {params.flags} -p --parallel {params.instances} -o {params.out_dir} {params.extra} 2> {log} " |
48 49 | shell: "python workflow/scripts/methcalls2cgmap.py -n {input} -f bismark &> {log}" |
64 65 66 | shell: "sort -k1,1 -k2,2n {input.cx} > {output.sort};" "python3 workflow/scripts/bismark_to_bigwig_pe.py {input.index} {output.sort} &> {log}" |
82 83 84 85 86 | shell: "gunzip {input.bedgraph};" "sed -i '1d' {output.unzip};" "sort -k1,1 -k2,2n {output.unzip} > {output.sort}; " "bedGraphToBigWig {output.sort} {input.index} {output.bw} 2> {log}" |
96 97 98 99 | shell: "mv {input.cg} {output.cg};" "mv {input.chg} {output.chg};" "mv {input.chh} {output.chh};" |
13 14 15 | shell: "mkdir -p {output} ; " "fastqc {input} -o {output} {params.extra} 2> {log}" |
33 34 | shell: "multiqc {input.fqc} {params.aligndir} {params.methdir} -n {output.file} 2> {log}" |
56 57 58 59 60 61 | shell: "bismark2report --alignment_report {input.align} " "--dedup_report {input.dedup} " "--nucleotide_report {input.nucl} " "--splitting_report {input.split} " "--mbias_report {input.mbias} --dir {params.outdir} -o {params.filename} 2> {log}" |
7 8 9 10 11 12 13 14 15 | run: if os.path.splitext(input.r1)[1] == '.gz': shell("gunzip -c {input.r1} > {output.r1}; ") else: shell("ln -sr {input.r1} {output.r1}; ") if os.path.splitext(input.r2)[1] == '.gz': shell("gunzip -c {input.r2} > {output.r2}; ") else: shell("ln -sr {input.r2} {output.r2}; ") |
24 25 26 | shell: "cat {input.r1} > {output.r1};" "cat {input.r2} > {output.r2}" |
44 45 46 | shell: "seqtk sample -s {params.seed} {input.r1} {params.size} > {output.r1} ; " "seqtk sample -s {params.seed} {input.r2} {params.size} > {output.r2} " |
23 24 25 26 27 | shell: "trimmomatic PE -phred33 -threads {resources.cpus} " " {input} " " {output} " " {params.trimmer}:{params.adapters}:{params.trimmeropts} {params.extra}" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | import sys, math, multiprocessing, subprocess, os # Usage: python3 bismark_to_bigwig_pe.py [-keep] [-sort] [-all] [-L=labels] [-p=num_proc] <chrm_sizes> <allC_file> [allC_file]* # NOTE: allc file contains the methylation information for all chromosomes # Steps: # 1. allC to bedGraph # 2. sort bedGraph if necessary # 3. bedGraph to BigWig # 4. remove temporary files NUMPROC=1 def processInputs( allCFileAr, chrmFileStr, keepTmp, labelsAr, outID, numProc, isSort, useAll ): print( 'Keep temp files: {:s}; Sort bedGraph: {:s}; Use all positions: {:s}'.format( str( keepTmp), str (isSort), str(useAll))) print( 'Begin processing files with {:d} processors'.format( numProc ) ) pool = multiprocessing.Pool( processes=numProc ) results = [ pool.apply_async( processFile, args=(allCFileAr[i], chrmFileStr, labelsAr[i], outID, keepTmp, isSort, useAll) ) for i in range(len(allCFileAr)) ] suc = [ p.get() for p in results ] print( 'Done' ) def processFile( allCFileStr, chrmFileStr, label, outID, keepTmp, isSort, useAll ): if outID == None and label == None: outID = allCFileStr.replace( '.tsv','' ).replace( 'allc_','' ) elif outID == None: outID = label elif label == None: outID += '_' + allCFileStr.replace( '.tsv','' ).replace( 'allc_','' ) else: outID += '_' + label print( 'Reading allC file {:s}'.format( allCFileStr ) ) # allC to bedGraphs bedGraphStr = outID + '.bedGraph' bedGraphAr = [bedGraphStr + '.' + x for x in ['cg','chg','chh'] ] readAllC( allCFileStr, bedGraphAr, useAll ) if isSort: print( 'Sorting bedGraph files' ) for b in bedGraphAr: sortBedFile( b ) print( 'Converting {:s} files to BigWig'.format(bedGraphStr ) ) # bedGraph to bigWig for b in bedGraphAr: processBedGraph( b, chrmFileStr ) # remove temporary if keepTmp == False: print( 'Removing temporary files...' ) for b in bedGraphAr: os.remove( b ) print( 'BigWig finished for {:s}.bw.*'.format( outID ) ) def readAllC( allCFileStr, outFileAr, useAll ): allCFile = open( allCFileStr, 'r' ) outFileAr = [open( outFileStr, 'w' ) for outFileStr in outFileAr] mTypes = [ 'CG', 'CHG', 'CHH' ] for line in allCFile: lineAr = line.rstrip().split('\t') # (0) chr (1) pos (2) strand (3) mC (4) C (5) Cctxt # (6) trintctxt if len(lineAr) < 7: continue elif ( useAll or int(lineAr[3])> 0 ): chrm = lineAr[0] pos = int( lineAr[1] ) - 1 methType = decodeMethType( lineAr[5] ) try: mInd = mTypes.index( methType ) except ValueError: continue value = float( lineAr[3] ) / (float( lineAr[4] ) + float( lineAr[3])) # adjust for negative strand if lineAr[2] == '-': value = value * -1 # (0) chrm (1) start (2) end (3) value outStr = '{:s}\t{:d}\t{:d}\t{:f}\n'.format( chrm, pos, pos+1, value ) outFile = outFileAr[mInd] outFile.write( outStr ) # end if # end for allCFile.close() [ outFile.close() for outFile in outFileAr ] def decodeMethType( mStr ): if mStr.startswith( 'CG' ): return 'CG' elif mStr.endswith( 'G' ): return 'CHG' elif mStr == 'CNN': return 'CNN' else: return 'CHH' def sortBedFile( bedFileStr ): command = 'bedSort {:s} {:s}'.format( bedFileStr, bedFileStr ) subprocess.call( command, shell=True ) def processBedGraph( bedGraphStr, chrmFileStr ): bigWigStr = bedGraphStr.replace( '.bedGraph', '.bw' ) #print( bigWigStr ) # bedGraphToBigWig in.bedGraph chrom.sizes out.bw command = 'bedGraphToBigWig {:s} {:s} {:s}'.format( bedGraphStr, chrmFileStr, bigWigStr ) subprocess.call( command, shell=True) def parseInputs( argv ): # Usage: python3 bismark_to_bigwig_pe.py [-keep] [-sort] [-all] [-L=labels] [-p=num_proc] <chrm_sizes> <allC_file> [allC_file]* keepTmp = False labelsAr = [] numProc = NUMPROC isSort = False useAll = False outID = None startInd = 0 for i in range( min(5, len(argv)-2) ): if argv[i] == '-keep': keepTmp = True startInd += 1 elif argv[i] == '-sort': isSort = True startInd += 1 elif argv[i] == '-all': useAll = True startInd += 1 elif argv[i].startswith( '-L=' ): labelsAr = argv[i][3:].split( ',' ) startInd += 1 elif argv[i].startswith( '-o=' ): outID = argv[i][3:] startInd += 1 elif argv[i].startswith( '-p=' ): try: numProc = int( argv[i][3:] ) startInd += 1 except ValueError: print( 'ERROR: number of processors must be an integer' ) exit() elif argv[i] in [ '-h', '--help', '-help']: printHelp() exit() elif argv[i].startswith( '-' ): print( 'ERROR: {:s} is not a valid parameter'.format( argv[i] ) ) exit() chrmFileStr = argv[startInd] allCFileAr = [] for j in range(startInd+1, len(argv) ): allCFileAr += [ argv[j] ] if len(labelsAr) == 0: labelsAr = [None] * len(allCFileAr) elif len(labelsAr) != len(allCFileAr): print( "ERROR: number of labels doesn't match number of allC files" ) exit() processInputs( allCFileAr, chrmFileStr, keepTmp, labelsAr, outID, numProc, isSort, useAll ) def printHelp(): print ("Usage: python3 bismark_to_bigwig_pe.py [-keep] [-sort] [-all] [-L=labels] [-o=out_id] [-p=num_proc] <chrm_sizes> <bismark_file> [bismark_file]*") print( 'Converts bismark files to context-specific BigWig files' ) print( 'Note: bedGraphToBigWig and bedSort programs must be in the path' ) print( 'Required:' ) print( 'chrm_file\ttab-delimited file with chromosome names and lengths,\n\t\ti.e. fasta index file' ) print( 'bismark_file\tbismark file with all chrms and contexts' ) print( 'Optional:' ) print( '-keep\t\tkeep intermediate files' ) print( '-sort\t\tcalls bedSort; add this option if bigwig conversion fails' ) print( '-all\t\tuse all covered positions including 0s [default only includes mC > 1]' ) print( '-L=labels\tcomma-separated list of labels to use for the allC files;\n\t\tdefaults to using information from the allc file name' ) print( '-o=out_id\toptional identifier to be added to the output file names' ) print( '-p=num_proc\tnumber of processors to use [default 1]' ) if __name__ == "__main__": if len(sys.argv) < 3 : printHelp() else: parseInputs( sys.argv[1:] ) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | import pandas as pd import numpy as np import traceback import argparse def get_parser(): """ Create a parser and add arguments """ parser = argparse.ArgumentParser() group1 = parser.add_argument_group('Input format') group1.add_argument("-n", "--filename", type=str, help="the file name that the users want to convert to CGMap format") group1.add_argument("-f", "--format", default="bismark",choices=["bismark", "bsmap", "methylpy", "methimpute"], type=str, help="the type of file to CGmap") return parser def readfile(filename): """ read files: allow the format with or without gz compressed """ if filename[-3:] == ".gz": inputfile = pd.read_csv(filename, header = None, sep="\t", compression='gzip') else: inputfile = pd.read_csv(filename, header = None, sep="\t") return inputfile def trinuc(threeletter): """ function to acquire the methylation context (CG, CHG, CHH) """ if threeletter[1] == "G": context = "CG" if threeletter[1] != "G": if threeletter[2] == "G": context = "CHG" if threeletter[2]!="G": context = "CHH" return context def bismark2cgmap(bismarkfile): """ CX report file to CGmap.gz """ name_bismark = bismarkfile bismark = readfile(name_bismark) try: int(bismark.iloc[0,1]) header = "No" except: header = "Yes" if header == "Yes": bismark = bismark.iloc[1:,:] print(bismark.head()) bismark["nuc"] = np.where(bismark[2]=="+","C","G") bismark["dinuc"] = [i[:2] for i in list(bismark[6])] bismark["mc_nc"] = bismark[3] + bismark[4] bismark["ratio"] = np.round(bismark[3].astype("float")/(bismark[3] + bismark[4]).astype("float"),2) cgmap_format = bismark[[0,"nuc",1,5,"dinuc","ratio",3,"mc_nc"]] cgmap_format_finish = cgmap_format.loc[(cgmap_format["mc_nc"]!=0)] cgmap_format_finish.to_csv(name_bismark + ".CGmap.gz", header = False, index = False, sep="\t", compression='gzip') def bsmap2cgmap(bsmapfile): """ the methylation calls generated by methratio.py in BSMAP (v2.73) to CGmap.gz """ name_bsmap = bsmapfile bsmap = readfile(name_bsmap) # the file with or without header can be properly processed try: int(bsmap.iloc[0,1]) header = "No" except: header = "Yes" if header == "Yes": bsmap = bsmap.iloc[1:,:] # to get the items in CGmap bsmap["nuc"] = np.where(bsmap[2]=="+","C","G") bsmap["dinuc"] = [i[:2] for i in list(bsmap[3])] cgmap_format = bsmap[[0,"nuc",1,3,"dinuc",4,6,5]] cgmap_format_finish = cgmap_format.loc[(cgmap_format[5]!=0)] cgmap_format_finish.to_csv(name_bsmap[:-3] + "CGmap.gz", header = False, index = False, sep="\t", compression='gzip') def methylpy2cgmap(methylpyfile): """ the allc files by methylpy to CGmap.gz """ name_methylpy = methylpyfile methylpy = readfile(name_methylpy) methylpy["nuc"] = np.where(methylpy[2]=="+","C","G") methylpy["dinuc"] = [i[:2] for i in list(methylpy[3])] methylpy["context"] = methylpy[3].apply(trinuc) methylpy["ratio"] = np.round(methylpy[4].astype("float")/methylpy[5].astype("float"),2) cgmap_format = methylpy[[0,"nuc",1,"context","dinuc","ratio",4,5]] cgmap_format_finish = cgmap_format.loc[(cgmap_format[5]!=0)] cgmap_format_finish.to_csv(name_methylpy + ".CGmap.gz", header = False, index = False, sep="\t", compression='gzip') def methimpute2cgmap(methimputefile): """ TSV files exported from the methimpute to CGmap.gz """ name_methimpute = methimputefile methimpute = readfile(name_methimpute) methimpute["nuc"] = np.where(methimpute[2]=="+","C","G") methimpute["dinuc"] = [i[:2] for i in list(methimpute[3])] methimpute["ratio"] = round(methimpute[4].astype("float")/methimpute[5].astype("float"), 2) cgmap_format = methimpute[[0, "nuc", 1, 3, "dinuc", "ratio", 4, 5]] cgmap_format_finish = cgmap_format.loc[(cgmap_format["mc_nc"]!=0)] cgmap_format_finish.to_csv(name_bismark + ".CGmap.gz", header = False, index = False, sep="\t", compression='gzip') def main(): parser = get_parser() args = parser.parse_args() try: if args.format == "bismark": bismark2cgmap(args.filename) if args.format == "bsmap": bsmap2cgmap(args.filename) if args.format == "methylpy": methylpy2cgmap(args.filename) if args.format =="methimpute": methimpute2cgmap(args.filename) except Exception as e: print(e) print(traceback.format_exc()) print("please check the input format or the parameter, see methcalls2cgmap.py -h") if __name__ == '__main__': main() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 | for cmd in "$@"; do { echo "Process \"$cmd\" started"; $cmd & pid=$! PID_LIST+=" $pid"; } done trap "kill $PID_LIST" SIGINT echo "Parallel processes have started"; wait $PID_LIST echo echo "All processes have completed"; |
79 80 | shell: "touch {output}" |
Support
- Future updates
Related Workflows





