Authors
- Rowan Callahan (@callahro)
Usage
Simple
Step 1: Install workflow
If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further extend this workflow or want to work under version control, fork this repository as outlined in Advanced . The latter way is recommended.
Installing the workflow involves cloning this directory and ensuring that you have Snakemake added to your path. To ensure that you have Snakemake installed you can run
which snakemake
and it will tell you which snakemake you have installed. Once you have cloned the workflow and checked your path you need to ensure that you have correctly formatted your configuration file.
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config.yaml
.
alternatively you can create your own configuration file and specify it when running snakemake with
snakemake --configfile my_config_file.yaml
viewing the example configuration file will specify what values are needed where.
An example configuration file is included that has information around what indices and databases to use.
Some of these configurations are paths to databases or whole genomes. Some of these databases have already been downloaded by me.
If you are using Human sample data then you can probably use some of the defaults in the pipeline that are included in the file titled
Snakefile
.
This file also includes the final target outputs that will be created by the pipeline and can be edited if you want to include or not include certain final outputs from the pipeline.
Metadata File configuration
The metadata file is required to have two columns one titled SampleID and the other titled Condition
This file also must be tab separated. When you run this pipeline in a dry run it will create a list of the sample names that it has found and print them out
The script works by matching these sample names exactly to the SampleID strings that are present in the metadata file.
One exception is if you use the
remove_filepath_regex:
option in the configuration file. This will perform a string replacement on the sample names that are listed
before trying to match them to the SampleID strings. This can be useful if a large part of the filepath does not contain sample identifiers and is instead repeated.
In this case you can simply write the part of the filepath that you care about in SampleID and cut off the rest with
remove_filepath_regex:
.
An example of this is as follows:
lets say I have samples in my sample list called
['sample1_long_filepath_to_ignore', 'sample2_long_filepath_to_ignore']
I can set my
remove_filepath_regex:
in my config file to
remove_filepath_regex = '_long_filepath_to_ignore'
.
This would mean that now the SampleID list is looking for matches that look like this
['sample1','sample2']
these matches are used to create a dictionary that assigns
sample names to the condition that they are present in.
Step 3: Execute workflow
Test your configuration by performing a dry-run via
snakemake --use-conda -n
This pipeline requires the usage of conda to run as all of its external dependencies are installed using conda.
its important that every time the pipeline is run that
--use-conda
is also called.
It is important that you remember to specity to use a profile path using
--profile /path/to/profile/folder
.
For different versions of Snakemake sometimes this will lead to an error saying it did not find a profile configuration file.
If this is the case then it is necessary that you provide the path to the profile configuration file with
--profile /path/to/profile/config.yaml
.
It is possible to submit your snakemake command with the sbatch directive but it is also possible to just use a terminal multiplexer like
screen
or
tmux
as the snakemake scheduler process does not take up that much cpu or memory. However, its important that your profile is set up so that it sends jobs to the queue and does not run these jobs on the head node.
in combination with any of the modes above. See the Snakemake documentation for further details.
Advanced
The following recipe provides established best practices for running and extending this workflow in a reproducible way.
-
Fork the repo to a personal or lab account.
-
Clone the fork to the desired working directory for the concrete project/run on your machine.
-
Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
-
Modify the config, and any necessary sheets (and probably the workflow) as needed.
-
Commit any changes and push the project-branch to your fork on github.
-
Run the analysis.
-
Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request . This would be greatly appreciated .
-
Optional: Push results (plots/tables) to the remote branch on your fork.
-
Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
-
Optional: Delete the local clone/workdir to free space.
Code Snippets
16 17 18 19 20 | shell: """ bamCoverage --bam {input.quality} -o {output.bigwig_rough} --numberOfProcessors {threads} --skipNAs --binSize 5 bamCoverage --bam {input.quality} -o {output.bigwig} --numberOfProcessors {threads} --skipNAs --binSize 5 --smoothLength 15 """ |
44 45 46 47 48 49 50 51 52 | shell: """ sambamba sort -t {threads} {input.bamfile} -o {output.input_sort} sambamba index -t {threads} {output.input_sort} {output.input_index} export PATH=$PATH:/opt/installed/samtools-1.6/bin/ #if [[ $(grep {wildcards.merged_sample} {input.downsample_list}) ]]; then sambamba view --subsample=0.$(grep {wildcards.merged_sample} {input.downsample_list}| awk '{{print $2}}') --subsampling-seed=2 {output.input_sort} -o {output.bamfile}; else cp {output.input_sort} {output.bamfile}; fi if [[ $(grep {wildcards.merged_sample} {input.downsample_list}) ]]; then samtools view -@ 4 -b -h -s 1.$(grep {wildcards.merged_sample} {input.downsample_list}| awk '{{print $2}}') {output.input_sort} -o {output.bamfile}; else cp {output.input_sort} {output.bamfile}; fi sambamba index -t {threads} {output.bamfile} {output.index} """ |
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | run: import pandas as pd dataframes = [] for file in input.readsfile: dataframes.append(pd.read_csv(file, names=["name","description","reads"])) merged_dataframe = pd.concat(dataframes, ignore_index=True) print(merged_dataframe) filtered = merged_dataframe[merged_dataframe.description=="satisfy_quality"] #ensure that we are normalizing to the final processed reads print(filtered) filtered = filtered.drop_duplicates() #its possible that some values have been written more than once print(filtered) #we want to drop these values as they are identical rows and are not needed filtered['downsample'] = pd.to_numeric(filtered['reads'].min()) / pd.to_numeric(filtered['reads']) #what percentage of the smallest sample to downsample print(filtered) # each of the samples to then print the result filtered.to_csv(output.filtered_list, header=True, index=False, sep='\t') out_data = filtered.loc[:,["name", "downsample"]] normalizing_sample = out_data[out_data.downsample == 1] # which one is hte normalizing sample print(normalizing_sample) out_data = out_data[out_data.downsample < 1] #only keep samples that need to be downsampled and skip the ones that wont be out_data["downsample"] = (round(out_data["downsample"], 8)*100000000).astype(int) #round the sample to the nearest number, I will add the . later out_data.to_csv(output.downsample_list, header=False, index=False, sep='\t') #write everything out to a csv |
97 98 99 100 101 | shell: """ alignmentSieve --bam {input.bamfile} --outFile {output.bamfile} --numberOfProcessors {threads} --ATACshift """ |
117 118 119 120 121 122 | shell: """ sambamba index -t {threads} {input.full_bam} {output.indexed_full} sambamba index -t {threads} {input.deduplicated} {output.indexed} """ |
138 139 140 141 142 | shell: """ samtools view -h -@ 4 -b -F 1804 {input.bamfile} > {output.quality} samtools view -@ 4 -c {output.quality}| xargs -I{{}} echo "{wildcards.merged_sample},satisfy_quality,"{{}}$'\n' >> {params.readsfile} """ |
161 162 163 164 165 | shell: """ sambamba markdup -t {threads} {input.bamfile} {output.deduplicated} samtools view -@ 4 -F 1024 -c {output.deduplicated}| xargs -I{{}} echo "{wildcards.merged_sample},no_duplicates,"{{}}$'\n' >> {params.readsfile} """ |
184 185 186 187 188 189 190 | shell: """ samtools view -@ 4 -c {input.bamfile}| xargs -I{{}} echo "{wildcards.merged_sample},starting_reads,"{{}}$'\n' >> {output.readsfile} samtools view -h -@ {threads} {input.bamfile}| grep -v chrM| samtools sort -@ {threads} -O BAM > {output.bamfile} samtools view -@ 4 -c {output.bamfile}| xargs -I{{}} echo "{wildcards.merged_sample},no_ChrM,"{{}}$'\n' >> {output.readsfile} """ |
207 208 | shell: """ cp {input.lane1} {output.merged_bamfile} """ |
58 59 60 61 62 63 | shell: """ multiBigwigSummary bins -b {params.bigwigs} --labels {params.names} -out {output.scores} -p {threads} plotCorrelation -in {output.scores} --corMethod pearson --skipZeros --plotTitle "Pearson Correlation of Read Counts" --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o {output.correlation_plot} """ |
75 76 77 78 79 80 | shell: """ export PATH=$PATH:/home/groups/MaxsonLab/software/bedtools/ bedtools multiinter -i {params.bedfiles} -header -names {params.names} > {output.bedfile} """ |
91 92 93 94 95 96 | shell: """ export PATH=$PATH:/home/groups/MaxsonLab/software/bedtools/ bedtools intersect -wa -a {input.read_catalog} -b {input.sample_condition_catalog} > {output.sample_condition_catalog} """ |
108 109 110 111 112 | shell: """ wc -l {input.summits}| awk '{{print $1}}'| xargs -I {{}} echo "{wildcards.merged_sample},unfiltered_peaks,"{{}}$'\n' >> {params.readsfile} touch {output.done} """ |
131 132 133 134 135 136 137 138 139 | shell: """ export PATH=$PATH:/opt/installed/samtools-1.6/bin/ samtools flagstat {input.bamfile} > {params.flagstat} grep total {params.flagstat}|awk '{{print $1}}'|xargs -I {{}} echo "{wildcards.merged_sample},total_reads_trimd,"{{}}$'\n' >> {params.readsfile} grep "[0-9] mapped" {params.flagstat} |awk '{{print $1}}' | xargs -I {{}} echo "{wildcards.merged_sample},mapping_reads,"{{}}$'\n' >> {params.readsfile} awk '{{sum +=$7}} END{{print sum}}' {input.read_count}| xargs -I {{}} echo "{wildcards.merged_sample},reads_in_peaks,"{{}}$'\n' >> {params.readsfile} touch {output.done} """ |
24 25 26 27 28 29 30 | shell: """ export PATH=$PATH:/opt/installed/samtools-1.6/bin/ samtools merge -@ {threads} - {params.merge_input} > {output.bamfile} samtools index {output.bamfile} bamCoverage --bam {output.bamfile} -o {output.bigwig} --numberOfProcessors {threads} --skipNAs --binSize 5 --smoothLength 15 """ |
10 11 12 13 14 | shell: """ bedtools genomecov -ibam {input.non_downsampled} -bg > {output.non_downsampled_bedGraph} """ |
25 26 27 28 29 | shell: """ bedSort {input.non_downsampled_bedGraph} {output.non_downsampled_sorted_bedGraph} """ |
42 43 44 45 46 | shell: """ bedGraphToBigWig {input.non_downsampled_sorted_bedGraph} {params.chrom_size} {output.non_downsampled_bigwig} """ |
7 8 9 10 11 12 13 14 15 | run: import pandas as pd dataframes = [] for file in input.read_count: dataframes.append(pd.read_csv(file, sep='\t')) merged = pd.concat(dataframes, axis=1) merged = merged.loc[:,~merged.columns.duplicated()] merged = merged.sort_index(axis=1) merged.to_csv(output.read_count, header=True, index=False, sep='\t') |
33 34 35 36 37 38 39 40 41 42 | shell: """ export PATH=$PATH:/opt/installed/samtools-1.6/bin/ samtools sort -@ 4 {input.bamfile} > {output.bamfile} samtools index {output.bamfile} bedtools multicov -bams {output.bamfile} -bed {input.read_catalog} >> {output.read_count} sed -i "1i {params.header}" {output.read_count} """ |
66 67 68 69 70 71 | shell: """ echo "" > {params.merged_bed} cat {params.peaks_input} >> {params.merged_bed} Rscript --vanilla {params.script_file} {params.merged_bed} {params.blacklist} {params.present_in_number} {output.reads_catalog_bed} {params.genome} {params.metadata} {params.regex} """ |
88 89 90 91 | shell: """ macs2 callpeak --treatment {input.bamfile} --name {params.name} --format BAMPE --gsize {params.genome} --outdir {params.directory} --broad """ |
6 7 8 9 10 11 12 13 14 | run: import pandas as pd dataframes = [] for file in input.read_count: dataframes.append(pd.read_csv(file, sep='\t')) merged = pd.concat(dataframes, axis=1) merged = merged.loc[:,~merged.columns.duplicated()] merged = merged.sort_index(axis=1) merged.to_csv(output.read_count, header=True, index=False, sep='\t') |
27 28 29 30 31 32 | shell: """ bedtools multicov -bams {input.bamfile} -bed {input.read_catalog} >> {output.read_count} sed -i "1i {params.header}" {output.read_count} """ |
43 44 45 46 | shell: """ annotatePeaks.pl {input.bedfile} {params.genome} > {output.read_count} """ |
71 72 73 74 75 76 | shell: """ echo "" > {params.merged_bed} cat {params.peaks_input} >> {params.merged_bed} Rscript --vanilla {params.script_file} {params.merged_bed} {params.blacklist} {params.present_in_number} {params.reads_catalog_bed} {params.genome} {params.metadata} {params.regex} """ |
93 94 95 96 | shell: """ macs2 callpeak --treatment {input.bamfile} --name {params.name} --format BAMPE --gsize {params.genome} --outdir {params.directory} --broad """ |
12 13 | shell: """samtools sort -T {params.outdirectory} -@ {threads} -o {output.sorted_bamfile} {input.bamfile} """ |
26 27 | shell: """samtools view -bh -@ {threads} -o {output.bamfile} {input.samfile}""" |
43 44 | shell: """bwa mem -t {threads} {params.reference} {input.forward_paired} {input.reverse_paired} > {output.samfile} """ |
64 65 | shell: """trimmomatic PE {input.forward_strand} {input.reverse_strand} {output.forward_paired} {output.forward_unpaired} {output.reverse_paired} {output.reverse_unpaired} ILLUMINACLIP:{params.adapter}:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30""" |
91 92 93 94 95 | shell: """ fastqc --outdir {params.outdir} --extract -f fastq {input.forward_strand} {input.reverse_strand} touch {output.fastqc_done} """ |
117 118 119 120 121 122 | shell: """ fastq_screen --aligner bowtie2 --conf {params.conf} --outdir {params.out_dir} {input.forward_strand} fastq_screen --aligner bowtie2 --conf {params.conf} --outdir {params.out_dir} {input.reverse_strand} touch {output.fastq_screen_done} """ |
Support
- Future updates
Related Workflows





