Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain.
Insert your code into the respective folders, i.e.
scripts
,
rules
, and
envs
. Define the entry point of the workflow in the
Snakefile
and the main configuration in the
config.yaml
file.
Authors
- Konnor von Emster (@kve)
Usage
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
Step 1: Obtain a copy of this workflow
-
Create a new github repository using this workflow as a template .
-
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
Step 2: Configure workflow
Configure the workflow according to your needs via editing the files in the
config/
folder. Adjust
config.yaml
to configure the workflow execution, and
samples.tsv
to specify your sample setup.
Sample table specification
Required Columns:
-
"treatment" –– specifies the treatment of the sample. This is the same within biological replicates.
-
"sample_name" –– specifies a sample identifier. This must be unique for each sample. For Paired End Reads:
-
"fwd_read" –– specifies the path of the forward read.
-
"rev_read" –– specifies the path of the reverse read. For Single End Reads:
-
"read_path" –– specifies the path of the single read.
Step 3: Install Snakemake
Install Snakemake using conda :
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation .
Step 4: Execute workflow
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Step 5: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .
Step 6: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Step 7: Obtain updates from upstream
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
-
Once, register the upstream repository in your local copy:
git remote add -f upstream git@github.com:snakemake-workflows/DGE-co-culture-workflow.git
orgit remote add -f upstream https://github.com/snakemake-workflows/DGE-co-culture-workflow.git
if you do not have setup ssh keys. -
Update the upstream version:
git fetch upstream
. -
Create a diff with the current version:
git diff HEAD upstream/master workflow > upstream-changes.diff
. -
Investigate the changes:
vim upstream-changes.diff
. -
Apply the modified diff via:
git apply upstream-changes.diff
. -
Carefully check whether you need to update the config files:
git diff HEAD upstream/master config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
Step 8: Contribute back
In case you have also changed or added steps, please consider contributing them back to the original repository:
-
Fork the original repo to a personal or lab account.
-
Clone the fork to your local system, to a different place than where you ran your analysis.
-
Copy the modified files from your analysis to the clone of your fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary. -
Commit and push your changes to your fork.
-
Create a pull request against the original repository.
Testing
Test cases are in the subfolder
.test
. They are automatically executed via continuous integration with
Github Actions
.
Code Snippets
13 14 15 | shell: "htseq-count --idattr=ID -a {params.mapping_quality_thresh} -s reverse -t exon -r pos --nonunique all " "{input.map_file} {input.gff} > {output}" |
10 11 | shell: "fastqc {input}" |
22 23 24 25 26 | shell: """ mv {input.html} {output.html} mv {input.zipp} {output.zipp} """ |
6 7 | shell: "cat {input} > {output}; sleep 1" |
6 7 | shell: "cat {input} > {output}; sleep 1" |
16 17 | script: "../scripts/gff_mod.py" |
12 13 | shell: "bowtie2 -x {input.ref} -1 {input.r1} -2 {input.r2} -S {output} -p {resources.tasks}" |
22 23 | shell: "bowtie2-build --threads {resources.tasks} {input.ref} {input.ref}" |
11 12 | shell: "bwa mem {input.ref} {input.r1} {input.r2} > {output}" |
22 23 | shell: "bwa index {input.ref}" |
6 7 | script: "../scripts/generate_bio_db_ref_table.py" |
15 16 | script: "../scripts/generate_annotation_df.py" |
31 32 | script: "../scripts/generate_raw_counts_metadata_dfs.py" |
45 46 | script: "../scripts/generate_comparison_results.R" |
63 64 | script: "../scripts/generate_data_json.py" |
12 13 14 15 16 17 | shell: "bbduk.sh threads={resources.tasks} " "in1={input.r1} in2={input.r2} " "out1={output.o1} out2={output.o2} " "minlen=25 qtrim=rl trimq=10 " "ref={input.ref} ktrim=r k=23 mink=11 hdist=1" |
9 10 | shell: "samtools view -@ {resources.tasks} -S -b {input} > {output}" |
19 20 | shell: "samtools sort -@ {resources.tasks} {input} -o {output}" |
29 30 | shell: "samtools index -@ {resources.tasks} -b {input}" |
8 9 | shell: "gzip -d {input}" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | from pathlib import Path import pandas as pd import logging as log log.basicConfig(format='%(levelname)s:%(message)s', level=log.DEBUG) def create_attr_dict(attr_row): log.debug(attr_row) d = {} for key_value_pair in attr_row.split(";"): k_v_list = key_value_pair.split(sep="=", maxsplit=1) if len(k_v_list) == 2: k, v = k_v_list d[k] = v return d def generate_gff_df(gff_file): ## This method is heavily influenced from GFFpandas df = pd.read_csv(gff_file, sep="\t", comment="#", names=[ "seq_id", "source", "type", "start", "end", "score", "strand", "phase", "attributes", ], ) attr_dict_series = df.attributes.apply( lambda attributes: create_attr_dict(attributes) ) key_set = set() attr_dict_series.apply(lambda at_dic: key_set.update(list(at_dic.keys()))) for attr in sorted(list(key_set)): df[attr] = attr_dict_series.apply( lambda attr_dict: attr_dict.get(attr) ) return df gffs = [] for genome, info in snakemake.config['input']['reference genomes'].items(): attributes_df = generate_gff_df(info['annotation']) attributes_df['organism'] = genome gffs.append(attributes_df) attributes_df = pd.concat(gffs) # filtering by 'type' column feature_types_to_keep = snakemake.config["feature_types_to_count"] attributes_df = attributes_df[attributes_df['type'].isin(feature_types_to_keep)] attributes_df = attributes_df.set_index(['organism','ID']) attributes_df.to_csv(snakemake.output[0], sep='\t') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | from pathlib import Path import pandas as pd import logging as log import urllib.error import Bio.KEGG.REST as kegg_rest log.basicConfig(format='%(levelname)s:%(message)s', level=log.INFO) kegg_pathway_columns = {'em_BRITE':'brite', 'em_KEGG_Module':'module', 'em_KEGG_Pathway':'pathway', 'em_KEGG_Reaction':'reaction', 'em_KEGG_rclass':'rclass'} kegg_ids = {k:set() for k in kegg_pathway_columns.values()} for genome, info in snakemake.config['input']['reference genomes'].items(): attributes = pd.read_table(info['annotation'], header=None, comment="#")[8].to_list() for gene in attributes: for annot in gene.split(";"): if "=" in annot: k, v = annot.split("=", maxsplit=1) if k in kegg_pathway_columns.keys(): kegg_ids[kegg_pathway_columns[k]].update(v.split(",")) def ident_kegg_term(db, id): try: line = kegg_rest.kegg_find(db, id).readline() return line.split('\n')[0].split('\t')[1] except IndexError as e: log.error(f'problematic term {id}:\n{kegg_rest.kegg_find(db, id).readline()}') return "" except urllib.error.URLError as e: return e.read().decode("utf8", 'ignore') ref_table = [] for col, db in kegg_pathway_columns.items(): s = kegg_ids[db] s -= {'-', 'nan'} log.info(f"{col} with {len(s)} entries") for i in s: log.info(i) ref_table.append([col, db, i, ident_kegg_term(db, i)]) pd.DataFrame(ref_table, columns=['eggnog column', 'kegg hierarchy id', 'kegg id', 'kegg name']).to_csv(snakemake.output[0], sep='\t', index=False) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | library("DESeq2") counts = read.csv(file=snakemake@input$counts, sep='\t') annots = read.csv(file=snakemake@input$annotations, sep='\t') organism = snakemake@wildcards$organism experiment = snakemake@wildcards$experiment comparison = snakemake@wildcards$comparison exp_design = as.formula(snakemake@config[["Comparisons"]][[organism]][[experiment]]$Design) sample_use_df = read.csv(file=snakemake@config$input[['sample table']], sep='\t') if ('Included Samples' %in% names(snakemake@config[["Comparisons"]][[organism]][[experiment]])) { included_samples = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Included Samples']] sample_use_df = subset(sample_use_df, sample %in% included_samples) } else if ('Excluded Samples' %in% names(snakemake@config[["Comparisons"]][[organism]][[experiment]])) { excluded_samples = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Excluded Samples']] sample_use_df = subset(sample_use_df, !sample %in% excluded_samples) } row.names(sample_use_df) <- sample_use_df$sample counts_subset = subset(counts, organism==snakemake@wildcards$organism) rownames(counts_subset) = counts_subset$ID counts_subset = subset(counts_subset, select=rownames(sample_use_df)) annots_subset = subset(annots, organism==snakemake@wildcards$organism) rownames(annots_subset) = annots_subset$ID # organism must be present in all samples. Good test to tell if it is from experience... if (min(colMeans(counts_subset[sapply(counts_subset, is.numeric)],na.rm=TRUE)) < 1){ stop('organism must be present in all samples') } dds = DESeqDataSetFromMatrix(countData = counts_subset, colData = sample_use_df, design = exp_design) dds_processed = DESeq(dds) vst_counts = assay(varianceStabilizingTransformation(dds_processed)) vst_counts = cbind(ID=rownames(vst_counts), vst_counts) write.table(vst_counts, file=snakemake@output$vst, quote=FALSE, sep='\t', row.names=FALSE) print("vst file exists:") print(snakemake@output$vst) print(file.exists(snakemake@output$vst)) rlog_counts = assay(rlog(dds_processed)) rlog_counts = cbind(ID=rownames(rlog_counts), rlog_counts) write.table(rlog_counts, file=snakemake@output$rlog, quote=FALSE, sep='\t', row.names=FALSE) print("rlog file exists:") print(snakemake@output$rlog) print(file.exists(snakemake@output$rlog)) comparison_factor = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Comparisons']][[comparison]][['Factor']] comparison_treatment = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Comparisons']][[comparison]][['Treatment']] comparison_control = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Comparisons']][[comparison]][['Control']] comparison_results = as.data.frame(results(dds_processed, contrast=c(comparison_factor, comparison_treatment, comparison_control))) comparison_results = cbind(ID=rownames(comparison_results), symlog10baseMean = with(comparison_results, log10(baseMean + 1)), comparison_results) write.table(comparison_results, file=snakemake@output$results, quote=FALSE, sep='\t', row.names=FALSE) print("results file exists:") print(snakemake@output$results) print(file.exists(snakemake@output$results)) results_w_annot = merge(comparison_results, annots_subset, by=0) write.table(results_w_annot, file=snakemake@output$results_w_annot, quote=FALSE, sep='\t', row.names=FALSE) print("results_w_annot file exists:") print(snakemake@output$results_w_annot) print(file.exists(snakemake@output$results_w_annot)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | from pathlib import Path import pandas as pd import logging as log import json import yaml import gzip log.basicConfig(format='%(levelname)s:%(message)s', level=log.INFO) open(snakemake.output[0],'w').close() mapping_metadata = pd.read_table(snakemake.input['mapping_metadata'], index_col=0) organism_occurance = pd.read_table(snakemake.input['organism_occurance'], index_col='organism') gene_sparsity = pd.read_table(snakemake.input['gene_sparsity'], index_col='organism') samples = pd.read_table(snakemake.config["input"]["sample table"], index_col="sample") annotations = pd.read_table(snakemake.input['annotations'], index_col=['organism', 'ID']) raw_counts = pd.read_table(snakemake.input['counts'], index_col=['organism', 'ID']) bio_ref_df = pd.read_table(snakemake.input['bio_df_ref']) data = { 'metadata': { 'mapping quality': mapping_metadata.to_dict(), 'organism occurance': organism_occurance.to_dict(), 'gene sparsity': gene_sparsity.to_dict(), }, 'samples' : samples.to_dict(), 'config' : snakemake.config, 'annotations': { org: annotations.loc[org].to_dict() for org in annotations.index.get_level_values('organism').unique() }, 'raw_counts': { org: raw_counts.loc[org].to_dict() for org in raw_counts.index.get_level_values('organism').unique() }, 'id ref table': bio_ref_df.to_dict(), 'deseq2' : {} } for rlog, vst, result in zip(snakemake.input['rlogs'], snakemake.input['vsts'], snakemake.input['deseq2_results']): assert Path(rlog).parent == Path(vst).parent == Path(result).parent org = Path(rlog).parent.parent.parent.name exp = Path(rlog).parent.parent.name comp = Path(rlog).parent.name if org not in data['deseq2'].keys(): data['deseq2'][org] = {} if exp not in data['deseq2'][org].keys(): data['deseq2'][org][exp] = {} data['deseq2'][org][exp][comp] = { 'rlog' : pd.read_table(rlog, index_col='ID').to_dict(), 'vst' : pd.read_table(vst, index_col='ID').to_dict(), 'results' : pd.read_table(result, index_col='ID').to_dict() } with gzip.open(snakemake.output[0], 'w') as fout: fout.write(json.dumps(data).encode('utf-8')) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | from pathlib import Path import pandas as pd import logging as log log.basicConfig(format='%(levelname)s:%(message)s', level=log.DEBUG) attributes_df = pd.read_table(snakemake.input['annotations'], index_col=[0,1]) raw_counts_df_list = [] raw_metadata_df_list = [] reference_index = None for path in [Path(p) for p in snakemake.input['counts']]: sample_name = path.stem temp_df = pd.read_table(path, names=['ID', sample_name]) temp_df = temp_df.set_index('ID') if isinstance(reference_index, pd.Index): assert reference_index.all() == temp_df.index.all() else: reference_index = temp_df.index temp_metadata_df = temp_df[temp_df.index.str.contains('__')] temp_counts_df = temp_df[~temp_df.index.str.contains('__')] # append df to raw_counts_df_list raw_counts_df_list.append(temp_counts_df) raw_metadata_df_list.append(temp_metadata_df) counts_df = pd.concat(raw_counts_df_list, axis=1) metadata_df = pd.concat(raw_metadata_df_list, axis=1) metadata_df.index = metadata_df.index.str.replace('__', '') counts_df = counts_df.loc[attributes_df.index.get_level_values('ID')] counts_df.index = attributes_df.index feature_df = counts_df.join(attributes_df['type']).groupby(['type']).sum() metadata_df = pd.concat([feature_df, metadata_df]) counts_df.to_csv(Path(snakemake.output["counts"]), sep="\t") metadata_df.to_csv(Path(snakemake.output["mapping_metadata"]), sep='\t') counts_df.join(attributes_df).to_csv(Path(snakemake.output["counts_w_annotations"]), sep="\t") organism_occurance = counts_df.groupby(level='organism').sum().div(counts_df.sum()) organism_occurance.to_csv(Path(snakemake.output["organism_occurance"]), sep='\t') sparsity_df = (counts_df >= 1).groupby(level='organism').sum().div(counts_df.index.get_level_values('organism').value_counts(), axis='index') sparsity_df.index.name = 'organism' sparsity_df.to_csv(Path(snakemake.output["gene_sparsity"]), sep="\t") |
1 2 3 4 5 6 7 8 9 10 11 | import pandas as pd feature_types_to_keep = snakemake.config["feature_types_to_count"] annotation = pd.read_table(snakemake.input[0], header=None, comment='#') annotation = annotation[annotation[2].isin(feature_types_to_keep)] annotation[2] = 'exon' annotation.to_csv(snakemake.output[0], sep='\t', index=False, header=False) |
Support
- Future updates
Related Workflows





