Snakemake workflow: RNA-seq

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Authors

Konnor von Emster (@kve)

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

Create a new github repository using this workflow as a template .
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and samples.tsv to specify your sample setup.

Sample table specification

Required Columns:

"treatment" –– specifies the treatment of the sample. This is the same within biological replicates.
"sample_name" –– specifies a sample identifier. This must be unique for each sample. For Paired End Reads:
"fwd_read" –– specifies the path of the forward read.
"rev_read" –– specifies the path of the reverse read. For Single End Reads:
"read_path" –– specifies the path of the single read.

Step 3: Install Snakemake

Install Snakemake using conda :

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation .

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 5: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators. An example (using some trivial test data) can be seen here .

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Step 7: Obtain updates from upstream

Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.

Once, register the upstream repository in your local copy: git remote add -f upstream git@github.com:snakemake-workflows/DGE-co-culture-workflow.git or git remote add -f upstream https://github.com/snakemake-workflows/DGE-co-culture-workflow.git if you do not have setup ssh keys.
Update the upstream version: git fetch upstream .
Create a diff with the current version: git diff HEAD upstream/master workflow > upstream-changes.diff .
Investigate the changes: vim upstream-changes.diff .
Apply the modified diff via: git apply upstream-changes.diff .
Carefully check whether you need to update the config files: git diff HEAD upstream/master config . If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.

Step 8: Contribute back

In case you have also changed or added steps, please consider contributing them back to the original repository:

Fork the original repo to a personal or lab account.
Clone the fork to your local system, to a different place than where you ran your analysis.
Copy the modified files from your analysis to the clone of your fork, e.g., cp -r workflow path/to/fork . Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.
Commit and push your changes to your fork.
Create a pull request against the original repository.

Testing

Test cases are in the subfolder .test . They are automatically executed via continuous integration with Github Actions .

Code Snippets

shell:
    "htseq-count --idattr=ID -a {params.mapping_quality_thresh} -s reverse -t exon -r pos --nonunique all "
    "{input.map_file} {input.gff} > {output}"

SnakeMake htseqcount From line 13 of rules/counting_features.smk

shell:
    "fastqc {input}"

SnakeMake FastQC From line 10 of rules/fastqc.smk

shell:
    """
    mv {input.html} {output.html}
    mv {input.zipp} {output.zipp}
    """

SnakeMake From line 22 of rules/fastqc.smk

shell:
    "cat {input} > {output}; sleep 1"

SnakeMake From line 6 of rules/genome_concat.smk

shell:
    "cat {input} > {output}; sleep 1"

SnakeMake From line 6 of rules/gff_tools.smk

script:
    "../scripts/gff_mod.py"

SnakeMake From line 16 of rules/gff_tools.smk

shell:
    "bowtie2 -x {input.ref} -1 {input.r1} -2 {input.r2} -S {output} -p {resources.tasks}"

SnakeMake Bowtie 2 From line 12 of rules/map_reads_bowtie2.smk

shell:
    "bowtie2-build --threads {resources.tasks} {input.ref} {input.ref}"

SnakeMake Bowtie 2 From line 22 of rules/map_reads_bowtie2.smk

shell:
    "bwa mem {input.ref} {input.r1} {input.r2} > {output}"

SnakeMake BWA From line 11 of rules/map_reads_bwa.smk

shell:
    "bwa index {input.ref}"

SnakeMake BWA From line 22 of rules/map_reads_bwa.smk

script:
    "../scripts/generate_bio_db_ref_table.py"

SnakeMake From line 6 of rules/post_count_analysis.smk

script:
    "../scripts/generate_annotation_df.py"

SnakeMake From line 15 of rules/post_count_analysis.smk

script:
    "../scripts/generate_raw_counts_metadata_dfs.py"

SnakeMake From line 31 of rules/post_count_analysis.smk

script:
    "../scripts/generate_comparison_results.R"

SnakeMake From line 45 of rules/post_count_analysis.smk

script:
    "../scripts/generate_data_json.py"

SnakeMake From line 63 of rules/post_count_analysis.smk

shell:
    "bbduk.sh threads={resources.tasks} "
    "in1={input.r1} in2={input.r2} "
    "out1={output.o1} out2={output.o2} "
    "minlen=25 qtrim=rl trimq=10 "
    "ref={input.ref} ktrim=r k=23 mink=11 hdist=1"

SnakeMake BBMap From line 12 of rules/run_trim.smk

shell:
    "samtools view -@ {resources.tasks} -S -b {input} > {output}"

SnakeMake SAMtools From line 9 of rules/samtools.smk

shell:
    "samtools sort -@ {resources.tasks} {input} -o {output}"

SnakeMake SAMtools From line 19 of rules/samtools.smk

shell:
    "samtools index -@ {resources.tasks} -b {input}"

SnakeMake SAMtools From line 29 of rules/samtools.smk

shell:
    "gzip -d {input}"

SnakeMake From line 8 of rules/unzip.smk

from pathlib import Path
import pandas as pd
import logging as log

log.basicConfig(format='%(levelname)s:%(message)s', level=log.DEBUG)

def create_attr_dict(attr_row):
    log.debug(attr_row)
    d = {}
    for key_value_pair in attr_row.split(";"):
        k_v_list = key_value_pair.split(sep="=", maxsplit=1)
        if len(k_v_list) == 2:
            k, v = k_v_list
            d[k] = v
    return d

def generate_gff_df(gff_file):
    ## This method is heavily influenced from GFFpandas
    df = pd.read_csv(gff_file, sep="\t", comment="#",
            names=[
                "seq_id",
                "source",
                "type",
                "start",
                "end",
                "score",
                "strand",
                "phase",
                "attributes",
            ],
    )
    attr_dict_series = df.attributes.apply(
        lambda attributes: create_attr_dict(attributes)
    )
    key_set = set()
    attr_dict_series.apply(lambda at_dic: key_set.update(list(at_dic.keys())))
    for attr in sorted(list(key_set)):
        df[attr] = attr_dict_series.apply(
            lambda attr_dict: attr_dict.get(attr)
        )
    return df

gffs = []
for genome, info in snakemake.config['input']['reference genomes'].items():
    attributes_df = generate_gff_df(info['annotation'])
    attributes_df['organism'] = genome
    gffs.append(attributes_df)
attributes_df = pd.concat(gffs)

# filtering by 'type' column 
feature_types_to_keep = snakemake.config["feature_types_to_count"]
attributes_df = attributes_df[attributes_df['type'].isin(feature_types_to_keep)]

attributes_df = attributes_df.set_index(['organism','ID'])
attributes_df.to_csv(snakemake.output[0], sep='\t')

Python Pandas From line 1 of scripts/generate_annotation_df.py

from pathlib import Path
import pandas as pd
import logging as log
import urllib.error
import Bio.KEGG.REST as kegg_rest

log.basicConfig(format='%(levelname)s:%(message)s', level=log.INFO)

kegg_pathway_columns = {'em_BRITE':'brite', 'em_KEGG_Module':'module', 'em_KEGG_Pathway':'pathway', 'em_KEGG_Reaction':'reaction', 'em_KEGG_rclass':'rclass'}

kegg_ids = {k:set() for k in kegg_pathway_columns.values()}

for genome, info in snakemake.config['input']['reference genomes'].items():
    attributes = pd.read_table(info['annotation'], header=None, comment="#")[8].to_list()
    for gene in attributes:
        for annot in gene.split(";"):
            if "=" in annot:
                k, v = annot.split("=", maxsplit=1)
                if k in kegg_pathway_columns.keys():
                    kegg_ids[kegg_pathway_columns[k]].update(v.split(","))

def ident_kegg_term(db, id):
    try: 
        line = kegg_rest.kegg_find(db, id).readline()
        return line.split('\n')[0].split('\t')[1]
    except IndexError as e:
        log.error(f'problematic term {id}:\n{kegg_rest.kegg_find(db, id).readline()}')
        return ""
    except urllib.error.URLError as e:
        return e.read().decode("utf8", 'ignore')

ref_table = []
for col, db in kegg_pathway_columns.items():
    s = kegg_ids[db]
    s -= {'-', 'nan'}
    log.info(f"{col} with {len(s)} entries")
    for i in s:
        log.info(i)
        ref_table.append([col, db, i, ident_kegg_term(db, i)])

pd.DataFrame(ref_table, columns=['eggnog column', 'kegg hierarchy id', 'kegg id', 'kegg name']).to_csv(snakemake.output[0], sep='\t', index=False)

Python Pandas Biopython eggNOG-mapper v2 From line 1 of scripts/generate_bio_db_ref_table.py

library("DESeq2")

counts = read.csv(file=snakemake@input$counts, sep='\t')
annots = read.csv(file=snakemake@input$annotations, sep='\t')

organism = snakemake@wildcards$organism
experiment = snakemake@wildcards$experiment
comparison = snakemake@wildcards$comparison

exp_design = as.formula(snakemake@config[["Comparisons"]][[organism]][[experiment]]$Design)

sample_use_df = read.csv(file=snakemake@config$input[['sample table']], sep='\t')

if ('Included Samples' %in% names(snakemake@config[["Comparisons"]][[organism]][[experiment]])) {
    included_samples = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Included Samples']]
    sample_use_df = subset(sample_use_df, sample %in% included_samples)
} else if ('Excluded Samples' %in% names(snakemake@config[["Comparisons"]][[organism]][[experiment]])) {
    excluded_samples = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Excluded Samples']]
    sample_use_df = subset(sample_use_df, !sample %in% excluded_samples)
}

row.names(sample_use_df) <- sample_use_df$sample

counts_subset = subset(counts, organism==snakemake@wildcards$organism)
rownames(counts_subset) = counts_subset$ID
counts_subset = subset(counts_subset, select=rownames(sample_use_df))

annots_subset = subset(annots, organism==snakemake@wildcards$organism)
rownames(annots_subset) = annots_subset$ID

# organism must be present in all samples. Good test to tell if it is from experience...
if (min(colMeans(counts_subset[sapply(counts_subset, is.numeric)],na.rm=TRUE)) < 1){
    stop('organism must be present in all samples')
}

dds = DESeqDataSetFromMatrix(countData = counts_subset, colData = sample_use_df, design = exp_design)
dds_processed = DESeq(dds)

vst_counts = assay(varianceStabilizingTransformation(dds_processed))
vst_counts = cbind(ID=rownames(vst_counts), vst_counts)
write.table(vst_counts, file=snakemake@output$vst, quote=FALSE, sep='\t', row.names=FALSE)
print("vst file exists:")
print(snakemake@output$vst)
print(file.exists(snakemake@output$vst))

rlog_counts = assay(rlog(dds_processed))
rlog_counts = cbind(ID=rownames(rlog_counts), rlog_counts)
write.table(rlog_counts, file=snakemake@output$rlog, quote=FALSE, sep='\t', row.names=FALSE)
print("rlog file exists:")
print(snakemake@output$rlog)
print(file.exists(snakemake@output$rlog))

comparison_factor = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Comparisons']][[comparison]][['Factor']]
comparison_treatment = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Comparisons']][[comparison]][['Treatment']]
comparison_control = snakemake@config[["Comparisons"]][[organism]][[experiment]][['Comparisons']][[comparison]][['Control']]
comparison_results = as.data.frame(results(dds_processed, contrast=c(comparison_factor, comparison_treatment, comparison_control)))
comparison_results = cbind(ID=rownames(comparison_results), symlog10baseMean = with(comparison_results, log10(baseMean + 1)), comparison_results)
write.table(comparison_results, file=snakemake@output$results, quote=FALSE, sep='\t', row.names=FALSE)
print("results file exists:")
print(snakemake@output$results)
print(file.exists(snakemake@output$results))

results_w_annot = merge(comparison_results, annots_subset, by=0)
write.table(results_w_annot, file=snakemake@output$results_w_annot, quote=FALSE, sep='\t', row.names=FALSE)

print("results_w_annot file exists:")
print(snakemake@output$results_w_annot)
print(file.exists(snakemake@output$results_w_annot))

R DESeq2 From line 1 of scripts/generate_comparison_results.R

from pathlib import Path
import pandas as pd
import logging as log
import json
import yaml
import gzip

log.basicConfig(format='%(levelname)s:%(message)s', level=log.INFO)

open(snakemake.output[0],'w').close()

mapping_metadata = pd.read_table(snakemake.input['mapping_metadata'], index_col=0)
organism_occurance = pd.read_table(snakemake.input['organism_occurance'], index_col='organism')
gene_sparsity = pd.read_table(snakemake.input['gene_sparsity'], index_col='organism')
samples = pd.read_table(snakemake.config["input"]["sample table"], index_col="sample")
annotations = pd.read_table(snakemake.input['annotations'], index_col=['organism', 'ID'])
raw_counts = pd.read_table(snakemake.input['counts'], index_col=['organism', 'ID'])
bio_ref_df = pd.read_table(snakemake.input['bio_df_ref'])

data = {
    'metadata': {
        'mapping quality': mapping_metadata.to_dict(),
        'organism occurance': organism_occurance.to_dict(),
        'gene sparsity': gene_sparsity.to_dict(),
    },
    'samples' : samples.to_dict(),
    'config' : snakemake.config,
    'annotations': {
        org: annotations.loc[org].to_dict()
        for org in annotations.index.get_level_values('organism').unique()
    },
    'raw_counts': {
        org: raw_counts.loc[org].to_dict()
        for org in raw_counts.index.get_level_values('organism').unique()
    },
    'id ref table': bio_ref_df.to_dict(),
    'deseq2' : {}
}

for rlog, vst, result in zip(snakemake.input['rlogs'], snakemake.input['vsts'], snakemake.input['deseq2_results']):
    assert Path(rlog).parent == Path(vst).parent == Path(result).parent
    org = Path(rlog).parent.parent.parent.name
    exp = Path(rlog).parent.parent.name
    comp = Path(rlog).parent.name
    if org not in data['deseq2'].keys():
        data['deseq2'][org] = {}
    if exp not in data['deseq2'][org].keys():
        data['deseq2'][org][exp] = {}
    data['deseq2'][org][exp][comp] = {
        'rlog' : pd.read_table(rlog, index_col='ID').to_dict(),
        'vst' : pd.read_table(vst, index_col='ID').to_dict(),
        'results' : pd.read_table(result, index_col='ID').to_dict()
    }

with gzip.open(snakemake.output[0], 'w') as fout:
    fout.write(json.dumps(data).encode('utf-8'))  

Python Pandas JSON DESeq2 PyYAML From line 1 of scripts/generate_data_json.py

from pathlib import Path
import pandas as pd
import logging as log

log.basicConfig(format='%(levelname)s:%(message)s', level=log.DEBUG)

attributes_df = pd.read_table(snakemake.input['annotations'], index_col=[0,1])

raw_counts_df_list = []
raw_metadata_df_list = []
reference_index = None

for path in [Path(p) for p in snakemake.input['counts']]:
    sample_name = path.stem
    temp_df = pd.read_table(path, names=['ID', sample_name])
    temp_df = temp_df.set_index('ID')

    if isinstance(reference_index, pd.Index):
        assert reference_index.all() == temp_df.index.all()
    else:
        reference_index = temp_df.index

    temp_metadata_df = temp_df[temp_df.index.str.contains('__')]
    temp_counts_df = temp_df[~temp_df.index.str.contains('__')]

    # append df to raw_counts_df_list
    raw_counts_df_list.append(temp_counts_df)
    raw_metadata_df_list.append(temp_metadata_df)

counts_df = pd.concat(raw_counts_df_list, axis=1)
metadata_df = pd.concat(raw_metadata_df_list, axis=1)

metadata_df.index = metadata_df.index.str.replace('__', '')

counts_df = counts_df.loc[attributes_df.index.get_level_values('ID')]
counts_df.index = attributes_df.index

feature_df = counts_df.join(attributes_df['type']).groupby(['type']).sum() 
metadata_df = pd.concat([feature_df, metadata_df])

counts_df.to_csv(Path(snakemake.output["counts"]), sep="\t")
metadata_df.to_csv(Path(snakemake.output["mapping_metadata"]), sep='\t')
counts_df.join(attributes_df).to_csv(Path(snakemake.output["counts_w_annotations"]), sep="\t")

organism_occurance = counts_df.groupby(level='organism').sum().div(counts_df.sum())
organism_occurance.to_csv(Path(snakemake.output["organism_occurance"]), sep='\t')

sparsity_df = (counts_df >= 1).groupby(level='organism').sum().div(counts_df.index.get_level_values('organism').value_counts(), axis='index')
sparsity_df.index.name = 'organism'
sparsity_df.to_csv(Path(snakemake.output["gene_sparsity"]), sep="\t")

Python Pandas From line 1 of scripts/generate_raw_counts_metadata_dfs.py

import pandas as pd

feature_types_to_keep = snakemake.config["feature_types_to_count"]

annotation = pd.read_table(snakemake.input[0], header=None, comment='#')

annotation = annotation[annotation[2].isin(feature_types_to_keep)]

annotation[2] = 'exon'

annotation.to_csv(snakemake.output[0], sep='\t', index=False, header=False)