Single Cell Analysis with Snakemake Pipelines: From Data Download to Processing
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Snakemake Pipelines for Single Cell. This pipeline is based on snakemake . It is probably worth reading through the introductory material before using this code.
To get started, you must have the following on your path
-
Salmon (ensure that the index specified in your config is the same as the salmon in your path)
-
Samtools
-
Python (snakemake is only supported by Python 3)
Make sure the python requirements are fulfilled using the following:
pip install -r requirements.txt
How to Use
Generally, full use of the pipeline involves downloading the data from iRODS first before processing the data.
These two steps are not done withing the same snakemake workflow, as the iRODS download is limited to 16 simultaneous processes, and in snakemake you cannot specify a job limit per step within the same workflow.
Both steps are directed by a shared config.json file. An example file is included in the repository. Please copy and create your own file per new processing task instead of using the provided example file, as this will make updating this repository easier.
There are two ways of running the pipeline - either using the bash scripts, which call snakemake using the required cluster syntax, or by using the individual .snake files as specified in the original repository .
Running the pipeline
The pipeline bash script can be used as follows (assuming your config is in ~/workflows/config.json):
pipeline.sh -c ~/workflows/config.json
The config file
The config file should specify, as a minimum:
-
The directory where the cram / fastq files are for analysis
-
The lustre folder for temporary files
-
The salmon index directory (created with the same version as the salmon in your path)
-
A file mapping transcript to gene ID's
-
The study ID
-
The query of what runs / lanes to fetch from the study
-
The file pattern (e.g.
{run}_{lane}#{tag_index}
)
The
query
parameter specifies the iRODS query. The syntax is in JSON.
Code Snippets
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | shell: "multiqc {salmon_results} -o {out_folder}".format( salmon_results=os.path.join(RESULTS_FOLDER, "quant"), out_folder = os.path.join(RESULTS_FOLDER, "qc", "multiqc_salmon")) rule read_salmon_qcs: # Combine the salmon QC output into a single table input: expand(os.path.join(RESULTS_FOLDER, "quant", "{sample}", "quant.genes.sf"), sample=final_samples) output: os.path.join(RESULTS_FOLDER, "qc", "salmon_qc.csv"), log: os.path.join(LOG_FOLDER, "parsing", "salmon_qc.log") run: qcs = read_qcs(pattern=os.path.join(RESULTS_FOLDER, "quant", "*")) qcs.to_csv(output[0]) rule read_reads: # Combine the read count into a single table input: expand(os.path.join(RESULTS_FOLDER, "quant", "{sample}", "quant.genes.sf"), sample=final_samples) output: os.path.join(RESULTS_FOLDER, "results", "counts.csv") log: os.path.join(LOG_FOLDER, "parsing", "reads.log") run: results = read_quants(os.path.join(RESULTS_FOLDER, "quant", "*"), col="NumReads") results.to_csv(output[0]) rule read_tpm: # Combine the computed TPM into a single table input: expand(os.path.join(RESULTS_FOLDER, "quant", "{sample}", "quant.genes.sf"), sample=final_samples) output: os.path.join(RESULTS_FOLDER, "results", "tpm.csv"), log: os.path.join(LOG_FOLDER, "parsing", "tpm.log") run: results = read_quants(pattern=os.path.join(RESULTS_FOLDER, "quant", "*")) results.to_csv(output[0]) rule quantify: input: forward=os.path.join(LUSTRE, "{sample}_" + FORWARD + ".fastq"), reverse=os.path.join(LUSTRE, "{sample}_" + REVERSE + ".fastq") output: os.path.join(RESULTS_FOLDER, "quant", "{sample}", "quant.genes.sf") log: lambda wildcards: os.path.join( LOG_FOLDER, "salmon", "{sample}.log".format(sample=wildcards.sample)) params: out_folder=lambda wildcards: os.path.join(RESULTS_FOLDER, "quant", wildcards.sample), index=config['index'], gene_map=config['gene_map'] shell: "if [ $(wc -c < '{input.forward}') -le 1200 ]; then " "touch {params.out_folder}/quant.genes.sf; " "else " "salmon quant -i {params.index} " "-g {params.gene_map} " "-l IU " "-1 {input.forward} -2 {input.reverse} " "-o {params.out_folder}; " "fi; " |
65 66 67 | run: qcs = read_qcs(pattern=os.path.join(RESULTS_FOLDER, "quant", "*")) qcs.to_csv(output[0]) |
78 79 80 81 | run: results = read_quants(os.path.join(RESULTS_FOLDER, "quant", "*"), col="NumReads") results.to_csv(output[0]) |
92 93 94 | run: results = read_quants(pattern=os.path.join(RESULTS_FOLDER, "quant", "*")) results.to_csv(output[0]) |
130 131 | shell: "cp {input} {output}" |
142 143 | shell: "cp {input} {output}" |
162 163 | shell: "cat {input} > {output}" |
182 183 | shell: "cat {input} > {output}" |
200 201 | shell: "multiqc {params.fastqc_folder} -o {params.multiqc_folder}" |
219 220 | shell: "fastqc -o {params.out_dir} {input.forward} {input.reverse}" |
235 236 237 | shell: "samtools sort -m 10G -n -T %s {input} | " "samtools fastq -F 0xB00 -1 {output.forward} -2 {output.reverse} -" |
Support
- Future updates
Related Workflows





