Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Workflow for RNAseq, using hisat2 aligner
get the hisat index for human
wget https://cloud.biohpc.swmed.edu/index.php/s/grch38/download
mv download grch39.tar.gz
tar -xvzf grch39.tar.gz
link the location in the Snakefile
eg
GENOME="/cluster/home/michalo/project_michalo/hisat/grch38/genome"
get the GTF
wget ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz
gunzip Homo_sapiens.GRCh38.99.gtf.gz
link the GTF in Snakefile
eg
GTF="/cluster/home/michalo/project_michalo/hg38/Homo_sapiens.GRCh38.99.gtf"
Software required:
If you want to use it locally, the software from the workflow: trimmomatic, hisat, subread, samtools need to be installed locally and made runnable from command line
Adapting
The paths to genome, GTF and adapters need to be set in the python constants in the Snakefile If needed, also paths to the software commands and trimmomatic jar. Recommended is to have them in the executable or java paths, eg with setting the environment value.
Running
Create a run directory, where you place: Snakefile, adapters.fa and fastq.gz files in "data" subdirectory. Do the updates to the Snakefile as above: location of genome index and GTF annotation, then:
dry run
snakemake -np
normal run
snakemake -p
run on the cluster
Make the snakemake available in the cluster environment, eg
module load gcc/8.2.0 python/3.10.4
LSF
snakemake -p -j 999 --cluster-config cluster.json --cluster "bsub -W {cluster.time} -n {cluster.n}"
SLURM
# change times in cluster.json to HH:MM:SS
snakemake -p -j 999 --cluster-config cluster.json --cluster "sbatch --time {cluster.time} -n {cluster.n}"
snakemake -p -j 999 --cluster-config cluster.json --cluster "sbatch --time {cluster.time} -n 1 --cpus-per-task={cluster.n}"
snakemake -p -j 999 --cluster-config cluster.json --cluster "sbatch --time {cluster.time} -n 1 --cpus-per-task={cluster.n} --mem-per-cpu={cluster.mem}"
SLURM with containers
Running the workflow with the containers from Galaxy software stack requires passing the external folders as singularity parameters to the snakemake. The containers will be loaded into .snakemake folder.
snakemake -p -j 999 --use-singularity --cluster-config cluster.json \
--cluster "sbatch --time {cluster.time} -n 1 --cpus-per-task={cluster.n}" \
--singularity-args "--bind /cluster/scratch/michalo/Anthony_RNA/:/mnt2 --bind /cluster/home/michalo/project_michalo/hisat/grch38/:/genomes --bind /cluster/home/michalo/project_michalo/hg38/:/annots"
Code Snippets
53 54 55 56 57 58 59 60 | run: shell( 'module load gdc \n'+ 'module load java \n'+ 'module load trimmomatic \n'+ 'echo {input} \n'+ 'trimmomatic SE -phred33 {input} {output} ILLUMINACLIP:'+TRIMFILE+':2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36' ) |
68 69 70 71 72 | run: shell( 'module load gcc/4.8.2 gdc python/2.7.11 hisat2/2.1.0 \n'+ 'echo {input} \n'+ 'hisat2 -q -p '+CORES+'-x '+GENOME+' -U {input} -S mapped_reads/{wildcards.sample}.sam \n') |
80 81 82 83 | run: shell( 'module load samtools \n'+ 'samtools view -@ '+CORES+' -bS {input} > {output} ') |
90 91 92 93 | shell: "module load samtools \n" "samtools sort -@ 24 -T sorted_reads/{wildcards.sample} " "-O bam {input} > {output}" |
100 101 102 | shell: "module load samtools \n" "samtools index {input}" |
109 110 | shell: "touch secondary_analysis/final_marker_bai.txt" |
117 118 | shell: "stringtie --rf -o {output} -p 24 {input}" |
125 126 | shell: "touch secondary_analysis/final_marker_string.txt" |
134 135 | shell: "module load legacy gcc/4.8.2 python/2.7.6 samtools/1.1 boost/1.55.0 eigen/3.2.1 cufflinks/2.2.1 \n" |
145 146 | shell: "touch secondary_analysis/final_marker_cuff.txt" |
154 155 156 | shell: "module load subread \n" "featureCounts -M -f --fraction -s 2 -T 24 -t gene -g gene_id -a "+GTF+" -o {output} {input}" |
164 165 166 167 168 169 170 171 172 173 174 175 176 | run: import pandas import glob filez = glob.glob('secondary_analysis/*.cnt') t1 = pandas.read_table(filez[1], header=1) tout = t1.iloc[:,0] for f in filez: t1=pandas.read_table(f, header=1) tout=pandas.concat([tout, t1.iloc[:,6]], axis=1) print(f) tout.to_csv('secondary_analysis/counts.csv') |
Support
- Future updates
Related Workflows





