Contamination Detection in the Olive Genome: A Snakemake Workflow on AWS
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This workflow includes analyses to find contamination in the domesticated olive genome. The results are presented in this preprint .
This workflow was run on a amazon aws instance (ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20171026.1 (ami-1c1d217c)). To run this workflow, you must first install and configure miniconda (or anaconda if you prefer).
Install miniconda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Run through the prompts.
Configure miniconda and create an environment
conda config --add channels r
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n snakemake python==3.6 snakemake=4.3.0 pysam=0.13 pandas=0.20.3 numpy=1.13.3 sourmash=2.0.0a2
conda activate snakemake
Then, run the workflow:
snakemake --use-conda -k
Currently, the all rule
dynamic("outputs/sylvestris/blast/tab/{contig_names_sylv}.tab")
in the
Snakefile
is commented out. This means that the pipeline will only output Oe6 relevant files. To output sylvestris files, uncomment the rule, and comment the Oe6 rule.
Code Snippets
10 11 12 13 | shell:''' wget -O {output.gz} http://denovo.cnag.cat/genomes/olive/download/Oe6/Oe6.scaffolds.fa.gz gunzip -c {output.gz} > {output.uncmp} ''' |
21 22 23 24 25 26 | shell:''' wget -O {output.sangz} ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/678/115/GCA_001678115.1_ASM167811v1/GCA_001678115.1_ASM167811v1_genomic.fna.gz gunzip -c {output.sangz} > {output.sanuncmp} wget -O {output.ex150gz} ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/721/785/GCA_000721785.1_Aureobasidium_pullulans_var._pullulans_EXF-150_assembly_version_1.0/GCA_000721785.1_Aureobasidium_pullulans_var._pullulans_EXF-150_assembly_version_1.0_genomic.fna.gz gunzip -c {output.ex150gz} > {output.ex150uncmp} ''' |
34 35 36 | shell:''' nucmer --mum {input.Oe6} {input.san} -p outputs/aur-pul-nucmer/Oe6-APvarSan ''' |
42 43 44 | shell:''' delta-filter -l 500 -q {input} > {output} ''' |
50 51 52 | shell:''' show-coords -c -l -L 500 -r -T {input} > {output} ''' |
60 61 62 | shell:''' nucmer --mum {input.Oe6} {input.ex150} -p outputs/aur-pul-nucmer/Oe6-APvarEx ''' |
68 69 70 | shell:''' delta-filter -l 500 -q {input} > {output} ''' |
76 77 78 | shell:''' show-coords -c -l -L 500 -r -T {input} > {output} ''' |
13 14 15 16 | shell:''' wget -O {output.gz} http://denovo.cnag.cat/genomes/olive/download/Oe6/Oe6.scaffolds.fa.gz gunzip -c {output.gz} > {output.uncmp} ''' |
22 23 24 25 | shell:''' wget -O {output.gz} http://olivegenome.org/genome_datasets/Olea_europaea%3E1kb_scaffolds.gz gunzip -c {output.gz} > {output.uncmp} ''' |
31 32 33 | shell:''' samtools faidx {input} ''' |
39 40 41 | shell:''' samtools faidx {input} ''' |
53 54 55 56 57 58 59 60 61 62 63 | run: fasta = pysam.Fastafile(filename = input.genome) f=open(input.contigs,'r') for line in f.readlines(): # strip white space line = line.strip() # use pysam to grab the line of interest fasta_of_interest = fasta.fetch(line) with open(f"outputs/Oe6/suspicious_contigs/{line}.fa", "w") as text_file: # write content of fasta file with appropriate header and sequence text_file.write(f">{line}\n{fasta_of_interest}") |
71 72 73 74 75 76 77 78 79 80 81 | run: fasta = pysam.Fastafile(filename = input.genome) f=open(input.contigs,'r') for line in f.readlines(): # strip white space line = line.strip() # use pysam to grab the line of interest fasta_of_interest = fasta.fetch(line) with open(f"outputs/sylvestris/suspicious_contigs/{line}.fa", "w") as text_file: # write content of fasta file with appropriate header and sequence text_file.write(f">{line}\n{fasta_of_interest}") |
87 88 89 90 91 | shell:''' cd inputs/blast_db wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.*.tar.gz' cat nt.*.tar.gz | tar -zxvi -f - -C . ''' |
99 100 101 | shell:''' blastn -query {input.contig} -db inputs/blast_db/nt -outfmt 11 -out {output} ''' |
109 110 111 | shell:''' blastn -query {input.contig} -db inputs/blast_db/nt -outfmt 11 -out {output} ''' |
117 118 119 | shell:''' blast_formatter -archive {input} -outfmt 6 -out {output} ''' |
125 126 127 | shell:''' blast_formatter -archive {input} -outfmt 6 -out {output} ''' |
8 9 10 11 | shell:''' wget -O {output.gz} http://olivegenome.org/genome_datasets/Olea_europaea_all_scaffolds.fa.gz gunzip -c {output.gz} > {output.uncmp} ''' |
17 18 19 20 21 | shell:''' wget -O {output.tgz} http://busco.ezlab.org/datasets/embryophyta_odb9.tar.gz mkdir -p {output.db} tar -xf {output.tgz} --strip-components=1 -C {output.db} ''' |
28 29 30 31 | shell:''' run_busco -i {input.genome} -o wild_olive_busco -l {input.busco_db} -m geno mv run_wild_olive_busco {output} ''' |
8 9 10 11 | shell:''' wget -O {output.gz} http://denovo.cnag.cat/genomes/olive/download/Oe6/Oe6.scaffolds.fa.gz gunzip -c {output.gz} > {output.uncmp} ''' |
17 18 19 20 | shell:''' wget -O {output.gz} http://olivegenome.org/genome_datasets/Olea_europaea_chromosome+unchromosome.gz gunzip -c {output.gz} > {output.uncmp} ''' |
28 29 30 | shell:''' nucmer --mum {input.sylv} {input.Oe6} -p outputs/olive_genomes_nucmer/sylvester_santander_nucmer ''' |
36 37 38 | shell:''' delta-filter -l 500 -q {input} > {output} ''' |
44 45 46 | shell:''' show-coords -c -l -L 500 -r -T {input} > {output} ''' |
6 7 8 | shell:''' wget -O {output} http://denovo.cnag.cat/genomes/olive/download/Oe6/OE6A.pep.fa ''' |
14 15 16 17 | shell:''' wget -O {output.gz} http://olivegenome.org/genome_datasets/Olea_europaea.gene.pep.final.chr_and_chrUn_noTE.gz gunzip -c {output.gz} > {output.uncmp} ''' |
32 33 34 35 | shell:''' orthofinder -f inputs/peptides -og cp -a inputs/peptides/Results*/. {output} ''' |
8 9 10 | shell:''' wget -O {output} http://denovo.cnag.cat/genomes/olive/download/Oe6/Oe6.scaffolds.fa.gz ''' |
14 15 16 | shell:''' wget -O {output} http://olivegenome.org/genome_datasets/Olea_europaea%3E1kb_scaffolds.gz ''' |
22 23 24 25 | shell:''' # compute tetranucleotide frequency of scaffolds sourmash compute -k 31 --scaled 10000 -o {output} {input} ''' |
31 32 33 34 | shell:''' # compute tetranucleotide frequency of scaffolds sourmash compute -k 31 --scaled 10000 -o {output} {input} ''' |
38 39 40 | shell:''' wget -O inputs/genbank_lca/genbank-k31.lca.json.gz https://osf.io/zskb9/download ''' |
48 49 50 | shell:''' sourmash lca classify --db {input.genbank} --query {input.sig} -o {output} ''' |
58 59 60 | shell:''' sourmash lca classify --db {input.genbank} --query {input.sig} -o {output} ''' |
Support
- Future updates
Related Workflows





