Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
This repo contains
Snakemake
pipelines for replicating the analysis in our
sstar
paper. These pipelines were tested on Linux operating systems (CentOS8, Oracle Linux 8, and Ubuntu 20.04).
To replicate our analysis, users should install Anaconda or Miniconda first, then use the following commands to create a virtual environment for the analysis.
conda install mamba -n base -c conda-forge
mamba env create -f environment.yml
conda activate sstar-analysis
For tools that cannot be installed through
conda
, users could follow the commands below.
mkdir ext && cd ext
# Download SPrime and its pipeline
mkdir SPrime && cd SPrime
wget https://faculty.washington.edu/browning/sprime.jar
git clone https://github.com/YingZhou001/sprimepipeline
chmod a+x sprimepipeline/pub.pipeline.pbs/tools/map_arch_genome/map_arch
sed 's/out<-c()/out<-data.frame()/' sprimepipeline/pub.pipeline.pbs/tools/score_summary.r > tmp
mv tmp sprimepipeline/pub.pipeline.pbs/tools/score_summary.r
cd ..
# Download SkovHMM
git clone https://github.com/LauritsSkov/Introgression-detection SkovHMM
cd SkovHMM
git checkout 3d1865a56b8fdecc2768448e6cb982a157c37c50
cd ..
# Download ArchaicSeeker2.0
git clone https://github.com/Shuhua-Group/ArchaicSeeker2.0
cd ArchaicSeeker2.0
cd libnlopt.so.0_free
chmod a+x ArchaicSeeker2
cd ../..
Users also need to download
ms.tar.gz
from
Hudson Lab
and decompress it under the
ext
folder and compile it with the following commands.
cd msdir
${CONDA_PREFIX}/bin/gcc -o ms ms.c streec.c rand1.c -lm
cd ../..
Running the pipelines
After installing the tools above, users could test the pipelines locally by using dry-run.
snakemake -np
If users want to run all the pipelines, users can use the following command.
snakemake -c 1 --use-conda
-c
specifies the number of threads and
snakemake
could run jobs parallelly as many as possible with the given number of threads.
However, we recommend users to run each pipeline individually. Users need to run the two pipelines for simulation first.
snakemake -s workflows/1src/simulation.snake -c 1
snakemake -s workflows/2src/simulation.snake -c 1
Other pipelines, including
workflows/1src/sstar.snake
,
workflows/1src/sprime.snake
,
workflows/1src/skovhmm.snake
,
workflows/2src/sstar.snake
,
workflows/2src/sprime.snake
,
workflows/2src/archaicseeker2.snake
, can be executed in any order after simulation.
For the SkovHMM pipeline, users need to add the
--use-conda
argument, because it requires
Python2
, which is different from the main environment
sstar-analysis
. A specific environment for SkovHMM needs to be created.
snakemake -s workflows/1src/skovhmm.snake -c 1 --use-conda
Finally, users could plot the results.
snakemake -s workflows/plot.snake -c 1
The plots and tables are in
results/plots
. They may be slightly different from those in our manuscript because of random effects.
Users could also submit the pipelines to HPC. Users should create
profiles
depending on the HPC scheduler. Users could find an example profile for
SLURM
in
config/slurm/config.yaml
. To submit jobs by
SLURM
, users first create a folder
logs_slurm
.
mkdir logs_slurm
Then, for example, to run simulation in the cluster,
snakemake -s workflows/1src/simulation.snake --profile config/slurm -j 200
-j
specifies the number of threads in the cluster.
Code Snippets
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | run: if wildcards.demog == 'BonoboGhost': graph = demes.load("config/simulation/models/BonoboGhost_4K19.yaml") demography = msprime.Demography.from_demes(graph) ref_id = 'Western' tgt_id = 'Bonobo' mutation_rate = 1.2e-8 recombination_rate = 0.7e-8 if wildcards.demog == 'HumanNeanderthal': graph = demes.load("config/simulation/models/HumanNeanderthal_4G21.yaml") demography = msprime.Demography.from_demes(graph) ref_id = 'YRI' tgt_id = 'CEU' mutation_rate = 1.29e-8 recombination_rate = 1e-8 samples = [ msprime.SampleSet(int(wildcards.nref), ploidy=2, population=ref_id), msprime.SampleSet(int(wildcards.ntgt), ploidy=2, population=tgt_id), ] ts = run_simulation(demography=demography, samples=samples, mut_rate=mutation_rate, recomb_rate=recombination_rate, sequence_length=seq_len, random_seed=wildcards.seed) ts.dump(output.ts) with open(output.vcf, 'w') as o: ts.write_vcf(o) |
77 78 79 80 81 82 83 84 85 86 87 88 | run: ts = tskit.load(input.ts) if wildcards.demog == 'BonoboGhost': src_id = "Ghost" tgt_id = "Bonobo" if wildcards.demog == 'HumanNeanderthal': src_id = "Nea" tgt_id = "CEU" get_introgressed_tracts(ts, chr_name=1, src_name=src_id, tgt_name=tgt_id, output=output.tracts) |
99 100 101 102 103 104 | shell: """ bcftools view {input.vcf} -m 2 -M 2 | awk -F "\\t" 'BEGIN{{OFS="\\t";ORS=""}}{{if($0~/^#/){{print $0"\\n"}}else{{print $1,$2,$3,"A","T\\t";for(i=6;i<NF;i++){{print $i"\\t"}};print $NF"\\n"}}}}' | bgzip -c > {output.bvcf} bgzip -c {input.vcf} > {output.vcf} rm {input.vcf} """ |
43 44 45 46 47 | shell: """ vcftools --gzvcf {input.vcf} --counts --stdout --keep {input.outgroup_file} --min-alleles 2 --max-alleles 2 > {output.outgroup_freq} vcftools --gzvcf {input.vcf} --indv {params.tgt_ind_id} --stdout --counts --min-alleles 2 --max-alleles 2 > {output.tgt_count} """ |
71 72 73 74 75 76 | shell: """ cat {input.tgt_count} | python {params.skovhmm_exec_filter} {input.outgroup_freq} {params.window_size} {input.weights} {params.observations} python {params.skovhmm_exec_train} {params.observations} {params.train_output_prefix} {input.init_prob} {input.weights} {input.mut_rates} python {params.skovhmm_exec_decode} {params.observations} {params.decode_output_prefix} {params.train_output_prefix}.hmm {input.weights} {input.mut_rates} {params.window_size} """ |
114 115 116 117 118 119 120 121 | run: for i in range(len(params.cutoffs)): cutoff = params.cutoffs[i] inferred_tracts = f'{params.inferred_tracts}.{cutoff}.bed' process_skovhmm_output(input.decoded_probs, inferred_tracts, float(cutoff), int(params.win_len), params.src_id) precision, recall = cal_accuracy(input.true_tracts, inferred_tracts) with open(output[i], 'w') as o: o.write(f'{wildcards.demog}\tnref_{wildcards.nref}_ntgt_{wildcards.ntgt}\t{cutoff}\t{precision}\t{recall}\n') |
132 133 134 135 | shell: """ cat {input.accuracy_files} | sed '1idemography\\tsample\\tcutoff\\tprecision\\trecall' > {output.accuracy_table} """ |
44 45 46 47 | shell: """ java -Xmx2g -jar {params.sprime_exec} gt={input.gt_file} outgroup={input.outgroup_file} map={input.map_file} out={params.output_prefix} minscore={params.threshold} mu={params.mu} """ |
59 60 61 62 63 | run: process_sprime_output(input.scores, output.inferred_tracts) precision, recall = cal_accuracy(input.true_tracts, output.inferred_tracts) with open(output.accuracy, 'w') as o: o.write(f'{wildcards.demog}\tnref_{wildcards.nref}_ntgt_{wildcards.ntgt}\t{wildcards.threshold}\t{precision}\t{recall}\n') |
74 75 76 77 | shell: """ cat {input.accuracy_files} | sed '1idemography\\tsample\\tcutoff\\tprecision\\trecall' > {output.accuracy_table} """ |
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | run: seq_len = 50000 if wildcards.demog == 'BonoboGhost': seq_len = 40000 N0 = 1000 mut_rate_mean = 1.29e-8 if wildcards.demog == 'BonoboGhost': mut_rate_mean = 1.2e-8 scaled_mut_rate_mean = 4*N0*mut_rate_mean*seq_len scaled_mut_rate_sdv = 0.233 rec_rate_mean = 0.7e-8 scaled_rec_rate_mean = 4*N0*rec_rate_mean*seq_len mut_rate_list = norm.rvs(loc=scaled_mut_rate_mean, scale=scaled_mut_rate_sdv, size=20000) rec_rate_list = nbinom.rvs(n=0.5, p=0.5/(0.5+scaled_rec_rate_mean), size=20000) with open(output.rates, 'w') as o: for i in range(len(mut_rate_list)): if mut_rate_list[i] < 0.001: mut_rate_list[i] = 0.001 if rec_rate_list[i] < 0.001: rec_rate_list[i] = 0.001 mut_rate = mut_rate_list[i] rec_rate = rec_rate_list[i] o.write(f'{mut_rate}\t{rec_rate}\n') |
125 126 127 128 | shell: """ cat {input.rates} | {params.ms_exec} {params.nsamp} {params.nreps} -t tbs -r tbs {params.seq_len} -s {wildcards.snp_num} {params.ms_params} > {output.ms} """ |
141 142 | run: ms2vcf(input.ms, output.vcf, params.nsamp, params.seq_len) |
156 157 158 159 | shell: """ sstar score --vcf {input.vcf} --ref {input.ref_ind} --tgt {input.tgt_ind} --output {output.score} --thread {threads} --win-len {params.seq_len} --win-step {params.seq_len} """ |
171 172 173 174 175 176 177 | run: df = pd.read_csv(input.score, sep="\t").dropna() mean_df = df.groupby(['chrom', 'start', 'end'], as_index=False)['S*_score'].mean().dropna() scores = np.quantile(mean_df['S*_score'], params.sim_quantiles) with open(output.quantile, 'w') as o: for i in range(len(scores)): o.write(f'{scores[i]}\t{wildcards.snp_num}\t{params.sim_quantiles[i]}\n') |
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | shell: """ cat {input.hum_nref_10_true} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.hum_nref_10_true} cat {input.hum_nref_50_true} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.hum_nref_50_true} cat {input.bon_nref_10_true} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.bon_nref_10_true} cat {input.bon_nref_50_true} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.bon_nref_50_true} cat {input.hum_nref_10_const} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.hum_nref_10_const} cat {input.hum_nref_50_const} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.hum_nref_50_const} cat {input.bon_nref_10_const} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.bon_nref_10_const} cat {input.bon_nref_50_const} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.bon_nref_50_const} cat {input.hum_nref_10_ref_tgt_only} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.hum_nref_10_ref_tgt_only} cat {input.hum_nref_50_ref_tgt_only} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.hum_nref_50_ref_tgt_only} cat {input.bon_nref_10_ref_tgt_only} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.bon_nref_10_ref_tgt_only} cat {input.bon_nref_50_ref_tgt_only} | sort -nk 2,2 | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.bon_nref_50_ref_tgt_only} """ |
237 238 239 240 | shell: """ sstar score --vcf {input.vcf} --ref {input.ref_ind} --tgt {input.tgt_ind} --output {output.score} --thread {threads} --win-len {params.seq_len} --win-step 10000 """ |
253 254 255 256 257 | shell: """ export R_LIBS={params.R_LIBS} sstar threshold --score {input.score} --sim-data {input.summary} --quantile {wildcards.quantile} --output {output.quantiles} """ |
269 270 271 272 273 | run: process_sstar_1src_output(input.quantiles, output.inferred_tracts) precision, recall = cal_accuracy(input.true_tracts, output.inferred_tracts) with open(output.accuracy, 'w') as o: o.write(f'{wildcards.demog}\t{wildcards.scenario}\tnref_{wildcards.nref}_ntgt_{wildcards.ntgt}\t{wildcards.quantile}\t{precision}\t{recall}\n') |
284 285 286 287 | shell: """ cat {input.accuracy_files} | sed '1idemography\\tscenario\\tsample\\tcutoff\\tprecision\\trecall' > {output.accuracy_table} """ |
43 44 45 46 47 48 49 50 51 52 53 54 | shell: """ bcftools view {input.vcf} -S {input.ref} | bgzip -c > {output.ref} bcftools view {input.vcf} -S {input.tgt} | bgzip -c > {output.tgt} bcftools view {input.vcf} -S {input.src1} | bgzip -c > {output.src1} bcftools view {input.vcf} -S {input.src2} | bgzip -c > {output.src2} echo vcf > {output.par} echo {output.ref} >> {output.par} echo {output.tgt} >> {output.par} echo {output.src1} >> {output.par} echo {output.src2} >> {output.par} """ |
73 74 75 76 | shell: """ {params.archaicseeker2_exec} -v {input.vcf_par} -r {input.remap_par} -m {input.model} -X {input.outgroup_par} -p {input.pop_par} -A {input.anc_par} -o {params.output_prefix} """ |
93 94 95 96 97 98 99 | run: process_archaicseeker2_output(input.seg, output.src1_inferred_tracts, output.src2_inferred_tracts, params.src1_id, params.src2_id) src1_precision, src1_recall = cal_accuracy(input.src1_true_tracts, output.src1_inferred_tracts) src2_precision, src2_recall = cal_accuracy(input.src2_true_tracts, output.src2_inferred_tracts) with open(output.accuracy, 'w') as o: o.write(f'{wildcards.demog}\tnref_{wildcards.nref}_ntgt_{wildcards.ntgt}\tsrc1\t-1\t{src1_precision}\t{src1_recall}\n') o.write(f'{wildcards.demog}\tnref_{wildcards.nref}_ntgt_{wildcards.ntgt}\tsrc2\t-1\t{src2_precision}\t{src2_recall}\n') |
110 111 112 113 | shell: """ cat {input.accuracy_files} | sed '1idemography\\tsample\\tsrc\\tcutoff\\tprecision\\trecall' > {output.accuracy_table} """ |
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | run: if wildcards.demog == 'HumanNeanderthalDenisovan': graph = demes.load("config/simulation/models/HumanNeanderthalDenisovan_PapuansOutOfAfrica_10J19.yaml") demography = msprime.Demography.from_demes(graph) ref_id = 'YRI' tgt_id = 'Papuan' src1_id = 'NeaA' src2_id = 'DenA' mutation_rate = 1.4e-8 recombination_rate = 1e-8 T_DenA = 59682 / 29 T_NeaA = 75748 / 29 samples = [ msprime.SampleSet(int(wildcards.nref), ploidy=2, population=ref_id), msprime.SampleSet(int(wildcards.ntgt), ploidy=2, population=tgt_id), msprime.SampleSet(1, ploidy=2, population=src1_id, time=T_NeaA), msprime.SampleSet(1, ploidy=2, population=src2_id, time=T_DenA), ] if wildcards.demog == 'ChimpBonoboGhost': graph = demes.load("config/simulation/models/ChimpBonoboGhost_4K19.yaml") demography = msprime.Demography.from_demes(graph) ref_id = 'Western' tgt_id = 'Central' src1_id = 'Ghost' src2_id = 'Bonobo' mutation_rate = 1.2e-8 recombination_rate = 0.7e-8 samples = [ msprime.SampleSet(int(wildcards.nref), ploidy=2, population=ref_id), msprime.SampleSet(int(wildcards.ntgt), ploidy=2, population=tgt_id), msprime.SampleSet(1, ploidy=2, population=src1_id), msprime.SampleSet(1, ploidy=2, population=src2_id), ] ts = run_simulation(demography=demography, samples=samples, mut_rate=mutation_rate, recomb_rate=recombination_rate, sequence_length=seq_len, random_seed=wildcards.seed) ts.dump(output.ts) with open(output.vcf, 'w') as o: ts.write_vcf(o) |
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | run: ts = tskit.load(input.ts) if wildcards.demog == 'HumanNeanderthalDenisovan': src1_id = "Nea1" src2_id = "Den1" src3_id = "Den2" tgt_id = "Papuan" src3_tracts = output_dir + f'simulated_data/{wildcards.demog}/nref_{wildcards.nref}/ntgt_{wildcards.ntgt}/{wildcards.seed}/sim.src3.introgressed.tracts.bed' get_introgressed_tracts(ts, chr_name=1, src_name=src1_id, tgt_name=tgt_id, output=output.src1_tracts) get_introgressed_tracts(ts, chr_name=1, src_name=src2_id, tgt_name=tgt_id, output=output.src2_tracts) get_introgressed_tracts(ts, chr_name=1, src_name=src3_id, tgt_name=tgt_id, output=src3_tracts) a = pybedtools.BedTool(output.src2_tracts) b = pybedtools.BedTool(src3_tracts) a.cat(b).sort().merge().saveas(output.src2_tracts) if wildcards.demog == 'ChimpBonoboGhost': src1_id = "Ghost" src2_id = "Bonobo" tgt_id = "Central" get_introgressed_tracts(ts, chr_name=1, src_name=src1_id, tgt_name=tgt_id, output=output.src1_tracts) get_introgressed_tracts(ts, chr_name=1, src_name=src2_id, tgt_name=tgt_id, output=output.src2_tracts) |
131 132 133 134 135 136 | shell: """ bcftools view {input.vcf} -m 2 -M 2 | awk -F "\\t" 'BEGIN{{OFS="\\t";ORS=""}}{{if($0~/^#/){{print $0"\\n"}}else{{print $1,$2,$3,"A","T\\t";for(i=6;i<NF;i++){{print $i"\\t"}};print $NF"\\n"}}}}' | bgzip -c > {output.bvcf} bgzip -c {input.vcf} > {output.vcf} rm {input.vcf} """ |
149 150 151 152 153 | shell: """ bcftools view {input.vcf} -S {input.src1_list} | bgzip -c > {output.src1_vcf} bcftools view {input.vcf} -S {input.src2_list} | bgzip -c > {output.src2_vcf} """ |
47 48 49 50 | shell: """ java -Xmx2g -jar {params.sprime_exec} gt={input.gt_file} outgroup={input.outgroup_file} map={input.map_file} out={params.output_prefix} minscore={params.threshold} mu={params.mu} excludesamples={input.exsamps} """ |
68 69 70 71 72 73 74 75 76 77 78 | shell: """ if [[ $(wc -l {input.score} | awk '{{print $1}}') -gt 1 ]]; then \ {params.map_arch_exec} --kpall --vcf {input.src1_vcf} --score {input.score} --tag "src1" --sep "\\t" > {params.src1_score}; {params.map_arch_exec} --kpall --vcf {input.src2_vcf} --score {params.src1_score} --tag "src2" --sep "\\t" > {params.src2_score}; rm {params.src1_score} sleep 10; Rscript {params.score_summary_exec} {params.out_dir} {output.match_rate}; else \ echo -e "chr seg from to src1 src2\nNA NA NA NA NA NA" > {output.match_rate}; fi """ |
92 93 94 95 96 97 98 | run: process_sprime_match_rate(input.match_rate, output.src1_inferred_tracts, output.src2_inferred_tracts) src1_precision, src1_recall = cal_accuracy(input.src1_true_tracts, output.src1_inferred_tracts) src2_precision, src2_recall = cal_accuracy(input.src2_true_tracts, output.src2_inferred_tracts) with open(output.accuracy, 'w') as o: o.write(f'{wildcards.demog}\tnref_{wildcards.nref}_ntgt_{wildcards.ntgt}\t{wildcards.threshold}\tsrc1\t{src1_precision}\t{src1_recall}\n') o.write(f'{wildcards.demog}\tnref_{wildcards.nref}_ntgt_{wildcards.ntgt}\t{wildcards.threshold}\tsrc2\t{src2_precision}\t{src2_recall}\n') |
109 110 111 112 | shell: """ cat {input.accuracy_files} | sed '1idemography\\tsample\\tcutoff\\tsrc\\tprecision\\trecall' > {output.accuracy_table} """ |
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | run: seq_len = 50000 if wildcards.demog == 'ChimpBonoboGhost': seq_len = 40000 N0 = 1000 mut_rate_mean = 1.4e-8 if wildcards.demog == 'ChimpBonoboGhost': mut_rate_mean = 1.2e-8 scaled_mut_rate_mean = 4*N0*mut_rate_mean*seq_len scaled_mut_rate_sdv = 0.233 rec_rate_mean = 0.7e-8 scaled_rec_rate_mean = 4*N0*rec_rate_mean*seq_len mut_rate_list = norm.rvs(loc=scaled_mut_rate_mean, scale=scaled_mut_rate_sdv, size=20000) rec_rate_list = nbinom.rvs(n=0.5, p=0.5/(0.5+scaled_rec_rate_mean), size=20000) with open(output.rates, 'w') as o: for i in range(len(mut_rate_list)): if mut_rate_list[i] < 0.001: mut_rate_list[i] = 0.001 if rec_rate_list[i] < 0.001: rec_rate_list[i] = 0.001 mut_rate = mut_rate_list[i] rec_rate = rec_rate_list[i] o.write(f'{mut_rate}\t{rec_rate}\n') |
98 99 100 101 | shell: """ cat {input.rates} | {params.ms_exec} {params.nsam} {params.nreps} -t tbs -r tbs {params.seq_len} -s {wildcards.snp_num} {params.ms_params} > {output.ms} """ |
114 115 | run: ms2vcf(input.ms, output.vcf, params.nsam, params.seq_len) |
129 130 131 132 | shell: """ sstar score --vcf {input.vcf} --ref {input.ref_ind} --tgt {input.tgt_ind} --output {output.score} --thread {threads} --win-len {params.seq_len} --win-step {params.seq_len} """ |
144 145 146 147 148 149 150 | run: df = pd.read_csv(input.score, sep="\t").dropna() mean_df = df.groupby(['chrom', 'start', 'end'], as_index=False)['S*_score'].mean().dropna() scores = np.quantile(mean_df['S*_score'], params.sim_quantiles) with open(output.quantile, 'w') as o: for i in range(len(scores)): o.write(f'{scores[i]}\t{wildcards.snp_num}\t{params.sim_quantiles[i]}\n') |
166 167 168 169 170 171 172 | shell: """ cat {input.hum_nref_10} | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.hum_nref_10} cat {input.hum_nref_50} | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.hum_nref_50} cat {input.bon_nref_10} | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.bon_nref_10} cat {input.bon_nref_50} | sed '1iS*_score\\tSNP_number\\tquantile\n' > {output.bon_nref_50} """ |
190 191 192 193 194 195 196 | shell: """ sstar score --vcf {input.vcf} --ref {input.ref_ind} --tgt {input.tgt_ind} --output {output.score} --thread {threads} --win-len {params.seq_len} --win-step 10000 sleep 180 sstar matchrate --vcf {input.vcf} --ref {input.ref_ind} --tgt {input.tgt_ind} --src {input.src1_ind} --score {output.score} --output {output.src1_match_pct} sstar matchrate --vcf {input.vcf} --ref {input.ref_ind} --tgt {input.tgt_ind} --src {input.src2_ind} --score {output.score} --output {output.src2_match_pct} """ |
209 210 211 212 213 | shell: """ export R_LIBS={params.R_LIBS} sstar threshold --score {input.score} --sim-data {input.summary} --quantile {wildcards.quantile} --output {output.quantile} """ |
228 229 230 231 232 233 234 | run: process_sstar_2src_output(input.src1_match_pct, input.src2_match_pct, input.threshold, output.src1_inferred_tracts, output.src2_inferred_tracts) src1_precision, src1_recall = cal_accuracy(input.src1_true_tracts, output.src1_inferred_tracts) src2_precision, src2_recall = cal_accuracy(input.src2_true_tracts, output.src2_inferred_tracts) with open(output.accuracy, 'w') as o: o.write(f'{wildcards.demog}\tnref_{wildcards.nref}_ntgt_{wildcards.ntgt}\t{wildcards.quantile}\tsrc1\t{src1_precision}\t{src1_recall}\n') o.write(f'{wildcards.demog}\tnref_{wildcards.nref}_ntgt_{wildcards.ntgt}\t{wildcards.quantile}\tsrc2\t{src2_precision}\t{src2_recall}\n') |
244 245 246 247 | shell: """ cat {input.accuracy_files} | sed '1idemography\\tsample\\tcutoff\\tsrc\\tprecision\\trecall' > {output.accuracy_table} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | import numpy as np import matplotlib.pyplot as plt import pandas as pd import matplotlib matplotlib.use("Agg") import seaborn as sns sns.set_style("darkgrid") from sklearn import metrics sstar_1src_accuracy = pd.read_csv(snakemake.input.sstar_1src_accuracy, sep="\t").dropna() sprime_1src_accuracy = pd.read_csv(snakemake.input.sprime_1src_accuracy, sep="\t").dropna() skovhmm_1src_accuracy = pd.read_csv(snakemake.input.skovhmm_1src_accuracy, sep="\t").dropna() sstar_1src_accuracy_grouped = sstar_1src_accuracy.groupby(['demography', 'scenario', 'sample', 'cutoff'], as_index=False) sprime_1src_accuracy_grouped = sprime_1src_accuracy.groupby(['demography', 'sample', 'cutoff'], as_index=False) skovhmm_1src_accuracy_grouped = skovhmm_1src_accuracy.groupby(['demography', 'sample', 'cutoff'], as_index=False) sstar_1src_accuracy_mean = sstar_1src_accuracy_grouped.mean() sprime_1src_accuracy_mean = sprime_1src_accuracy_grouped.mean() skovhmm_1src_accuracy_mean = skovhmm_1src_accuracy_grouped.mean() sstar_1src_accuracy_mean.to_csv(snakemake.output.sstar_1src_accuracy_mean, sep="\t", index=False) sprime_1src_accuracy_mean.to_csv(snakemake.output.sprime_1src_accuracy_mean, sep="\t", index=False) skovhmm_1src_accuracy_mean.to_csv(snakemake.output.skovhmm_1src_accuracy_mean, sep="\t", index=False) sprime_1src_accuracy_mean['scenario'] = ['true'] * len(sprime_1src_accuracy_mean) skovhmm_1src_accuracy_mean['scenario'] = ['true'] * len(skovhmm_1src_accuracy_mean) sstar_2src_accuracy = pd.read_csv(snakemake.input.sstar_2src_accuracy, sep="\t").dropna() sprime_2src_accuracy = pd.read_csv(snakemake.input.sprime_2src_accuracy, sep="\t").dropna() archaicseeker2_2src_accuracy = pd.read_csv(snakemake.input.archaicseeker2_2src_accuracy, sep="\t").dropna() sstar_2src_accuracy_grouped = sstar_2src_accuracy.groupby(['demography', 'sample', 'cutoff', 'src'], as_index=False) sprime_2src_accuracy_grouped = sprime_2src_accuracy.groupby(['demography', 'sample', 'cutoff', 'src'], as_index=False) archaicseeker2_2src_accuracy_grouped = archaicseeker2_2src_accuracy.groupby(['demography', 'sample', 'cutoff', 'src'], as_index=False) sstar_2src_accuracy_mean = sstar_2src_accuracy_grouped.mean() sprime_2src_accuracy_mean = sprime_2src_accuracy_grouped.mean() archaicseeker2_2src_accuracy_mean = archaicseeker2_2src_accuracy_grouped.mean() sstar_2src_accuracy_mean.to_csv(snakemake.output.sstar_2src_accuracy_mean, sep="\t", index=False) sprime_2src_accuracy_mean.to_csv(snakemake.output.sprime_2src_accuracy_mean, sep="\t", index=False) archaicseeker2_2src_accuracy_mean.to_csv(snakemake.output.archaicseeker2_2src_accuracy_mean, sep="\t", index=False) methods1 = ['sstar', 'sprime', 'skovhmm'] demography1 = ['HumanNeanderthal', 'BonoboGhost'] samples = ['nref_10_ntgt_1', 'nref_50_ntgt_1'] scenarios = ['true', 'const', 'ref_tgt_only'] accuracy1 = { 'sstar': sstar_1src_accuracy_mean, 'sprime': sprime_1src_accuracy_mean, 'skovhmm': skovhmm_1src_accuracy_mean, } methods2 = [ 'sstar', 'sprime', 'archaicseeker2' ] demography2 = ['HumanNeanderthalDenisovan', 'ChimpBonoboGhost'] accuracy2 = { 'sstar': sstar_2src_accuracy_mean, 'sprime': sprime_2src_accuracy_mean, 'archaicseeker2': archaicseeker2_2src_accuracy_mean, } fig, axs = plt.subplots(nrows=2, ncols=3, constrained_layout=True, figsize=(7.5,4), dpi=350) gridspec = axs[0, 0].get_subplotspec().get_gridspec() for a in axs[:,2]: a.remove() markers = { 'nref_10_ntgt_1': {'symbol':'.', 'size': 6}, 'nref_50_ntgt_1': {'symbol':'*', 'size': 6}, } colors = { 'sstar': {'true': 'blue', 'const': 'cyan', 'ref_tgt_only': 'purple'}, 'skovhmm': 'green', 'sprime': 'orange', 'archaicseeker2': 'magenta', } linestyles = { 'const': 'dotted', 'true': 'solid', 'ref_tgt_only': (0, (3, 1, 1, 1, 1, 1)), } titles = { 'HumanNeanderthal': 'Human-Neanderthal model', 'BonoboGhost': 'Bonobo-Ghost model', 'HumanNeanderthalDenisovan': 'Human-Neanderthal-Denisovan model', 'ChimpBonoboGhost': 'Chimpanzee-Ghost-Bonobo model', } zorders = { 'sstar': 2, 'skovhmm': 5, 'sprime': 10, } j = 0 for d in demography1: for s in samples: for sc in scenarios: for m in methods1: if m == 'sstar': color = colors[m][sc] else: color = colors[m] df = accuracy1[m][ (accuracy1[m]['demography'] == d) & (accuracy1[m]['sample'] == s) & (accuracy1[m]['scenario'] == sc) ].sort_values(by='recall', ascending=False) recall = df['recall'] precision = df['precision'] if (m == 'sprime') or (m == 'skovhmm'): if sc != 'true': continue if d == 'BonoboGhost': axs[0,j].plot(recall, precision, marker=markers[s]['symbol'], ms=markers[s]['size'], c=color, zorder=zorders[m]) else: axs[0,j].plot(recall, precision, marker=markers[s]['symbol'], ms=markers[s]['size'], c=color) axs[0,j].set_xlabel('Recall (%)', fontsize=10) axs[0,j].set_ylabel('Precision (%)', fontsize=10) axs[0,j].set_xlim([-5, 105]) axs[0,j].set_ylim([-5, 105]) axs[0,j].set_title(titles[d], fontsize=8, weight='bold') if j == 0: axs[0,j].text(-35, 110, 'B', fontsize=10, weight='bold') axs[0,j].plot([0,100],[2.25,2.25], c='red', alpha=0.5) if j == 1: axs[0,j].text(-35, 110, 'C', fontsize=10, weight='bold') axs[0,j].plot([0,100],[2,2], c='red', alpha=0.5) f_scores = np.linspace(20, 80, num=4) lines, labels = [], [] for f_score in f_scores: x = np.linspace(1, 100) y = f_score * x / (2 * x - f_score) (l,) = axs[0,j].plot(x[y >= 0], y[y >= 0], color="black", alpha=0.4, linestyle='dotted', zorder=1) axs[0,j].annotate("F1={0:0.0f}%".format(f_score), xy=(101, y[45] + 2), fontsize=8) j += 1 j = 0 for d in demography2: for s in samples: for m in methods2: if m == 'sstar': color = colors[m]['true'] else: color = colors[m] src1_df = accuracy2[m][ (accuracy2[m]['demography'] == d) & (accuracy2[m]['sample'] == s) & (accuracy2[m]['src'] == 'src1') ].sort_values(by='recall', ascending=False) src2_df = accuracy2[m][ (accuracy2[m]['demography'] == d) & (accuracy2[m]['sample'] == s) & (accuracy2[m]['src'] == 'src2') ].sort_values(by='recall', ascending=False) src1_recall = src1_df['recall'] src1_precision = src1_df['precision'] src2_recall = src2_df['recall'] src2_precision = src2_df['precision'] axs[1,j].plot(src1_recall, src1_precision, marker=markers[s]['symbol'], ms=markers[s]['size'], c=color, markerfacecolor='white') axs[1,j].plot(src2_recall, src2_precision, marker=markers[s]['symbol'], ms=markers[s]['size'], c=color, linestyle='dashdot') axs[1,j].set_xlabel('Recall (%)', fontsize=10) axs[1,j].set_ylabel('Precision (%)', fontsize=10) axs[1,j].set_xlim([-5, 105]) axs[1,j].set_ylim([-5, 105]) axs[1,j].set_title(titles[d], fontsize=8, weight='bold') if j == 0: axs[1,j].text(-35, 110, 'D', fontsize=10, weight='bold') axs[1,j].plot([0,100],[0.2,0.2], c='red', alpha=0.5) axs[1,j].plot([0,100],[4,4], c='red', linestyle='dotted') if j == 1: axs[1,j].text(-35, 110, 'E', fontsize=10, weight='bold') axs[1,j].plot([0,100],[2,2], c='red', alpha=0.5) axs[1,j].plot([0,100],[2,2], c='red', linestyle='dotted') f_scores = np.linspace(20, 80, num=4) lines, labels = [], [] for f_score in f_scores: x = np.linspace(1, 100) y = f_score * x / (2 * x - f_score) (l,) = axs[1,j].plot(x[y >= 0], y[y >= 0], color="black", alpha=0.4, linestyle='dotted', zorder=1) axs[1,j].annotate("F1={0:0.0f}%".format(f_score), xy=(101, y[45] + 2), fontsize=8) j += 1 # legend subfig = fig.add_subfigure(gridspec[:,2]) handles, labels = subfig.gca().get_legend_handles_labels() sstar_line = plt.Line2D([0], [0], label='sstar (full)', color=colors['sstar']['true']) sstar_line2 = plt.Line2D([0], [0], label='sstar (constant)', color=colors['sstar']['const']) sstar_line3 = plt.Line2D([0], [0], label='sstar (only ref & tgt)', color=colors['sstar']['ref_tgt_only']) skovhmm_line = plt.Line2D([0], [0], label='SkovHMM', color=colors['skovhmm']) sprime_line = plt.Line2D([0], [0], label='SPrime', color=colors['sprime']) archaicseeker2_line = plt.Line2D([0], [0], label='ArchaicSeeker2.0', color=colors['archaicseeker2']) baseline1 = plt.Line2D([0], [0], label='baseline/src1 baseline', color='red', alpha=0.5) baseline2 = plt.Line2D([0], [0], label='src2 baseline', color='red', linestyle='dotted') f1_curves = plt.Line2D([0], [0], label='iso-F1 curves', color='black', alpha=0.4, linestyle='dotted') nref_10_ntgt_1 = plt.Line2D([0], [0], marker=markers['nref_10_ntgt_1']['symbol'], ms=5, label='Nref = 10', color='black', linewidth=0) nref_50_ntgt_1 = plt.Line2D([0], [0], marker=markers['nref_50_ntgt_1']['symbol'], ms=5, label='Nref = 50', color='black', linewidth=0) src1 = plt.Line2D([0], [0], label='src1', color='black', marker='o', ms=4, markerfacecolor='white') src2 = plt.Line2D([0], [0], label='src2', color='black', marker='o', ms=4, linestyle='dotted') handles.extend([sstar_line, sstar_line2, sstar_line3, skovhmm_line, sprime_line, archaicseeker2_line, baseline1, baseline2, f1_curves, nref_10_ntgt_1, nref_50_ntgt_1, src1, src2]) subfig.legend(handles=handles, fontsize=8, handlelength=1.5) fig.set_constrained_layout_pads(w_pad=4 / 72, h_pad=4 / 72, hspace=0, wspace=0.1) plt.savefig(snakemake.output.accuracy, bbox_inches='tight') |
Support
- Future updates
Related Workflows





