Analysis pipeline associated with master's thesis on the population structure, demographic history and distribution of fitness effects of birches in Scandinavia.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Demography of Birch Populations across Scandinavia
Code and supplementary associated with master's thesis on the population structure, demographic history and the distribution of fitness effects of the diploid Betula pendula and tetraploid B. pubescens across Scandinavia.
Workflows
All results are part of Snakemake pipeline, the subworkflows of which are visualised below.
Schematic of total workflow
SNP calling
δaδi
polyDFE
Sample set derivation
VCF filtering
Summary statistics
Code Snippets
64 65 | script: "../scripts/recode_variant_ids_plink.py" |
83 84 | script: "../scripts/create_pairwise_ld_lists.py" |
99 100 | script: "../scripts/extract_sites_plink.py" |
116 117 | script: "../scripts/run_admixture.py" |
129 130 | script: "../scripts/plot_cv_error_admixture.py" |
143 144 | script: "../scripts/create_sample_set.py" |
167 168 | script: "../scripts/plot_pca.py" |
185 186 | script: "../scripts/plot_clusters_location.py" |
204 205 | script: "../scripts/create_barplot_admixture.py" |
121 122 | script: "../scripts/fit_pop_scenarios_dadi.py" |
141 142 | script: "../scripts/determine_best_batch_dadi.py" |
158 159 | script: "../scripts/plot_trajectory_dadi.py" |
178 179 | script: "../scripts/plot_sfs_comparison_2D.py" |
197 198 | script: "../scripts/plot_sfs_comparison_1D.py" |
217 218 | script: "../scripts/bootstrap_pop_scenarios_dadi.py" |
239 240 | script: "../scripts/determine_cis_dadi.py" |
281 282 | script: "../scripts/plot_models_lrt_dadi.py" |
328 329 | script: "../scripts/plot_models_lrt_dadi.py" |
361 362 | script: "../scripts/plot_models_lrt_dadi.py" |
406 407 | script: "../scripts/plot_models_lrt_dadi.py" |
436 437 | script: "../scripts/tabulate_results_dadi.py" |
464 465 | script: "../scripts/tabulate_results_dadi.py" |
481 482 | script: "../scripts/calculate_basic_stats.py" |
501 502 | script: "../scripts/calculate_basic_stats.py" |
44 45 | script: "../scripts/plot_cv_error_feems.py" |
65 66 | script: "../scripts/calculate_migration_surfaces_cv.py" |
83 84 | script: "../scripts/calculate_migration_surfaces_cv.py" |
101 102 | script: "../scripts/calculate_migration_surfaces_lambda.py" |
120 121 | script: "../scripts/calculate_migration_surfaces_lambda.py" |
136 137 | script: "../scripts/plot_sample_allocation_feems.py" |
151 152 | script: "../scripts/plot_sample_locations.py" |
191 192 | shell: "cairosvg {input} -d 200 -o {output}" |
202 203 | shell: "gatk IndexFeatureFile -I {input}" |
38 39 | script: "../scripts/plot_pca.py" |
57 58 | script: "../scripts/plot_pca.py" |
77 78 | script: "../scripts/plot_pca.py" |
96 97 | script: "../scripts/plot_pca.py" |
112 113 | script: "../scripts/plot_clusters_location.py" |
57 58 | shell: "wget {params.url} -O {output}" |
75 76 | script: "../scripts/prepare_input_polydfe.py" |
91 92 | script: "../scripts/run_polydfe.py" |
106 107 | script: "../scripts/run_polydfe.py" |
118 119 | script: "../scripts/parse_dfe_output.R" |
130 131 | script: "../scripts/plot_dfe_types.py" |
151 152 | script: "../scripts/plot_dfe_bootstrapped.py" |
166 167 | script: "../scripts/create_bootstraps_dfe.R" |
181 182 | script: "../scripts/create_init_bootstraps_dfe.R" |
193 194 | script: "../scripts/parse_dfe_output.R" |
208 209 | script: "../scripts/plot_dfe_bootstrapped.py" |
220 221 | script: "../scripts/plot_dfe_param_dist.py" |
241 242 | script: "../scripts/plot_dfe_bootstrapped.py" |
253 254 | script: "../scripts/plot_dfe_gradient_error.py" |
266 267 | script: "../scripts/compare_dfe_types.R" |
277 278 | script: "../scripts/plot_props_nested_dfe_types.py" |
21 22 | script: "../scripts/create_sample_set.py" |
35 36 | script: "../scripts/create_sample_set.py" |
49 50 | script: "../scripts/create_sample_set.py" |
63 64 | script: "../scripts/create_sample_set.py" |
77 78 | script: "../scripts/create_sample_set.py" |
92 93 | script: "../scripts/create_sample_set.py" |
107 108 | script: "../scripts/create_sample_set.py" |
122 123 | script: "../scripts/create_sample_set.py" |
137 138 | script: "../scripts/create_sample_set.py" |
152 153 | script: "../scripts/create_sample_set.py" |
164 165 | shell: "cp {input} {output}" |
179 180 | script: "../scripts/create_sample_set.py" |
194 195 | script: "../scripts/create_subpopulation_files.py" |
210 211 | script: "../scripts/create_subpopulation_files.py" |
69 70 | shell: "wget {params.url} -O {output}" |
98 99 | script: "../scripts/create_sample_set.py" |
112 113 | script: "../scripts/create_sample_set.py" |
127 128 | script: "../scripts/trim_reads.py" |
144 145 | shell: "fastqc {input} --outdir {params.outdir}" |
159 160 | shell: "multiqc {params.outdir} -o {params.outdir}" |
174 175 | shell: "bwa index {input}" |
188 189 | script: "../scripts/map_reads.py" |
201 202 | shell: "samtools coverage -m {input} > {output}" |
212 213 | shell: "samtools coverage -m {input} > {output}" |
223 224 | shell: "samtools faidx {input}" |
235 236 | script: "../scripts/mark_duplicates.py" |
246 247 | shell: "samtools index {input} {output}" |
257 258 | shell: "gatk CreateSequenceDictionary -R {input} -O {output}" |
272 273 | script: "../scripts/generate_genomic_intervals.py" |
291 292 | script: "../scripts/call_haplotypes.py" |
302 303 | script: "../scripts/gather_variants.py" |
313 314 | shell: "gatk SortVcf -I {input} -O {output} --MAX_RECORDS_IN_RAM 50000" |
327 328 | script: "../scripts/generate_genomic_intervals.py" |
341 342 | script: "../scripts/import_variants.py" |
355 356 | script: "../scripts/call_imported_variants.py" |
369 370 | script: "../scripts/quality_filter_snps.py" |
381 382 | script: "../scripts/filter_snps.py" |
392 393 | script: "../scripts/setup_est_sfs.py" |
407 408 | script: "../scripts/prepare_input_est_sfs.py" |
423 424 | script: "../scripts/gather_input_est_sfs.py" |
438 439 | script: "../scripts/derive_ancestral_alleles.py" |
450 451 | script: "../scripts/scatter_output_est_sfs.py" |
464 465 | script: "../scripts/recode_ancestral_alleles_vcf.py" |
475 476 | shell: "bgzip {input} -c > {output}" |
487 488 | script: "../scripts/cleanup_annotation_file.py" |
498 499 | shell: "grep -v '#' {input} | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > {output}" |
509 510 | shell: "tabix -p gff {input}" |
524 525 | script: "../scripts/predict_variant_effects_vep.py" |
537 538 | script: "../scripts/determine_degeneracy.py" |
548 549 | shell: "bgzip {input} -c > {output}" |
561 562 | script: "../scripts/determine_synonymy.py" |
572 573 | shell: "bgzip {input} -c > {output}" |
583 584 | script: "../scripts/gather_variants.py" |
594 595 | shell: "gatk SortVcf -I {input} -O {output} --MAX_RECORDS_IN_RAM 50000" |
609 610 | script: "../scripts/plot_sfs_est.py" |
47 48 | script: "../scripts/call_haplotypes.py" |
63 64 | script: "../scripts/import_variants.py" |
37 38 | script: "../scripts/calculate_fst.py" |
51 52 | script: "../scripts/calculate_fst_windowed.py" |
64 65 | script: "../scripts/calculate_pi.py" |
77 78 | script: "../scripts/calculate_tajimas_d.py" |
90 91 | script: "../scripts/plot_fst.py" |
103 104 | script: "../scripts/plot_fst_windowed.py" |
116 117 | script: "../scripts/plot_tajimas_d.py" |
129 130 | script: "../scripts/plot_pi.py" |
144 145 | script: "../scripts/collect_sample_set_stats.py" |
160 161 | script: "../scripts/calculate_hwe.py" |
176 177 | script: "../scripts/calculate_hwe.py" |
189 190 | script: "../scripts/plot_hwe.py" |
202 203 | script: "../scripts/plot_hwe.py" |
216 217 | script: "../scripts/plot_heterozygosity.py" |
229 230 | script: "../scripts/plot_heterozygosity_comparison.py" |
244 245 | script: "../scripts/calculate_missingness.py" |
257 258 | script: "../scripts/plot_missingness_sites.py" |
270 271 | script: "../scripts/plot_missingness_individuals.py" |
284 285 | script: "../scripts/calculate_linkage.py" |
35 36 | script: "../scripts/plot_umap.py" |
52 53 | script: "../scripts/plot_umap.py" |
71 72 | script: "../scripts/plot_umap.py" |
45 46 | script: "../scripts/generate_vcf_sample_set.py" |
62 63 | script: "../scripts/generate_vcf_sample_set.py" |
80 81 | script: "../scripts/generate_vcf_sample_set.py" |
96 97 | script: "../scripts/filter_by_info.py" |
112 113 | script: "../scripts/filter_by_info.py" |
127 128 | script: "../scripts/filter_by_info.py" |
142 143 | script: "../scripts/filter_by_info.py" |
157 158 | script: "../scripts/filter_by_info.py" |
172 173 | script: "../scripts/filter_by_info.py" |
187 188 | script: "../scripts/filter_by_info.py" |
203 204 | script: "../scripts/plot_sfs_1D.py" |
218 219 | script: "../scripts/plot_sfs_2D.py" |
236 237 | script: "../scripts/plot_sfs_1D.py" |
254 255 | script: "../scripts/plot_sfs_1D.py" |
273 274 | script: "../scripts/plot_sfs_1D.py" |
286 287 | script: "../scripts/plot_degeneracy.py" |
298 299 | script: "../scripts/plot_synonymy.py" |
311 312 | script: "../scripts/annotate_ld.py" |
325 326 | script: "../scripts/filter_by_info.py" |
21 22 | script: "../scripts/convert_plink.py" |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell try: vcf = snakemake.input[0] window_size_kb = snakemake.params.get('window_size_kb', 50) out = snakemake.output[0] except NameError: # testing vcf = 'output/default/snps/pendula/biallelic/snps.vcf.gz' window_size_kb = 50 out = 'scratch/vcf.r2.gz' shell(f"bcftools +prune --annotate r2 {vcf} --window {window_size_kb}kb -Oz -o {out}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import dadi import pandas as pd import demographic_scenarios import sfs_utils try: vcf = snakemake.input.vcf pops = snakemake.input.pops result = snakemake.input.result scenario_name = snakemake.params.scenario sample_set = snakemake.params.sample_set n_pops = snakemake.params.n_pops params_out = snakemake.output.params params_pretty_out = snakemake.output.params_pretty mode = snakemake.params.mode n_iter = snakemake.config['dadi_n_iter_' + mode] generation_time = snakemake.config['generation_time'] mu = snakemake.config['mu'] N_e = snakemake.config['N_e'][sample_set] except NameError: # testing vcf = "output/default/snps/pendula/synonymous/snps.vcf.gz" """pops = "output/default/sample_sets/subpopulations/pendula/2_pops.txt" scenario_name = 'split_symmetric_migration' n_pops = 2""" pops = "output/default/sample_sets/subpopulations/pendula/1_pops.txt" result = "output/default/dadi/one_pop_size_change_since_ice_age/pendula/synonymous/batches/1/dadi.csv" scenario_name = 'one_pop_size_change_since_ice_age' n_pops = 1 sample_set = "pendula" params_out = "scratch/dadi.csv" params_pretty_out = "scratch/dadi.out" mode = 'local_optimization' n_iter = 10 generation_time = 20 mu = 1e-9 N_e = 675000 n_proj = 20 dd = dadi.Misc.make_data_dict_vcf(vcf, pops) # divide data into chunks of size chunk_size_bs # we then subsample from these chunks with replacement # to create the bootstraps # https://dadi.readthedocs.io/en/latest/user-guide/bootstrapping/ chunks = dadi.Misc.fragment_data_dict(dd, demographic_scenarios.DemographicScenario.chunk_size_bs) # obtain one bootstrap sample fs_bs = dadi.Misc.bootstraps_from_dd_chunks(chunks, 1, sfs_utils.get_pop_ids(n_pops), [n_proj] * n_pops)[0] # we pass the original data dict here as it is only # used to create parametric bootstraps params = dict(fs=fs_bs, dd=dd, n_pops=n_pops, n_proj=n_proj, n_iter=n_iter, N_e=N_e, mu=mu, generation_time=generation_time, mode=mode) scenario = demographic_scenarios.create(scenario_name, params) # load params from simulation result params = pd.read_csv(result, index_col=None, header=0).iloc[0] # determine initial value scenario.p0 = [params[name] for name in scenario.params] # repeat the simulation with the appropriate initial value for one # iteration for convenience reasons scenario.simulate() # save to file data = scenario.to_dataframe() open(params_pretty_out, 'w').write(data.to_markdown()) open(params_out, 'w').write(data.to_csv()) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import pandas as pd import sfs_utils import utils try: vcfs = snakemake.input.vcfs pops_1 = snakemake.input.pops_1 pops_2 = snakemake.input.pops_2 names = snakemake.params.names out_txt = snakemake.output.txt out_csv = snakemake.output.csv except NameError: # testing sample_sets = ["pubescens"] flag = "0fold" vcfs = [f"output/tetraploid/snps/{sample_set}/{flag}/snps.vcf.gz" for sample_set in sample_sets] # vcf_all = [f"output/tetraploid/snps/{sample_set}/all/snps.vcf.gz" for sample_set in sample_sets] pops_1 = [f"output/tetraploid/sample_sets/subpopulations/{sample_set}/1_pops.txt" for sample_set in sample_sets] pops_2 = [f"output/tetraploid/sample_sets/subpopulations/{sample_set}/2_pops.txt" for sample_set in sample_sets] names = sample_sets out_txt = "scratch/stats.txt" out_csv = "scratch/stats.csv" stats = pd.DataFrame(index=['n sites', "Tajima's D", 'pi', 'pi per site', 'S', 'FST', 'FST_0']) for i, name in enumerate(names): print(f'Calculating for subset: {name}', flush=True) # load spectrum for subset # down-projecting does not seem to affect any of # the statistics calculated however fs_1D = sfs_utils.get_full_spectrum(vcfs[i], pops_1[i]) fs_2D = sfs_utils.get_full_spectrum(vcfs[i], pops_2[i]) n_sites = utils.count_lines_vcf(vcfs[i]) # n_sites_all = utils.count_lines_vcf(vcf_all[i]) stats[name] = [ n_sites, fs_1D.Tajima_D(), fs_1D.pi(), fs_1D.pi() / n_sites, fs_1D.S(), fs_2D.Fst(), fs_2D.scramble_pop_ids().Fst() ] open(out_txt, 'w').write(stats.to_markdown()) stats.to_csv(out_csv) |
5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell vcf = snakemake.input.vcf pops = snakemake.input.pops out = snakemake.params.prefix shell(f"vcftools --gzvcf {vcf} --weir-fst-pop {pops[0]} --weir-fst-pop {pops[1]} --out {out}") |
5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell vcf = snakemake.input.vcf pops = snakemake.input.pops out = snakemake.params.prefix shell(f"vcftools --gzvcf {vcf} --weir-fst-pop {pops[0]} --weir-fst-pop {pops[1]} --fst-window-size 1000000000 --out {out}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell file = snakemake.params.input_prefix out = snakemake.params.output_prefix midp = snakemake.params.midp flag = 'midp' if midp else '' # execute command shell(f"plink --bfile {file} --hardy {flag} --hwe 0 --allow-extra-chr --out {out}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell file = snakemake.params.input_prefix out = snakemake.params.output_prefix # execute command # LD window options obtained from Luis Leal shell(f"plink --bfile {file} --r2 --ld-window-kb 50 --ld-window 500 " f"--ld-window-r2 0 --allow-extra-chr --out {out}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from feems_utils import * try: test_mode = False sample_set = snakemake.params['sample_set'] sample_class = snakemake.params['sample_class'] flag = snakemake.params['flag'] buffer = snakemake.params.get('buffer', 2) lambdas = snakemake.params.lambdas out_cv = snakemake.output.cv out_surfaces = snakemake.output.surfaces except NameError: # testing test_mode = True sample_set = 'pendula' sample_class = 'default' flag = 'biallelic' buffer = 2 lambdas = np.geomspace(1e-8, 1e2, 11) out_cv = 'scratch/cv.png' out_surfaces = 'scratch/surfaces.png' sp_graph, samples, coords = get_sp_graph(sample_set, flag, sample_class=sample_class, buffer=buffer) projection = get_projection(coords) # determine cross-validation error best_lambda = determine_best_lamb(lambdas, sp_graph) utils.save_fig(out_cv, tight_layout=True, show=test_mode, clear=True) # fit the graph sp_graph.fit(lamb=best_lambda) # draw figure showing the migration surfaces v = create_plot(sp_graph, projection) v.draw_edge_colorbar() v.draw_edges(use_weights=True) v.draw_obs_nodes(use_ids=False) utils.save_fig(out_surfaces, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from feems_utils import * try: test_mode = False sample_set = snakemake.params['sample_set'] sample_class = snakemake.params['sample_class'] flag = snakemake.params['flag'] lamb = snakemake.params.get('lamb', None) buffer = snakemake.params.get('buffer', 2) out_cv_error = snakemake.output.get('cv_error', None) out_surfaces = snakemake.output.surfaces except NameError: # testing test_mode = True sample_set = 'pendula' sample_class = 'default' flag = 'biallelic' lamb = 10.0 buffer = 2 out_cv_error = 'scratch/cv_error.txt' out_surfaces = 'scratch/surfaces.png' sp_graph, samples, coords = get_sp_graph(sample_set, flag, sample_class=sample_class, buffer=buffer) projection = get_projection(coords) if out_cv_error: # determine cross-validation error cv_error = determine_cv_error(lamb, sp_graph)[0, 0] with open(out_cv_error, 'w') as f: f.write(str(cv_error)) # fit the graph sp_graph.fit(lamb=lamb) # draw figure showing the migration surfaces v = create_plot(sp_graph, projection) v.draw_edge_colorbar() v.draw_edges(use_weights=True) v.draw_obs_nodes(use_ids=False) plt.savefig(out_surfaces, bbox_inches='tight', pad_inches=0) if test_mode: plt.show() |
5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell file = snakemake.params.input_prefix out = snakemake.params.output_prefix # execute command shell(f"plink --bfile {file} --missing --allow-extra-chr --out {out}") |
5 6 7 8 9 10 11 12 13 14 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell vcf = snakemake.input.vcf out = snakemake.params.prefix shell(f"vcftools --site-pi --gzvcf {vcf} --out {out}") |
5 6 7 8 9 10 11 12 13 14 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell vcf = snakemake.input.vcf out = snakemake.params.prefix shell(f"vcftools --gzvcf {vcf} --TajimaD 1000 --out {out}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts ref = snakemake.input.ref bam = snakemake.input.bam intervals = snakemake.input.intervals java_opts = get_java_opts(snakemake) tmp_dir = snakemake.resources.tmpdir ploidy = snakemake.params.ploidy out = snakemake.output[0] shell(f"gatk --java-options '{java_opts}' HaplotypeCaller " f"-R {ref} -I {bam} -O {out} --tmp-dir {tmp_dir} " f"--interval-padding 0 -L {intervals} -ERC GVCF " f"--sample-ploidy {ploidy}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts ref = snakemake.input.ref db = snakemake.input.db intervals = snakemake.input.intervals java_opts = get_java_opts(snakemake) out = snakemake.output[0] shell(f"gatk GenotypeGVCFs --java-options '{java_opts}' -V gendb://{db} \ -R {ref} -O {out} --include-non-variant-sites -L {intervals}") |
5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell input = snakemake.input[0] out = snakemake.output[0] # cleanup file shell(f"agat_convert_sp_gxf2gxf.pl -g {input} -o {out} --gvi 3") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import pandas as pd import utils try: vcfs = snakemake.input.vcfs sample_sets = snakemake.params.sample_sets sample_class = snakemake.params['sample_class'] out = snakemake.output[0] except NameError: sample_sets = ['pendula', 'pubescens', 'pendula_pubescens'] sample_class = 'default' vcfs = ['output/default/snps/pendula/all/snps.vcf.gz', 'output/default/snps/pubescens/all/snps.vcf.gz', 'output/default/snps/pendula_pubescens/all/snps.vcf.gz' ] out = 'scratch/sample_set_stats.csv' n_samples, n_sites = [], [] for i, sample_set in enumerate(sample_sets): n_samples.append(len(utils.get_samples(sample_set, sample_class))) n_sites.append(utils.count_lines_vcf(vcfs[i])) data = pd.DataFrame(columns=[s.replace('_', ' ') for s in sample_sets]) data.loc['n samples'] = n_samples data.loc['n sites'] = n_sites data.T.to_csv(out) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | if (exists('snakemake')) { full_anc = snakemake@input[['full_anc']] full = snakemake@input[['full']] deleterious_anc = snakemake@input[['deleterious_anc']] deleterious = snakemake@input[['deleterious']] output = snakemake@output[[1]] postprocessing_source = snakemake@config$polydfe_postprocessing_source } else { # testing full_anc = "output/default/polydfe/pendula/output/C.full_anc.out" full = "output/default/polydfe/pendula/output/C.full.out" deleterious_anc = "output/default/polydfe/pendula/output/C.deleterious_anc.out" deleterious = "output/default/polydfe/pendula/output/C.deleterious.out" output = "scratch/model_comparison.txt" postprocessing_source = "resources/polydfe/postprocessing/script.R" } source(postprocessing_source) tt = rbind(compareModels(full_anc, full)$LRT, compareModels(deleterious_anc, deleterious)$LRT, compareModels(full, deleterious)$LRT, compareModels(full_anc, deleterious_anc)$LRT) tt = data.frame(type1=c('full_anc', 'deleterious_anc', 'full', 'full_anc'),tt) tt = data.frame(type2=c('full', 'deleterious', 'deleterious', 'deleterious_anc'),tt) write.table(tt, output) |
5 6 7 8 9 10 11 12 13 14 15 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell vcf = snakemake.input.vcf prefix = snakemake.params.prefix # create bed, bim and fam files shell(f"plink --vcf {vcf} --out {prefix} --allow-extra-chr") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import numpy as np import pandas as pd import utils try: test_mode = False Q = snakemake.input.Q sample_set = snakemake.params.sample_set flag = snakemake.params.flag sample_class = snakemake.params.sample_class subsample_sets = snakemake.params.subsample_sets out = snakemake.output[0] except NameError: # testing test_mode = True sample_set = 'pendula_pubescens' flag = 'biallelic' sample_class = 'default' subsample_sets = [ 'pendula_north_admixture', 'pendula_south_admixture', 'pubescens_north_admixture', 'pubescens_south_admixture' ] Q = f"output/{sample_class}/admixture/{sample_set}/{flag}/snps.admixture.4.Q" out = 'scratch/admixture_locations.png' # load data data = pd.read_csv(Q, sep=' ', header=None) K = data.columns.shape[0] data[['name', 'latitude']] = utils.get_samples(sample_set, sample_class=sample_class)[['name', 'latitude']] data['sample_set'] = '' # assign subsample sets to samples for subsample_set in subsample_sets: subsamples = utils.get_samples(subsample_set, sample_class=sample_class) data.sample_set[data.name.isin(subsamples.name)] = subsample_set # sort individuals # sort sample sets in the specified order data['sample_set_order'] = [subsample_sets.index(s) for s in data.sample_set] data = data.sort_values(['sample_set_order', 'latitude']).reset_index() # plot data ax = data[range(K)].plot(kind='bar', stacked=True, xticks=[], yticks=[0, 1], width=1, legend=None) # define ticks labels = [utils.get_proper_name(s) for s in subsample_sets] tick_locs = [np.median(data[data.sample_set == s].index) for s in subsample_sets] labels2 = [data[data.sample_set == s].sample_set.unique() for s in subsample_sets] plt.xticks(ticks=tick_locs, labels=labels) # plot vertical lines separating the subsample sets plt.vlines([np.max(data[data.sample_set == s].index) for s in subsample_sets[:-1]], 0, 1, linestyles='dashed', color='black') ax.set_aspect(data.shape[0] / 5) utils.scale_cf(4 / 3) ax.margins(0) utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | if (exists('snakemake')) { input = snakemake@input[[1]] n = snakemake@params[['n']] prefix = snakemake@params[['prefix']] postprocessing_source = snakemake@config$polydfe_postprocessing_source } else { input = "output/default/polydfe/pendula.1_pops.C.full_anc.out" n = 10 prefix = "scratch/polydfe" postprocessing_source = "resources/polydfe/postprocessing/script.R" } source(postprocessing_source) bootstrapData(input, outputfile = prefix, rep = n) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | if (exists('snakemake')) { input = snakemake@input[[1]] prefix = snakemake@params[['prefix']] type = snakemake@params[['type']] postprocessing_source = snakemake@config$polydfe_postprocessing_source source(postprocessing_source) } else { input = "output/default/polydfe/pendula/out/C.deleterious.out" prefix = "scratch/init_C.deleterious" type = "deleterious" postprocessing_source = "https://github.com/paula-tataru/polyDFE/raw/master/postprocessing.R" library(devtools) source_url(postprocessing_source) } # For some reason we need to pass the parameters to be fixed explicitly. # We only support model C here fix = c("eps_cont") if (type == 'deleterious_anc' | type == 'deleterious') { fix = c(fix, c("p_b", "S_b")) } if (type == 'deleterious' | type == 'full') { fix = c(fix, "eps_an") } createInitLines(input, prefix, fix = fix) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import os.path from snakemake.shell import shell try: bed = snakemake.input.bed window_size = snakemake.params.get('window_size', 50) step_size = snakemake.params.get('step_size', 10) R2 = snakemake.params.get('R2', 0.1) out = snakemake.output[0] except NameError: # testing input = 'output/default/snps/pendula/biallelic/snps.bed' window_size = 50 step_size = 10 R2 = 0.1 out = 'scratch/' output_prefix = os.path.dirname(out) input_file_prefix = os.path.basename(bed).replace('.bed', '') shell(f""" cd {output_prefix} plink --bfile {input_file_prefix} --indep-pairwise {window_size} {step_size} {R2} --allow-extra-chr """) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import pandas as pd import pca_utils import utils # The derivation of some sample sets depends on the previous deviation of # other, more general sample sets. This is because the utils module will # look for previously generated files matching the sample set name. # The order of derivation is thus important, which has to be from less # specific to more specific. try: input = snakemake.input[0] sample_set = snakemake.params.sample_set sample_class = snakemake.params.sample_class pendula_latitudinal_barrier = snakemake.config['latitudinal_barriers']['pendula'] pubescens_latitudinal_barrier = snakemake.config['latitudinal_barriers']['pubescens'] out = snakemake.output[0] except NameError: # testing input = 'output/default/admixture/pubescens/biallelic/snps.admixture.2.Q' sample_set = 'pubescens_north_admixture' sample_class = 'default' pendula_latitudinal_barrier = 64 pubescens_latitudinal_barrier = 65 out = f'scratch/{sample_set}.args' # get the samples to be included depending on ADMIXTURE's # Q matrix denoting the provenance of an individual def get_samples_admixture(condition, supersample_set): Q = pd.read_csv(input, index_col=None, header=None, sep=' ') mask = condition(Q.iloc[:, 0]).values return utils.get_samples(supersample_set, sample_class)[mask] # based on PCA pendula_pubescens_barrier = 20 birch_outliers = ['GNA02', 'GNA04', 'BUR27a', 'BUR27b'] birch_samples = pd.read_csv(utils.config['birch_samples'], index_col=0) if sample_set == 'all': outgroup_samples = pd.read_csv(utils.config['outgroup_samples'], index_col=0) # concatenate outgroup and birch samples samples = pd.concat([birch_samples, outgroup_samples]) elif sample_set == 'birch': samples = birch_samples elif sample_set == 'left_cluster': pca, pc, _ = pca_utils.get('birch', 'biallelic', sample_class) samples = birch_samples[pc[:, 0] < pendula_pubescens_barrier] elif sample_set == 'right_cluster': pca, pc, _ = pca_utils.get('birch', 'biallelic', sample_class) samples = birch_samples[pc[:, 0] >= pendula_pubescens_barrier] elif sample_set == 'pendula_pubescens': # remove outliers samples = birch_samples[~birch_samples.name.isin(birch_outliers)] elif sample_set == 'pendula': samples = utils.get_samples('right_cluster', sample_class) samples = samples[~samples.name.isin(birch_outliers)] elif sample_set == 'pendula_south': samples = utils.get_samples('pendula', sample_class) samples = samples[samples.latitude <= pendula_latitudinal_barrier] elif sample_set == 'pendula_north': samples = utils.get_samples('pendula', sample_class) samples = samples[samples.latitude > pendula_latitudinal_barrier] elif sample_set == 'pubescens': samples = utils.get_samples('left_cluster', sample_class) # remove outliers samples = samples[~samples.name.isin(birch_outliers)] elif sample_set == 'pubescens_south': samples = utils.get_samples('pubescens', sample_class) samples = samples[samples.latitude <= pubescens_latitudinal_barrier] elif sample_set == 'pubescens_north': samples = utils.get_samples('pubescens', sample_class) samples = samples[samples.latitude > pubescens_latitudinal_barrier] elif sample_set == 'outgroups': samples = pd.read_csv(utils.config['outgroup_samples'], index_col=0) elif sample_set == 'pendula_north_admixture': samples = get_samples_admixture(lambda Q: Q > 0.5, 'pendula') elif sample_set == 'pendula_south_admixture': samples = get_samples_admixture(lambda Q: Q <= 0.5, 'pendula') elif sample_set == 'pubescens_north_admixture': samples = get_samples_admixture(lambda Q: Q > 0.5, 'pubescens') elif sample_set == 'pubescens_south_admixture': samples = get_samples_admixture(lambda Q: Q <= 0.5, 'pubescens') elif sample_set == 'pubescens_admixture': samples = get_samples_admixture(lambda Q: Q > 0.5, 'pendula_pubescens') elif sample_set == 'pendula_admixture': samples = get_samples_admixture(lambda Q: Q <= 0.5, 'pendula_pubescens') # write samples to file samples.name.to_csv(out, index=False, header=False) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import os import utils sample_set = snakemake.params.sample_set sample_class = snakemake.params.sample_class n_pops = snakemake.params.n_pops out = snakemake.output[0] samples = utils.get_samples(sample_set, sample_class) samples['pop'] = 'pop0' def assign_pop1(subset_name, samples): subset = utils.get_samples(subset_name, sample_class) samples.loc[samples.name.isin(subset.name), 'pop'] = 'pop1' # subpopulations: # pubescens: pop0, pendula: pop1 # northern: pop0, southern: pop1 if n_pops == 2: if sample_set == 'pendula_pubescens': assign_pop1('pendula', samples) elif sample_set == 'pendula': assign_pop1('pendula_south', samples) elif sample_set == 'pubescens': assign_pop1('pubescens_south', samples) # write to file with open(out, 'w') as f: f.write(os.linesep.join(samples[['name', 'pop']].agg('\t'.join, axis=1)) + os.linesep) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell data = snakemake.input.data seed = snakemake.input.seed config = snakemake.input.config bin = snakemake.input.bin out_sfs = snakemake.output.sfs out_probs = snakemake.output.probs tmp_dir = snakemake.resources.tmpdir # execute command shell(f"{bin} {config} {data} {seed} {out_sfs} {out_probs}") |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import pandas as pd import demographic_scenarios try: test_mode = False vcf = snakemake.input.vcf pops = snakemake.input.pops results = snakemake.input.results scenario_name = snakemake.params.scenario sample_set = snakemake.params.sample_set n_pops = snakemake.params.n_pops params_out = snakemake.output.params params_pretty_out = snakemake.output.params_pretty generation_time = snakemake.config['generation_time'] mu = snakemake.config['mu'] N_e = snakemake.config['N_e'][sample_set] except NameError: # testing test_mode = True vcf = "output/tetraploid/snps/pendula/synonymous/snps.vcf.gz" sample_set = 'pendula_pubescens' scenario_name = 'split_symmetric_migration' n_pops = 2 """scenario_name = 'one_pop_size_change_since_ice_age' n_pops = 1""" results = [f"output/tetraploid/dadi/{scenario_name}/{sample_set}/synonymous/batches/{n}/dadi.csv" for n in range(1, 101)] pops = f"output/tetraploid/sample_sets/subpopulations/pendula/{n_pops}_pops.txt" params_out = "scratch/dadi.csv" params_pretty_out = "scratch/dadi.out" generation_time = 20 mu = 1e-9 N_e = 675000 res = [] for file in results: df = pd.read_csv(file, index_col=None, header=0) res.append(df) res = pd.concat(res, axis=0, ignore_index=True) # determine result with highest likelihood best_params = res[res.lnl == res.lnl.max()].iloc[0] # determine initial value param_names = demographic_scenarios.get_class(scenario_name).params p0 = [best_params[name] for name in param_names] scenario = demographic_scenarios.restore(scenario_name, p0, vcf, pops, generation_time, n_pops, N_e, mu) # calculate standard deviation from GIM scenario.perform_uncertainty_analysis() # save to file data = scenario.to_dataframe() open(params_pretty_out, 'w').write(data.to_markdown()) open(params_out, 'w').write(data.to_csv()) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import numpy as np import pandas as pd import demographic_scenarios import utils try: vcf = snakemake.input.vcf pops = snakemake.input.pops results = snakemake.input.results result = snakemake.input.result_original scenario_name = snakemake.params.scenario sample_set = snakemake.params.sample_set n_pops = snakemake.params.n_pops params_out = snakemake.output.params params_pretty_out = snakemake.output.params_pretty generation_time = snakemake.config['generation_time'] mu = snakemake.config['mu'] N_e = snakemake.config['N_e'][sample_set] except NameError: # testing vcf = "output/tetraploid/snps/pendula/synonymous/snps.vcf.gz" """pops = "output/tetraploid/sample_sets/subpopulations/pendula/2_pops.txt" scenario_name = 'split_symmetric_migration' n_pops = 2""" scenario_name = 'exp_pop_growth' sample_set = "pendula" pops = f"output/tetraploid/sample_sets/subpopulations/{sample_set}/1_pops.txt" results = [f"output/tetraploid/dadi/{scenario_name}/{sample_set}/synonymous/bootstraps/{n}/dadi.csv" for n in range(1, 11)] result = f"output/tetraploid/dadi/{scenario_name}/pendula/synonymous/dadi.csv" n_pops = 1 params_out = "scratch/dadi.csv" params_pretty_out = "scratch/dadi.out" generation_time = 20 mu = 1e-9 N_e = 675000 # load boostrap results bootstraps = [] for file in results: bootstraps.append(pd.read_csv(file, index_col=None, header=0).head(1)) bootstraps = pd.concat(bootstraps, axis=0, ignore_index=True) # load params from file params = pd.read_csv(result, index_col=None, header=0).iloc[0] # determine initial value param_names = demographic_scenarios.get_class(scenario_name).params p0 = [params[name] for name in param_names] scenario = demographic_scenarios.restore(scenario_name, p0, vcf, pops, generation_time, n_pops, N_e, mu) # calculate standard deviation from GIM scenario.perform_uncertainty_analysis() # get dataframe of params data = scenario.to_dataframe() # confidence interval threshold a = 0.05 # calculate additional statistics based on the bootstraps cols = [ 'cis_bs_percentile', # percentile bootstrap confidence interval 'cis_bs_bca', # bca bootstrap confidence interval 'std_bs', # bca bootstrap standard deviation 'mean_bs', # bca bootstrap mean ] col_data = {key: [] for key in cols} for name in ['lnl', 'theta'] + scenario.params: values = bootstraps[name].values.tolist() bs_percentile = utils.get_ci_percentile_bootstrap(values, a) bs_bca = utils.get_ci_bca(values, params[name], a) col_data['cis_bs_percentile'].append(bs_percentile) col_data['cis_bs_bca'].append(bs_bca) col_data['std_bs'].append(np.std(values, ddof=1)) col_data['mean_bs'].append(np.mean(values)) # add to data frame for col in cols: data.loc[col] = col_data[col] # save to file open(params_out, 'w').write(data.to_csv()) open(params_pretty_out, 'w').write(data.to_markdown()) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import vcf from Bio import SeqIO from vcf.parser import _Info as Info import codon_utils try: vcf_file = snakemake.input.vcf gff = snakemake.input.gff ref = snakemake.input.ref out = snakemake.output[0] except NameError: # testing vcf_file = 'testing/snps1.ancestral.vep.annotated.vcf.gz' gff = 'resources/reference/genome.corrected.gff.gz' ref = 'resources/reference/genome.fasta' out = 'scratch/annotated.vcf' vcf_reader = vcf.Reader(filename=vcf_file) # add new fields to header vcf_reader.infos['Degeneracy'] = Info('Degeneracy', 1, 'Integer', 'n-fold degeneracy', None, None) vcf_reader.infos['Degeneracy_Info'] = Info('Degeneracy_Info', 1, 'String', 'Additional information about custom degeneracy annotation:', None, None) vcf_writer = vcf.Writer(open(out, 'w'), vcf_reader) ref_reader = SeqIO.parse(ref, 'fasta') # add degeneracy annotation for given record def annotate_degeneracy(record, cd, codon, codon_pos, codon_start, pos_codon, contig): degeneracy = None if 'N' not in codon: degeneracy = codon_utils.get_degeneracy(codon, pos_codon) record.INFO['Degeneracy'] = degeneracy record.INFO['Degeneracy_Info'] = f"{pos_codon},{cd.strand},{codon}" print(f'pos codon: {pos_codon}, pos abs: {record.POS}, ' f'codon start: {codon_start}, codon: {codon}, ' f'strand: {cd.strand}, ref allele: {contig[record.POS - 1]}, ' f'degeneracy: {degeneracy}, codon pos: {str(codon_pos)}, ' f'rec allele: {record.REF}', flush=True) return record info_fields = {'Degeneracy': None, 'Degeneracy_Info': None} # perform annotation codon_utils.map_sites(vcf_reader, vcf_writer, gff, ref_reader, annotate_degeneracy, info_fields=info_fields) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import numpy as np import vcf from Bio import SeqIO from vcf.parser import _Info as Info import codon_utils try: vcf_file = snakemake.input.vcf gff = snakemake.input.gff ref = snakemake.input.ref out = snakemake.output[0] except NameError: # testing vcf_file = 'testing/snps1.ancestral.vep.annotated.vcf.gz' gff = 'resources/reference/genome.corrected.gff.gz' ref = 'resources/reference/genome.fasta' out = 'scratch/annotated.vcf' vcf_reader = vcf.Reader(filename=vcf_file) # add new fields to header vcf_reader.infos['Synonymy'] = Info('Synonymy', 1, 'Integer', 'Whether coding site is synonymous', None, None) vcf_reader.infos['Synonymy_Info'] = Info('Synonymy_Info', 1, 'String', 'Additional information about custom synonymy annotation:', None, None) vcf_writer = vcf.Writer(open(out, 'w'), vcf_reader) ref_reader = SeqIO.parse(ref, 'fasta') # Maximum number of assertion errors until raising an exception. # We introduce some leniency here as VEP produced different # codons in a very cases for which no explanation could be found. n_errors_tol = 3 # the erroneous records records_err = [] # add synonymy annotation for given record def annotate_synonymy(record, cd, codon, codon_pos, codon_start, pos_codon, contig): # fetch the alternative allele if present alt = codon_utils.get_alt_allele(record, cd.strand) info = '' synonymy, alt_codon, codons_vep = None, None, None if alt: # alternative codon # 'n' might not be in uppercase alt_codon = codon_utils.mutate(codon, alt, pos_codon).upper() # whether the alternative codon is synonymous if 'N' not in codon and 'N' not in alt_codon: synonymy = int(codon_utils.is_synonymous(codon, alt_codon)) info += f'{alt_codon}' if alt_codon in codon_utils.stop_codons: info += ',stop_gained' if alt_codon in codon_utils.start_codons: info += ',start_gained' # fetch the codons from the VEP annotation codons_vep = codon_utils.parse_codons_vep(record) # make sure the codons determined by VEP are the same as our codons # we can only do the comparison for variant sites # only do the comparison if we could find exactly one pair of VEP codons # if there are several pairs then this might be because of adjacent SNPs # whose alternative codons are not supported at this point if len(codons_vep) == 1: # check for equality if not np.array_equal(codons_vep[0], [codon, alt_codon]): records_err.append(record) if len(records_err) > n_errors_tol: raise AssertionError(f"VEP Codons don't match at {codon_utils.format_errors(records_err)}. " f"Number of errors ({n_errors_tol}) exceeded.") # add to info field record.INFO['Synonymy'] = synonymy record.INFO['Synonymy_Info'] = info print(f'pos codon: {pos_codon}, pos abs: {record.POS}, ' f'codon start: {codon_start}, strand: {cd.strand}, ' f'ref allele: {contig[record.POS - 1]}, rec allele: {record.REF}, ' f'codon pos: {str(codon_pos)}, codons: {str([codon, alt_codon])}, ' f'VEP: {str(codons_vep)}', flush=True) return record info_fields = {'Synonymy': None, 'Synonymy_Info': None} # perform annotation codon_utils.map_sites(vcf_reader, vcf_writer, gff, ref_reader, annotate_synonymy, info_fields=info_fields) print(f"In total, {len(records_err)} codon assertion errors " f"occurred at {str([str(r) for r in records_err])}.") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell try: ids = snakemake.input.ids input = snakemake.input.bed out = snakemake.output.bed except NameError: # testing ids = "output/default/snps/pendula/biallelic/plink.prune.in" input = "output/default/snps/pendula/biallelic/snps.bed" out = "output/default/snps/pendula/biallelic/snps.admixture.bed" input_prefix = input.replace('.bed', '') out_prefix = out.replace('.bed', '') shell(f"plink --bfile {input_prefix} --extract {ids} --make-bed --out {out_prefix} --allow-extra-chr '0'") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts java_opts = get_java_opts(snakemake) vcf = snakemake.input.vcf filter = snakemake.params.filter out = snakemake.output[0] shell(f"gatk --java-options '{java_opts}' SelectVariants -V {vcf} -O {out} " f"--selectExpressions '{filter}'") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell # from snakemake_wrapper_utils.java import get_java_opts vcf = snakemake.input.vcf # samples = snakemake.input.samples # java_opts = get_java_opts(snakemake) out = snakemake.output[0] # keep only mono and bi-allelic SNPs # we retain non-variants sites here which we will annotate # we do this as we need the total number of surveyed sites for predicting the DFE shell(f"bcftools view {vcf} -Oz --apply-filters .,PASS --max-alleles 2 --exclude-types indels,mnps,other > {out}") # this didn't work for some reasons. There were still indels among the filtered variants """shell(f"gatk --java-options '{java_opts}' SelectVariants -V {vcf} -O {out} \ --select-type-to-include SNP --select-type-to-include NO_VARIATION \ --select-type-to-exclude INDEL --select-type-to-exclude SYMBOLIC \ --select-type-to-exclude MNP --select-type-to-exclude MIXED \ --sample-name {samples}")""" |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import dadi import demographic_scenarios import sfs_utils try: test_mode = False vcf = snakemake.input.vcf pops = snakemake.input.pops scenario_name = snakemake.params.scenario sample_set = snakemake.params.sample_set n_pops = snakemake.params.n_pops params_out = snakemake.output.params params_pretty_out = snakemake.output.params_pretty mode = snakemake.params.mode n_iter = snakemake.config['dadi_n_iter_' + mode] generation_time = snakemake.config['generation_time'] mu = snakemake.config['mu'] N_e = snakemake.config['N_e'][sample_set] except NameError: # testing test_mode = True sample_set = "pendula" vcf = f"output/default/snps/{sample_set}/synonymous/snps.vcf.gz" n_pops = 2 pops = f"output/default/sample_sets/subpopulations/{sample_set}/{n_pops}_pops.txt" scenario_name = 'split_unidirectional_migration_from_two_to_one_since_ice_age' params_out = "scratch/dadi.csv" params_pretty_out = "scratch/dadi.out" # mode = 'local_optimization' mode = 'random_hopping' n_iter = 1 generation_time = 20 mu = 1e-9 N_e = 675000 n_proj = 20 dd = dadi.Misc.make_data_dict_vcf(vcf, pops) fs = sfs_utils.get_spectrum_from_dd(dd, n_proj=n_proj, n_pops=n_pops) params = dict(fs=fs, dd=dd, n_pops=n_pops, n_proj=n_proj, n_iter=n_iter, N_e=N_e, mu=mu, generation_time=generation_time, mode=mode) scenario = demographic_scenarios.create(scenario_name, params) # perform simulation scenario.simulate() # save to file data = scenario.to_dataframe() open(params_pretty_out, 'w').write(data.to_markdown()) open(params_out, 'w').write(data.to_csv()) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" try: data_files = snakemake.input.data out_data = snakemake.output.data out_seed = snakemake.output.seed except NameError: # testing data_files = [f"output/default/est-sfs/data/{n}.txt" for n in range(1, 10)] out_data = "scratch/data.combined.txt" out_seed = "scratch/seed.txt" # create seed file open(out_seed, 'w').write('0') # combine contents of all data files with open(out_data, 'w') as out: for file in data_files: out.writelines(open(file, 'r').readlines()) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts vcfs = ' '.join([f"-I {vcf}" for vcf in snakemake.input.vcfs]) java_opts = get_java_opts(snakemake) tmp_dir=snakemake.resources.tmpdir out = snakemake.output[0] shell(f"gatk --java-options '{java_opts}' GatherVcfs {vcfs} -O {out} \ --TMP_DIR {tmp_dir} --REORDER_INPUT_BY_FIRST_VARIANT") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import os import numpy as np import pandas as pd import utils try: targets_file = snakemake.input.targets gff = snakemake.input.gff n_files = snakemake.params.n_files out = snakemake.output except NameError: # testing targets_file = "resources/reference/targeted_positions.txt" gff = "resources/reference/genome.corrected.gff.gz" n_files = 1 out = ["scratch/intervals1.bed"] pd.options.mode.chained_assignment = None # the list of targeted genes whose positions are relative to the gene targets = pd.read_csv(targets_file, sep='\t') # load the genes from the annotation file genes = utils.load_gff(gff) genes = genes[genes.feature == 'gene'] # only keep rows where TanjaKeep is true targets = targets.loc[targets.TanjaKeep] targets = targets.reset_index(drop=True) # initialize array rows = [] # loop through genes and fetch their positions relative to the contigs for i, target in targets.iterrows(): gene = genes[genes.attribute.str.contains('ID=' + target.CHROMOSOME + ';')].iloc[0] contig = gene.seqname # We use 0-based coordinates for start and 1-based for end positions # as is done BED files. The GFF file uses 1-based coordinates so we need # to offset start by 1 start = (gene.start - 1) + target.START # The stop position in the BED file is non-inclusive so we # just add the length to the start value stop = start + target.LENGTH rows.append([contig, start, stop]) if i % 100 == 0: print(f"Processed genes: {i}/{len(targets)}", flush=True) # randomize order and split arrays into chunks # we do this to obtain chunks of similar sequence length np.random.seed(0) np.random.shuffle(rows) # partition set of intervals rows = np.array_split(rows, n_files) # write contents to files for i, chunk in enumerate(rows): # sort chunk in ascending contig order chunk = sorted(chunk, key=lambda x: (len(x[0]), x[0])) with open(out[i], 'w') as f: f.write(os.linesep.join(['\t'.join(row) for row in chunk]) + os.linesep) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts vcf = snakemake.input.vcf samples = snakemake.input.samples max_nocall = snakemake.params.max_nocall_fraction java_opts = get_java_opts(snakemake) out = snakemake.output[0] biallelic = snakemake.params.get('biallelic') flag_biallelic = '--select-type-to-include SNP --restrict-alleles-to BIALLELIC' optional_flags = flag_biallelic if biallelic else '' shell(f"gatk --java-options '{java_opts}' SelectVariants -V {vcf} " f"-O {out} --sample-name {samples} --max-nocall-fraction {max_nocall} " f"--remove-unused-alternates {optional_flags}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts intervals = snakemake.input.intervals variants = ' '.join([f"-V {vcf}" for vcf in snakemake.input.vcfs]) java_opts = get_java_opts(snakemake) db_dir = snakemake.output[0] tmp_dir = snakemake.resources.tmpdir batch_size = snakemake.params.batch_size shell(f"gatk --java-options '{java_opts}' GenomicsDBImport {variants} \ --genomicsdb-workspace-path {db_dir} --tmp-dir {tmp_dir} \ --interval-padding 100 -L {intervals} --batch-size {batch_size}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell ref = snakemake.input.ref left = snakemake.input.left right = snakemake.input.right name = snakemake.wildcards.name out = snakemake.output.bam tmp_dir = snakemake.resources.tmpdir threads = snakemake.threads # only map paired end reads which account for most of the data shell(f"samtools sort -T {tmp_dir} <(bwa mem {ref} {left} {right} -R '@RG\\tID:{name}\\tSM:{name}' -t {threads}) > {out}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts in_bam = snakemake.input.bam java_opts = get_java_opts(snakemake) out_bam = snakemake.output.bam out_metric = snakemake.output.metrics shell(f"gatk MarkDuplicates --java-options '{java_opts}' -I {in_bam} -O {out_bam} -M {out_metric}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | if (exists('snakemake')) { input = snakemake@input[[1]] output = snakemake@output[[1]] postprocessing_source = snakemake@config$polydfe_postprocessing_source } else { # testing input = "output/default/polydfe/pendula/output/C.full_anc.out" output = "scratch/dfe_parsed_output.txt" postprocessing_source = "resources/polydfe/postprocessing/script.R" } source(postprocessing_source) intervals = c(-100, -10, -1, 0, 1) est = c(sapply(input, parseOutput)) dfe = t(sapply(est, getDiscretizedDFE, intervals)) # write intervals and discretized DFE values write(paste(intervals, collapse = " "), file = output, append = TRUE) write(paste(dfe, collapse = " "), file = output, append = TRUE) # write gradient write(est[[1]]$criteria, file = output, append = TRUE) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import numpy as np import utils try: test_mode = False sample_set = snakemake.params.sample_set subsample_sets = snakemake.params.subsample_sets sample_class = snakemake.params['sample_class'] out = snakemake.output[0] except NameError: # testing test_mode = True sample_set = 'pendula' subsample_sets = ['pendula_south_admixture', 'pendula_north_admixture'] sample_class = 'default' out = 'scratch/admixture_locations.png' # fetch all samples samples = utils.get_samples(sample_set=sample_set, sample_class=sample_class) # fetch subsamples and choose colors according to subsample set # only two subsample sets are currently supported subsamples = utils.get_samples(subsample_sets[0], sample_class=sample_class) mask = samples.name.isin(subsamples.name) colors = [int(m) for m in mask] # perturb the sample locations to make individual samples visible np.random.seed(seed=0) samples.latitude += np.random.normal(size=samples.shape[0], scale=0.3) samples.longitude += np.random.normal(size=samples.shape[0], scale=0.3) # plot samples scatter = plt.scatter(samples.longitude, samples.latitude, c=colors, s=3.5) plt.gca().axis('square') plt.xlabel('longitude') plt.ylabel('latitude') # add legend labels = [utils.get_proper_name(s) for s in subsample_sets] plt.legend(handles=scatter.legend_elements()[0], labels=labels) utils.scale_cf(2 / 3) utils.save_fig(out, show=test_mode, tight_layout=True) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import re import matplotlib.pyplot as plt from matplotlib.ticker import MaxNLocator import plot_utils try: test_mode = False log_files = snakemake.input K = snakemake.params.K out = snakemake.output[0] except NameError: # testing test_mode = True log_files = [f"output/default/admixture/pendula/biallelic/snps.admixture.{n}.log" for n in range(1, 5)] K = range(1, 5) out = 'scratch/cv_error_admixture.svg' cv_errors = {} for K, file in zip(K, log_files): # get file contents contents = open(file, 'r').read() # match cv error match = re.search("CV error \(K=\d\): (\d+\.\d+)", contents) cv_errors[K] = float(match.group(1)) # plot cv errors plt.plot(cv_errors.keys(), cv_errors.values(), marker='o') plt.xlabel('K') plt.ylabel('CV error') # mark lambda with lowest cv error best_K = list(cv_errors.keys())[list(cv_errors.values()).index(min(cv_errors.values()))] plt.axvline(best_K, color="orange") # only show integers on x-axis plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True)) # increase scaling plot_utils.scale_cf(2 / 3) # save plot plot_utils.save_fig(out, show=test_mode, tight_layout=True) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import feems_utils import utils try: test_mode = False files = snakemake.input lambdas = snakemake.params.lambdas out = snakemake.output[0] except NameError: # testing test_mode = True lambdas = [0.01, 1] files = [f"output/default/feems/pendula/biallelic/cv_error_{n}.txt" for n in lambdas] out = 'scratch/cv_error_admixture.svg' cv_errors = [] for file in files: # get file contents cv_errors.append(float(open(file, 'r').read())) feems_utils.plot_cv_errors(lambdas, cv_errors) # save plot utils.save_fig(out, show=test_mode, tight_layout=True) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import numpy as np import utils import warnings try: sites_all = snakemake.input.sites_all vcf_synonymous = snakemake.input.synonymous vcf_nonsynonymous = snakemake.input.nonsynonymous out = snakemake.output[0] except NameError: # testing sites_all = "output/default/snps/pendula/snps.vcf.gz" vcf_synonymous = "output/default/snps/pendula_synonymous/snps.vcf.gz" vcf_nonsynonymous = "output/default/snps/pendula_nonsynonymous/snps.vcf.gz" out = 'scratch/degeneracy.svg' # the degeneracy values ns = [0, 2, 4] labels_degeneracy = ['non-coding'] + [f"{n}-fold" for n in ns] labels_synonymy = ['_', 'non-synonymous', '_', 'non-synonymous', 'synonymous', '_', 'synonymous', '_'] n_lines_total = utils.count_lines_vcf(sites_all) degeneracy_all = {n: utils.get_n_degenerate(sites_all, n) for n in ns} n_non_coding = n_lines_total - sum(degeneracy_all.values()) # the degeneracy classes with respect to synonymy synonymous = {n: utils.get_n_degenerate(vcf_synonymous, n) for n in [2, 4]} nonsynonymous = {n: utils.get_n_degenerate(vcf_nonsynonymous, n) for n in [0, 2]} # the degeneracy classes with synonymy subclasses degeneracy_synonymy = { 0: [nonsynonymous[0], degeneracy_all[0] - nonsynonymous[0]], 2: [nonsynonymous[2], synonymous[2], degeneracy_all[2] - nonsynonymous[2] - synonymous[2]], 4: [synonymous[4], degeneracy_all[4] - synonymous[4]] } # all values in a nested array degeneracy_synonymy = [[n_non_coding]] + list(degeneracy_synonymy.values()) degeneracy = [sum(s) for s in degeneracy_synonymy] # prepare colors of disks cmap = plt.get_cmap("tab20c") outer_colors = cmap(np.arange(4) * 4) inner_colors = cmap([1, 5, 6, 9, 10, 11, 13, 14]) fig, ax = plt.subplots() # get wedge labels of outer disk def autopict(percentage): n_sites = int(np.round(percentage * n_lines_total / 100)) return str(n_sites) + '\n' + str(np.round(percentage, 1)) + '%' widths = [0.5, 0.1] # plot outer disk showing the degeneracy classes ax.pie(degeneracy, labels=labels_degeneracy, autopct=autopict, startangle=90, wedgeprops=dict(edgecolor='w', linewidth=1, antialiased=True, width=widths[0]), pctdistance=0.75, colors=outer_colors, radius=1) # plot inner disk showing the synonymy classes wedges_synonymy, _ = ax.pie(utils.flatten(degeneracy_synonymy), startangle=90, wedgeprops=dict(edgecolor='w', linewidth=1, antialiased=True, width=widths[1]), colors=inner_colors, radius=1 - widths[0]) with warnings.catch_warnings(): warnings.simplefilter("ignore") plt.legend(wedges_synonymy, labels_synonymy, loc="upper left") ax.axis('equal') utils.save_fig(out, tight_layout=True) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import numpy as np from matplotlib import cycler import plot_utils import polydfe_utils import utils plt.rcParams['axes.prop_cycle'] = cycler('color', plt.get_cmap('Set2').colors) try: test_mode = False input = snakemake.input bs_type = snakemake.params.bs_type label_type = snakemake.params.label_type out = snakemake.output[0] except NameError: # testing test_mode = True input = {} for key in ['pendula_south', 'pendula_north', 'pubescens_south', 'pubescens_north']: files = [f"output/default/polydfe/{key}/output/bootstraps/dfe_C.full_anc.{str(n).zfill(3)}.txt" for n in range(1, 101)] input[key] = files bs_type = 'percentile' label_type = 'sample_set' out = "scratch/dfe.svg" width_total = 0.9 width = width_total / len(input.keys()) for i, (key, files) in enumerate(input.items()): # parse bootstrapped discretized DFE values dfe = [] for bs in files: dfe.append(polydfe_utils.parse_output(bs)['discretized']) n = len(dfe[0]) # determine mean and 95% CIs a = 0.05 values = np.mean(dfe, axis=0) dfe = np.array(dfe) if bs_type == 'percentile': ci = [utils.get_ci_percentile_bootstrap(dfe[:, i], a) for i in range(n)] elif bs_type == 'bca': ci = [utils.get_ci_bca(dfe[:, i], values[i], a) for i in range(n)] yerr = np.array([[values[i] - ci[i][0], ci[i][1] - values[i]] for i in range(n)]).T indices = np.array(range(n)) + i * width if label_type == 'sample_set': label = utils.get_proper_name(key) elif label_type == 'polydfe_type': label = polydfe_utils.get_label(key) # plot bar plt.bar(indices, values, width=width, label=label, yerr=yerr, error_kw=dict(capsize=3)) # adjust x-axis plt.xlabel('$4N_e s$') plt.xticks([r + (width_total - width) / 2 for r in range(n)], polydfe_utils.get_interval_names(files[0])) plt.autoscale(tight=True) eps = 0.01 plt.ylim(min(values - yerr[0]), max(values + yerr[1]) + eps) plt.legend() plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import numpy as np from scipy import stats import polydfe_utils try: bootstrapped = snakemake.input.bootstrapped out = snakemake.output[0] except NameError: # testing bootstrapped = [f"output/default/polydfe/pendula/output/bootstraps/dfe_C.full_anc.{str(n).zfill(3)}.txt" for n in range(1, 101)] out = "scratch/dfe.svg" data = polydfe_utils.parse_output_files(bootstrapped) data = data.sort_values(by=['gradient']) names = polydfe_utils.get_interval_names(bootstrapped[0]) x = data.gradient.values # prepare axes and colors axs = plt.subplots(2, 3)[1].flatten() colors = plt.get_cmap('Set2').colors # iterate over intervals for i, name in enumerate(names): y = data[i].values # plot points axs[i].scatter(x, y, c=[colors[i]] * len(x), s=10, alpha=0.5) # plot linear regression line m, b = np.polyfit(np.log(x), y, 1) axs[i].plot(x, m * np.log(x) + b, c='black') # plot mean per bin bins = stats.binned_statistic(np.log(x), y, 'mean', bins=10) x_bins = np.exp(bins.bin_edges[1:] + (bins.bin_edges[:-1] - bins.bin_edges[1:]) / 2) y_bins = bins.statistic axs[i].plot(x_bins, y_bins, c='red', alpha=0.5) axs[i].set_xlabel("log($\delta$)") axs[i].set_ylabel(name) axs[i].set_xscale('log') plt.tight_layout(pad=1) plt.savefig(out) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import numpy as np import polydfe_utils import utils try: input = snakemake.input out = snakemake.output[0] except NameError: # testing input = [f"output/default/polydfe/pendula_south/output/bootstraps/dfe_C.full_anc.{str(n).zfill(3)}.txt" for n in range(1, 101)] out = "scratch/param_dist.svg" # number of intervals n = len(polydfe_utils.parse_output(input[0])['discretized']) fig, axes = plt.subplots(n) # parse bootstrapped discretized DFE values values = [] for bs in input: values.append(polydfe_utils.parse_output(bs)['discretized']) # determine mean and 95% CIs a = 0.05 mean = np.mean(values, axis=0) values = np.array(values) for i, ax in enumerate(axes): ax.hist(values[:, i], bins=50) ci_bca = utils.get_ci_bca(values[:, i], mean[i], a) ci_percentile = utils.get_ci_percentile_bootstrap(values[:, i], a) for x in ci_bca: ax.axvline(x=x, c='red', linewidth=1) for x in ci_percentile: ax.axvline(x=x, c='black', linewidth=1) plt.tight_layout(pad=0) fig.savefig(out) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import numpy as np import polydfe_utils import plot_utils from matplotlib import cycler plt.rcParams['axes.prop_cycle'] = cycler('color', plt.get_cmap('Set2').colors) try: test_mode = False files = snakemake.input out = snakemake.output[0] except NameError: # testing test_mode = True files = {'full_anc': "output/default/polydfe/pendula/output/dfe_C.full_anc.txt", 'full': "output/default/polydfe/pendula/output/dfe_C.full.txt", 'deleterious': "output/default/polydfe/pendula/output/dfe_C.deleterious.txt"} out = "scratch/dfe.svg" width_total = 0.9 width = width_total / len(files) # iterate over files and a bars for i, (name, file) in enumerate(files.items()): values = polydfe_utils.parse_output(file)['discretized'] indices = np.arange(len(values)) + i * width label = polydfe_utils.get_label(name) plt.bar(indices, values, width=width, label=label) # adjust x-axis plt.xlabel('$4N_e s$') plt.xticks([r + (width_total - width) / 2 for r in range(len(values))], polydfe_utils.get_interval_names(file)) plt.autoscale(tight=True) plt.legend() plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import plot_utils try: test_mode = False input = snakemake.input[0] out = snakemake.output[0] log_scale = snakemake.params.log_scale except NameError: # testing test_mode = True input = 'output/default/stats/pendula/biallelic/fst.weir.fst' out = 'scratch/fst.svg' log_scale = True plot_utils.plot_hist_basic_stat(input, colname='WEIR_AND_COCKERHAM_FST', xlabel="$F_{st}$", ylabel="n SNPs", color=plot_utils.get_color('tab20c', 1), log_scale=log_scale) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import plot_utils try: test_mode = False input = snakemake.input[0] out = snakemake.output[0] log_scale = snakemake.params.log_scale except NameError: # testing test_mode = True input = 'output/default/stats/pendula/biallelic/fst.windowed.weir.fst' out = 'scratch/fst_windowed.svg' log_scale = True plot_utils.plot_hist_basic_stat(input, colname='WEIGHTED_FST', xlabel="$F_{st}$", ylabel="n windows", color="turquoise", log_scale=log_scale, weights='N_VARIANTS') plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import plot_utils try: test_mode = False input = snakemake.input[0] out = snakemake.output[0] log_scale = snakemake.params.log_scale except NameError: # testing test_mode = True input = 'output/default/stats/pendula/all/hwe.hwe' out = 'scratch/hwe.svg' log_scale = True opts = dict(xlabel="heterozygosity", delim_whitespace=True, ylabel="n SNPs", log_scale=log_scale) opts_observed = dict(colname='O(HET)', color=plot_utils.get_color('tab20b', 18), bins=50, opts=dict(alpha=0.3, label='observed')) plot_utils.plot_hist_basic_stat(input, **(opts_observed | opts)) opts_expected = dict(colname='E(HET)', color='cornflowerblue', bins=25, opts=dict(alpha=0.3, label='expected')) plot_utils.plot_hist_basic_stat(input, **(opts_expected | opts)) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import plot_utils try: test_mode = False input = snakemake.input[0] out = snakemake.output[0] observed = snakemake.params.observed log_scale = snakemake.params.log_scale except NameError: # testing test_mode = True input = 'output/default/stats/pendula/biallelic/hwe.hwe' out = 'scratch/hwe.svg' observed = False log_scale = True colname = 'O(HET)' if observed else 'E(HET)' plot_utils.plot_hist_basic_stat(input, colname=colname, xlabel="heterozygosity", delim_whitespace=True, ylabel="n SNPs", color=plot_utils.get_color('tab20b', 18), log_scale=log_scale) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import plot_utils try: test_mode = False input = snakemake.input[0] out = snakemake.output[0] log_scale = snakemake.params.log_scale except NameError: # testing test_mode = True input = 'output/default/stats/pendula/biallelic/hwe.hwe' out = 'scratch/hwe.svg' log_scale = False # filtering dependent on allele frequency to check for bias ''' def filter_ac(data): data['AC'] = data.GENO.str.split('/').apply(lambda x: float(x[1])) / \ data.GENO.str.split('/').apply(lambda x: float(x[2])) return data[~data.AC.between(0.001, 0.999)] ''' plot_utils.plot_hist_basic_stat(input, colname='P', xlabel="p-value", delim_whitespace=True, ylabel="n SNPs", color=plot_utils.get_color('Paired', 0), log_scale=log_scale) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import plot_utils try: test_mode = False input = snakemake.input[0] out = snakemake.output[0] log_scale = snakemake.params.log_scale except NameError: # testing test_mode = True input = 'output/default/stats/pendula/biallelic/missingness.imiss' out = 'scratch/missingness.svg' log_scale = True plot_utils.plot_hist_basic_stat(input, colname='F_MISS', xlabel="F_MISS", delim_whitespace=True, ylabel="n individuals", color=plot_utils.get_color('Pastel1', 6), log_scale=log_scale) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import plot_utils try: test_mode = False input = snakemake.input[0] out = snakemake.output[0] log_scale = snakemake.params.log_scale except NameError: # testing test_mode = True input = 'output/default/stats/pendula/biallelic/missingness.lmiss' out = 'scratch/missingness.svg' log_scale = True plot_utils.plot_hist_basic_stat(input, colname='F_MISS', xlabel="F_MISS", delim_whitespace=True, ylabel="n SNPs", color=plot_utils.get_color('Pastel1', 6), log_scale=log_scale) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import pandas as pd import plot_utils import utils try: test_mode = False results = snakemake.input df = snakemake.params.get('df', 1) comparisons = snakemake.params['comparisons'] labels = snakemake.params.get('labels', list(results.keys())) scaling = snakemake.params.get('scaling', 1) transpose = snakemake.params.get('transpose', False) out = snakemake.output[0] except NameError: # testing test_mode = True scenarios = [ "split_no_migration", "split_unidirectional_migration_from_two_to_one", "split_unidirectional_migration_from_one_to_two", "split_symmetric_migration", "split_asymmetric_migration", "split_asymmetric_migration_one_pop_size_change", "split_no_migration_since_ice_age", "split_unidirectional_migration_from_two_to_one_since_ice_age", "split_unidirectional_migration_from_one_to_two_since_ice_age", "split_symmetric_migration_since_ice_age", "split_asymmetric_migration_since_ice_age", "split_asymmetric_migration_one_pop_size_change_since_ice_age" ] sample_set = "pendula_pubescens" results = {scenario: f"output/tetraploid/dadi/{scenario}/{sample_set}/synonymous/dadi.csv" for scenario in scenarios} # each element is a pair of models to compare # where the first element is the more simple model comparisons = [ ['split_no_migration_since_ice_age', 'split_no_migration', 1], ['split_unidirectional_migration_from_two_to_one_since_ice_age', 'split_unidirectional_migration_from_two_to_one', 1], ['split_unidirectional_migration_from_one_to_two_since_ice_age', 'split_unidirectional_migration_from_one_to_two', 1], ['split_symmetric_migration_since_ice_age', 'split_symmetric_migration', 1], ['split_asymmetric_migration_since_ice_age', 'split_asymmetric_migration', 1], ['split_asymmetric_migration_one_pop_size_change_since_ice_age', 'split_asymmetric_migration_one_pop_size_change', 1] ] labels = [key.replace('_', ' ') for key in results.keys()] scaling = 1 df = 1 transpose = False out = 'scratch/dadi_nested.svg' def parse_lnl(file): # take the last row which if the file contains bootstrap ci estimates # and take the first row otherwise n_row = -1 if '_ci' in file else 0 return float(pd.read_csv(file, index_col=None, header=0).iloc[n_row]['lnl']) simple, complex, p_values = [], [], [] for comp in comparisons: simple.append(comp[0]) complex.append(comp[1]) df = comp[2] if len(comp) == 3 else 1 lnl_simple = parse_lnl(results[comp[0]]) lnl_complex = parse_lnl(results[comp[1]]) p = utils.lrt(lnl_simple, lnl_complex, df) p_values.append(p) types = list(results.keys()) plot_utils.plot_lrt_p_values(simple, complex, p_values, types, labels, transpose=transpose) plt.tight_layout() plot_utils.scale_cf(scaling) plot_utils.save_fig(out, show=test_mode, tight_layout=True) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.cm as cm import matplotlib.pyplot as plt import pca_utils import utils try: test_mode = False sample_set = snakemake.params['sample_set'] sample_class = snakemake.params['sample_class'] flag = snakemake.params['flag'] out = snakemake.output[0] add_names = snakemake.params.get('add_names') subsample_sets = snakemake.params.get('subsample_sets', []) demarcation_type = snakemake.params.get('demarcation_type', 'ellipse') scaling = snakemake.params.get('scaling', 1) marker_size = snakemake.params.get('marker_size', 20) tight_layout = snakemake.params.get('tight_layout', False) except NameError: # testing test_mode = True sample_set = 'pendula' sample_class = 'default' flag = 'no_missing' out = "scratch/all_sites_no_missing.png" add_names = False subsample_sets = ['pendula_north', 'pendula_south'] demarcation_type = 'gradient' scaling = 1 marker_size = 20 tight_layout = False pca, pc, samples = pca_utils.get(sample_set, flag, sample_class) # determine color of dots if demarcation_type == 'color': # we only support two colors here subsamples = utils.get_samples(subsample_sets[0], sample_class=sample_class) mask = samples.name.isin(subsamples.name) colors = [int(m) for m in mask] else: colors = samples.latitude # plot the 2 components scatter = plt.scatter(x=pc[:, 0], y=pc[:, 1], c=colors, s=marker_size) # draw ellipses around subsample sets if specified if demarcation_type == 'ellipse': for i, subsample_set in enumerate(subsample_sets): pca_utils.encircle_subset(sample_set, subsample_set, sample_class, pc, i) # add sample names if specified if add_names: eps = 1 min_lat = samples.latitude.min() max_lat = samples.latitude.max() for i in range(len(samples)): c = cm.viridis((samples.latitude[i] - min_lat) / (max_lat - min_lat)) plt.text(x=pc[i, 0] + eps, y=pc[i, 1] + eps, s=samples.name[i], size=6, c=c) plt.gca().axis('square') # add axis labels v1 = round(pca.explained_variance_ratio_[0] * 100, 2) v2 = round(pca.explained_variance_ratio_[1] * 100, 2) plt.xlabel(f'{v1}%') plt.ylabel(f'{v2}%') # only show legend and colorbar if we don't # color the samples according to their specified subsets if demarcation_type == 'color': labels = [utils.get_proper_name(s) for s in subsample_sets] plt.legend(handles=scatter.legend_elements()[0], labels=labels) else: plt.colorbar().set_label('latitude') if len(subsample_sets): plt.legend() # increase scaling utils.scale_cf(scaling) utils.save_fig(out, tight_layout=tight_layout, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import plot_utils try: test_mode = False input = snakemake.input[0] out = snakemake.output[0] log_scale = snakemake.params.log_scale except NameError: # testing test_mode = True input = 'output/default/stats/pendula/biallelic/pi.sites.pi' out = 'scratch/pi.svg' log_scale = True plot_utils.plot_hist_basic_stat(input, colname='PI', xlabel="$\pi$", ylabel="n SNPs", color=plot_utils.get_color('tab20b', 14), log_scale=log_scale) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import pandas as pd from matplotlib import cycler import plot_utils import polydfe_utils plt.rcParams['axes.prop_cycle'] = cycler('color', plt.get_cmap('Set2').colors) try: testing = False input = snakemake.input[0] out = snakemake.output[0] except NameError: # testing testing = True input = "output/tetraploid/polydfe/pendula/type_comparison.C.txt" out = "scratch/probs_nested.svg" # read data containing p-values # columns: 'type1': complex model name, 'type2': simple model name, 'p.value': LRT p-value data = pd.read_csv(input, sep=' ') types = ['full', 'deleterious', 'full_anc', 'deleterious_anc'] labels = [polydfe_utils.get_short_label(t) for t in types] plot_utils.plot_lrt_p_values(data.type1, data.type2, data['p.value'], types, labels) plt.tight_layout() plot_utils.scale_cf(0.75) plot_utils.save_fig(out, show=testing, tight_layout=True) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import feems_utils try: sample_set = snakemake.params['sample_set'] sample_class = snakemake.params['sample_class'] flag = snakemake.params['flag'] out = snakemake.output[0] except NameError: # testing sample_set = 'pendula' sample_class = 'default' flag = 'biallelic' out = 'scratch/feems_locations.svg' sp_graph, samples, coords = feems_utils.get_sp_graph(sample_set, flag, sample_class=sample_class) projection = feems_utils.get_projection(coords) # draw figure showing which nodes the sample snapped to v = feems_utils.create_plot(sp_graph, projection) v.draw_samples() v.draw_obs_nodes(use_ids=False) plt.savefig(out, bbox_inches='tight', pad_inches=0) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import cartopy.feature as cfeature import matplotlib.pyplot as plt from feems import viz from pyproj import Proj import feems_utils import utils try: test_mode = False sample_set = snakemake.params['sample_set'] sample_class = snakemake.params['sample_class'] latitudinal_barriers = snakemake.config['latitudinal_barriers'] # get latitudinal barrier of given sample set from the config if sample_set in latitudinal_barriers: latitudinal_barrier = latitudinal_barriers[sample_set] else: latitudinal_barrier = None out = snakemake.output[0] except NameError: # testing test_mode = True sample_set = 'pendula' sample_class = 'default' latitudinal_barrier = 64 out = 'scratch/feems_locations.svg' # fetch coordinates coords = utils.get_samples(sample_set, sample_class)[['longitude', 'latitude']].values # initialize figure and axis fig = plt.figure() projection = feems_utils.get_projection(coords) ax = fig.add_subplot(1, 1, 1, projection=projection) fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0) ax.axis("off") # add cartopy features ax.add_feature(cfeature.LAND, facecolor="#f7f7f7", zorder=0) ax.coastlines("50m", color="#636363", linewidth=0.5, zorder=0) # project coordinates project = Proj(projection.proj4_init) coords_projected = viz.project_coords(coords, project) # draw coordinates ax.scatter( coords_projected[:, 0], coords_projected[:, 1], edgecolors="black", linewidth=0.5, s=150, color="#30D5C8", marker=".", zorder=2 ) # draw line indicating subpopulation boundary if latitudinal_barrier: p1 = project(min(coords[:, 0]), latitudinal_barrier) p2 = project(max(coords[:, 0]), latitudinal_barrier) ax.plot([p1[0], p2[0]], [p1[1], p2[1]], linestyle='dotted', c='grey', alpha=0.5, linewidth=3) if test_mode: fig.show() # save plot plt.savefig(out, bbox_inches='tight', pad_inches=0) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import sfs_utils import plot_utils try: test_mode = False files = snakemake.input.vcf pops = snakemake.input.pops n_proj = snakemake.params.n_proj unfolded = snakemake.params.unfolded log_scale = snakemake.params.log_scale labels = snakemake.params.get('labels', []) tight_layout = snakemake.params.get('tight_layout', False) out = snakemake.output[0] except NameError: # testing test_mode = True files = ["output/tetraploid/snps/pendula/biallelic/snps.vcf.gz"] pops = "output/tetraploid/sample_sets/subpopulations/pendula/1_pops.txt" n_proj = 20 unfolded = True log_scale = False labels = ['biallelic'] tight_layout = False out = "scratch/pendula.pendula_one_pop.svg" sfs_utils.plot_sfs_from_file(files, pops, n_proj, unfolded, log_scale, labels) plot_utils.save_fig(out, tight_layout=tight_layout, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import sfs_utils import plot_utils try: test_mode = False vcf = snakemake.input.vcf pops = snakemake.input.pops n_proj = snakemake.params.n_proj sample_set = snakemake.params.get('sample_set', None) out = snakemake.output[0] except NameError: # testing test_mode = True sample_set = 'pendula' vcf = f"output/default/snps/{sample_set}/4fold/snps.vcf.gz" pops = f"output/default/sample_sets/subpopulations/{sample_set}/2_pops.txt" n_proj = 20 out = "scratch/sfs_{sample_set}.svg" sfs_utils.plot_sfs_from_file([vcf], pops, n_proj, unfolded=True, sample_set=sample_set) plot_utils.scale_cf(0.65) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import numpy as np import pandas as pd import demographic_scenarios import plot_utils try: test_mode = False vcf = snakemake.input.vcf pops = snakemake.input.pops result = snakemake.input.result scenario_name = snakemake.params.scenario sample_set = snakemake.params.sample_set n_pops = snakemake.params.n_pops out_sfs = snakemake.output.sfs out_residuals = snakemake.output.residuals scaling = [1, 0.5] generation_time = snakemake.config['generation_time'] mu = snakemake.config['mu'] N_e = snakemake.config['N_e'][sample_set] except NameError: # testing test_mode = True n_pops = 1 vcf = "output/tetraploid/snps/pendula/synonymous/snps.vcf.gz" pops = f"output/tetraploid/sample_sets/subpopulations/pendula/{n_pops}_pops.txt" scenario_name = 'one_pop_size_change_since_ice_age' result = f"output/tetraploid/dadi/{scenario_name}/pendula/synonymous/dadi.csv" out_sfs = "scratch/sfs.png" out_residuals = "scratch/residuals.png" scaling = np.array([1, 1.15]) * 0.6 generation_time = 20 mu = 1e-9 N_e = 945883 # load params from file params = pd.read_csv(result, index_col=None, header=0).iloc[0] # determine initial value param_names = demographic_scenarios.get_class(scenario_name).params p0 = [params[name] for name in param_names] scenario = demographic_scenarios.restore(scenario_name, p0, vcf, pops, generation_time, n_pops, N_e, mu) # generate SFS plot scenario.plot_sfs(scaling_1d=scaling) plot_utils.save_fig(out_sfs, tight_layout=True, show=test_mode, clear=True) # generate plot of residuals scenario.plot_residuals_1d() plot_utils.save_fig(out_residuals, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import pandas as pd import demographic_scenarios import plot_utils try: test_mode = False vcf = snakemake.input.vcf pops = snakemake.input.pops result = snakemake.input.result scenario_name = snakemake.params.scenario sample_set = snakemake.params.sample_set n_pops = snakemake.params.n_pops out_data = snakemake.output.data out_model = snakemake.output.model out_residuals = snakemake.output.residuals generation_time = snakemake.config['generation_time'] mu = snakemake.config['mu'] N_e = snakemake.config['N_e'][sample_set] except NameError: # testing test_mode = True sample_set = "pendula" n_pops = 2 vcf = "output/tetraploid/snps/pendula/synonymous/snps.vcf.gz" pops = f"output/tetraploid/sample_sets/subpopulations/pendula/{n_pops}_pops.txt" scenario_name = 'split_asymmetric_migration' result = f"output/tetraploid/dadi_old/{scenario_name}/pendula/synonymous/dadi.csv" out_data = "scratch/data.svg" out_model = "scratch/model.svg" out_residuals = "scratch/residuals.svg" generation_time = 20 mu = 1e-9 N_e = 675000 # load params from file params = pd.read_csv(result, index_col=None, header=0).iloc[0] # determine initial value param_names = demographic_scenarios.get_class(scenario_name).params p0 = [params[name] for name in param_names] scenario = demographic_scenarios.restore(scenario_name, p0, vcf, pops, generation_time, n_pops, N_e, mu, sample_set=sample_set) scenario.plot_data_2d() plot_utils.save_fig(out_data, tight_layout=True, show=test_mode) scenario.plot_model_2d() plot_utils.save_fig(out_model, tight_layout=True, show=test_mode) scenario.plot_residuals_2d() plot_utils.save_fig(out_residuals, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import numpy as np from matplotlib import pyplot as plt import sfs_utils import dadi try: input = snakemake.input[0] n_proj = snakemake.params.n_proj unfolded = snakemake.params.unfolded log_scale = snakemake.params.log_scale out = snakemake.output[0] except NameError: # testing input = "output_old/est-sfs/sfs/1.txt" n_proj = 100 unfolded = False log_scale = True out = "scratch/pendula.pendula_one_pop.svg" fs = dadi.Spectrum([int(np.round(float(n))) for n in open(input, 'r').readline().split(',')]) if not unfolded: fs = fs.fold() n_proj *= 2 fs = fs.project([n_proj]) sfs_utils.plot_sfs([fs], log_scale=log_scale) plt.savefig(out) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import numpy as np import utils try: vcf_synonymous = snakemake.input.synonymous vcf_nonsynonymous = snakemake.input.nonsynonymous out = snakemake.output[0] except NameError: # testing vcf = 'testing/snps1.ancestral.vep.annotated.vcf.gz' out = 'scratch/synonymy.svg' # the labels labels_degeneracy = [f"{n}-fold" for n in [2, 4] + [0, 2]] labels_synonymy = ['synonymous', 'non-synonymous'] # the degeneracy values for each synonymy class synonymous = [utils.get_n_degenerate(vcf_synonymous, n) for n in [2, 4]] nonsynonymous = [utils.get_n_degenerate(vcf_nonsynonymous, n) for n in [0, 2]] # the nested array containing the values values = [synonymous, nonsynonymous] # prepare colors for inner and outer disk cmap = plt.get_cmap("tab20c") outer_colors = cmap(np.arange(2) * 5) inner_colors = cmap([1, 2, 6, 7]) fig, ax = plt.subplots() # get wedge labels of outer disk def autopict(percentage): n_sites = int(np.round(percentage * sum(utils.flatten(values)) / 100)) return str(n_sites) + '\n' + str(np.round(percentage, 1)) + '%' # plot outer disk showing the synonymy classes ax.pie([sum(s) for s in values], startangle=90, labels=labels_synonymy, autopct=autopict, wedgeprops=dict(edgecolor='w', linewidth=1, antialiased=True, width=0.4), colors=outer_colors, radius=1, pctdistance=0.8) # plot inner disk showing the degeneracy classes ax.pie(utils.flatten(values), startangle=90, labels=labels_degeneracy, wedgeprops=dict(edgecolor='w', linewidth=1, antialiased=True, width=0.2), colors=inner_colors, radius=0.6, labeldistance=0.25) ax.axis('equal') utils.save_fig(out, tight_layout=True) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import plot_utils try: test_mode = False input = snakemake.input[0] out = snakemake.output[0] log_scale = snakemake.params.log_scale except NameError: # testing test_mode = True input = 'output/default/stats/pendula/all/D.Tajima.D' out = 'scratch/tajimas_d.svg' log_scale = False plot_utils.plot_hist_basic_stat(input, colname='TajimaD', xlabel="Tajima's D", weights='N_SNPS', ylabel="n windows", color=plot_utils.get_color('tab20b', 10), log_scale=log_scale) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import numpy as np import pandas as pd import demographic_scenarios import plot_utils try: test_mode = False vcf = snakemake.input.vcf pops = snakemake.input.pops result = snakemake.input.result scenario_name = snakemake.params.scenario sample_set = snakemake.params.sample_set n_pops = snakemake.params.n_pops out = snakemake.output[0] plot_opts = {} generation_time = snakemake.config['generation_time'] mu = snakemake.config['mu'] N_e = snakemake.config['N_e'][sample_set] except NameError: # testing test_mode = True sample_set = 'pendula' vcf = f"output/tetraploid/snps/{sample_set}/synonymous/snps.vcf.gz" pops = f"output/tetraploid/sample_sets/subpopulations/{sample_set}/1_pops.txt" scenario_name = 'one_pop_size_change' result = f"output/tetraploid/dadi/{scenario_name}/{sample_set}/synonymous/dadi.csv" n_pops = 1 out = "scratch/trajectory.png" plot_opts = {'scaling': np.array([1, 1 / 2]) * 0.8} generation_time = 20 mu = 1e-9 N_e = 675000 # load params from file params = pd.read_csv(result, index_col=None, header=0).iloc[0] # determine initial value param_names = demographic_scenarios.get_class(scenario_name).params p0 = [params[name] for name in param_names] scenario = demographic_scenarios.restore(scenario_name, p0, vcf, pops, generation_time, n_pops, N_e, mu) # plot trajectory scenario.plot_trajectory(opts=plot_opts) plot_utils.save_fig(out, tight_layout=True, show=test_mode) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import matplotlib.pyplot as plt import umap import pca_utils import utils try: sample_set = snakemake.params['sample_set'] sample_class = snakemake.params['sample_class'] flag = snakemake.params['flag'] out = snakemake.output[0] subsample_sets = snakemake.params.get('subsample_sets', []) min_dist = snakemake.params.get('min_dist', 1.5) spread = snakemake.params.get('spread', 2) tight_layout = snakemake.params.get('tight_layout', False) except NameError: # testing sample_set = 'birch' sample_class = 'default' flag = 'biallelic' out = "scratch/pca.svg" subsample_sets = ['pubescens', 'pendula'] min_dist = 1.5 spread = 2 tight_layout = False # fetch genotypes genotypes, samples, bim, fam = utils.get_genotypes(sample_set, flag, sample_class) # create umap object reducer = umap.UMAP(n_neighbors=samples.shape[0] - 1, metric='euclidean', min_dist=min_dist, random_state=4, spread=spread) # fit embedding using umap embedding = reducer.fit_transform(genotypes) # plot and save embedding plt.scatter(x=embedding[:, 0], y=embedding[:, 1], c=samples.latitude, s=20) plt.gca().set_aspect('equal', 'datalim') # draw ellipses around subsample sets if specified for i, subsample_set in enumerate(subsample_sets): pca_utils.encircle_subset(sample_set, subsample_set, sample_class, embedding, i, buf=10) plt.xlabel(f'x') plt.ylabel(f'y') if len(subsample_sets): plt.legend() plt.colorbar().set_label('latitude') if tight_layout: plt.savefig(out, bbox_inches='tight', pad_inches=0) else: plt.savefig(out) |
5 6 7 8 9 10 11 12 13 14 15 16 17 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell ref = snakemake.input.ref vcf = snakemake.input.vcf gff = snakemake.input.gff out = snakemake.output[0] shell(f"vep --gff {gff} --fasta {ref} -i {vcf} --vcf --species betula_pendula --allow_non_variant \ -o {out} --fields 'Consequence,Codons' --compress_output bgzip") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import os import utils from pyvcf_utils import * try: vcf_file = snakemake.input.vcf ingroups_file = snakemake.input.ingroups outgroups_file = snakemake.input.outgroups out_data = snakemake.output.data n_max_subsamples = snakemake.config['est_sfs_n_samples'] except NameError: # testing vcf_file = "testing/snps/pendula/synonymous/snps.vcf.gz" ingroups_file = "output/default/sample_sets/ingroups.args" outgroups_file = "output/default/sample_sets/outgroups.args" out_data = "scratch/pendula.synonymous.data.txt" n_max_subsamples = 50 # load ingroup and outgroup samples ingroups = utils.get_samples_from_file(ingroups_file) outgroups = utils.get_samples_from_file(outgroups_file) # seed rng np.random.seed(seed=0) # number of subsamples to sample from haplotypes n_subsamples = min(n_max_subsamples, len(ingroups)) # write to data file with open(out_data, 'w') as f: i = 0 for record in vcf.Reader(filename=vcf_file): # only do inference for polymorphic sites if not record.is_monomorphic: ingroup = count(subsample(haplotypes(restrict(record.samples, outgroups, True)), n_subsamples)) outgroup1 = count(subsample(haplotypes(restrict(record.samples, [outgroups[0]])), 1)) outgroup2 = count(subsample(haplotypes(restrict(record.samples, [outgroups[1]])), 1)) f.write(' '.join([base_dict_to_string(r) for r in [ingroup, outgroup1, outgroup2]]) + os.linesep) i += 1 if i % 1000 == 0: print(f"Processed sites: {i}", flush=True) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import os import sfs_utils import utils try: synonymous = snakemake.input.synonymous nonsynonymous = snakemake.input.nonsynonymous vcf = snakemake.input.vcf pops = snakemake.input.pops n_pops = snakemake.params.n_pops out = snakemake.output[0] except NameError: # testing synonymous = "output/default/snps/pendula/synonymous/snps.vcf.gz" nonsynonymous = "output/default/snps/pendula/nonsynonymous/snps.vcf.gz" vcf = "output/default/snps/pendula/all/snps.vcf.gz" pops = "output/default/sample_sets/subpopulations/pendula/1_pops.txt" n_pops = 1 out = "scratch/pendula.1_pops.txt" # number of two fold degenerate sites n_2fold = utils.get_n_degenerate(vcf, 2) # polyDFE fails to converge for large sample sizes # n = 20 was used in their own simulations n_proj = 20 freqs_neut = sfs_utils.get_spectrum(synonymous, pops, n_proj) freqs_neut = freqs_neut[~freqs_neut.mask].tolist() n_sites_neut = utils.get_n_degenerate(vcf, 4) + 1 / 3 * n_2fold freqs_sel = sfs_utils.get_spectrum(nonsynonymous, pops, n_proj) freqs_sel = freqs_sel[~freqs_sel.mask].tolist() n_sites_sel = utils.get_n_degenerate(vcf, 0) + 2 / 3 * n_2fold def write_line(file, arr, sep): file.write(sep.join(map(str, arr)) + os.linesep) m_neut = 1 m_sel = 1 with open(out, 'w') as f: write_line(f, [m_neut, m_sel, n_proj], ' ') write_line(f, freqs_neut + [n_sites_neut], '\t') write_line(f, freqs_sel + [n_sites_sel], '\t') |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts java_opts = get_java_opts(snakemake) ref = snakemake.input.ref vcf = snakemake.input.vcf out = snakemake.output[0] # the filters are based on the GATK recommendations for hard filtering # https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants shell(f"gatk --java-options '{java_opts}' VariantFiltration -R {ref} -V {vcf} -O {out} \ --filter-name 'filter1' \ --filter-expression 'QD < 2.0' \ --filter-name 'filter2' \ --filter-expression 'SOR > 3.0' \ --filter-name 'filter3' \ --filter-expression 'MQ < 40.0' \ --filter-name 'filter4' \ --filter-expression 'MQRankSum < -12.5' \ --filter-name 'filter5' \ --filter-expression 'ReadPosRankSum < -8.0'") |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from collections import Counter import vcf from vcf.parser import _Info as Info import pyvcf_utils import utils try: vcf_file = snakemake.input.vcf probs = snakemake.input.probs data = snakemake.input.data samples_file = snakemake.input.samples out = snakemake.output.vcf except NameError: # testing vcf_file = "output/default/snps/raw/intervals/snps1.vcf.gz" probs = "output/default/est-sfs/probs/1.txt" data = "output/default/est-sfs/data/1.txt" samples_file = "output/default/sample_sets/ingroups.args" out = "scratch/1.ancestral.vcf" samples = utils.get_samples_from_file(samples_file) vcf_reader = vcf.Reader(filename=vcf_file) # add AA info field to header vcf_reader.infos['AA'] = Info('AA', 1, 'String', 'Ancestral Allele', None, None) vcf_reader.infos['EST_SFS_probs'] = Info('EST_SFS_probs', 1, 'String', 'EST-SFS probabilities', None, None) vcf_reader.infos['EST_SFS_input'] = Info('EST_SFS_input', 1, 'String', 'EST-SFS input', None, None) probs_reader = open(probs, 'r') data_reader = open(data, 'r') writer = vcf.Writer(open(out, 'w'), vcf_reader) # write to data file i = 0 for record in vcf_reader: # simply assign the ancestral allele to be the reference allele # if the record is monomorphic if record.is_monomorphic: record.INFO['AA'] = record.REF record.INFO['EST_SFS_probs'] = None record.INFO['EST_SFS_input'] = None else: line = probs_reader.readline() # read est-sfs input data est_input = data_reader.readline() # get probability of major allele prob_major_allele = float(line.split(' ')[2]) # restrict haplotypes to the non-outgroup birch samples hps = pyvcf_utils.haplotypes(pyvcf_utils.restrict(record.samples, samples)) major_allele = '.' if hps: # determine major allele major_allele = Counter(hps).most_common()[0][0] # take the major allele to be the ancestral allele # if its probability is greater than equal 0.5 if prob_major_allele >= 0.5: ancestral_allele = major_allele else: # there are exactly two alleles for allele in record.alleles: if str(allele) != major_allele: ancestral_allele = str(allele) else: ancestral_allele = '.' # add ancestral allele annotation to record record.INFO['AA'] = ancestral_allele # add additional information from est-sfs record.INFO['EST_SFS_probs'] = line.replace('\n', '').replace(' ', '|') record.INFO['EST_SFS_input'] = est_input.replace('\n', '').replace(' ', '|').replace(',', ':') if not (major_allele == ancestral_allele == record.REF): print(f"site: {record.CHROM}:{record.POS}, ancestral allele: {ancestral_allele}, " f"major allele: {major_allele}, reference: {record.REF}, " f"prob major allele: {prob_major_allele}", flush=True) writer.write_record(record) i += 1 if i % 1000 == 0: print(f"Processed sites: {i}", flush=True) # raise error if there are lines left from the EST-SFS output if next(probs_reader, None) is not None: raise AssertionError("Number of sites don't match.") probs_reader.close() |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell try: input = snakemake.input.bed out = snakemake.output.bed except NameError: # testing input = "output/default/snps/pendula/biallelic/snps.bed" out = "output/default/snps/pendula/biallelic/snps.recoded_ids.bed" input_prefix = input.replace('.bed', '') out_prefix = out.replace('.bed', '') shell(f"plink2 --bfile {input_prefix} --set-all-var-ids @_# --make-bed --out {out_prefix} --allow-extra-chr") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import os from snakemake.shell import shell try: bed = snakemake.input.bed out_log = snakemake.output.log K = snakemake.params.K except NameError: # testing bed = "output/default/snps/pendula/biallelic/snps.admixture.bed" out_log = "scratch/admixture" K = 4 # We can't specify the path of the output files # so we change the working directory to the output # path of the given log file. bed_abs_path = os.path.abspath(bed) out_path = os.path.dirname(out_log) log = os.path.basename(out_log) shell(f""" mkdir -p {out_path} cd {out_path} admixture --cv {bed_abs_path} {K} | tee {log} """) |
5 6 7 8 9 10 11 12 13 14 15 16 17 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell sfs = snakemake.input.sfs init = snakemake.input.init out = snakemake.output[0] id = snakemake.params.id m = snakemake.params.m shell(f"polyDFE -d {sfs} -m {m} -i {init} {id} -v 100 > {out}") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" try: probs_in = snakemake.input.probs data_in = snakemake.input.data probs_out = snakemake.output.probs except NameError: # testing probs_in = "output/default/est-sfs/probs/all.txt" data_in = [f"output/default/est-sfs/data/{n}.txt" for n in range(1, 11)] probs_out = [f"scratch/est-sfs/probs/{n}.txt" for n in range(1, 11)] with open(probs_in, 'r') as probs_all: # iterate over files for data_file, probs_file in zip(data_in, probs_out): # obtain number of lines to read and write line_count = sum(1 for line in open(data_file)) with open(probs_file, 'w') as out: # iterate over lines for i in range(line_count): line = probs_all.readline() # skip header lines while line.startswith('0'): line = probs_all.readline() # write to out file out.write(line) # raise error if there are lines left from the EST-SFS output if next(probs_all, None) is not None: raise AssertionError("Unassigned sites left from EST-SFS output.") |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell try: out = snakemake.output[0] tmp_dir = snakemake.resources.tmpdir max_sites = snakemake.config['est_sfs_max_sites'] except NameError: # testing out = "scratch/est-sfs" tmp_dir = "/tmp" max_sites = 100000 # set up est-sfs # Note: the compilation only works on Linux shell(f""" set -x work_dir=$(realpath .) cd {tmp_dir} wget https://sourceforge.net/projects/est-usfs/files/est-sfs-release-2.03.tar.gz/download -O est-sfs.tar.gz rm -rf est-sfs mkdir est-sfs tar -xvf est-sfs.tar.gz -C est-sfs --strip-components 1 cd est-sfs sed -i 's/#define max_config 100000/#define max_config {max_sites}/g' est-sfs.c export C_INCLUDE_PATH=$CONDA_PREFIX/include make mv est-sfs "$work_dir/{out}" """) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" import math import numpy as np import pandas import pandas as pd try: test_mode = False results = snakemake.input labels = snakemake.params.labels out_pretty = snakemake.output.pretty out_tex = snakemake.output.tex except NameError: # testing test_mode = True scenarios = [ "split_no_migration", "split_unidirectional_migration_from_two_to_one", "split_unidirectional_migration_from_one_to_two", "split_symmetric_migration", "split_asymmetric_migration", "split_asymmetric_migration_one_pop_size_change", "split_no_migration_since_ice_age", "split_unidirectional_migration_from_two_to_one_since_ice_age", "split_unidirectional_migration_from_one_to_two_since_ice_age", "split_symmetric_migration_since_ice_age", "split_asymmetric_migration_since_ice_age", "split_asymmetric_migration_one_pop_size_change_since_ice_age" ] sample_set = "pendula_pubescens" results = {scenario: f"output/tetraploid/dadi/{scenario}/{sample_set}/synonymous/dadi_ci.csv" for scenario in scenarios} labels = [key.replace('_', ' ') for key in results.keys()] out_pretty = 'scratch/dadi_tabulated.out' out_tex = 'scratch/dadi_tabulated.tex' # format given float string def format_number(x): x = float(x) if x == 0: return str(0) magnitude = math.floor(math.log(abs(float(x)), 10)) if magnitude > 0: return str(int(x)) else: return str(np.round(x, abs(magnitude) + 1)) def parse_row(file, delimiter): data = pd.read_csv(file, index_col=0, header=0) mean = data.loc['mean_bs'] std = data.loc['std_bs'] for key in mean.index: if mean[key]: mean[key] = format_number(mean[key]) if float(std[key]) != 0: mean[key] += delimiter + format_number(std[key]) return mean def prepare_table(delimiter): data = pandas.DataFrame() for i, (key, file) in enumerate(results.items()): row = parse_row(file, delimiter) row.name = labels[i] data = data.append(row) # remove unnamed columns data = data.loc[:, ~data.columns.str.contains('unnamed')] data = data.loc[:, ~(data.columns == 'theta')] return data.fillna('') open(out_pretty, 'w').write(prepare_table(" ± ").to_markdown()) open(out_tex, 'w').write(prepare_table("\,$\pm$\,").to_latex(escape=False)) |
5 6 7 8 9 10 11 12 13 14 15 16 17 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell input = ' '.join(snakemake.input) output = ' '.join(snakemake.output) adapter = "$CONDA_PREFIX/share/trimmomatic-0.36-5/adapters/TruSeq3-PE-2.fa" threads = snakemake.threads shell(f"trimmomatic PE -threads {threads} {input} {output} \ ILLUMINACLIP:{adapter}:2:30:10 SLIDINGWINDOW:4:15 MINLEN:36") |
5 6 7 8 9 10 11 12 13 14 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell snakefile = snakemake.params.snakefile out = snakemake.output[0] shell(f"snakemake --snakefile {snakefile} --rulegraph | dot -Tsvg > {out}") |
5 6 7 8 9 10 11 12 13 14 | __author__ = "Janek Sendrowski" __contact__ = "j.sendrowski18@gmail.com" __date__ = "2022-05-31" from snakemake.shell import shell input = snakemake.input[0] out = snakemake.output[0] shell(f"dot {input} -Tsvg > {out}") |
207 208 | script: "scripts/visualize_complete_rulegraph.py" |
216 217 | script: "scripts/visualize_dotfile.py" |
227 228 | script: "scripts/visualize_complete_rulegraph.py" |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/Sendrowski/BirchesScandinavia
Name:
birchesscandinavia
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
None
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...