kGWASflow is a Snakemake workflow for performing k-mers-based GWAS.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A modular, flexible, and reproducible Snakemake workflow to perform k-mers-based GWAS.
Table of Contents
Summary
kGWASflow is a Snakemake pipeline developed for performing k-mers-based genome-wide association study (GWAS) based on the method developed by Voichek et al. (2020) . It performs several pre-GWAS analyses, including read trimming, quality control, and k-mer counting. It implements the kmersGWAS method into an easy to use and accessible workflow. The pipeline also contains post-GWAS analyses, such as mapping k-mers to a reference genome, finding and mapping the source reads of k-mers, assembling source reads into contigs, and mapping them to a reference genome. kGWASflow is also highly customizable and offers users multiple options to choose from depending on their needs.
More information and explanations on how to install, configure and run kGWASflow are provided in the kGWASflow Wiki .
kGWASflow preprint is out on bioRxiv:
- Adnan Kivanc Corut, Jason G. Wallace. kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS (2023). bioRxiv. https://doi.org/10.1101/2023.07.10.548365
Installation
Installing via Bioconda
‼️This is the preferred method .‼️
In order to use this workflow, you need
conda
to be installed (to install
conda
, please follow the instructions
here
).
# Create a new conda environment with kgwasflow
# and its dependencies
conda create -c bioconda -n kgwasflow kgwasflow
# Activate kGWASflow conda environment
conda activate kgwasflow
# test kGWASflow
kgwasflow --help
Installing via GitHub (Alternative Method)
Alternatively, kGWASflow can be installed by cloning the GitHub repository .
# Clone this repository to your local machin
git clone https://github.com/akcorut/kGWASflow.git
# Change into the kGWASflow directory
cd kGWASflow
After cloning the GitHub repo, you can install snakemake and the other dependencies using
conda
as below:
# This assumes conda is installed
conda env create -f environment.yaml
# Activate kGWASflow conda environment
conda activate kGWASflow
Finally, you can install kGWASflow using the setup script as below:
# Install kgwasflow
python setup.py install
# Test kgwasflow
kgwasflow --help
Other Options:
The other options on how to deploy this workflow can be found in the Snakemake Workflow Catalog .
Configuration
Initializing kGWASflow
To configure kGWASflow, you first need to initialize a new kGWASflow working directory by following the below steps:
# Activating the conda environment
conda activate kgwasflow
# Initializing a new kgwasflow working dir
kgwasflow init --work-dir path/to/your/work_dir
or
# Activating the conda environment
conda activate kgwasflow
# Change into your preferred working directory
cd path/to/your/work_dir
# Initializing a new kgwasflow working dir
kgwasflow init
kGWASflow Working Directory Structure
This command will initialize a new kGWASflow working directory with the default configuration files. Below is the directory structure of the working directory:
path/to/your/work_dir
├── config
│ ├── config.yaml
│ ├── phenos.tsv
│ └── samples.tsv
└── test
kGWASflow Configuration Files
Below are the configuration files generated by
kgwasflow init
command:
-
config/config.yaml
is a YAML file containing the workflow configuration. -
config/samples.tsv
is a TSV file containing the sample information. -
config/phenos.tsv
is a TSV file containing the phenotype information.
For more information about each configuration file, please see kGWASflow Wiki .
Usage
After initializing (
kgwasflow init
)
step
and modifying the configuration files, kGWASflow can be run as below:
# Activating the conda environment
conda activate kgwasflow
# Change into your preferred working directory
cd path/to/your/work_dir
# Run kgwasflow
kgwasflow run -t 16
Below are some of the run examples of kGWASflow:
Run examples:
1. Run kGWASflow with the default config file, default arguments and 16 threads:
kgwasflow run -t 16 --snake-default
2. Run kGWASflow with a custom config file and default settings:
kgwasflow run -t 16 -c path/to/custom_config.yaml
3. Run kGWASflow with user defined working directory:
kgwasflow run -t 16 --work-dir path/to/work_dir
4. Run kGWASflow in dryrun mode to see what tasks would be executed:
kgwasflow run -t 16 -n
5. Run kGWASflow using mamba as the conda frontend:
kgwasflow run -t 16 --conda-frontend mamba
6. Run kGWASflow and generate an HTML report:
kgwasflow run -t 16 --generate-report
Information about how to use kGWASflow with Snakemake commands can be found in the Snakemake Workflow Catalog .
Testing
In order to test kGWASflow using an E.coli dataset ( Earle et al. 2016 , Rahman et al. 2018 )
# Activating the conda environment
conda activate kgwasflow
After activating kGWASflow, you can perform a test run as axplained below:
Test examples:
1. Run the kGWASflow test in dryrun mode to see what tasks would be executed:
kgwasflow test -t 16 -n
2. Run the kGWASflow test using the test config file with 16 threads:
kgwasflow test -t 16
3. Run the kGWASflow test and define the test working directory:
kgwasflow test -t 16 --work-dir path/to/test_work_dir
Authors
kGWASflow was developed by Adnan Kivanc Corut .
Issues
For Issues: https://github.com/akcorut/kGWASflow/issues
Contributions to the development of kGWASflow are welcome! Create Pull Requests to fix bugs or recommend new features!
Maintainers
Citation
‼️ kGWASflow preprint is out now on bioRxiv: https://doi.org/10.1101/2023.07.10.548365 ‼️
If you use kGWASflow in your research, please cite using the DOI: https://doi.org/10.1101/2023.07.10.548365 and the original method paper by Voichek et al. (2020) :
-
Adnan Kivanc Corut, Jason G. Wallace. kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS (2023). bioRxiv. https://doi.org/10.1101/2023.07.10.548365
-
Voichek, Y., Weigel, D. Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nat Genet 52, 534–540 (2020). https://doi.org/10.1038/s41588-020-0612-7
License
kGWASflow is licensed under the MIT license.
Code Snippets
19 20 21 22 | shell: """ minimap2 -t {threads} -a {input.ref_gen} {input.contigs} > {output} 2> {log} """ |
39 40 41 42 | shell: """ awk -v s={params.min_mapping_score} '$5 > s || $1 ~ /^@/' {input} > {output} 2> {log} """ |
61 62 63 64 | shell: """ samtools view -@ {threads} -Sbh {input} > {output} 2> {log} """ |
83 84 85 86 | shell: """ samtools sort -@ {threads} {input} -o {output} 2> {log} """ |
105 106 107 108 | shell: """ samtools index -@ {threads} {input} 2> {log} """ |
128 129 130 131 | shell: """ bedtools bamtobed -i {input.bam} > {output} 2> {log} """ |
146 147 148 149 | shell: """ touch {output} 2> {log} """ |
23 24 25 26 27 | shell: """ bowtie -p {threads} -a --best --strata {params.extra} \ -x {params.index} -f {input.kmers_list} --sam {output} 2> {log} """ |
48 49 50 51 52 | shell: """ bowtie2 -p {threads} {params.extra} \ -x {params.index} -f {input.kmers_list} -S {output} 2> {log} """ |
71 72 73 74 | shell: """ samtools view -@ {threads} -Sbh {input} > {output} 2> {log} """ |
93 94 95 96 | shell: """ samtools sort -@ {threads} {input} -o {output} 2> {log} """ |
115 116 117 118 | shell: """ samtools index -@ {threads} {input} 2> {log} """ |
150 151 | script: "../scripts/plot_manhattan.py" |
166 167 168 169 | shell: """ touch {output} 2> {log} """ |
16 17 18 19 20 21 22 23 24 | shell: """ echo Merging r1 files: echo "$(ls {input.dir}/*_reads_with_kmers_R1.fastq)" cat {input.dir}/*_reads_with_kmers_R1.fastq > {output.merged_r1} 2> {log} echo Merging r2 files: echo "$(ls {input.dir}/*_reads_with_kmers_R2.fastq)" cat {input.dir}/*_reads_with_kmers_R2.fastq > {output.merged_r2} 2> {log} """ |
44 45 46 47 48 | shell: """ seqkit sort -n {input.r1} > {output.sorted_r1} 2> {log} seqkit sort -n {input.r2} > {output.sorted_r2} 2> {log} """ |
73 74 75 76 | shell: """ bowtie2 -p {threads} --very-sensitive-local {params.extra} -x {params.index} -1 {input.r1} -2 {input.r2} -S {output.out_sam} 2> {log} """ |
93 94 95 96 | shell: """ awk -v s={params.min_mapping_score} '$5 > s || $1 ~ /^@/' {input.sam} > {output.filtered_sam} 2> {log} """ |
115 116 117 118 | shell: """ samtools view -@ {threads} -Sbh {input.sam} > {output.bam} 2> {log} """ |
138 139 140 141 | shell: """ samtools sort -@ {threads} {input.bam} -o {output.sorted_bam} 2> {log} """ |
162 163 164 165 | shell: """ samtools index -@ {threads} {input.bam} 2> {log} """ |
187 188 189 190 | shell: """ bedtools bamtobed -i {input.bam} > {output.bed} 2> {log} """ |
209 210 211 212 | shell: """ bedtools merge -i {input.bed} -s -c 6 -o distinct > {output.merged_bed} 2> {log} """ |
231 232 233 234 235 | shell: """ sort -k1,1 -k2,2n {input.bed} | bgzip > {input.bed}.gz 2> {log} tabix -p bed {input.bed}.gz 2> {log} """ |
250 251 252 253 | shell: """ touch {output} 2> {log} """ |
22 23 24 25 | shell: """ spades.py --careful --only-assembler -t {threads} {params.extra} --pe1-1 {input.R1} --pe1-2 {input.R2} -o {params.out_dir} 2> {log} """ |
22 23 | wrapper: "v1.25.0/bio/blast/makeblastdb" |
38 39 40 41 | shell: """ seqkit head -n 1 {input} > {output} 2> {log} """ |
68 69 | wrapper: "v1.25.0/bio/blast/blastn" |
82 83 84 85 | shell: """ touch {output} """ |
17 18 | shell: "wget https://github.com/voichek/kmersGWAS/releases/download/{params.version}/{params.prefix}.zip -O {output} 2> {log}" |
38 39 | shell: "unzip {input} -d {params.dir} 2> {log}" |
21 22 | script: "../scripts/generate_kmers_list_paths.py" |
47 48 49 50 51 52 53 | shell: """ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib {input.kmersGWAS_bin}/list_kmers_found_in_multiple_samples -l {input.kmers_list} -k {params.kmer_len} \ --mac {params.mac} -p {params.min_app} -o {output} 2> {log} """ |
75 76 | script: "../scripts/plot_kmer_allele_counts.py" |
30 31 32 33 34 35 36 37 | shell: """ echo Working on fastq files: {input.fastqs} echo Symlink -fastq1: {input.fastqs[0]} to {output.r1} ln -rs {input.fastqs[0]} {output.r1} 2> {log.log1} echo Symlink -fastq2: {input.fastqs[1]} to {output.r2} ln -rs {input.fastqs[1]} {output.r2} 2> {log.log2} """ |
62 63 | script: "../scripts/generate_input_lists.py" |
84 85 | script: "../scripts/generate_input_lists.py" |
114 115 116 117 118 119 120 | shell: """ kmc -t{threads} {params.extra} -v -k{params.kmer_len} -ci{params.count_threshold} \ @{input} {params.outfile_prefix} {params.outdir_prefix} \ 1> {params.outdir_prefix}/kmc_canon.1 2> {params.outdir_prefix}/kmc_canon.2 \ > {log} """ |
148 149 150 151 152 153 154 | shell: """ kmc -t{threads} -v -k{params.kmer_len} -ci0 -b \ @{input} {params.outfile_prefix} {params.outdir_prefix} \ 1> {params.outdir_prefix}/kmc_all.1 2> {params.outdir_prefix}/kmc_all.2 \ > {log} """ |
186 187 188 189 190 191 | shell: """ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib {input.kmersGWAS_bin}/kmers_add_strand_information -c {params.prefix_kmc_canon} -n {params.prefix_kmc_all} -k {params.kmer_len} -o {output.strand} > {log} """ |
258 259 | script: "../scripts/plot_kmer_stats.py" |
18 19 | script: "../scripts/fetch_significant_kmers.py" |
12 13 14 15 16 | shell: """ git clone --recurse-submodules https://github.com/voichek/fetch_reads_with_kmers.git 2> {log.log1} mv fetch_reads_with_kmers scripts/external 2> {log.log2} """ |
27 28 29 30 31 | shell: """ cd {params.prefix} make """ |
65 66 | script: "../scripts/fetch_source_reads.py" |
96 97 | script: "../scripts/fetch_source_reads.py" |
113 114 115 116 | shell: """ touch {output} 2> {log} """ |
21 22 23 24 25 26 | shell: """ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib {input.kmersGWAS_bin}/filter_kmers -t {params.prefix} -k {input.lists} -o {output} 2> {log} """ |
41 42 43 44 | shell: """ touch {output} 2> {log} """ |
24 25 26 27 28 29 30 | shell: """ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib {input.kmersGWAS_bin}/emma_kinship_kmers -t {params.prefix} -k {params.kmer_len} \ --maf {params.maf} > {output} 2> {log} """ |
52 53 54 55 56 57 | shell: """ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib {input.kmersGWAS_bin}/emma_kinship {params.prefix} > {output} 2> {log} """ |
23 24 25 26 27 28 29 | shell: """ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib {input.kmersGWAS_bin}/build_kmers_table -l {input.list_paths} -k {params.kmer_len} \ -a {input.kmers_to_use} -o {params.prefix} 2> {log} """ |
57 58 59 60 61 62 63 64 65 66 67 68 69 70 | shell: """ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib {input.kmersGWAS_bin}/kmers_table_to_bed \ -t {params.input_prefix} \ -p {input.phenotype} \ -o {params.output_prefix} \ -k {params.kmer_len} \ --maf {params.maf} --mac {params.mac} \ -b {params.max_num_var} \ {params.only_unique} \ >>{log} 2>&1 """ |
25 26 | wrapper: "v1.25.0/bio/igv-reports" |
41 42 43 44 | shell: """ touch {output} 2> {log} """ |
74 75 | wrapper: "v1.25.0/bio/igv-reports" |
90 91 92 93 | shell: """ touch {output} 2> {log} """ |
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | shell: """ export LD_LIBRARY_PATH=$CONDA_PREFIX/lib rm -r {params.out_prefix} python2 {input.kmers_gwas_py} {params.extra} \ --min_data_points {params.min_data_points} \ --pheno {input.phenotype} \ --kmers_table {params.kmers_tab_prefix} \ --kmers_number {params.kmers_number} \ --permutations {params.n_permutations} \ --maf {params.maf} --mac {params.mac} \ -l {params.kmer_len} -p {threads} \ --outdir {params.out_prefix} >>{log} 2>&1 """ |
17 18 | wrapper: "v1.12.2/bio/fastqc" |
39 40 | wrapper: "v1.12.2/bio/multiqc" |
15 16 17 18 | shell: """ ln -rs {input.reference} {output} 2> {log} """ |
37 38 39 40 | shell: """ samtools faidx {input.reference} 2> {log} """ |
62 63 | wrapper: "v1.25.0/bio/bowtie2/build" |
90 91 | shell: "bowtie-build --threads {threads} {input.reference} {params.prefix} 2> {log}" |
108 109 | wrapper: "v1.23.5/bio/sra-tools/fasterq-dump" |
120 121 | wrapper: "v1.23.5/bio/sra-tools/fasterq-dump" |
37 38 | script: "../scripts/generate_kmers_gwas_summary.py" |
21 22 | wrapper: "v1.18.3/bio/cutadapt/pe" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | import csv import os, glob, shutil import pandas as pd import numpy as np import sys # Logging with open(snakemake.log[0], "w") as f: sys.stderr = sys.stdout = f ## Set float presicion for the p-values pd.set_option("display.precision", 8) # Input directory path for kmersGWAS results res_dir_path = os.path.dirname(snakemake.params.in_prefix) ## Set the threshold for k-mers GWAS results (%5 or %10 family-wise error rate) ## 5 is the default threshold = snakemake.params.threshold # If threshold is not 5 or 10, print an error message and exit if threshold != 5 and threshold != 10: print("ERROR: threshold value must be 5 or 10. \nPlease enter a valid threshold (5 or 10). \nExiting...") sys.exit(1) # Define the output directory outdir = snakemake.output.dir print("Output directory: " + outdir) # Define a function to fetch significant k-mers def fetch_kmers(input_dir, pheno, out_dir, threshold): """ Fetch significant k-mers for a given phenotype from the k-mers GWAS results. """ kmers_pass_threshold = pd.read_csv(input_dir + '/' + pheno + '/kmers/pass_threshold_{threshold}per'. format(threshold=threshold), sep="\t") # read the k-mers that pass the threshold if not os.path.exists(out_dir): os.makedirs(out_dir) # create the output directory if it does not exist if kmers_pass_threshold.empty: # if no k-mers pass the 5% threshold: # create a file to indicate that no k-mers pass the 5% threshold with open(out_dir + '/{pheno}_NO_KMERS_PASS_5PER_THRESHOLD_FOUND'. format(pheno=pheno), 'a'): print("No k-mers pass the {threshold}% threshold for the phenotype {pheno}.". format(threshold=threshold, pheno=pheno)) # print a message to the user and exit the script pass else: # if k-mers pass thethreshold ## Get significant k-mers without k-mer number kmers_pass_threshold['rs']= kmers_pass_threshold['rs'].apply(lambda x: x.split('_')[0]) # print(kmers_pass_threshold) # For debugging only ## Write significant k-mers list in TEXT format print("Writing significant k-mers list in TEXT format...") kmers_pass_threshold['rs'].to_csv(out_dir + "/" + pheno + "_kmers_list.txt", index=False, sep="\t", header=None) # write the k-mers list in a text file print ("Done.") ## Make a new column with k-mer number and its p_value (e.g. kmer1_8.543334e-11) kmers_pass_threshold['kmer_name']= 'kmer' + (kmers_pass_threshold.index+1).astype(str) + '_' + kmers_pass_threshold['p_lrt'].astype(str) ### Write significant k-mers list in FASTA format (based on: https://www.biostars.org/p/271977/) ## Fasta file output fileOutput = open(out_dir + "/" + pheno + "_kmers_list.fa", "w") ## Seq count count = 1 ## Loop through each line in the input file print("Writing significant k-mers list in FASTA format...") for index, row in kmers_pass_threshold.iterrows(): #Output the header fileOutput.write(">" + row['kmer_name'] + "\n") fileOutput.write(row['rs'] + "\n") count = count + 1 print ("Done.") print("") #Close the input and output file fileOutput.close() # Get the list of phenotypes phenos = [os.path.basename(x) for x in glob.glob(res_dir_path + '/*')] print("Phenotypes: " + str(phenos)) # print the list of phenotypes for pheno in phenos: # for each phenotype print("Fetching significant k-mers for phenotype: " + pheno) print("Threshold: " + str(threshold) + "%") # Fetch significant k-mers for the current phenotype fetch_kmers(input_dir=res_dir_path, pheno=pheno, out_dir=outdir , threshold=threshold) print("Done!.") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | import csv import sys import os, argparse, glob, shutil import pandas as pd import numpy as np import subprocess from pathlib import Path # logging with open(snakemake.log[0], "w") as f: sys.stderr = sys.stdout = f ################################################# ####### Load Snakemake Inputs and Params ######## ################################################# fq1_list= list(snakemake.input.r1) # list of fastq r1 files fq2_list= list(snakemake.input.r2) # list of fastq r2 files samp_list = list(snakemake.params.samples) # list of sample names lib_list = list(snakemake.params.library) # list of library names kmers_tab = snakemake.input.kmers_tab # kmers table kmers_list_fa = snakemake.input.kmers_list # kmers list fasta pheno = snakemake.params.pheno # phenotype name out_dir = snakemake.output.out_dir # output directory kmer_len = snakemake.params.kmer_len # kmer length fetch_reads = snakemake.input.fetch_reads # fetch_reads script ######################################### ####### Generate a Samples Table ######## ######################################### ## Make a samples info table w/ sample names, library names and corresponding fastq file paths reads_path_tab = {'sample_name': samp_list, 'library_name': lib_list, 'fq1': fq1_list, 'fq2': fq2_list} reads_path_tab = pd.DataFrame(data=reads_path_tab, dtype=object) print(reads_path_tab.head()) # reads_path_tab.to_csv(snakemake.params.out_prefix + "test.txt", index=False, sep="\t") ############################################################### ####### Identify the samples having the k-mers present ######## ############################################################### ## Read "kmers_table.txt" for a given phenotype kmers_table = pd.read_csv(kmers_tab, sep="\t") ## Filter out accessions/samples that doesn't have the kmers in them kmers_table_filt = kmers_table.loc[:, (kmers_table != 0).any(axis=0)] ## Get accesions with k-mers acc_list = list(kmers_table_filt.iloc[: , 1:].columns.values) print("Working on the phenotype: " + pheno) print("") print("################## Identifying accessions that has the k-mers... ##################") print("") print(str(len(acc_list)) + " accessions identified: " + ', \n'.join(acc_list)) print("") print("######################### Fetching reads with k-mers... #########################") print("") ## Filter samples table to get only samples with k-mers reads_path_tab_filt = reads_path_tab[reads_path_tab['sample_name'].isin(acc_list)] reads_path_tab_filt = reads_path_tab_filt.reset_index(drop=True) ## Check if output directory is exist. If not create one if not os.path.exists(out_dir): os.makedirs(out_dir) ## Filter the reads that have one of the k-mers in them for index, row in reads_path_tab_filt.iterrows(): if row["sample_name"] == row["library_name"]: ## Source code: https://github.com/voichek/fetch_reads_with_kmers subprocess.run(" {fetch_reads} {r1} {r2} {fa} {kmer_len} {out}/{acc}_reads_with_kmers".format( fetch_reads= fetch_reads, out=out_dir, r1= row["fq1"], r2= row["fq2"], acc= row["sample_name"], lib = row["library_name"], fa= kmers_list_fa, pheno= pheno, kmer_len= kmer_len), shell=True) # make sure that the output file is generated and if not exit the script with error if not os.path.exists("{out}/{acc}_reads_with_kmers_R1.fastq".format(out=out_dir, acc=row["sample_name"])): print("Error: {acc}_reads_with_kmers_R1.fastq is not generated!".format(acc=row["sample_name"])) sys.exit(1) if not os.path.exists("{out}/{acc}_reads_with_kmers_R2.fastq".format(out=out_dir, acc=row["sample_name"])): print("Error: {acc}_reads_with_kmers_R2.fastq is not generated!".format(acc=row["sample_name"])) sys.exit(1) else: ## Source code: https://github.com/voichek/fetch_reads_with_kmers subprocess.run(" {fetch_reads} {r1} {r2} {fa} {kmer_len} {out}/{acc}_{lib}_reads_with_kmers".format( fetch_reads= fetch_reads, out=out_dir, r1= row["fq1"], r2= row["fq2"], acc= row["sample_name"], lib = row["library_name"], fa= kmers_list_fa, pheno= pheno, kmer_len= kmer_len), shell=True) # make sure that the output file is generated and if not exit the script with error if not os.path.exists("{out}/{acc}_{lib}_reads_with_kmers_R1.fastq".format(out=out_dir, acc=row["sample_name"], lib=row["library_name"])): print("Error: {acc}_{lib}_reads_with_kmers_R1.fastq is not generated!".format(acc=row["sample_name"], lib=row["library_name"])) sys.exit(1) if not os.path.exists("{out}/{acc}_{lib}_reads_with_kmers_R2.fastq".format(out=out_dir, acc=row["sample_name"], lib=row["library_name"])): print("Error: {acc}_{lib}_reads_with_kmers_R2.fastq is not generated!".format(acc=row["sample_name"], lib=row["library_name"])) sys.exit(1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | import csv import os, argparse, glob, shutil import pandas as pd import numpy as np import sys # logging with open(snakemake.log[0], "w") as f: sys.stderr = sys.stdout = f dir_path = snakemake.params.prefix # Directory path of input reads # Get dir names (each dir name shoud be same as sample names) dir_names = os.listdir(dir_path) # For each dir/sample create a list # that contains full path of each # fastq/fq/.gz files exist in that folder for dir in dir_names: input_list=[] df=[] types = ["*.gz", "*.fq", "*.fastq"] for type in types: temp = glob.glob(dir_path + '/' + dir + '/' + type) input_list += temp df = pd.DataFrame(sorted(input_list),columns=['input_files']) # Write out list of fastq files to input_files.txt # and save it in current dir/sample df.to_csv(dir_path + '/' + dir + '/' + "input_files.txt", index=False, header=False) # Check if input_files.txt is created and not empty try: if os.path.getsize(dir_path + '/' + dir + '/' + "input_files.txt") > 0: print("input_files.txt is successfully created for the sample: " + dir) else: print("Warning: input_files.txt is empty for the sample: " + dir) except OSError as e: print("input_files.txt cannot be created for the sample: " + dir) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | import csv import os, glob, shutil import pandas as pd import numpy as np import sys import matplotlib.pyplot as plt # Logging with open(snakemake.log[0], "w") as f: sys.stderr = sys.stdout = f # Read the kmersGWAS results file (pass_threshold_5per) pass_threshold_5per = pd.read_csv(snakemake.input.kmers_gwas_res, sep="\t") # Input directory path for kmersGWAS results res_dir_path = os.path.dirname(snakemake.input.kmers_gwas_res) # Read the pass_threshold_10per file pass_threshold_10per = pd.read_csv(res_dir_path + '/pass_threshold_10per', sep="\t") # Get the permutation based p-value thresholds values for 5% and 10% threshold_5per = pd.read_csv(res_dir_path + '/threshold_5per', sep="\t", names=['threshold']) threshold_10per = pd.read_csv(res_dir_path + '/threshold_10per', header=None, sep="\t", names=['threshold']) # read the phenotype_value.assoc.txt.gz file pheno_val_assoc = pd.read_csv(res_dir_path + '/output/phenotype_value.assoc.txt.gz', compression='gzip', sep="\t") # Phenotype name pheno= snakemake.params.pheno # Number of k-mers filtered kmers_number=snakemake.params.kmers_number # Define a function to generate a table with the k-mers, p-values and p-value thresholds def generate_kmers_gwas_results_table(pass_threshold, threshold, pheno, output_file): # Generate an empty dataframe with the kmer, p-value and threshold res_table = pd.DataFrame(columns=['kmers_pass_threshold', 'pval', 'pval_threshold']) # Fill the dataframe with the kmer, p-value, threshold, phenotype, -log10(p-value) and allele frequency res_table['kmers_pass_threshold'] = pass_threshold['rs'].apply(lambda x: x.split('_')[0]) # get the kmer from the rs column # Get the p-value from the p_lrt column and round it to 6 decimal places res_table['pval'] = pass_threshold['p_lrt'] # Get the -log10(p-value) from the p-value column res_table['-log10(pval)'] = -np.log10(res_table['pval']) res_table['pval'] = res_table['pval'].map('{:.6e}'.format) # round to 6 decimal places res_table['-log10(pval)'] = res_table['-log10(pval)'].map('{:.5f}'.format) # round to 5 decimal places # Get the allele frequency from the af column res_table['allele_freq'] = pass_threshold['af'] # add the allele frequency column # Get the p-value threshold res_table['pval_threshold'] = threshold['threshold'][0] # add the p-value threshold column res_table['phenotype'] = pheno # add the phenotype column print(res_table.head()) # Save the table to a tsv file print('Saving the results summary table to a tsv file...') res_table.to_csv(output_file, index=False, sep='\t', columns=['kmers_pass_threshold', 'phenotype', 'allele_freq', 'pval', '-log10(pval)', 'pval_threshold']) print('Generating the results summary table is done!') # Define a function to plot the histogram of the -log10(p-value) values def plot_pval_histogram(pheno_val_assoc, threshold_5per, threshold_10per, kmers_number, pheno, out_dir): # If output directory does not exist, create it if not os.path.exists(out_dir): os.makedirs(out_dir) # Plot the histogram of the -log10(p-value) values f, ax = plt.subplots(figsize=(16, 8), facecolor='w', edgecolor='k') plt.hist(-np.log10(pheno_val_assoc['p_lrt']), bins=50, color='g', alpha=0.5) # Add labels to the axes plt.xlabel('-log10(pval)', fontsize=24) plt.ylabel('Frequency', fontsize=24) # Add a vertical line to the histogram based on the 5% threshold plt.axvline(x=threshold_5per['threshold'][0], color='r', linestyle='--') # Add a caption to the right bottom corner of the outside of the plot ax.text(1.0, -0.1, "*Red dashed line indicates 5% family-wise error-rate threshold", color='r', transform=ax.transAxes, ha='right', va='bottom', fontsize=12) # Add label to the top of the vertical line ax.text(threshold_5per['threshold'][0]+0.3, 1, '5% threshold', rotation=0, color='r', va='top' , ha='left', bbox=dict(facecolor='w', edgecolor='r', boxstyle='round,pad=0.5'), transform=ax.get_xaxis_transform(), fontsize=12) # Add a vertical line to the histogram based on the 10% threshold plt.axvline(x=threshold_10per['threshold'][0], color='b', linestyle='--') # Add a caption to the right bottom corner of the outside of the plot ax.text(1.0, -0.13, "*Blue dashed line indicates 10% family-wise error-rate threshold", color='b', transform=ax.transAxes, ha='right', va='bottom', fontsize=12) # Add label to the top of the vertical line ax.text(threshold_10per['threshold'][0]-0.3, 1, '10% threshold', rotation=0, color='b', va='top' , ha='right', bbox=dict(facecolor='w', edgecolor='b', boxstyle='round,pad=0.5'), transform=ax.get_xaxis_transform(), fontsize=12) # Add title to the histogram plt.title('Histogram of (-log10) p-values - k-mers passing the first filtering step (n={kmers_num})'.format(kmers_num=kmers_number), fontsize=16, y=1.02) plt.xticks(fontsize = 18) plt.yticks(fontsize = 18) # Adjust the layout of the plot plt.tight_layout() print('Saving the histogram plot to a pdf file...') # Save the plot to a PDF file plt.savefig(out_dir + "/{pheno}.kmers_pass_first_filter.pval_hist.pdf".format(pheno=pheno)) print("Generating the p-value histogram plot is done!") outdir = os.path.dirname(snakemake.output.res_sum_5per) # Check if there are k-mers that pass the 5% threshold if pass_threshold_5per.empty: # if no k-mers pass the 5% threshold with open(outdir + "/NO_KMERS_PASS_5PER_THRESHOLD_FOUND", 'a'): # create a file to indicate that no k-mers pass the 5% threshold print("No k-mers pass the 5% threshold.") # print a message to the user # generate the table with the k-mers that pass the 5% threshold print("Generating kmersGWAS summary table with k-mers that pass the 5% threshold...") generate_kmers_gwas_results_table(pass_threshold=pass_threshold_5per, threshold=threshold_5per, pheno=pheno, output_file=snakemake.output.res_sum_5per) # Check if there are k-mers that pass the 10% threshold if pass_threshold_10per.empty: # if no k-mers pass the 10% threshold with open(outdir + "/NO_KMERS_PASS_10PER_THRESHOLD_FOUND", 'a'): # create a file to indicate that no k-mers pass the 10% threshold print("No k-mers pass the 10% threshold.") # print a message to the user and exit the script # generate the table with the k-mers that pass the 10% threshold print("Generating kmersGWAS summary table with k-mers that pass the 10% threshold...") generate_kmers_gwas_results_table(pass_threshold=pass_threshold_10per, threshold=threshold_10per, pheno=pheno, output_file=snakemake.output.res_sum_10per) # if k-mers pass the 5% threshold or the 10% threshold if not pass_threshold_5per.empty or not pass_threshold_10per.empty: # Plot the histogram of the -log10(p-value) values print("Plotting the histogram of the -log10(p-value) values...") plot_pval_histogram(pheno_val_assoc=pheno_val_assoc, threshold_5per=threshold_5per, threshold_10per=threshold_10per, kmers_number=kmers_number, pheno=pheno, out_dir=snakemake.output.pval_hist_dir) # if no k-mers pass the 5% threshold and no k-mers pass the 10% threshold if pass_threshold_5per.empty and pass_threshold_10per.empty: # If output directory does not exist, create it if not os.path.exists(snakemake.output.pval_hist_dir): os.makedirs(snakemake.output.pval_hist_dir) # Create a file to indicate that no k-mers pass the 5% threshold with open(snakemake.output.pval_hist_dir + "/NO_KMERS_PASS_10PER_THRESHOLD_FOUND", 'a'): print("No k-mers pass the 5% or the 10% threshold. No plots will be generated.") # print a message to the user and exit the script print("Done!") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | import csv import os, argparse, glob, shutil import pandas as pd import numpy as np import sys # logging with open(snakemake.log[0], "w") as f: sys.stderr = sys.stdout = f in_dir = snakemake.params.input_dir # Path for k-mers counts folder samp_tab = snakemake.input.samples_tab # Path for sampels sheet (samples.tsv) dir_names = os.listdir(in_dir) ## Read samples.tsv (provided by the user) file # to get the order of samples samples_tab = pd.read_csv(samp_tab, sep="\t", usecols=["sample_name"], dtype='object') samples_tab = samples_tab.drop_duplicates() ## Loop through kmers_count results for each # accession/sample directory and make a list of # kmers_with_strand file paths for each accession/sample input_list=[] dirs=[] for dir in dir_names: if os.path.isfile(in_dir + '/' + dir + '/' + "kmers_with_strand"): input_list.append(in_dir + '/' + dir + '/' + "kmers_with_strand") dirs.append(dir) ## Make a dataframe with columns accession name and # kmers_with_strand file path for that acession list_paths = pd.DataFrame(input_list,columns=['input_files']) list_paths["accession"] = dirs ## Make sure the output file has same sample order as in samples.tsv file list_paths_sorted = samples_tab.merge(list_paths, left_on='sample_name', right_on = 'accession').drop('sample_name',axis=1) ## Write out output file list_paths_sorted.to_csv(snakemake.params.out_dir + '/' + "kmers_list_paths.txt", index=False, header=False, sep="\t") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | import csv import os, glob, shutil import pandas as pd import numpy as np import seaborn as sns import scipy as sp from scipy import stats import matplotlib.ticker as ticker import matplotlib.pyplot as plt from matplotlib.pyplot import hist # Logging with open(snakemake.log[0], "w") as f: sys.stderr = sys.stdout = f ## Read kmers_to_use.shareness file. # This file contains the information about # how many k-mers appeared in each individual/sample. # More info: https://github.com/voichek/kmersGWAS/blob/master/manual.pdf kmers_shareness = pd.read_csv(snakemake.input.shareness, sep="\t", header=None, skiprows=1) print(kmers_shareness) ## Plot the stats from kmers_to_use.shareness file # x-axis is allele count and y-axis is no. of k-mers # For every n accession/sample, the number of k-mers # appeared in exactly that number of accessions. fig_dims = (10, 10) fig, ax = plt.subplots(figsize=fig_dims) sns_plot = sns.barplot(x=kmers_shareness[0], y=kmers_shareness[1], color="darksalmon", saturation=1) sns_plot.set_xlabel("No. of samples",fontsize=24) sns_plot.set_ylabel("No. of k-mers",fontsize=24) sns_plot.set_yscale("log") # set y-axis ticks yticks = [10**i for i in range(4, 10)] sns_plot.yaxis.set_major_locator(ticker.FixedLocator(yticks)) # set y-axis tick labels ytick_labels = [f"$10^{i}$" for i in range(4, 10)] sns_plot.set_yticklabels(ytick_labels) # set x-axis ticks sns_plot.xaxis.set_major_formatter(ticker.FormatStrFormatter('%d')) sns_plot.xaxis.set_major_locator(ticker.MultipleLocator(base=30)) plt.xticks(fontsize = 18) plt.yticks(fontsize = 18) # rotate x axis labels ax.tick_params(axis='x', rotation=45) fig.tight_layout() ## Save the plot sns_plot.figure.savefig(snakemake.output.kmer_allele_counts_plot, dpi=300) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | import csv import os, glob, shutil import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import scipy as sp from scipy import stats import matplotlib.ticker as ticker # Logging with open(snakemake.log[0], "w") as f: sys.stderr = sys.stdout = f ## Gather dir (sample names as directory) names in a list kmc_logs_path= snakemake.params.input_path dir_names= os.listdir(kmc_logs_path) ## Define a function to generate k-mers count stats table # for each accession/sample. Accession name, no. total reads and no. of # unique k-mers as columns, samples are rows. def generate_kmers_stats_tab(dir_path, file_name, dir_names): """ Generate k-mers count stats table for each accession/sample. """ # Define lists to store values unique_kmers=[] # list to store unique k-mers total_reads=[] # list to store total reads accessions=[] # list to store accession names # For each accession/sample gather accesssion name, # from dir names and no. total reads and no. of unique # k-mers from kmc log files. for dir in dir_names: # for each accession/sample if os.path.isfile(dir_path + '/' + dir + '/' + file_name): # if kmc log file exists df= pd.read_csv(dir_path + '/' + dir + '/' + file_name, delimiter = ":", header=None) # read kmc log file df= df.dropna() # drop rows with NaN values unique_kmers.append(df.iloc[6][1]) # append unique k-mers to list total_reads.append(df.iloc[9][1]) # append total reads to list accessions.append(dir) # append accession name to list ## Create k-mers count stats table. target_df = pd.DataFrame( {'accessions': accessions, 'unique_kmers': unique_kmers, 'total_reads': total_reads }) ## Convert unique k-mers and total reads values to integer. target_df["unique_kmers"] = pd.to_numeric(target_df["unique_kmers"]) # target_df["total_reads"] = pd.to_numeric(target_df["total_reads"]) return target_df # return k-mers count stats table # Define functions to generete plots from k-mers count stats table (above). def plot_kmers_stats_scatter(target_df, out_file, plot_name): """ Plot the scatter plot with regression line. """ # Define x-axis and y-axis X=target_df["unique_kmers"] # Unique k-mers as x-axis Y=target_df["total_reads"] # Total reads as y-axis # Plot scatter plot plt.figure(figsize=(10, 10)) # set figure size plt.scatter(X, Y, alpha=0.5, s=200) slope, intercept = np.polyfit(X, Y, 1) # calculate regression line plt.plot(X, slope*X + intercept,color="red") # plot regression line plt.xlabel("No. of unique kmers", fontsize=24) # add x-axis label plt.ylabel("No. of total reads", fontsize=24) # add y-axis label plt.xticks(fontsize = 18) plt.yticks(fontsize = 18) plt.title(plot_name) # add plot title plt.tight_layout() # tight layout plt.savefig(out_file, dpi=300) # save plot plt.close() # close plot # Define a function to plot the joint plot with regression line and pearson correlation coefficient. def plot_kmers_stats_joint(target_df, out_file, plot_name): """ Plot the joint plot with regression line and pearson correlation coefficient. """ # plot joint plot joint_plot= sns.jointplot(x="unique_kmers", y="total_reads", data=target_df, # data kind="reg", # regression line height=10, joint_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'s': 150}}) # pearson correlation coefficient r, p = stats.pearsonr(target_df["unique_kmers"], target_df["total_reads"]) # calculate pearson correlation coefficient phantom, = joint_plot.ax_joint.plot([], [], linestyle="", alpha=0) # add pearson correlation coefficient to the legend legend = joint_plot.ax_joint.legend([phantom],['r={:f}, p={:f}'.format(r,p)]) # add pearson correlation coefficient to the legend # add x-axis label, y-axis label and plot title joint_plot.ax_joint.set_xlabel('No. of unique kmers', fontsize=24) # add x-axis label joint_plot.ax_joint.set_ylabel('No. of total reads', fontsize=24) # add y-axis label plt.figure(figsize=(10, 10)) # set figure size joint_plot.ax_joint.xaxis.set_tick_params(labelsize=18) joint_plot.ax_joint.yaxis.set_tick_params(labelsize=18) plt.setp(legend.get_texts(), fontsize='14') # for legend text joint_plot.fig.suptitle(plot_name, y = 1) # add plot title joint_plot.fig.tight_layout() # tight layout joint_plot.savefig(out_file, dpi=300) # save plot plt.close() # close plot # Define a function to plot the distribution of k-mer counts for canonical and non-canonical k-mers. def plot_kmer_dist(target_df, out_file, plot_name): """ Plot the distribution of k-mer counts for canonical and non-canonical k-mers. """ # Plot the histogram of k-mer counts plt.figure(figsize=(10, 10)) # set figure size plt.hist(target_df["unique_kmers"], bins=30, color="#86bf91", alpha=0.5, edgecolor="black", linewidth=1.2) # plt.xlabel("No. of unique kmers", fontsize=24) # add x-axis label plt.ylabel("No. of samples", fontsize=24) # add y-axis label plt.xticks(fontsize = 18) plt.yticks(fontsize = 18) plt.title(plot_name, fontsize=24) # add plot title plt.ticklabel_format(style='sci', axis='x', scilimits=(0,0)) # set x-axis to scientific notation plt.gca().xaxis.set_major_formatter(plt.ScalarFormatter()) # Add a boxed text indicating the total number of k-mers to the top right corner of the plot textstr = 'Total no. of k-mers:\n{:,}'.format(target_df["unique_kmers"].sum()) # add box around the text plt.gca().text(0.95, 0.95, textstr, transform=plt.gca().transAxes, fontsize=16, color="black", verticalalignment='top', horizontalalignment='right', bbox=dict(facecolor='wheat', pad=10.0), alpha=0.5) plt.tight_layout() # tight layout plt.savefig(out_file, dpi=300) # save plot plt.close() # close plot ## Generate stats tables for both canonized and non-canonized k-mers kmc_canon_stats= generate_kmers_stats_tab(dir_path=kmc_logs_path, file_name="kmc_canon.log", dir_names=dir_names) kmc_non_canon_stats= generate_kmers_stats_tab(dir_path=kmc_logs_path, file_name="kmc_all.log", dir_names=dir_names) ## Writing out KMC stats as a tsv table print("Writing out KMC stats as a tsv table...") kmc_canon_stats.to_csv(snakemake.output.kmc_canon_stats, index=False, sep="\t") kmc_non_canon_stats.to_csv(snakemake.output.kmc_all_stats, index=False, sep="\t") ## Plot the stats print("Plotting total reads vs. number of unique k-mers for canonical k-mers...") plot_kmers_stats_scatter(target_df=kmc_canon_stats, out_file=snakemake.output.kmc_canon_plot, plot_name="Total Reads vs. Number of Unique k-mers (Canonical)") print("Plotting total reads vs. number of unique k-mers for non-canonical k-mers...") plot_kmers_stats_scatter(target_df=kmc_non_canon_stats, out_file=snakemake.output.kmc_all_plot, plot_name="Total Reads vs. Number of Unique k-mers (Non-canonical)") print("Plotting total reads vs. number of unique k-mers for canonical k-mers (joint plot)...") plot_kmers_stats_joint(target_df=kmc_canon_stats, out_file=snakemake.output.kmc_canon_joint_plot, plot_name="Total Reads vs. Number of Unique k-mers (Canonical)") print("Plotting total reads vs. number of unique k-mers for non-canonical k-mers (joint plot)...") plot_kmers_stats_joint(target_df=kmc_non_canon_stats, out_file=snakemake.output.kmc_all_joint_plot, plot_name="Total Reads vs. Number of Unique k-mers (Non-canonical)") print("Plotting the distribution of k-mer counts for canonical k-mers...") plot_kmer_dist(target_df=kmc_canon_stats, out_file=snakemake.output.kmc_canon_kmer_dist_plot, plot_name="Distribution of k-mer counts (Canonical)") print("Plotting the distribution of k-mer counts for non-canonical k-mers...") plot_kmer_dist(target_df=kmc_non_canon_stats, out_file=snakemake.output.kmc_all_kmer_dist_plot, plot_name="Distribution of k-mer counts (Non-canonical)") print("Done!") |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | import csv import os, glob, shutil import re import sys import pandas as pd import numpy as np import matplotlib.pyplot as plt from qmplot import manhattanplot import natsort from natsort import natsorted import seaborn as sns import pysam if __name__ == "__main__": ## Logging with open(snakemake.log[0], "w") as f: sys.stderr = sys.stdout = f ## Read the alignment data align_kmers_sam = pd.read_table(snakemake.input[0], sep='\t', comment='@', header=None, usecols=[0,2,3], names=['kmer_id', 'chr', 'bp']) ## Preparing the data for plotting align_kmers_sam['kmer'] = align_kmers_sam['kmer_id'].str.split('_').str[0] align_kmers_sam['p_value'] = align_kmers_sam['kmer_id'].str.split('_').str[1] align_kmers_sam['p_value'] = align_kmers_sam['p_value'].astype(float) align_kmers_sam['bp'] = align_kmers_sam['bp'].astype(int) # Sort the data by chromosome and chromosome position align_kmers_sam_sorted = align_kmers_sam.sort_values(by=["chr", "bp"], key=natsort.natsort_keygen()) # Get colors for manhattan plot colors = sns.color_palette("colorblind").as_hex() colors_2 = sns.color_palette("husl").as_hex() # Make a column of minus log10 p-values align_kmers_sam_sorted['minuslog10pvalue'] = -np.log10(align_kmers_sam_sorted.p_value) ## Get min & max minus log10 p-values for y axis limits y_max = align_kmers_sam_sorted['minuslog10pvalue'].max() y_min = align_kmers_sam_sorted['minuslog10pvalue'].min() print("y_max: " + str(y_max)) print("y_min: " + str(y_min)) ## Check if only one chromosome is provided for the manhattan plot num_of_chrs = len(pd.unique(align_kmers_sam_sorted['chr'])) # Define a function to extract chromosome names and lengths from a SAM file header def extract_chromosome_info(sam_file): """ Extract chromosome names and lengths from a SAM file header and return as a Pandas DataFrame with columns "chr" and "bp". """ chromosome_info = {} # dictionary to store chromosome names and lengths pattern = re.compile(r'^([Cc][Hh][Rr])?\d*[XYM]?$') # chromosome names pattern with pysam.AlignmentFile(sam_file, "r") as sam: # open SAM file for header_line in sam.header["SQ"]: # iterate over SQ header lines chromosome_name = header_line["SN"] # get chromosome name length = header_line["LN"] # get chromosome length if chromosome_name.startswith("chr"): # if chromosome name starts with "chr" name = chromosome_name # use name as is else: # if chromosome name does not start with "chr" match = pattern.match(chromosome_name) # if match: # if chromosome name matches pattern # Convert name to "chrX" format name = "chr" + match.group(1) else: # if chromosome name does not match pattern continue # skip chromosome chromosome_info[name] = length # add chromosome name and length to dictionary # Convert dictionary to DataFrame df = pd.DataFrame(chromosome_info.items(), columns=["chr", "bp"]) return df # return the DataFrame # Plotting the manhattan plot print("Plotting...") # Set font sizes tick_fontsize = snakemake.params["tick_fontsize"] label_fontsize = snakemake.params["label_fontsize"] title_fontsize = snakemake.params["title_fontsize"] # Set the figure dpi dpi = snakemake.params["dpi"] ## If only one chromosome is provided, plot the k-mer's position on ## that chromosome on the x axis if num_of_chrs == 1: f, ax = plt.subplots(figsize=(18, 9), facecolor='w', edgecolor='k') manhattanplot(data=align_kmers_sam_sorted, snp="kmer_id", chrom="chr", CHR= pd.unique(align_kmers_sam_sorted['chr']), color=colors_2, pos="bp", pv="p_value", suggestiveline=None, # Turn off suggestiveline genomewideline=None, # Turn off genomewideline xticklabel_kws={"rotation": "vertical"}, ax=ax, s = snakemake.params["point_size"], clip_on=False) ax.set_ylim([y_min-0.5, y_max+1]) # Set y axis limits # Set x axis tick interval xtick_interval = snakemake.params["xtick_interval"] # Calculate the minimum and maximum of your data, rounded to the nearest multiple of 5000 min_val = align_kmers_sam_sorted['bp'].min() // xtick_interval * xtick_interval max_val = (align_kmers_sam_sorted['bp'].max() // xtick_interval + 1) * xtick_interval # Generate the tick locations xtick_locs = np.arange(min_val, max_val, xtick_interval) f.suptitle('k-mer Based GWAS Manhattan Plot for ' + snakemake.params["pheno"], fontsize=title_fontsize) plt.xlabel('Chromosome: ' + pd.unique(align_kmers_sam_sorted['chr'])[0], fontsize=label_fontsize) plt.ylabel(r"$-log_{10}{(P)}$", fontsize=label_fontsize) plt.xticks(xtick_locs, fontsize = tick_fontsize) plt.yticks(fontsize = tick_fontsize) plt.tight_layout() ## If more than one chromosome is provided, use all chromosomes if num_of_chrs > 1: # Extract chromosome names and lengths from the SAM file header chrom_info_tab = extract_chromosome_info(snakemake.input[0]) chrom_names = natsorted(chrom_info_tab['chr'].tolist()) # Add extra chromosome names and lengths to the data frame align_kmers_sam_with_all_chrom = pd.concat([align_kmers_sam, chrom_info_tab], ignore_index=True) align_kmers_sam_with_all_chrom = align_kmers_sam_with_all_chrom[align_kmers_sam_with_all_chrom['chr'].isin(chrom_names)] # Sort the data by chromosome and chromosome position align_kmers_sam_with_all_chrom_sorted = align_kmers_sam_with_all_chrom.sort_values(by=["chr", "bp"], key=natsort.natsort_keygen()) # Fill NaN values with 1 align_kmers_sam_with_all_chrom_sorted['p_value'] = align_kmers_sam_with_all_chrom_sorted['p_value'].fillna(1) # Plot the dots of chromosome length rows wiht the lowest opacity extra_rows = align_kmers_sam_with_all_chrom_sorted[align_kmers_sam_with_all_chrom_sorted['p_value'] == 1] f, ax = plt.subplots(figsize=(18, 9), facecolor='w', edgecolor='k') manhattanplot(data=align_kmers_sam_with_all_chrom_sorted, snp="kmer_id", chrom="chr", color=colors, pos="bp", pv="p_value", suggestiveline=None, # Turn off suggestiveline genomewideline=None, # Turn off genomewideline xticklabel_kws={"rotation": "vertical"}, ax=ax, s = snakemake.params["point_size"], clip_on=True) ax.set_ylim([y_min-2, y_max+1]) # Set y axis limits plt.scatter(extra_rows['bp'], -np.log10(extra_rows['p_value']), alpha=0) # Calculate the cumulative distances from the start of each chromosome and store them in a list chrom_ends = chrom_info_tab['bp'].cumsum().tolist() # Plot the vertical lines for the end of each chromosome for end_position in chrom_ends: plt.axvline(x=end_position, color='grey', linestyle='--', alpha=0.2) # Add a caption to the right bottom corner of the outside of the plot caption = '*Vertical dashed lines indicate chromosome boundaries.' ax.text(1.0, -0.2, caption, transform=ax.transAxes, ha='right', va='bottom', fontsize=12) # Set the title of the plot f.suptitle('k-mer Based GWAS Manhattan Plot for ' + snakemake.params["pheno"], fontsize=title_fontsize) # Set the x and y axis labels plt.xlabel('Chromosome', fontsize=label_fontsize) # Set the x axis label plt.ylabel(r"$-log_{10}{(P)}$", fontsize=label_fontsize) # Set the y axis label plt.xticks(fontsize = tick_fontsize) plt.yticks(fontsize = tick_fontsize) # Adjust the plot layout plt.tight_layout() ## Saving the plot as pdf print("Plotting is done. Saving the plot...") plt.savefig(snakemake.output["manhattan_plot"], dpi=dpi) |
2
of
scripts/plot_manhattan.py
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path import re from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) # Remove file extension(s) (similar to the internal fastqc approach) base = re.sub("\\.gz$", "", base) base = re.sub("\\.bz2$", "", base) base = re.sub("\\.txt$", "", base) base = re.sub("\\.fastq$", "", base) base = re.sub("\\.fq$", "", base) base = re.sub("\\.sam$", "", base) base = re.sub("\\.bam$", "", base) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell extra = snakemake.params.get("extra", "") # Set this to False if multiqc should use the actual input directly # instead of parsing the folders where the provided files are located use_input_files_only = snakemake.params.get("use_input_files_only", False) if not use_input_files_only: input_data = set(path.dirname(fp) for fp in snakemake.input) else: input_data = set(snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {extra}" " --force" " -o {output_dir}" " -n {output_name}" " {input_data}" " {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) assert ( extra != "" or adapters != "" ), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='." shell( "cutadapt" " --cores {snakemake.threads}" " {adapters}" " {extra}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " {snakemake.input}" " > {snakemake.output.qc} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | __author__ = "Johannes Köster, Derek Croote" __copyright__ = "Copyright 2020, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" import os import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.snakemake import get_mem log = snakemake.log_fmt_shell(stdout=True, stderr=True) extra = snakemake.params.get("extra", "") # Parse memory mem_mb = get_mem(snakemake, "MiB") # Outdir outdir = os.path.dirname(snakemake.output[0]) if outdir: outdir = f"--outdir {outdir}" # Output compression compress = "" mem = f"-m{mem_mb}" if mem_mb else "" for output in snakemake.output: out_name, out_ext = os.path.splitext(output) if out_ext == ".gz": compress += f"pigz -p {snakemake.threads} {out_name}; " elif out_ext == ".bz2": compress += f"pbzip2 -p{snakemake.threads} {mem} {out_name}; " with tempfile.TemporaryDirectory() as tmpdir: mem = f"--mem {mem_mb}M" if mem_mb else "" shell( "(fasterq-dump --temp {tmpdir} --threads {snakemake.threads} {mem} " "{extra} {outdir} {snakemake.wildcards.accession}; " "{compress}" ") {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __author__ = "Antonie Vietor" __copyright__ = "Copyright 2021, Antonie Vietor" __email__ = "antonie.v@gmx.de" __license__ = "MIT" from snakemake.shell import shell from os import path log = snakemake.log_fmt_shell(stdout=False, stderr=True) format = snakemake.params.get("format", "") blastdb = snakemake.input.get("blastdb", "")[0] db_name = path.splitext(blastdb)[0] if format: out_format = " -outfmt '{}'".format(format) shell( "blastn" " -query {snakemake.input.query}" " {out_format}" " {snakemake.params.extra}" " -db {db_name}" " -num_threads {snakemake.threads}" " -out {snakemake.output[0]}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Antonie Vietor" __copyright__ = "Copyright 2021, Antonie Vietor" __email__ = "antonie.v@gmx.de" __license__ = "MIT" from snakemake.shell import shell from os import path log = snakemake.log out = snakemake.output[0] db_type = "" (out_name, ext) = path.splitext(out) if ext.startswith(".n"): db_type = "nucl" elif ext.startswith(".p"): db_type = "prot" shell( "makeblastdb" " -in {snakemake.input.fasta}" " -dbtype {db_type}" " {snakemake.params}" " -logfile {log}" " -out {out_name}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = "Daniel Standage" __copyright__ = "Copyright 2020, Daniel Standage" __email__ = "daniel.standage@nbacc.dhs.gov" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) index = os.path.commonprefix(snakemake.output).rstrip(".") shell( "bowtie2-build" " --threads {snakemake.threads}" " {extra}" " {snakemake.input.ref}" " {index}" " {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2019, Johannes Köster" __email__ = "johannes.koester@uni-due.de" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) tracks = snakemake.input.get("tracks", []) if tracks: if isinstance(tracks, str): tracks = [tracks] tracks = "--tracks {}".format(" ".join(tracks)) shell( "create_report {extra} --standalone --output {snakemake.output[0]} {snakemake.input.vcf} {snakemake.input.fasta} {tracks} {log}" ) |
Support
- Future updates
Related Workflows





