Anlysis of pharmacogenomic targets from Genomic Medicine Swedens heamatology panel.
This workflow is designed to generate sample-specific report with clinical guidelines dependent on variants detected in a Genomic Medicine Sweden sequencing panel.
It is designed to be coupled to an existing pipeline and need to be given the location of analysis ready bam files, path to which is specified in
data/example_config
.
Setup
Set paths in
data/example_congif.yaml
for
reference_fasta
and
dbsnp
to suitable files, as well as
bam_location
to input-pattern to bamfiles.
specify the samples and sequencerun desired.
If using singularity with snakemake please run script
envs/get_containers.sh
This was tested using:
-
Snakemake 5.6.0
-
Hg19 reference genome
-
dbsnp build 138
Code Snippets
13 14 15 16 17 18 19 20 21 | shell: """ gatk VariantAnnotator \ -R {params.ref} \ -V {input.vcf} \ -I {input.bam} \ -O {output.vcf} \ --dbsnp {params.dbsnp} """ |
11 12 13 14 15 16 17 18 | shell: """ python3 {params.script_location}/src/Summary/reform_genomic_region.py \ --target_bed={params.target_bed} \ --output_file={output.interval} \ --padding={params.padding} \ --format='bed' """ |
33 34 35 36 37 | shell: """ samtools view -b {input.bam} -L {input.region_list} > {output.bam} samtools index {output.bam} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | from pysam import VariantFile import argparse import sys def filter_variants(vcf, read_ratio, depth, output): """ Soft filter all variants with suspicious read ratio and insufficient read-depth """ vcf_in = VariantFile(vcf) new_header = vcf_in.header new_header.filters.add(f"AR{read_ratio}", None, None, f"Ratio of ref/alt reads lower than {read_ratio}") new_header.filters.add(f"DP{depth}", None, None, f"DP is lower than {depth}x") vcf_out = VariantFile(output, "w", header=new_header) for record in vcf_in.fetch(): ad = record.samples[0]["AD"] # No multiallelic split if record.info["DP"] < depth: record.filter.add("DP100") elif len(ad) == 2: n_ref, n_alt = ad if n_alt / (n_ref + n_alt) < read_ratio: record.filter.add(f"AR{read_ratio}") else: record.filter.add("PASS") vcf_out.write(record) def main(): parser = argparse.ArgumentParser( description="Filter variants on depth and read ratio" ) parser.add_argument("--input_vcf", type=str) parser.add_argument("--read_ratio", type=float) parser.add_argument("--depth", type=int) parser.add_argument("--output_file", type=str) args = parser.parse_args(sys.argv[1:]) input_vcf = args.input_vcf read_ratio = args.read_ratio depth = args.depth output_file = args.output_file filter_variants(input_vcf, read_ratio, depth, output_file) if __name__ == '__main__': main() |
13 14 15 16 17 18 19 20 | shell: """ python3 {params.script_location}/src/Filtering/variant_filtration.py \ --input_vcf={input.vcf} \ --read_ratio={params.read_ratio} \ --depth={params.DP} \ --output_file={output.filtered_vcf} """ |
17 18 19 20 21 22 23 24 25 26 | shell: """ python3 {params.script_location}/src/Summary/get_possible_diplotypes.py \ --variant_csv {input.found_variants} \ --haplotype_definitions {params.haplotype_definitions} \ --clinical_guidelines {params.clinical_guidelines} \ --haplotype_activity {params.haplotype_activity} \ --output {output.csv} \ --hidden_haplotypes {params.hidden_haplotypes} """ |
39 40 41 42 43 44 45 | shell: """ python3 {params.script_location}/src/Summary/get_interaction_guidelines.py \ --diploids {input.diploids} \ --interaction_guidelines {params.interacting_targets} \ --output {output.csv} """ |
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | shell: """ wkdir=$(pwd) # Needed since Rscript will set wd to location of file not session intdir=$(echo {output.html} | head -c -6) Rscript \ -e ".libPaths('/lib/rlib'); library(rmdformats); rmarkdown::render('{params.script_location}/src/Report/generate_sample_report.Rmd', output_file='$wkdir/{output.html}', output_format=c('readthedown'), intermediates_dir='$wkdir/$intdir')" \ --args --title='Farmakogenomisk analys av {wildcards.sample}' --author=joel \ --found_variants=$wkdir/{input.found_variants} \ --missed_variants=$wkdir/{input.missed_variants} \ --haplotype_definitions={params.haplotype_definitions} \ --clinical_guidelines=$wkdir/{input.diploids} \ --interaction_guidelines=$wkdir/{input.interactions} \ --data_location={params.script_location}/data \ --depth_file=$wkdir/{input.depth_at_baits} \ --sample={wildcards.sample} \ --seqid={wildcards.seqID} \ --dbsnp=$(basename {params.dbsnp}) \ --ref=$(basename {params.ref}) \ --name="{params.name}" \ --adress="{params.adress}" \ --mail="{params.mail}" \ --phone="{params.phone}" rmdir $wkdir/$intdir """ |
13 14 15 16 17 18 19 | shell: """ python3 {params.script_location}/src/Summary/append_rsid_to_gdf.py \ --input_gdf={input.gdf} \ --target_bed={params.target_bed} \ --output_file={output.gdf} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | from get_target_variants import GDF import argparse import sys # As script to run "shell:" in snakemake rule to allow singularity run def main(): parser = argparse.ArgumentParser( description="Rewrite bed to chr:start-end list. Removing wt targets or adding padding" ) parser.add_argument("--input_gdf", type=str) parser.add_argument("--target_bed", type=str) parser.add_argument("--output_file", type=str) args = parser.parse_args(sys.argv[1:]) input_gdf = args.input_gdf target_bed = args.target_bed output_file = args.output_file c_gdf = GDF(input_gdf) c_gdf.rsid_per_position(target_bed) c_gdf.write_proccessed_gdf(output_file) if __name__ == '__main__': main() |
12 13 14 15 16 17 18 | shell: """ python3 {params.script_location}/src/Summary/reform_genomic_region.py \ --target_bed={params.target_bed} \ --output_file={output.interval} \ --detected_variants={input.detected_variants} """ |
34 35 36 37 | shell: """ java -jar /usr/GenomeAnalysisTK.jar -T DepthOfCoverage -R {params.ref} -I {input.bam} -o {output.gdf} -L {input.interval} """ |
49 50 51 52 53 54 55 | shell: """ python3 {params.script_location}/src/Summary/reform_genomic_region.py \ --target_bed={params.target_bed} \ --output_file={output.interval} \ --padding={params.padding} """ |
71 72 73 74 75 | shell: """ # NOTE: does not work with openjdk-11, openjdk-8 works java -jar /usr/GenomeAnalysisTK.jar -T DepthOfCoverage -R {params.ref} -I {input.bam} -o {output.gdf} -L {input.interval} """ |
14 15 16 17 18 19 20 | shell: """ python3 {params.script_location}/src/Summary/get_target_variants.py \ --target_bed {params.target_bed} \ --vcf {input.vcf} \ --output {output.csv} """ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | import pandas as pd import argparse import sys import numpy as np from itertools import product class GetInteractions: def __init__(self, variants_file, guideline_file): self.variants_df = pd.read_csv(variants_file, sep="\t") self.variants_df = self.variants_df.sort_values(by=["gene"]) self.interaction_guidelines_df = pd.read_csv(guideline_file, sep="\t") def get_possible_interactions(self): # TODO: refactor inefficient and ugly interacting_genes = set(self.interaction_guidelines_df[["gene1", "gene2"]].values.flat) if self.variants_df.empty: return self.interaction_guidelines_df[np.zeros(len(self.interaction_guidelines_df), dtype=bool)] self.variants_df = self.variants_df[self.variants_df.gene.isin(interacting_genes)] if self.variants_df.empty: return self.interaction_guidelines_df[np.zeros(len(self.interaction_guidelines_df), dtype=bool)] index_combinations = product( range(len(self.variants_df)), range(len(self.variants_df)) ) prev_idx = np.zeros(len(self.interaction_guidelines_df), dtype=bool) haplotypes = [""] * len(self.interaction_guidelines_df) for i, j in index_combinations: if not i == j: gene1 = self.variants_df.iloc[[i]].gene.values[0] gene2 = self.variants_df.iloc[[j]].gene.values[0] activity1 = int(self.variants_df.iloc[[i]].Genotype_activity) activity2 = int(self.variants_df.iloc[[j]].Genotype_activity) all_haplotypes = ",".join([ self.variants_df.iloc[[i]]["Haplotype1"].values.flat[0], self.variants_df.iloc[[i]]["Haplotype2"].values.flat[0], self.variants_df.iloc[[j]]["Haplotype1"].values.flat[0], self.variants_df.iloc[[j]]["Haplotype2"].values.flat[0] ]) idx = (self.interaction_guidelines_df.gene1 == gene1) & \ (self.interaction_guidelines_df.gene2 == gene2) & \ (self.interaction_guidelines_df.activity_1 == activity1) & \ (self.interaction_guidelines_df.activity_2 == activity2) if any(idx): prev_idx += idx for idx_i in idx: if idx_i: haplotypes[idx_i] = all_haplotypes haplotypes = [h for h in haplotypes if h != ""] interactions = self.interaction_guidelines_df[prev_idx] interactions["haplotypes"] = haplotypes return interactions def run_and_write(self, output): interactions = self.get_possible_interactions() interactions.to_csv(output, index=None, sep="\t") def main(): parser = argparse.ArgumentParser( description="Get guidelines based on interacting variants in different genes" ) parser.add_argument("--diploids", type=str) parser.add_argument("--interaction_guidelines", type=str) parser.add_argument("--output", type=str, help="Location of output") args = parser.parse_args(sys.argv[1:]) get_interactions = GetInteractions(args.diploids, args.interaction_guidelines) get_interactions.run_and_write(args.output) if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | import pandas as pd import re import argparse import sys import numpy as np class ArrangeHaplotype: """ Get possible haplotypes from variant combinations detected in file. Add clinical guidelines based on the haplotypes detected """ def __init__(self, detected_variants, haplotype_definitions, activity_scores, clinical_guidelines): self.detected_variants = pd.read_csv(detected_variants, sep="\t") self.haplotype_definitions = pd.read_csv(haplotype_definitions, sep="\t") self.activity_scores = pd.read_csv(activity_scores, sep="\t") self.clinical_guidelines = pd.read_csv(clinical_guidelines, sep="\t") if not self.detected_variants.empty: self.merge_data() def merge_data(self): """ Join possible haplotypes containing ids :return: """ self.detected_variants["multival_haplotype"] = \ self.detected_variants.ID.apply( lambda x: list(self.haplotype_definitions["HAPLOTYPE"][ self.haplotype_definitions.ID == x ]) ) self.detected_variants["CN"] = self.detected_variants["GT"].apply( lambda x: sum(map(int, re.split('[/|]+', x))) ) def get_haplotypes(self): """ Return tree-structure of possible combinations of haplotypes explaning seen variants. Assumptions: Haplotype explaning most variants goes first, if any of variants in haplotype is zero the all haplotypes containing that variant is removed for futher chocies. :return: Gene - haplotype-tree dict """ def _get_haplotypes(variant_subdf, current_haplotype, depth=2): idx = variant_subdf["multival_haplotype"].apply(lambda x: current_haplotype in x) variant_subdf.loc[idx, "CN"] -= 1 if any(variant_subdf["CN"] == 0): remove_hap = lambda x, y: x.remove(y) if y in x else x variant_subdf["multival_haplotype"] = variant_subdf["multival_haplotype"].apply( lambda x: remove_hap(x, current_haplotype) ) variant_subdf = variant_subdf[variant_subdf["CN"] != 0] if depth == 1: if len(variant_subdf) == 0: return [current_haplotype, True] else: return [current_haplotype, False] if len(variant_subdf) == 0 or not any(variant_subdf["multival_haplotype"].apply(lambda x: bool(x))): wt_haplotype = "WT" return [current_haplotype, [_get_haplotypes(variant_subdf.copy(), wt_haplotype, depth - 1)]] remaining_haplo = set([element for halpolist in variant_subdf["multival_haplotype"] for element in halpolist]) return [current_haplotype, [ _get_haplotypes(variant_subdf.copy(), hap, depth - 1) for hap in remaining_haplo ]] genes = set(self.detected_variants["GENE"]) full_mat = {} for gene in genes: gene_subset = self.detected_variants[self.detected_variants.GENE == gene] print(gene_subset) candidate_haplotypes = np.array(list(set( [element for halpolist in gene_subset["multival_haplotype"] for element in halpolist] ))) order = list(reversed(np.argsort( [sum(gene_subset["multival_haplotype"].apply(lambda y: x in y)) for x in candidate_haplotypes] ))) candidate_haplotypes = candidate_haplotypes[order] gene_haps = [] for current_haplotype in candidate_haplotypes: gene_haps.append(_get_haplotypes(gene_subset.copy(), current_haplotype)) idx = gene_subset["multival_haplotype"].apply(lambda x: current_haplotype in x) cn = gene_subset.loc[idx, "CN"] if any([(c - 1) == 0 for c in cn]): remove_hap = lambda x, y: x.remove(y) if y in x else x gene_subset["multival_haplotype"] = gene_subset["multival_haplotype"].apply( lambda x: remove_hap(x, current_haplotype) ) full_mat.update({gene: gene_haps}) return full_mat def get_haplotype_dataframe(self): # Wow what a mess hap_mat = self.get_haplotypes() def _haplot_to_row(hap, gene): def prim_haplot_to_row(hap, gene): current_hap = (f"{gene}-1" if hap[0] == "WT" else hap[0]) if not type(hap[1]) is list: return [current_hap, gene, hap[1]] else: next_hap = [prim_haplot_to_row(c_hap, gene) for c_hap in hap[1]] return [[current_hap] + c_hap for c_hap in next_hap][0] return [prim_haplot_to_row(c_hap, gene) for c_hap in hap] out = [] for gene, hap in hap_mat.items(): if len(hap) != 0: out += _haplot_to_row(hap, gene) hap_df = pd.DataFrame(out, columns=["Haplotype1", "Haplotype2", "gene", "pass"]) hap_df = hap_df[hap_df["pass"]] return hap_df[["gene", "Haplotype1", "Haplotype2"]] def get_clinical_guidelines_table(self): if self.detected_variants.empty: columns = [ "gene", "Haplotype1", "Haplotype2", "HAPLOTYPE1", "ACTIVITY_SCORE1", "HAPLOTYPE2", "ACTIVITY_SCORE2", "Genotype_activity", "Gene", "Activity", "Guideline" ] return pd.DataFrame(columns=columns) hap_df = self.get_haplotype_dataframe() hap_df = hap_df.merge(self.activity_scores, how="left", left_on="Haplotype1", right_on="HAPLOTYPE") hap_df = hap_df.merge(self.activity_scores, how="left", left_on="Haplotype2", right_on="HAPLOTYPE", suffixes=("1", "2")) hap_df["Genotype_activity"] = hap_df["ACTIVITY_SCORE1"] + hap_df["ACTIVITY_SCORE2"] hap_df = hap_df.merge(self.clinical_guidelines, how="left", left_on=["gene", "Genotype_activity"], right_on=["Gene", "Activity"]) return hap_df def get_wildtypes(self, hap_df): hap_genes = list(hap_df.gene.values) for gene in set(self.clinical_guidelines.Gene): if hap_df.empty or gene not in hap_genes: gene_df = pd.DataFrame( {"gene": [gene], "Haplotype1": [gene + "-1"], "Haplotype2": [gene + "-1"], "HAPLOTYPE1": [gene + "-1"], "ACTIVITY_SCORE1": [1], "HAPLOTYPE2": [gene + "-1"], "ACTIVITY_SCORE2": [1], "Genotype_activity": [2.0]}) gene_df = gene_df.merge(self.clinical_guidelines, how="left", left_on=["gene", "Genotype_activity"], right_on=["Gene", "Activity"]) hap_df = hap_df.append(gene_df, ignore_index=True) return hap_df def main(): parser = argparse.ArgumentParser( description="Finds selected RSIDs form bed file in input VCF" ) parser.add_argument("--variant_csv", type=str) parser.add_argument("--haplotype_definitions", type=str) parser.add_argument("--clinical_guidelines", type=str) parser.add_argument("--haplotype_activity", type=str) parser.add_argument("--hidden_haplotypes", type=str) parser.add_argument("--output", type=str, help="Location of output") args = parser.parse_args(sys.argv[1:]) variant_csv = args.variant_csv haplotype_definitions = args.haplotype_definitions clinical_guidelines = args.clinical_guidelines haplotype_activity = args.haplotype_activity output = args.output hidden_haplotypes = args.hidden_haplotypes ah = ArrangeHaplotype( variant_csv, haplotype_definitions, haplotype_activity, clinical_guidelines ) df = ah.get_clinical_guidelines_table() df = ah.get_wildtypes(df) columns = [ "gene", "Haplotype1", "Haplotype2", "HAPLOTYPE1", "ACTIVITY_SCORE1", "HAPLOTYPE2", "ACTIVITY_SCORE2", "Genotype_activity", "Gene", "Activity", "Guideline" ] if not df.empty: hidden_haplotypes = pd.read_csv( hidden_haplotypes, sep="\t" ) hidden_haplotypes["comb"] = hidden_haplotypes[["Haplotype1", "Haplotype2"]].apply( lambda x: "".join(sorted(x.tolist())), axis=1 ) df["comb"] = df[["Haplotype1", "Haplotype2"]].apply( lambda x: "".join(sorted(x.tolist())), axis=1 ) print(df["comb"]) print(hidden_haplotypes["comb"]) df = df[~df["comb"].isin(hidden_haplotypes["comb"])] df.to_csv(output, sep="\t", index=False, columns=columns) if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | import pandas as pd import gzip import re import os import argparse import sys class GDF: """ We are assuming single sample GDF """ def __init__(self, filename): self.data = pd.read_csv(filename, sep="\t") self.data_cols = self.data.columns.values pos_df = pd.DataFrame( self.data.Locus.apply(lambda x: x.split(":")).to_list(), columns=["CHROM", "POS"] ) pos_df.POS = pos_df.POS.astype('int64') self.data = pd.concat([self.data, pos_df], axis=1) def rsid_per_position(self, target_bed): def _annotate(x, targets): try: ids = targets[(targets.CHROM == x.CHROM) & (targets.START <= x.POS) & (x.POS <= targets.END)].ID.to_list() return ", ".join(ids) except IndexError: return "-" targets = pd.read_csv( target_bed, sep="\t", names=["CHROM", "START", "END", "ID", "GENE"] ) targets["save"] = targets.START idx_swap = targets.START > targets.END targets.loc[idx_swap, "START"] = targets.loc[idx_swap, "END"] targets.loc[idx_swap, "END"] = targets.loc[idx_swap, "save"] targets["CHROM"] = targets.CHROM.apply(lambda x: f"chr{x}") self.data["ID"] = self.data.apply(lambda x: _annotate(x, targets), axis=1) def write_proccessed_gdf(self, filename, annotate=True): if annotate: self.data.to_csv(filename, sep="\t", index=False) else: self.data.to_csv(filename, sep="\t", columns=self.data_cols, index=False) class VCF: """ We are assuming single sample VCF """ def __init__(self, filename): self.meta = [] self.data = pd.DataFrame() self.original_header = [] self.read_vcf(filename) def read_vcf(self, filename): if ".gz" in filename: f = gzip.open(filename, "rt") else: f = open(filename, "r") lines = f.readlines() lines = [l.strip() for l in lines] i = None for i, line in enumerate(lines): if re.search("^#CHROM", line): break if i is None: raise ImportError("No lines in: " + filename) self.meta = lines[:i - 1] data = [l.split("\t") for l in lines[i:]] self.data = pd.DataFrame(data[1:], columns=data[0]) self.original_header = self.data.columns.values if not self.data.empty: sample_column = self.data.columns.values[-1] max_len_idx = self.data.FORMAT.str.len().idxmax format_columns = self.data.FORMAT[max_len_idx].split(":") format_split = pd.DataFrame(self.data[sample_column].apply(lambda x: x.split(":")).to_list(), columns=format_columns) self.data = pd.concat([self.data, format_split], axis=1) def filter_snp(self, filter_file, exclude=True, column="SNP"): filter_snps = pd.read_csv(filter_file, sep="\t")[column] if exclude: self.data = self.data[~self.data.ID.isin(filter_snps)] else: self.data = self.data[self.data.ID.isin(filter_snps)] def write_vcf(self, filename_out): with open(filename_out, "w+") as f: for line in self.meta: f.write(line + "\n") self.data.to_csv(f, mode="a", sep="\t", columns=self.original_header, index=False) class VariantQCCollection: def __init__(self, target_bed, vcf_filename): self.bed_targets = pd.read_csv(target_bed, names=None, sep="\t") self.bed_targets.columns = ["#CHROM", "START", "END", "ID", "GENE"] self.sample = os.path.basename(vcf_filename) self.vcf = VCF(vcf_filename) self.detected_variants = [] def detect_variants(self): """ Select all variants within VCF with ID same as in target_bed """ self.detected_variants = self.vcf.data.ID[ (self.vcf.data.ID.isin(self.bed_targets.ID)) & (self.vcf.data.FILTER == "PASS") ].tolist() def write_detected_variant_qc(self, output_file): """ Write detected variants to csv """ if not self.detected_variants: self.detect_variants() current_variants = self.vcf.data[self.vcf.data.ID.isin(self.detected_variants)] current_variants = current_variants.merge( self.bed_targets[["ID", "GENE"]], on="ID" ) current_variants.to_csv(output_file, index=False, sep="\t") def main(): parser = argparse.ArgumentParser( description="Finds selected RSIDs form bed file in input VCF" ) parser.add_argument("--target_bed", type=str, help="Bed-file containing RSIDs of interest") parser.add_argument("--vcf", type=str) parser.add_argument("--output", type=str, help="Location of output, NOTE: will overwrite") args = parser.parse_args(sys.argv[1:]) vcf_f = args.vcf bed_f = args.target_bed output_f = args.output var_collect = VariantQCCollection(bed_f, vcf_f) var_collect.write_detected_variant_qc(output_f) if __name__ == '__main__': main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | import pandas as pd import argparse import sys def reform(target_bed, output_f, detected_variants, padding, file_format): targets = pd.read_csv(target_bed, sep="\t", names=["CHROM", "START", "END", "ID", "GENE"], dtype={"START": int, "END": int}) if detected_variants != "": detected_rsid = pd.read_csv(detected_variants, sep="\t").ID targets = targets[~targets.ID.isin(detected_rsid)] bed = file_format == "bed" with open(output_f, "w+") as f: for i, row in targets.iterrows(): chrom, start, end, id = row[0:4] if start > end: end, start = start, end start -= padding end += padding if bed: f.write(f"chr{chrom}\t{start}\t{end}\t{id}\n") else: f.write(f"chr{chrom}:{start}-{end}\n") def main(): parser = argparse.ArgumentParser( description="Rewrite bed to chr:start-end list. Removing wt targets or adding padding" ) parser.add_argument("--target_bed", type=str) parser.add_argument("--output_file", type=str) parser.add_argument("--detected_variants", type=str, default="") parser.add_argument("--padding", type=int, default=0) parser.add_argument("--format", type=str, default="list") args = parser.parse_args(sys.argv[1:]) target_bed = args.target_bed output_file = args.output_file detected_variants = args.detected_variants padding = args.padding file_format = args.format reform(target_bed, output_file, detected_variants, padding, file_format) if __name__ == '__main__': main() |
13 14 15 16 17 18 19 20 | shell: """ gatk HaplotypeCaller \ -R {params.ref} \ -I {input.bam} \ --dbsnp {params.dbsnp} \ -O {output.vcf} """ |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/JoelAAs/pgx_module
Name:
pgx_module
Version:
1
Downloaded:
0
Copyright:
Public Domain
License:
MIT License
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...