Variant Calling Workflow for Joint Multi-Sample VCF Generation
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Forked from https://gitlab.com/data-analysis5/dna-sequencing/54gene-wgs-germline
This workflow is designed to take either fastqs or gvcfs as input, and emit a joint-called multi-sample VCF. Please see Read the Docs for additional documentation.
You can find
Code Snippets
44 45 46 47 48 49 50 51 | shell: "mkdir -p {output.t} && " "bwa mem " "-K 10000000 -M " '-R "@RG\\tCN:{params.center}\\tID:{params.rg}\\tSM:{params.sm}\\tPL:{params.pl}\\tLB:N/A" ' "-t {threads} " "{input.r} {input.r1} {input.r2} | " "samtools sort -@ {params.samtools_threads} -T {output.t} -m {params.samtools_mem}M -o {output.bam} - " |
86 87 88 89 90 91 92 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" MarkDuplicates ' "TMP_DIR={params.t} " "REMOVE_DUPLICATES=true " "INPUT={params.l} " "OUTPUT={output.bam} " "METRICS_FILE={output.metrics}" |
119 120 121 122 123 124 125 126 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" BaseRecalibrator ' "--tmp-dir {params.t} " "-R {input.r} " "-I {input.bam} " "--known-sites {input.snps} " "--known-sites {input.indels} " "-O {output.table}" |
154 155 156 157 158 159 160 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" ApplyBQSR ' "--tmp-dir {params.t} " "-R {input.r} " "-I {input.bam} " "--bqsr-recal-file {input.recal} " "-O {output.bam}" |
176 177 | shell: "samtools stats {input.bam} > {output}" |
21 22 23 | shell: "ln -s {input.r1} {output.r1} && " "ln -s {input.r2} {output.r2}" |
47 48 49 | shell: "fastqc {input.r1} -d {params.t} --quiet -t {threads} --outdir={params.o} && " "fastqc {input.r2} -d {params.t} --quiet -t {threads} --outdir={params.o}" |
77 78 79 80 81 82 83 84 | shell: "fastp -i {input.r1} -I {input.r2} " "-w {threads} " "-o {output.r1_paired} -O {output.r2_paired} " "-h {output.h} -j {output.j} " "--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA " "--adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT " "-l 36" |
24 25 26 27 28 29 30 31 | shell: "bcftools norm " "-f {input.ref} " "-m - " "--threads {threads} " "{input.vcf} " "-Oz -o {output.vcf} && " "tabix -p vcf {output.vcf}" |
53 54 55 56 57 58 59 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants ' "--tmp-dir {params.t} " "-V {input.vcf} " "--select-type SNP " "-O {output.vcf}" |
81 82 83 84 85 86 87 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants ' "--tmp-dir {params.t} " "-V {input.vcf} " "--select-type INDEL " "-O {output.vcf}" |
109 110 111 112 113 114 115 116 117 118 119 120 121 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" VariantFiltration ' "--tmp-dir {params.t} " "-V {input.vcf} " '-filter "QD < 2.0" --filter-name "QD2" ' '-filter "QUAL < 30.0" --filter-name "QUAL30" ' '-filter "SOR > 3.0" --filter-name "SOR3" ' '-filter "FS > 60.0" --filter-name "FS60" ' '-filter "MQ < 40.0" --filter-name "MQ40" ' '-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" ' '-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" ' "-O {output.vcf}" |
143 144 145 146 147 148 149 150 151 152 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" VariantFiltration ' "--tmp-dir {params.t} " "-V {input.vcf} " '-filter "QD < 2.0" --filter-name "QD2" ' '-filter "QUAL < 30.0" --filter-name "QUAL30" ' '-filter "FS > 200.0" --filter-name "FS200" ' '-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" ' "-O {output.vcf}" |
179 180 181 | shell: "bcftools view --threads {threads} -r {params.region} -Oz -o {output.vcf} {input.vcf} && " "tabix -p vcf {output.vcf}" |
205 206 207 208 209 210 | shell: "verifyBamID " "--best " "--vcf {input.vcf} " "--bam {input.bam} " "--out {params.d}" |
220 221 | shell: 'grep -v "^#" {params}*selfSM > {output}' |
245 246 247 248 | shell: "bcftools concat -a -Ov {input.snps} {input.indels} | " "bcftools sort -T {params.t} -Oz -o {output.vcf} && " "tabix -p vcf {output.vcf}" |
279 280 281 282 283 284 285 286 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" CollectVariantCallingMetrics ' "--TMP_DIR {params.t} " "-I {input.calls} " "--DBSNP {input.dbsnp} " "-SD {input.d} " "-O {params.d}" |
310 311 312 313 314 315 316 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants ' "--tmp-dir {params.t} " "-V {input.vcf} " "--exclude-filtered " "-O {output.vcf}" |
35 36 37 38 39 40 41 42 43 44 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" HaplotypeCaller ' "--tmp-dir {params.t} " "-R {input.r} " "-I {input.bam} " "-ERC GVCF " "-L {input.intervalfile} " "-O {output.gvcf} " "-G StandardAnnotation " "-G StandardHCAnnotation" |
59 60 61 | shell: "bgzip -c {input.gvcf} > {output.gvcf} && " "tabix -p vcf {input.gvcf}.gz" |
94 95 96 97 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GatherVcfs ' "-I {params.l} " "-O {output}" |
110 111 | shell: "tabix -p vcf {input}" |
14 15 16 | shell: "ln -s {input.gvcf} {output.gvcf} && " "ln -s {input.index} {output.index}" |
39 40 41 | shell: "n=$(bcftools query -l {input.gvcf});" 'echo "${{n}}\t{input.gvcf}" > {output}' |
54 55 | shell: "for i in {params.mapDir}*.map; do cat $i >> {output}; done" |
126 127 128 129 130 131 132 133 134 135 136 137 | shell: 'export _JAVA_OPTIONS="" && ' "rm -r {params.db} && " 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GenomicsDBImport ' "--batch-size {params.batch_size} " "--disable-bam-index-caching " "--sample-name-map {input.sampleMap} " "--genomicsdb-workspace-path {params.db} " "-L {input.intervalfile} " "--tmp-dir {params.t} " "--reader-threads {params.reader_threads} " "--genomicsdb-shared-posixfs-optimizations" |
170 171 172 173 174 175 176 177 178 179 180 181 182 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GenotypeGVCFs ' "-R {input.r} " "-V gendb://{params.db} " "-O {output.vcf} " "--tmp-dir {params.t} " "--max-alternate-alleles {params.m} " "--genomicsdb-max-alternate-alleles {params.g} " "--max-genotype-count {params.c} " "-stand-call-conf 30 " "-G StandardAnnotation " "-G StandardHCAnnotation" |
204 205 206 207 | shell: "bcftools concat -a {input.vcfList} -Ou | " "bcftools sort -T {params.t} -Oz -o {output.projectVCF} && " "tabix -p vcf {output.projectVCF}" |
16 17 18 19 20 | shell: "bcftools stats " "--af-bins 0.01,0.05,0.1,1 " "-F {input.r} " "-s- {input.vcf} > {output}" |
52 53 | shell: "plot-vcfstats -P -p {params.d} {input}" |
65 66 | shell: "python workflow/scripts/generate_ped.py {input} {params.prefix}" |
87 88 89 90 91 92 | shell: "somalier extract " "-d {params.d} " "--sites $CONDA_PREFIX/share/somalier/sites.hg38.vcf.gz " "-f {input.r} {input.vcf} && " "somalier relate --ped {input.ped} -o {params.o} {params.d}/*.somalier" |
103 104 | shell: "touch {output}" |
120 121 122 | shell: "bcftools +guess-ploidy -g hg38 {input.vcf} -v > {output.txt} && " "guess-ploidy.py {output.txt} {params.p}" |
143 144 | shell: "python workflow/scripts/create_exclude_list.py {input.b} {params.out} --verify {input.v} -r {params.r} -d {params.d} -c {params.c}" |
163 164 | shell: "python workflow/scripts/create_exclude_list.py {input.b} {params.out} -r {params.r} -d {params.d}" |
180 181 182 183 | shell: "bcftools view -S ^{input.l} --threads {threads} -Ou {input.v} | " "bcftools annotate --threads {threads} --set-id '%CHROM:%POS:%REF:%ALT' -Oz -o {output.v} && " "tabix -p vcf {output.v}" |
227 228 | shell: "multiqc --force -o {params.outDir} -n {params.outName} --config {input.mqc_config} {params.inDirs} {params.relatedness}" |
263 264 | shell: "multiqc --force -o {params.outDir} --config {input.mqc_config} -n {params.outName} {params.inDirs} {params.relatedness}" |
290 291 | shell: "multiqc --force -o {params.outDir} --config {input.mqc_config} -n {params.outName} {params.inDirs}" |
14 15 | shell: "bcftools query -l {input} > {output}" |
46 47 | script: "../scripts/run_summary.Rmd" |
77 78 | script: "../scripts/run_summary.Rmd" |
103 104 | script: "../scripts/run_initial_summary.Rmd" |
131 132 | script: "../scripts/combine_benchmarks.R" |
151 152 | script: "../scripts/benchmarking_report.Rmd" |
24 25 26 27 28 29 30 31 32 33 34 35 36 37 | shell: "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx resources/ --no-sign-request " |
5 6 | shell: "date +%s > {output}" |
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | # Load required packages knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=8) require(knitr, quietly = TRUE) require(ggplot2, quietly=TRUE) require(RColorBrewer, quietly=TRUE) require(dplyr, quietly = TRUE) require(gluedown, quietly = TRUE) require(gridExtra, quietly = TRUE) require(lubridate, quietly = TRUE) # set input variable from snakemake benchmarks.table <- snakemake@input[["benchmarks"]] # read in tsv dataset <- read.table(benchmarks.table, header=TRUE, sep="\t") # some light date/time manipulation for ease in plotting dataset$times <- hms(dataset$h.m.s) d.lub <- hour(dataset$times) + minute(dataset$times)/60 + second(dataset$times)/3600 dataset$walltime <- d.lub # some upfront unit conversions & derivations for ease in comprehension # # convert IO units from MB to GB dataset$io_in <- dataset$io_in / 1000 dataset$io_out <- dataset$io_out / 1000 # # convert CPU time from seconds to hours dataset$cpu_time_hrs <- dataset$cpu_time / 3600 # # calculate cpu/walltime ratio dataset$cpuwall_ratio <- (dataset$cpu_time / dataset$s) # subset the dataset to get data for rules where there is more than 1 # observation/process to pass to geom_boxplot subset.df <- dataset %>% group_by(rule) %>% filter(n() > 1) # subset the dataset to get data points for rules meeting time_threshold # parameter specified at the rule level time.param <- snakemake@params[["threshold"]] time.threshold <- time.param * 60 filtered.df <- dataset %>% group_by(rule) %>% filter(s >= time.threshold) # start time to calculate elapsed time tat.file <- snakemake@input[["start_time"]] |
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | # plotting function for box/jitter plot where single data points are omitted # from boxplots create.boxjitter.plot <- function(df, df.subset, xvar, yvar, xlab, ylab) { bd.plot <- ggplot(data=df, aes_string(x=xvar, y=yvar)) + geom_boxplot(data = df.subset, aes_string(fill=xvar), show.legend = FALSE) + geom_point(position="identity", show.legend = FALSE, alpha=1/2, size=1) + labs(x=xlab, y=ylab) + theme_minimal() + theme(axis.text.x=element_text(angle = 90, vjust=0.5, hjust=1), axis.title = element_text(size = 14), axis.text = element_text(size = 12)) bd.plot } create.subset.plot <- function(subdf, xvar, yvar, xlab, ylab, title){ sb.plot <- ggplot(data = subdf, aes_string(x = xvar, y = yvar)) + geom_boxplot(aes_string(fill=xvar), show.legend = FALSE) + geom_point(position = "identity", show.legend = FALSE, alpha=1/2, size=1) + labs(x=xlab, y=ylab, title=title) + theme_minimal() + theme(axis.text.x=element_text(angle = 90, vjust=0.5, hjust=1), axis.title = element_text(size = 14), axis.text = element_text(size = 12)) sb.plot } |
95 96 97 98 | inlines <- c( md_bold(unique(dataset$rule)) ) md_bullet(inlines) |
109 110 111 | start.time <- read.table(tat.file, header=FALSE)[1,1] end.time <- as.integer(format(Sys.time(), "%s")) elapsed.time <- round((end.time - start.time) / 3600, 2) |
115 116 117 | # plot walltime for all rules benchmarked exec.time.all <- create.boxjitter.plot(dataset, subset.df, "rule", "walltime", "Rule", "Total walltime in hours for all rules") exec.time.all |
129 130 131 132 | max.pss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_pss", "Rule", "PSS (MB) for all rules") + scale_y_continuous(n.breaks = 12) max.pss.all |
138 139 140 141 | max.uss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_uss", "Rule", "USS (MB) for all rules") + scale_y_continuous(n.breaks = 12) max.uss.all |
147 148 149 150 | max.rss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_rss", "Rule", "RSS (MB) for all rules") + scale_y_continuous(n.breaks = 12) max.rss.all |
156 157 158 159 | max.vms.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_vms", "Rule", "VMS (MB) for all rules") + scale_y_continuous(n.breaks = 12) max.vms.all |
167 | cat(paste0("Time threshold (minutes): ", time.param)) |
178 179 180 181 182 | # plot the memory usage with boxplots for rules with >1 process otherwise # just plot the datapoint max.pss.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_pss", "Rule", "PSS (MB)") + scale_y_continuous(n.breaks = 12) max.pss.detail |
188 189 190 | max.uss.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_uss", "Rule", "USS (MB)") + scale_y_continuous(n.breaks = 12) max.uss.detail |
196 197 198 | max.rss.detail <- create.boxjitter.plot(dataset, subset.df, "rule", "max_rss", "Rule", "RSS (MB)") + scale_y_continuous(n.breaks = 12) max.rss.detail |
204 205 206 | max.vms.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_vms", "Rule", "VMS (MB)") + scale_y_continuous(n.breaks = 12) max.vms.detail |
211 212 213 214 215 216 217 | # plot detailed grid view mem.df1 <- subset(filtered.df, max_pss > 5000) mem.df2 <- subset(filtered.df, max_pss < 5000) max.pss.subset1 <- create.subset.plot(mem.df1, "rule", "max_pss", "Rule", "max_pss (MB)", "Rules using memory >5GB") max.pss.subset2 <- create.subset.plot(mem.df2, "rule", "max_pss", "Rule", "max_pss (MB)", "Rules using memory <5GB") grid.arrange(max.pss.subset1, max.pss.subset2, nrow=1) |
228 229 230 | # plot the io read io.read <- create.boxjitter.plot(filtered.df, subset.df, "rule", "io_in", "Rule", "Cumulative GB Read") io.read |
236 237 238 | # plot io write io.write <- create.boxjitter.plot(filtered.df, subset.df, "rule", "io_out", "Rule", "Cumulative GB Written") io.write |
243 244 245 246 247 248 249 250 251 252 253 | # plot detailed grid view ioin.df1 <- subset(filtered.df, io_in > 10) ioin.df2 <- subset(filtered.df, io_in < 10) ioout.df1 <- subset(filtered.df, io_out > 10) ioout.df2 <- subset(filtered.df, io_out < 10) ioin.subset1 <- create.subset.plot(ioin.df1, "rule", "io_in", "Rule", "Cumulative GB Written", "Rules with >10GB Read") ioout.subset1 <- create.subset.plot(ioout.df1, "rule", "io_out", "Rule", "Cumulative GB Written", "Rules with >10GB Written") grid.arrange(ioin.subset1, ioout.subset1, nrow=2) |
262 263 264 | # plot the CPU time in hours cpu.time <- create.boxjitter.plot(filtered.df, subset.df, "rule", "cpu_time_hrs", "Rule", "CPU time in hours") cpu.time |
279 280 281 | # plot cpu to walltime ratio (i.e. CPU/s) cpuwall.ratio <- create.boxjitter.plot(filtered.df, subset.df, "rule", "cpuwall_ratio", "Rule", "CPU/second") cpuwall.ratio |
288 | sessionInfo() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | library(tidyverse) # explicitly specify column types to avoid errors when coercing rows col.types <- cols( s = col_double(), `h:m:s` = col_time(format = ""), max_rss = col_double(), max_vms = col_double(), max_uss = col_double(), max_pss = col_double(), io_in = col_double(), io_out = col_double(), mean_load = col_double(), cpu_time = col_double() ) # function to read tsv input and append rule name and process read_benchmarks <- function(filename) { read_tsv(filename, col_types=col.types) %>% mutate( process = gsub("*.tsv","",basename(filename)), rule = basename(dirname(filename)) ) } # run function for all specified snakemake inputs and output benchmarks tsv dataset <- snakemake@input[["tsv"]] %>% lapply(read_benchmarks) %>% bind_rows() head(dataset) write_tsv(dataset, snakemake@output[['benchmarks']]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | import argparse from typing import Dict, List, Optional, Set, Tuple import pandas as pd def get_args(): """Handle command line arguments""" parser = argparse.ArgumentParser( description="Generates a list of samples to exclude based on previously-run QC metrics." ) parser.add_argument("bstats", type=str, help="bcftools stats file name") parser.add_argument("outfile", type=str, help="Output file prefix") parser.add_argument( "-v", "--verify", type=str, help="concatenated verifyBamID *.selfSM output files", default=False, ) parser.add_argument( "-r", "--ratio", type=float, help="maximum allowed het/hom_alt ratio", default=2.5 ) parser.add_argument( "-d", "--depth", type=float, help="minimum allowed average depth", default=20.0 ) parser.add_argument( "-c", "--contam", type=float, help="maximum allowed contamination", default=0.03 ) results = parser.parse_args() return ( results.bstats, results.outfile, results.verify, results.ratio, results.depth, results.contam, ) def read_in_bcftools_PSC(bstats: str) -> pd.DataFrame: """Read into a dataframe just the PSC portion of bcftools stats output""" skip: list[int] = [i for i, line in enumerate(open(bstats)) if not line.startswith("PSC")] return pd.read_csv( bstats, sep="\t", skiprows=skip, header=None, names=[ "PSC", "id", "sample", "nRefHom", "nNonRefHom", "nHets", "nTransitions", "nTransversions", "nIndels", "average depth", "nSingletons", "nHapRef", "nHapAlt", "nMissing", ], ) def exclude_high_het_hom(df: pd.DataFrame, r: float) -> pd.DataFrame: """Exclude any samples with a het/hom ratio over 2.5""" df.loc[:, "het_hom_ratio"] = df["nHets"] / df["nNonRefHom"] df_het = df.loc[(df["het_hom_ratio"] > r)].copy() if df_het.empty: return pd.DataFrame({"sample": [], "exclude_reason": []}) else: df_het.loc[:, "exclude_reason"] = "high_het_hom" return df_het[["sample", "exclude_reason"]] def exclude_low_depth(df: pd.DataFrame, d: float) -> pd.DataFrame: """Setting a 20x avg depth cutoff This is reported by the lab as well, but this should catch samples that are borderline, that were accidentally included despite low coverage, or that are from outside collaborators.""" df_depth = df.loc[df["average depth"] < d].copy() if df_depth.empty: return pd.DataFrame({"sample": [], "exclude_reason": []}) else: df_depth.loc[:, "exclude_reason"] = "low_depth" return df_depth[["sample", "exclude_reason"]] def read_in_verifybamid(verify: str) -> pd.DataFrame: """Requires single *.selfSM files to be concatenated without headers""" return pd.read_csv( verify, sep="\t", header=None, names=[ "#SEQ_ID", "RG", "CHIP_ID", "#SNPS", "#READS", "AVG_DP", "FREEMIX", "FREELK1", "FREELK0", "FREE_RH", "FREE_RA", "CHIPMIX", "CHIPLK1", "CHIPLK0", "CHIP_RH", "CHIP_RA", "DPREF", "RDPHET", "RDPALT", ], ) def exclude_contam(df_v: pd.DataFrame, c: float) -> pd.DataFrame: """Setting a 3% contamination threshold""" df_contam = df_v.loc[df_v["FREEMIX"] > c].copy() if df_contam.empty: return pd.DataFrame({"sample": [], "exclude_reason": []}) else: df_contam.loc[:, "exclude_reason"] = "contamination" t = df_contam["#SEQ_ID"].str.split(":", expand=True) if len(t.columns) == 2: df_contam["sample"] = t[1] else: # if you're running a single sample, verifyBamID doesn't use the sample_self:sample format in this field df_contam["sample"] = t[0] return df_contam[["sample", "exclude_reason"]] def combine_exclusions(df_list: list) -> pd.DataFrame: """""" all_df = pd.concat(df_list).groupby("sample")["exclude_reason"].apply(",".join).reset_index() return all_df if __name__ == "__main__": bstats, outfile, verify, r, d, c = get_args() df = read_in_bcftools_PSC(bstats) exclude1 = exclude_high_het_hom(df, r) exclude2 = exclude_low_depth(df, d) if verify: df_v = read_in_verifybamid(verify) exclude3 = exclude_contam(df_v, c) exclude_df = combine_exclusions([exclude1, exclude2, exclude3]) else: exclude_df = combine_exclusions([exclude1, exclude2]) exclude_df.to_csv(outfile + "_with_annotation.tsv", sep="\t", header=None, index=None) exclude_df[["sample"]].to_csv(outfile + ".tsv", sep="\t", header=None, index=None) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | import argparse as ap import re import shutil import sys import pandas as pd def get_args() -> list: """Get command line arguments""" parser = ap.ArgumentParser( description="Creates a plink-formatted ped from a tsv with subject and sex" ) parser.add_argument("infile", type=str, help="linker filename") parser.add_argument( "outprefix", type=str, help="output filename prefix (.ped extension will be added)" ) results = parser.parse_args() return (results.infile, results.outprefix) def check_if_ped(infile: str) -> bool: """Look for .ped file extension This script can either be provided a tsv, which it will convert to a plink-style ped, or it can be provided a ped which it will simply copy to the expected output name/ location. """ if re.search(".ped$", infile, re.IGNORECASE): return True else: return False def read_in_linker(infile: str) -> pd.DataFrame: """Read tsv linker file into a dataframe""" df = pd.read_table(infile, sep="\t") return df def check_column_number(df: pd.DataFrame): """Check for two and only two columns in input tsv""" if len(df.columns) != 2: raise def check_column_headers(df: pd.DataFrame): """Check for required headers in input tsv""" if ["Sample", "Sex"] != list(df.columns)[0:2]: raise def add_ped_columns(df: pd.DataFrame) -> pd.DataFrame: """Add dummy columns to conform to plink ped format Should end up with a dataframe like so: Sample Sample 0 0 [F|M] -9 """ df.insert(1, "a", df.loc[:, "Sample"]) df.insert(2, "b", 0) df.insert(2, "c", 0) df["d"] = "-9" return df def check_sex_values(df: pd.DataFrame) -> pd.DataFrame: """Check for allowed values for sex information Input tsv may only have (case-insensitive) f, female, m, or male to encode sex. Missing data can be encoded by one of the standard NA values that pandas can deal with automatically. The error message emitted when this error is raised lists out the allowed values. """ df["Sex"] = df["Sex"].str.lower() if not all(df[~df["Sex"].isna()].isin(["f", "female", "m", "male"])["Sex"]): raise def encode_sex(df: pd.DataFrame) -> pd.DataFrame: """Convert sex information to ped encoding Missing sex information is converted to 0; male and female are converted to 1 and 2 respectively. """ df["Sex"] = df["Sex"].str.lower() df["Sex"] = df["Sex"].replace(["f", "female", "m", "male"], [2, 2, 1, 1]) df = df.fillna(0) df["Sex"] = df["Sex"].astype(int) return df if __name__ == "__main__": infile, outprefix = get_args() if check_if_ped(infile): shutil.copyfile(infile, outprefix + ".ped") sys.exit(0) df = read_in_linker(infile) try: check_column_number(df) except Exception: sys.exit("Error: {} is required to have exactly two columns.".format(infile)) try: check_column_headers(df) except Exception: sys.exit( "Error: Correct headers not detected in {} (requires 'Sample' and 'Sex', in that order).".format( infile ) ) try: check_sex_values(df) except Exception: sys.exit( "Error: Sex reported in {} must be represented by f, female, m, or male (case insensitive). Missing data can be represented by any of the following strings: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', or 'null'.".format( infile ) ) df = add_ped_columns(df) df = encode_sex(df) df.to_csv(outprefix + ".ped", sep="\t", header=False, index=False) |
23 24 25 26 27 28 | ## Load required R packages library(knitr, quietly = TRUE) library(kableExtra, quietly = TRUE) library(ggplot2, quietly = TRUE) library(RColorBrewer, quietly = TRUE) |
32 33 | ## Load tested functions for this report source(snakemake@input[["r_resources"]]) |
37 38 39 40 41 42 43 | ## Configure standard theme information for ggplot2 my.theme <- theme_light() + theme(plot.title = element_text(size = 15, hjust = 0.5), axis.title = element_text(size = 14), axis.text = element_text(size = 12), strip.background = element_blank(), strip.text = element_text(size = 14, colour = "black")) |
47 48 49 50 | input.subjects <- snakemake@params[["input_samples"]] fastqc.filename <- snakemake@input[["fastqc"]] tsv.path <- snakemake@params[["out_prefix"]] start.time.filename <- snakemake@input[["start_time"]] |
58 59 60 61 62 63 64 65 | res <- prepare.initial.subject.tracking.table(input.subjects) res <- add.fastqc.data(res, fastqc.filename) res <- res[order(res[, "Subject"]), ] rownames(res) <- NULL knitr::kable(res, caption = "Summary of Subject Fate") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE) write.output.table(res, tsv.path) |
74 75 76 | start.time <- read.table(start.time.filename, header = FALSE)[1,1] end.time <- as.integer(format(Sys.time(), "%s")) elapsed.time <- round((end.time - start.time) / 3600, 2) |
98 | sessionInfo() |
23 24 25 26 27 28 | ## Load required R packages library(knitr, quietly = TRUE) library(kableExtra, quietly = TRUE) library(ggplot2, quietly = TRUE) library(RColorBrewer, quietly = TRUE) |
32 33 | ## Load tested functions for this report source(snakemake@input[["r_resources"]]) |
37 38 39 40 41 42 43 | ## Configure standard theme information for ggplot2 my.theme <- theme_light() + theme(plot.title = element_text(size = 15, hjust = 0.5), axis.title = element_text(size = 14), axis.text = element_text(size = 12), strip.background = element_blank(), strip.text = element_text(size = 14, colour = "black")) |
47 48 49 50 51 52 53 54 55 56 57 58 | input.subjects <- snakemake@params[["input_samples"]] exclude.reasons.filename <- snakemake@input[["exclude_list"]] output.subjects.filename <- snakemake@input[["output_subject_list"]] somalier.run <- snakemake@params[["somalier"]] somalier.relatedness.filename <- snakemake@input[["relatedness"]] guess.ploidy.filename <- snakemake@input[["sex"]] sex.linker.filename <- snakemake@input[["ped_file"]] fastqc.filename <- snakemake@input[["fastqc"]] bcftools.stats.filename <- snakemake@input[["bcftools_stats"]] tsv.path <- snakemake@params[["out_prefix"]] start.time.filename <- snakemake@input[["start_time"]] run.mode <- snakemake@params[["run_mode"]] |
62 63 | somalier.related.min <- 0.5 somalier.duplicate.min <- 0.95 |
74 75 76 77 78 79 80 81 82 83 84 85 86 | res <- prepare.subject.tracking.table(input.subjects, output.subjects.filename, exclude.reasons.filename) if (!run.mode == "jointgeno"){ res <- add.fastqc.data(res, fastqc.filename) } res <- add.coverage(res, bcftools.stats.filename) res <- res[order(res[, "Subject"]), ] rownames(res) <- NULL knitr::kable(res, caption = "Summary of Subject Fate") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE) write.output.table(res, tsv.path) |
97 98 99 100 101 102 103 | res <- report.sex.discordances(sex.linker.filename, guess.ploidy.filename) if (nrow(res) == 0) { cat("No sex discordant subjects were identified.") } else { print(knitr::kable(res, caption = "Sex Discordant Subjects") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)) } |
115 116 117 118 119 120 121 | res <- report.related.subject.pairs(output.subjects.filename, somalier.relatedness.filename, somalier.duplicate.min, 1.0) if (nrow(res) == 0) { cat("No duplicate subjects were detected above the specified relatedness cutoff.") } else { print(knitr::kable(res, caption = "Genetic Duplicates") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)) } |
135 136 137 138 139 140 141 | res <- report.related.subject.pairs(output.subjects.filename, somalier.relatedness.filename, somalier.related.min, somalier.duplicate.min) if (nrow(res) == 0) { cat("No non-identical related subjects were detected within the specified range.") } else { print(knitr::kable(res, caption = "Potentially Related Non-identical Subjects") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)) } |
150 151 152 | start.time <- read.table(start.time.filename, header = FALSE)[1,1] end.time <- as.integer(format(Sys.time(), "%s")) elapsed.time <- round((end.time - start.time) / 3600, 2) |
174 | sessionInfo() |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://github.com/bballew/54gene-wgs-germline
Name:
54gene-wgs-germline
Version:
1.0.0
Other Versions:
Downloaded:
0
Copyright:
Public Domain
License:
MIT License
Keywords:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...