54gene WGS germline

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

This workflow is designed to take either fastqs or gvcfs as input, and emit a joint-called multi-sample VCF. Please see Read the Docs for additional documentation.

You can find a small test dataset and pre-configured files for this pipeline here .

Authors

Esha Joshi
Cameron Palmer
Bari Jane Ballew (@bballew)

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository.

Step 1: Obtain a copy of this workflow

Clone this repository to your local system, into the place where you want to perform the data analysis.

 git clone git@gitlab.com:data-analysis5/54gene-wgs-germline.git

Step 2: Configure workflow

The pipeline inputs include:

A configuration file
A manifest file
A list of intervals
A sex linker file
A MultiQC config file (provided)

Step 3: Install the run-time environment

If needed, install miniconda by following the steps here .

Create a conda environment with, minimally, the dependencies defined in environment.yaml .

# create the env
conda env create -f environment.yaml

Step 4: Execute workflow

Activate the conda environment:

conda activate 54gene-wgs-germline

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

To run the pipeline in a cluster environment, edit wrapper.sh as needed for your system, and then run via

bash run.sh

Alternatively, you may run snakemake pipelines on a cluster via something like this

snakemake --use-conda --cluster sbatch --jobs 100

Step 5: Investigate results

Upon pipeline completion, verify that all steps have completed without error by checking the top-level Snakemake log. The bottom few lines of should contain something like nnn of nnn steps (100%) done . Additional job logs (when run on a cluster) are stored in the logs/ directory.

All pipeline results are stored in the results/ directory.

The hard-filtered, joint-called VCF can be found in results/HaplotypeCaller/filtered/HC_variants.hardfiltered.vcf.gz .

For future joint-calling, the gVCFs are located at results/HaplotypeCaller/called/<sample>_all_chroms.g.vcf.gz .

Deduplicated and post-BQSR bams are found at results/bqsr/<sample>.bam .

Code Snippets

shell:
    "mkdir -p {output.t} && "
    "bwa mem "
    "-K 10000000 -M "
    '-R "@RG\\tCN:{params.center}\\tID:{params.rg}\\tSM:{params.sm}\\tPL:{params.pl}\\tLB:N/A" '
    "-t {threads} "
    "{input.r} {input.r1} {input.r2} | "
    "samtools sort -@ {params.samtools_threads} -T {output.t} -m {params.samtools_mem}M -o {output.bam} - "

SnakeMake SAMtools BWA From line 44 of rules/align.smk

shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" MarkDuplicates '
    "TMP_DIR={params.t} "
    "REMOVE_DUPLICATES=true "
    "INPUT={params.l} "
    "OUTPUT={output.bam} "
    "METRICS_FILE={output.metrics}"

SnakeMake gatk From line 86 of rules/align.smk

shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" BaseRecalibrator '
    "--tmp-dir {params.t} "
    "-R {input.r} "
    "-I {input.bam} "
    "--known-sites {input.snps} "
    "--known-sites {input.indels} "
    "-O {output.table}"

SnakeMake gatk From line 119 of rules/align.smk

shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" ApplyBQSR '
    "--tmp-dir {params.t} "
    "-R {input.r} "
    "-I {input.bam} "
    "--bqsr-recal-file {input.recal} "
    "-O {output.bam}"

SnakeMake gatk From line 154 of rules/align.smk

shell:
    "samtools stats {input.bam} > {output}"

SnakeMake SAMtools From line 176 of rules/align.smk

shell:
    "ln -s {input.r1} {output.r1} && "
    "ln -s {input.r2} {output.r2}"

SnakeMake From line 21 of rules/fastq_qc.smk

shell:
    "fastqc {input.r1} -d {params.t} --quiet -t {threads} --outdir={params.o} && "
    "fastqc {input.r2} -d {params.t} --quiet -t {threads} --outdir={params.o}"

SnakeMake FastQC From line 47 of rules/fastq_qc.smk

shell:
    "fastp -i {input.r1} -I {input.r2} "
    "-w {threads} "
    "-o {output.r1_paired} -O {output.r2_paired} "
    "-h {output.h} -j {output.j} "
    "--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA "
    "--adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT "
    "-l 36"

SnakeMake fastp From line 77 of rules/fastq_qc.smk

shell:
    "bcftools norm "
    "-f {input.ref} "
    "-m - "
    "--threads {threads} "
    "{input.vcf} "
    "-Oz -o {output.vcf} && "
    "tabix -p vcf {output.vcf}"

SnakeMake BCFtools tabix From line 24 of rules/filter.smk

shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants '
    "--tmp-dir {params.t} "
    "-V {input.vcf} "
    "--select-type SNP "
    "-O {output.vcf}"

SnakeMake gatk From line 53 of rules/filter.smk

shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants '
    "--tmp-dir {params.t} "
    "-V {input.vcf} "
    "--select-type INDEL "
    "-O {output.vcf}"

SnakeMake gatk From line 81 of rules/filter.smk

shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" VariantFiltration '
    "--tmp-dir {params.t} "
    "-V {input.vcf} "
    '-filter "QD < 2.0" --filter-name "QD2" '
    '-filter "QUAL < 30.0" --filter-name "QUAL30" '
    '-filter "SOR > 3.0" --filter-name "SOR3" '
    '-filter "FS > 60.0" --filter-name "FS60" '
    '-filter "MQ < 40.0" --filter-name "MQ40" '
    '-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" '
    '-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" '
    "-O {output.vcf}"

SnakeMake gatk From line 109 of rules/filter.smk

shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" VariantFiltration '
    "--tmp-dir {params.t} "
    "-V {input.vcf} "
    '-filter "QD < 2.0" --filter-name "QD2" '
    '-filter "QUAL < 30.0" --filter-name "QUAL30" '
    '-filter "FS > 200.0" --filter-name "FS200" '
    '-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" '
    "-O {output.vcf}"

SnakeMake gatk From line 143 of rules/filter.smk

shell:
    "bcftools view --threads {threads} -r {params.region} -Oz -o {output.vcf} {input.vcf} && "
    "tabix -p vcf {output.vcf}"

SnakeMake BCFtools tabix From line 179 of rules/filter.smk

shell:
    "verifyBamID "
    "--best "
    "--vcf {input.vcf} "
    "--bam {input.bam} "
    "--out {params.d}"

SnakeMake From line 205 of rules/filter.smk

shell:
    'grep -v "^#" {params}*selfSM  > {output}'

SnakeMake From line 220 of rules/filter.smk

shell:
    "bcftools concat -a -Ov {input.snps} {input.indels} | "
    "bcftools sort -T {params.t} -Oz -o {output.vcf} && "
    "tabix -p vcf {output.vcf}"

SnakeMake BCFtools tabix From line 245 of rules/filter.smk

shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" CollectVariantCallingMetrics '
    "--TMP_DIR {params.t} "
    "-I {input.calls} "
    "--DBSNP {input.dbsnp} "
    "-SD {input.d} "
    "-O {params.d}"

SnakeMake gatk From line 279 of rules/filter.smk

shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" HaplotypeCaller '
    "--tmp-dir {params.t} "
    "-R {input.r} "
    "-I {input.bam} "
    "-ERC GVCF "
    "-L {input.intervalfile} "
    "-O {output.gvcf} "
    "-G StandardAnnotation "
    "-G StandardHCAnnotation"

SnakeMake gatk From line 35 of rules/HC_calling.smk

shell:
    "bgzip -c {input.gvcf} > {output.gvcf} && "
    "tabix -p vcf {input.gvcf}.gz"

SnakeMake tabix From line 59 of rules/HC_calling.smk

shell:
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GatherVcfs '
    "-I {params.l} "
    "-O {output}"

SnakeMake gatk From line 94 of rules/HC_calling.smk

shell:
    "tabix -p vcf {input}"

SnakeMake tabix From line 110 of rules/HC_calling.smk

shell:
    "ln -s {input.gvcf} {output.gvcf} && "
    "ln -s {input.index} {output.index}"

SnakeMake From line 14 of rules/HC_joint_geno.smk

shell:
    "n=$(bcftools query -l {input.gvcf});"
    'echo "${{n}}\t{input.gvcf}" > {output}'

SnakeMake From line 39 of rules/HC_joint_geno.smk

shell:
    "for i in {params.mapDir}*.map; do cat $i >> {output}; done"

SnakeMake From line 54 of rules/HC_joint_geno.smk

shell:
    'export _JAVA_OPTIONS="" && '
    "rm -r {params.db} && "
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GenomicsDBImport '
    "--batch-size {params.batch_size} "
    "--disable-bam-index-caching "
    "--sample-name-map {input.sampleMap} "
    "--genomicsdb-workspace-path {params.db} "
    "-L {input.intervalfile} "
    "--tmp-dir {params.t} "
    "--reader-threads {params.reader_threads} "
    "--genomicsdb-shared-posixfs-optimizations"

SnakeMake gatk From line 126 of rules/HC_joint_geno.smk

shell:
    'export _JAVA_OPTIONS="" && '
    'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GenotypeGVCFs '
    "-R {input.r} "
    "-V gendb://{params.db} "
    "-O {output.vcf} "
    "--tmp-dir {params.t} "
    "--max-alternate-alleles {params.m} "
    "--genomicsdb-max-alternate-alleles {params.g} "
    "--max-genotype-count {params.c} "
    "-stand-call-conf 30 "
    "-G StandardAnnotation "
    "-G StandardHCAnnotation"

SnakeMake gatk From line 170 of rules/HC_joint_geno.smk

shell:
    "bcftools concat -a {input.vcfList} -Ou | "
    "bcftools sort -T {params.t} -Oz -o {output.projectVCF} && "
    "tabix -p vcf {output.projectVCF}"

SnakeMake BCFtools tabix From line 204 of rules/HC_joint_geno.smk

shell:
    "bcftools stats "
    "--af-bins 0.01,0.05,0.1,1 "
    "-F {input.r} "
    "-s- {input.vcf} > {output}"

SnakeMake BCFtools From line 16 of rules/post_calling_qc.smk

shell:
    "plot-vcfstats -P -p {params.d} {input}"

SnakeMake vcfstats From line 52 of rules/post_calling_qc.smk

shell:
    "python workflow/scripts/generate_ped.py {input} {params.prefix}"

SnakeMake From line 65 of rules/post_calling_qc.smk

shell:
    "somalier extract "
    "-d {params.d} "
    "--sites $CONDA_PREFIX/share/somalier/sites.hg38.vcf.gz "
    "-f {input.r} {input.vcf} && "
    "somalier relate --ped {input.ped} -o {params.o} {params.d}/*.somalier"

SnakeMake From line 87 of rules/post_calling_qc.smk

shell:
    "touch {output}"

SnakeMake From line 103 of rules/post_calling_qc.smk

shell:
    "bcftools +guess-ploidy -g hg38 {input.vcf} -v > {output.txt} && "
    "guess-ploidy.py {output.txt} {params.p}"

SnakeMake BCFtools From line 120 of rules/post_calling_qc.smk

shell:
    "python workflow/scripts/create_exclude_list.py {input.b} {params.out} --verify {input.v} -r {params.r} -d {params.d} -c {params.c}"

SnakeMake From line 143 of rules/post_calling_qc.smk

shell:
    "python workflow/scripts/create_exclude_list.py {input.b} {params.out} -r {params.r} -d {params.d}"

SnakeMake From line 163 of rules/post_calling_qc.smk

shell:
    "bcftools view -S ^{input.l} --threads {threads} -Ou {input.v} | "
    "bcftools annotate --threads {threads} --set-id '%CHROM:%POS:%REF:%ALT' -Oz -o {output.v} && "
    "tabix -p vcf {output.v}"

SnakeMake BCFtools tabix From line 180 of rules/post_calling_qc.smk

shell:
    "multiqc --force -o {params.outDir} -n {params.outName} --config {input.mqc_config} {params.inDirs} {params.relatedness}"

SnakeMake MultiQC From line 226 of rules/post_calling_qc.smk

shell:
    "multiqc --force -o {params.outDir} --config {input.mqc_config} -n {params.outName} {params.inDirs} {params.relatedness}"

SnakeMake MultiQC From line 261 of rules/post_calling_qc.smk

shell:
    "multiqc --force -o {params.outDir} --config {input.mqc_config} -n {params.outName} {params.inDirs}"

SnakeMake MultiQC From line 288 of rules/post_calling_qc.smk

shell:
    "bcftools query -l {input} > {output}"

SnakeMake BCFtools From line 14 of rules/report.smk

script:
    "../scripts/run_summary.Rmd"

SnakeMake From line 45 of rules/report.smk

script:
    "../scripts/run_summary.Rmd"

SnakeMake From line 75 of rules/report.smk

script:
    "../scripts/run_initial_summary.Rmd"

SnakeMake From line 101 of rules/report.smk

script:
    "../scripts/combine_benchmarks.R"

SnakeMake From line 129 of rules/report.smk

script:
    "../scripts/benchmarking_report.Rmd"

SnakeMake From line 149 of rules/report.smk

shell:
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf resources/ --no-sign-request && "
    "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx resources/ --no-sign-request "

SnakeMake From line 24 of rules/resources.smk

shell:
    "date +%s > {output}"

SnakeMake From line 5 of rules/track_time.smk

# Load required packages
knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=8)
require(knitr, quietly = TRUE)
require(ggplot2, quietly=TRUE)
require(RColorBrewer, quietly=TRUE)
require(dplyr, quietly = TRUE)
require(gluedown, quietly = TRUE)
require(gridExtra, quietly = TRUE)
require(lubridate, quietly = TRUE)

# set input variable from snakemake
benchmarks.table <- snakemake@input[["benchmarks"]]

# read in tsv
dataset <- read.table(benchmarks.table, header=TRUE, sep="\t")

# some light date/time manipulation for ease in plotting
dataset$times <- hms(dataset$h.m.s)
d.lub <- hour(dataset$times) + minute(dataset$times)/60 + second(dataset$times)/3600
dataset$walltime <- d.lub

# some upfront unit conversions  & derivations for ease in comprehension
# # convert IO units from MB to GB
dataset$io_in <- dataset$io_in / 1000
dataset$io_out <- dataset$io_out / 1000

# # convert CPU time from seconds to hours
dataset$cpu_time_hrs <- dataset$cpu_time / 3600

# # calculate cpu/walltime ratio
dataset$cpuwall_ratio <- (dataset$cpu_time / dataset$s)

# subset the dataset to get data for rules where there is more than 1
# observation/process to pass to geom_boxplot
subset.df <- dataset %>% group_by(rule) %>% filter(n() > 1)

# subset the dataset to get data points for rules meeting time_threshold
# parameter specified at the rule level
time.param <- snakemake@params[["threshold"]]
time.threshold <- time.param * 60
filtered.df <- dataset %>% group_by(rule) %>% filter(s >= time.threshold)

# start time to calculate elapsed time
tat.file <- snakemake@input[["start_time"]]

R Markdown Snakemake ggplot2 dplyr RColorBrewer gridExtra knitr lubridate gluedown From line 15 of scripts/benchmarking_report.Rmd

# plotting function for box/jitter plot where single data points are omitted
# from boxplots
create.boxjitter.plot <- function(df, df.subset, xvar, yvar, xlab, ylab) {
  bd.plot <- ggplot(data=df, aes_string(x=xvar, y=yvar)) +
    geom_boxplot(data = df.subset, aes_string(fill=xvar), show.legend = FALSE)  +
    geom_point(position="identity", show.legend = FALSE, alpha=1/2, size=1) +
    labs(x=xlab, y=ylab) +
    theme_minimal() +
    theme(axis.text.x=element_text(angle = 90, vjust=0.5, hjust=1),
          axis.title = element_text(size = 14),
          axis.text = element_text(size = 12))
  bd.plot
}

create.subset.plot <- function(subdf, xvar, yvar, xlab, ylab, title){
  sb.plot <- ggplot(data = subdf, aes_string(x = xvar, y = yvar)) +
    geom_boxplot(aes_string(fill=xvar), show.legend = FALSE) +
    geom_point(position = "identity", show.legend = FALSE, alpha=1/2, size=1) +
    labs(x=xlab, y=ylab, title=title) +
    theme_minimal() +
    theme(axis.text.x=element_text(angle = 90, vjust=0.5, hjust=1),
          axis.title = element_text(size = 14),
          axis.text = element_text(size = 12))
  sb.plot
}

R Markdown ggplot2 From line 62 of scripts/benchmarking_report.Rmd

inlines <- c(
  md_bold(unique(dataset$rule))
)
md_bullet(inlines)

R Markdown From line 95 of scripts/benchmarking_report.Rmd

start.time <- read.table(tat.file, header=FALSE)[1,1]
end.time <- as.integer(format(Sys.time(), "%s"))
elapsed.time <- round((end.time - start.time) / 3600, 2)

R Markdown From line 109 of scripts/benchmarking_report.Rmd

# plot walltime for all rules benchmarked
exec.time.all <- create.boxjitter.plot(dataset, subset.df, "rule", "walltime", "Rule", "Total walltime in hours for all rules")
exec.time.all

R Markdown From line 115 of scripts/benchmarking_report.Rmd

max.pss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_pss",
                                     "Rule", "PSS (MB) for all rules") +
  scale_y_continuous(n.breaks = 12)
max.pss.all

R Markdown From line 129 of scripts/benchmarking_report.Rmd

max.uss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_uss",
                                     "Rule", "USS (MB) for all rules") +
  scale_y_continuous(n.breaks = 12)
max.uss.all

R Markdown From line 138 of scripts/benchmarking_report.Rmd

max.rss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_rss",
                                     "Rule", "RSS (MB) for all rules") +
  scale_y_continuous(n.breaks = 12)
max.rss.all

R Markdown From line 147 of scripts/benchmarking_report.Rmd

max.vms.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_vms",
                                     "Rule", "VMS (MB) for all rules") +
  scale_y_continuous(n.breaks = 12)
max.vms.all

R Markdown From line 156 of scripts/benchmarking_report.Rmd

cat(paste0("Time threshold (minutes): ", time.param))

R Markdown From line 167 of scripts/benchmarking_report.Rmd

# plot the memory usage with boxplots for rules with >1 process otherwise
# just plot the datapoint
max.pss.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_pss", "Rule", "PSS (MB)") +
  scale_y_continuous(n.breaks = 12)
max.pss.detail

R Markdown From line 178 of scripts/benchmarking_report.Rmd

max.uss.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_uss", "Rule", "USS (MB)") +
  scale_y_continuous(n.breaks = 12)
max.uss.detail

R Markdown From line 188 of scripts/benchmarking_report.Rmd

max.rss.detail <- create.boxjitter.plot(dataset, subset.df, "rule", "max_rss", "Rule", "RSS (MB)") +
  scale_y_continuous(n.breaks = 12)
max.rss.detail

R Markdown From line 196 of scripts/benchmarking_report.Rmd

max.vms.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_vms", "Rule", "VMS (MB)") +
  scale_y_continuous(n.breaks = 12)
max.vms.detail

R Markdown From line 204 of scripts/benchmarking_report.Rmd

# plot detailed grid view
mem.df1 <- subset(filtered.df, max_pss > 5000)
mem.df2 <- subset(filtered.df, max_pss < 5000)

max.pss.subset1 <- create.subset.plot(mem.df1, "rule", "max_pss", "Rule", "max_pss (MB)", "Rules using memory >5GB")
max.pss.subset2 <- create.subset.plot(mem.df2, "rule", "max_pss", "Rule", "max_pss (MB)", "Rules using memory <5GB")
grid.arrange(max.pss.subset1, max.pss.subset2, nrow=1)

R Markdown From line 211 of scripts/benchmarking_report.Rmd

# plot the io read
io.read <- create.boxjitter.plot(filtered.df, subset.df, "rule", "io_in", "Rule", "Cumulative GB Read")
io.read

R Markdown From line 228 of scripts/benchmarking_report.Rmd

# plot io write
io.write <- create.boxjitter.plot(filtered.df, subset.df, "rule", "io_out", "Rule", "Cumulative GB Written")
io.write

R Markdown From line 236 of scripts/benchmarking_report.Rmd

# plot detailed grid view
ioin.df1 <- subset(filtered.df, io_in > 10)
ioin.df2 <- subset(filtered.df, io_in < 10)

ioout.df1 <- subset(filtered.df, io_out > 10)
ioout.df2 <- subset(filtered.df, io_out < 10)

ioin.subset1 <- create.subset.plot(ioin.df1, "rule", "io_in", "Rule", "Cumulative GB Written", "Rules with >10GB Read")
ioout.subset1 <- create.subset.plot(ioout.df1, "rule", "io_out", "Rule", "Cumulative GB Written", "Rules with >10GB Written")

grid.arrange(ioin.subset1, ioout.subset1, nrow=2)

R Markdown From line 243 of scripts/benchmarking_report.Rmd

# plot the CPU time in hours
cpu.time <- create.boxjitter.plot(filtered.df, subset.df, "rule", "cpu_time_hrs", "Rule", "CPU time in hours")
cpu.time

R Markdown From line 262 of scripts/benchmarking_report.Rmd

# plot cpu to walltime ratio (i.e. CPU/s)
cpuwall.ratio <- create.boxjitter.plot(filtered.df, subset.df, "rule", "cpuwall_ratio", "Rule", "CPU/second")
cpuwall.ratio

R Markdown From line 279 of scripts/benchmarking_report.Rmd

288	sessionInfo()

R Markdown From line 288 of scripts/benchmarking_report.Rmd

library(tidyverse)

# explicitly specify column types to avoid errors when coercing rows
col.types <- cols(
  s = col_double(),
  `h:m:s` = col_time(format = ""),
  max_rss = col_double(),
  max_vms = col_double(),
  max_uss = col_double(),
  max_pss = col_double(),
  io_in = col_double(),
  io_out = col_double(),
  mean_load = col_double(),
  cpu_time = col_double()
)
# function to read tsv input and append rule name and process
read_benchmarks <- function(filename) {
  read_tsv(filename, col_types=col.types) %>%
    mutate(
      process = gsub("*.tsv","",basename(filename)),
      rule =  basename(dirname(filename))
      )
}

# run function for all specified snakemake inputs and output benchmarks tsv
dataset <- snakemake@input[["tsv"]] %>%
  lapply(read_benchmarks) %>%
  bind_rows()
head(dataset)
write_tsv(dataset, snakemake@output[['benchmarks']])

R Snakemake tidyverse From line 6 of scripts/combine_benchmarks.R

import argparse
from typing import Dict, List, Optional, Set, Tuple

import pandas as pd


def get_args():
    """Handle command line arguments"""
    parser = argparse.ArgumentParser(
        description="Generates a list of samples to exclude based on previously-run QC metrics."
    )
    parser.add_argument("bstats", type=str, help="bcftools stats file name")
    parser.add_argument("outfile", type=str, help="Output file prefix")
    parser.add_argument(
        "-v",
        "--verify",
        type=str,
        help="concatenated verifyBamID *.selfSM output files",
        default=False,
    )
    parser.add_argument(
        "-r", "--ratio", type=float, help="maximum allowed het/hom_alt ratio", default=2.5
    )
    parser.add_argument(
        "-d", "--depth", type=float, help="minimum allowed average depth", default=20.0
    )
    parser.add_argument(
        "-c", "--contam", type=float, help="maximum allowed contamination", default=0.03
    )
    results = parser.parse_args()
    return (
        results.bstats,
        results.outfile,
        results.verify,
        results.ratio,
        results.depth,
        results.contam,
    )


def read_in_bcftools_PSC(bstats: str) -> pd.DataFrame:
    """Read into a dataframe just the PSC portion of bcftools stats output"""
    skip: list[int] = [i for i, line in enumerate(open(bstats)) if not line.startswith("PSC")]
    return pd.read_csv(
        bstats,
        sep="\t",
        skiprows=skip,
        header=None,
        names=[
            "PSC",
            "id",
            "sample",
            "nRefHom",
            "nNonRefHom",
            "nHets",
            "nTransitions",
            "nTransversions",
            "nIndels",
            "average depth",
            "nSingletons",
            "nHapRef",
            "nHapAlt",
            "nMissing",
        ],
    )


def exclude_high_het_hom(df: pd.DataFrame, r: float) -> pd.DataFrame:
    """Exclude any samples with a het/hom ratio over 2.5"""
    df.loc[:, "het_hom_ratio"] = df["nHets"] / df["nNonRefHom"]
    df_het = df.loc[(df["het_hom_ratio"] > r)].copy()
    if df_het.empty:
        return pd.DataFrame({"sample": [], "exclude_reason": []})
    else:
        df_het.loc[:, "exclude_reason"] = "high_het_hom"
        return df_het[["sample", "exclude_reason"]]


def exclude_low_depth(df: pd.DataFrame, d: float) -> pd.DataFrame:
    """Setting a 20x avg depth cutoff

    This is reported by the lab as well, but this should catch samples
    that are borderline, that were accidentally included despite low
    coverage, or that are from outside collaborators."""
    df_depth = df.loc[df["average depth"] < d].copy()
    if df_depth.empty:
        return pd.DataFrame({"sample": [], "exclude_reason": []})
    else:
        df_depth.loc[:, "exclude_reason"] = "low_depth"
        return df_depth[["sample", "exclude_reason"]]


def read_in_verifybamid(verify: str) -> pd.DataFrame:
    """Requires single *.selfSM files to be concatenated without headers"""
    return pd.read_csv(
        verify,
        sep="\t",
        header=None,
        names=[
            "#SEQ_ID",
            "RG",
            "CHIP_ID",
            "#SNPS",
            "#READS",
            "AVG_DP",
            "FREEMIX",
            "FREELK1",
            "FREELK0",
            "FREE_RH",
            "FREE_RA",
            "CHIPMIX",
            "CHIPLK1",
            "CHIPLK0",
            "CHIP_RH",
            "CHIP_RA",
            "DPREF",
            "RDPHET",
            "RDPALT",
        ],
    )


def exclude_contam(df_v: pd.DataFrame, c: float) -> pd.DataFrame:
    """Setting a 3% contamination threshold"""
    df_contam = df_v.loc[df_v["FREEMIX"] > c].copy()
    if df_contam.empty:
        return pd.DataFrame({"sample": [], "exclude_reason": []})
    else:
        df_contam.loc[:, "exclude_reason"] = "contamination"
        t = df_contam["#SEQ_ID"].str.split(":", expand=True)
        if len(t.columns) == 2:
            df_contam["sample"] = t[1]
        else:
            # if you're running a single sample, verifyBamID doesn't use the sample_self:sample format in this field
            df_contam["sample"] = t[0]
        return df_contam[["sample", "exclude_reason"]]


def combine_exclusions(df_list: list) -> pd.DataFrame:
    """"""
    all_df = pd.concat(df_list).groupby("sample")["exclude_reason"].apply(",".join).reset_index()
    return all_df


if __name__ == "__main__":
    bstats, outfile, verify, r, d, c = get_args()
    df = read_in_bcftools_PSC(bstats)
    exclude1 = exclude_high_het_hom(df, r)
    exclude2 = exclude_low_depth(df, d)
    if verify:
        df_v = read_in_verifybamid(verify)
        exclude3 = exclude_contam(df_v, c)
        exclude_df = combine_exclusions([exclude1, exclude2, exclude3])
    else:
        exclude_df = combine_exclusions([exclude1, exclude2])

    exclude_df.to_csv(outfile + "_with_annotation.tsv", sep="\t", header=None, index=None)
    exclude_df[["sample"]].to_csv(outfile + ".tsv", sep="\t", header=None, index=None)

Python Pandas BCFtools From line 3 of scripts/create_exclude_list.py

import argparse as ap
import re
import shutil
import sys

import pandas as pd


def get_args() -> list:
    """Get command line arguments"""
    parser = ap.ArgumentParser(
        description="Creates a plink-formatted ped from a tsv with subject and sex"
    )
    parser.add_argument("infile", type=str, help="linker filename")
    parser.add_argument(
        "outprefix", type=str, help="output filename prefix (.ped extension will be added)"
    )
    results = parser.parse_args()
    return (results.infile, results.outprefix)


def check_if_ped(infile: str) -> bool:
    """Look for .ped file extension

    This script can either be provided a tsv, which it will
    convert to a plink-style ped, or it can be provided a ped
    which it will simply copy to the expected output name/
    location.
    """
    if re.search(".ped$", infile, re.IGNORECASE):
        return True
    else:
        return False


def read_in_linker(infile: str) -> pd.DataFrame:
    """Read tsv linker file into a dataframe"""
    df = pd.read_table(infile, sep="\t")
    return df


def check_column_number(df: pd.DataFrame):
    """Check for two and only two columns in input tsv"""
    if len(df.columns) != 2:
        raise


def check_column_headers(df: pd.DataFrame):
    """Check for required headers in input tsv"""
    if ["Sample", "Sex"] != list(df.columns)[0:2]:
        raise


def add_ped_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Add dummy columns to conform to plink ped format

    Should end up with a dataframe like so:
    Sample Sample 0 0 [F|M] -9
    """
    df.insert(1, "a", df.loc[:, "Sample"])
    df.insert(2, "b", 0)
    df.insert(2, "c", 0)
    df["d"] = "-9"
    return df


def check_sex_values(df: pd.DataFrame) -> pd.DataFrame:
    """Check for allowed values for sex information

    Input tsv may only have (case-insensitive) f, female,
    m, or male to encode sex.  Missing data can be encoded
    by one of the standard NA values that pandas can deal
    with automatically.  The error message emitted when this
    error is raised lists out the allowed values.
    """
    df["Sex"] = df["Sex"].str.lower()
    if not all(df[~df["Sex"].isna()].isin(["f", "female", "m", "male"])["Sex"]):
        raise


def encode_sex(df: pd.DataFrame) -> pd.DataFrame:
    """Convert sex information to ped encoding

    Missing sex information is converted to 0; male and
    female are converted to 1 and 2 respectively.
    """
    df["Sex"] = df["Sex"].str.lower()
    df["Sex"] = df["Sex"].replace(["f", "female", "m", "male"], [2, 2, 1, 1])
    df = df.fillna(0)
    df["Sex"] = df["Sex"].astype(int)
    return df


if __name__ == "__main__":
    infile, outprefix = get_args()
    if check_if_ped(infile):
        shutil.copyfile(infile, outprefix + ".ped")
        sys.exit(0)
    df = read_in_linker(infile)
    try:
        check_column_number(df)
    except Exception:
        sys.exit("Error: {} is required to have exactly two columns.".format(infile))
    try:
        check_column_headers(df)
    except Exception:
        sys.exit(
            "Error: Correct headers not detected in {} (requires 'Sample' and 'Sex', in that order).".format(
                infile
            )
        )
    try:
        check_sex_values(df)
    except Exception:
        sys.exit(
            "Error: Sex reported in {} must be represented by f, female, m, or male (case insensitive).  Missing data can be represented by any of the following strings: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', or 'null'.".format(
                infile
            )
        )

    df = add_ped_columns(df)
    df = encode_sex(df)
    df.to_csv(outprefix + ".ped", sep="\t", header=False, index=False)

Python Pandas pLink From line 3 of scripts/generate_ped.py

## Load required R packages

library(knitr, quietly = TRUE)
library(kableExtra, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(RColorBrewer, quietly = TRUE)

R Markdown ggplot2 RColorBrewer knitr kableExtra From line 23 of scripts/run_initial_summary.Rmd

## Load tested functions for this report
source(snakemake@input[["r_resources"]])

R Markdown From line 32 of scripts/run_initial_summary.Rmd

## Configure standard theme information for ggplot2

my.theme <- theme_light() + theme(plot.title = element_text(size = 15, hjust = 0.5),
                                  axis.title = element_text(size = 14),
                                  axis.text = element_text(size = 12),
                                  strip.background = element_blank(),
                                  strip.text = element_text(size = 14, colour = "black"))

R Markdown ggplot2 From line 37 of scripts/run_initial_summary.Rmd

input.subjects <- snakemake@params[["input_samples"]]
fastqc.filename <- snakemake@input[["fastqc"]]
tsv.path <- snakemake@params[["out_prefix"]]
start.time.filename <- snakemake@input[["start_time"]]

R Markdown FastQC From line 47 of scripts/run_initial_summary.Rmd

res <- prepare.initial.subject.tracking.table(input.subjects)
res <- add.fastqc.data(res, fastqc.filename)
res <- res[order(res[, "Subject"]), ]
rownames(res) <- NULL

knitr::kable(res, caption = "Summary of Subject Fate") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)

write.output.table(res, tsv.path)

R Markdown From line 58 of scripts/run_initial_summary.Rmd

start.time <- read.table(start.time.filename, header = FALSE)[1,1]
end.time <- as.integer(format(Sys.time(), "%s"))
elapsed.time <- round((end.time - start.time) / 3600, 2)

R Markdown From line 74 of scripts/run_initial_summary.Rmd

98	sessionInfo()

R Markdown From line 98 of scripts/run_initial_summary.Rmd

## Load required R packages

library(knitr, quietly = TRUE)
library(kableExtra, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(RColorBrewer, quietly = TRUE)

R Markdown ggplot2 RColorBrewer knitr kableExtra From line 23 of scripts/run_summary.Rmd

## Load tested functions for this report
source(snakemake@input[["r_resources"]])

R Markdown From line 32 of scripts/run_summary.Rmd

## Configure standard theme information for ggplot2

my.theme <- theme_light() + theme(plot.title = element_text(size = 15, hjust = 0.5),
                                  axis.title = element_text(size = 14),
                                  axis.text = element_text(size = 12),
                                  strip.background = element_blank(),
                                  strip.text = element_text(size = 14, colour = "black"))

R Markdown ggplot2 From line 37 of scripts/run_summary.Rmd

input.subjects <- snakemake@params[["input_samples"]]
exclude.reasons.filename <- snakemake@input[["exclude_list"]]
output.subjects.filename <- snakemake@input[["output_subject_list"]]
somalier.run <- snakemake@params[["somalier"]]
somalier.relatedness.filename <- snakemake@input[["relatedness"]]
somalier.sex.filename <- snakemake@input[["sex"]]
fastqc.filename <- snakemake@input[["fastqc"]]
bcftools.stats.filename <- snakemake@input[["bcftools_stats"]]
tsv.path <- snakemake@params[["out_prefix"]]
start.time.filename <- snakemake@input[["start_time"]]
run.mode <- snakemake@params[["run_mode"]]

R Markdown FastQC From line 47 of scripts/run_summary.Rmd

somalier.related.min <- 0.5
somalier.duplicate.min <- 0.95

R Markdown From line 61 of scripts/run_summary.Rmd

res <- prepare.subject.tracking.table(input.subjects, output.subjects.filename, exclude.reasons.filename)

if (!run.mode == "jointgeno"){
  res <- add.fastqc.data(res, fastqc.filename)
}

res <- add.coverage(res, bcftools.stats.filename)
res <- res[order(res[, "Subject"]), ]
rownames(res) <- NULL

knitr::kable(res, caption = "Summary of Subject Fate") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)

write.output.table(res, tsv.path)

R Markdown From line 73 of scripts/run_summary.Rmd

res <- report.sex.discordances(output.subjects.filename, somalier.sex.filename)

if (nrow(res) == 0) {
    cat("No sex discordant subjects were identified.")
} else {
    print(knitr::kable(res, caption = "Sex Discordant Subjects") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE))
}

R Markdown From line 99 of scripts/run_summary.Rmd

res <- report.related.subject.pairs(output.subjects.filename, somalier.relatedness.filename, somalier.duplicate.min, 1.0)

if (nrow(res) == 0) {
    cat("No duplicate subjects were detected above the specified relatedness cutoff.")
} else {
    print(knitr::kable(res, caption = "Genetic Duplicates") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE))
}

R Markdown From line 117 of scripts/run_summary.Rmd

res <- report.related.subject.pairs(output.subjects.filename, somalier.relatedness.filename, somalier.related.min, somalier.duplicate.min)

if (nrow(res) == 0) {
    cat("No non-identical related subjects were detected within the specified range.")
} else {
    print(knitr::kable(res, caption = "Potentially Related Non-identical Subjects") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE))
}

R Markdown From line 137 of scripts/run_summary.Rmd

start.time <- read.table(start.time.filename, header = FALSE)[1,1]
end.time <- as.integer(format(Sys.time(), "%s"))
elapsed.time <- round((end.time - start.time) / 3600, 2)