Trimming, alignment, variant calling, and QC for germline data.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This workflow is designed to take either fastqs or gvcfs as input, and emit a joint-called multi-sample VCF. The pipeline analyzes germline samples only, and does not currently support multiplexed data. It is optimized for deployment on AWS parallelcluster, which can be set up as described here , though it should run without issue on any HPC system or local machine.
Authors
-
Esha Joshi
-
Cameron Palmer
-
Bari Jane Ballew (@bballew)
Usage
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
Step 1: Obtain a copy of this workflow
- Clone this repository to your local system, into the place where you want to perform the data analysis.
git clone git@gitlab.com:data-analysis5/54gene-wgs-germline.git
Step 2: Configure workflow
Configure the workflow according to your needs via editing the files in the
config/
folder. Adjust
config.yaml
to configure the workflow execution, and
manifest.txt
to specify your sample setup.
A. Config file
This pipeline currently offers two modes. Please specify the run mode in
config.yaml
.
-
full : This mode starts with fastq pairs and emits a joint-called, filtered, multi-sample VCF.
-
joint_genotyping : This mode starts with gVCFs and runs joint-calling and filtering, emitting a multi-sample VCF. The idea here is that samples can be analyzed in batches in the full run mode, and then the batches can be jointly re-genotyped with this mode.
-
fastq_qc_only : This mode starts with fastq pairs and performs trimming, then runs FastQC on both pre- and post-trimming fastq pairs, and finally emits a MultiQC report. This is designed to be run initially before a full pipeline run, so that if any lanes need to be dropped or trimming parameters adjusted, it can be identified before spending time and compute on alignment, joint calling, etc.
After joint-calling the samples listed in the manifest, the pipeline will create a new multi-sample VCF where samples have been automatically removed based on the following thresholds sourced from the config:
-
max_het_ratio (defaults to 2.5): excludes samples with het/hom-alt ratios, as calculated by bcftools stats, that are above the threshold
-
min_avg_depth (defaults to 20.0): excludes samples with average depth of coverage, as calclated by bcftools stats, that are below the threshold
-
max_contam (defaults to 0.03): only applied in full runs; excludes samples with contamination estimates, as calculated by verifyBamID, that are above the threshold
B. Manifest file
You will need to provide a headerless, whitespace-delimited manifest file to run the pipeline. For full mode and for fastq_qc_only mode:
-
Columns:
readgroup (should be unique) sample_ID path/to/r1.fastq path/to/r2.fastq
-
readgroup
values should be unique, e.g. sampleID_flowcellID -
sample_ID
should be the same for all fastq pairs from a single sample, and can be different from the fastq filenames
For joint_genotyping mode:
-
Columns:
sample_ID path/to/file.g.vcf.gz
-
sample_ID
values should be unique, and should correspond to the sample IDs in the gvcfs -
gvcfs should be zipped and indexed
Step 3: Install the run-time environment
If needed, install miniconda by following the steps here .
-
Create a conda environment with, minimally, the dependencies defined in
environment.yaml
.
# create the env
conda env create -f environment.yaml
Step 4: Execute workflow
Activate the conda environment:
conda activate 54gene-wgs-germline
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
To run the pipeline in a cluster environment, edit
wrapper.sh
as needed for your system, and then run via
bash run.sh
Alternatively, you may run snakemake pipelines on a cluster via something like this
snakemake --use-conda --cluster qsub --jobs 100
Step 5: Investigate results
Upon pipeline completion, verify that all steps have completed without error by checking the top-level log
WGS_<datestamp>.out
. The bottom few lines of the file should contain something like
nnn of nnn steps (100%) done
. Additional job logs (when run on a cluster) are stored in the
logs/
directory.
All pipeline results are stored in the
results/
directory.
The hard-filtered, joint-called VCF can be found in
results/HaplotypeCaller/filtered/HC_variants.hardfiltered.vcf.gz
.
For future joint-calling, the gVCFs are located at
results/HaplotypeCaller/called/<sample>_all_chroms.g.vcf.gz
.
Deduplicated and post-BQSR bams are found at
results/bqsr/<sample>.bam
.
Samples that fail the following thresholds are automatically removed from the above file, and the output is placed in
results/post_qc_exclusions/samples_excluded.HC_variants.hardfiltered.vcf.gz
. The record of sample exclusions, along with reasons for exclusion, is found at
results/post_qc_exclusions/exclude_list_with_annotation.tsv
. Samples are excluded for at least one of the following reasons. Values listed are defaults, but can be changed in the
config.yaml
.
-
Average depth of coverage <20x
-
Contamination > 3%
-
Het/Hom ratio > 2.5
The following QC metrics are available:
-
fastqc at
results/fastqc/
-
Trimming stats via fastp at
results/paired_trimmed_reads/
-
Alignment stats via samtools at
results/alignment_stats/
-
Recalibration stats from bqsr at
results/bqsr/
-
Relatedness via somalier at
results/qc/relatedness/
-
Sample contamination via verifyBamID at
results/qc/contamination_check/
- for full runs; not included in joint-genotyping only -
Inferred sex via bcftools +guess-ploidy at
results/qc/sex_check/
-
Picard metrics at
results/HaplotypeCaller/filtered/
-
bcftools stats at
results/qc/bcftools_stats/
-
multiqc report at
results/multiqc/
A final summary report for the run is available at
results/run_summary/
. This report contains overview information, such as samples starting vs.
samples emitted into the final VCF, a table of key QC stats, and lists of samples flagged for unexpected relatedness, duplication, or sex discordance.
Step 6: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Step 7: Obtain updates from upstream
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
-
Once, register the upstream repository in your local copy:
git remote add -f upstream git@github.com:snakemake-workflows/54gene-wgs-germline.git
orgit remote add -f upstream https://github.com/snakemake-workflows/54gene-wgs-germline.git
if you do not have setup ssh keys. -
Update the upstream version:
git fetch upstream
. -
Create a diff with the current version:
git diff HEAD upstream/master workflow > upstream-changes.diff
. -
Investigate the changes:
vim upstream-changes.diff
. -
Apply the modified diff via:
git apply upstream-changes.diff
. -
Carefully check whether you need to update the config files:
git diff HEAD upstream/master config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
Step 8: Contribute back
In case you have also changed or added steps, please consider contributing them back to the original repository:
-
Fork the original repo to a personal or lab account.
-
Clone the fork to your local system, to a different place than where you ran your analysis.
-
Copy the modified files from your analysis to the clone of your fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary. -
Commit and push your changes to your fork.
-
Create a pull request against the original repository.
Testing
Test cases are in the subfolder
tests
. They are automatically executed via continuous integration (TBD).
Unit tests for scripts are found in
workflow/scripts/
. They can be executed with
pytest
or
testthat
for python and R scripts, respectively.
Test input data, configs, and example manifests for both pipeline modes can be found here . Note that there are a few important caveats when testing.
-
Somalier doesn't seem to be functional on Mac, so make sure you're in a linux environment or comment out that target from the Snakefile (ugh). (Using a container for the OS would solve this problem....)
-
Make sure you update the manifests to point to wherever you put the input data
-
You can run tests locally using the script in that same repo, via
bash run_local_test.sh
.
Code Snippets
43 44 45 46 47 48 49 50 | shell: "mkdir -p {output.t} && " "bwa mem " "-K 10000000 -M " '-R "@RG\\tCN:54gene\\tID:{params.rg}\\tSM:{params.sm}\\tPL:{params.pl}\\tLB:N/A" ' "-t {threads} " "{input.r} {input.r1} {input.r2} | " "samtools sort -@ {params.samtools_threads} -T {output.t} -m {params.samtools_mem}M -o {output.bam} - " |
85 86 87 88 89 90 91 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" MarkDuplicates ' "TMP_DIR={params.t} " "REMOVE_DUPLICATES=true " "INPUT={params.l} " "OUTPUT={output.bam} " "METRICS_FILE={output.metrics}" |
118 119 120 121 122 123 124 125 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" BaseRecalibrator ' "--tmp-dir {params.t} " "-R {input.r} " "-I {input.bam} " "--known-sites {input.snps} " "--known-sites {input.indels} " "-O {output.table}" |
153 154 155 156 157 158 159 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" ApplyBQSR ' "--tmp-dir {params.t} " "-R {input.r} " "-I {input.bam} " "--bqsr-recal-file {input.recal} " "-O {output.bam}" |
175 176 | shell: "samtools stats {input.bam} > {output}" |
21 22 23 | shell: "ln -s {input.r1} {output.r1} && " "ln -s {input.r2} {output.r2}" |
47 48 49 | shell: "fastqc {input.r1} -d {params.t} --quiet -t {threads} --outdir={params.o} && " "fastqc {input.r2} -d {params.t} --quiet -t {threads} --outdir={params.o}" |
77 78 79 80 81 82 83 84 | shell: "fastp -i {input.r1} -I {input.r2} " "-w {threads} " "-o {output.r1_paired} -O {output.r2_paired} " "-h {output.h} -j {output.j} " "--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA " "--adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT " "-l 36" |
24 25 26 27 28 29 30 31 | shell: "bcftools norm " "-f {input.ref} " "-m - " "--threads {threads} " "{input.vcf} " "-Oz -o {output.vcf} && " "tabix -p vcf {output.vcf}" |
53 54 55 56 57 58 59 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants ' "--tmp-dir {params.t} " "-V {input.vcf} " "--select-type SNP " "-O {output.vcf}" |
81 82 83 84 85 86 87 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" SelectVariants ' "--tmp-dir {params.t} " "-V {input.vcf} " "--select-type INDEL " "-O {output.vcf}" |
109 110 111 112 113 114 115 116 117 118 119 120 121 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" VariantFiltration ' "--tmp-dir {params.t} " "-V {input.vcf} " '-filter "QD < 2.0" --filter-name "QD2" ' '-filter "QUAL < 30.0" --filter-name "QUAL30" ' '-filter "SOR > 3.0" --filter-name "SOR3" ' '-filter "FS > 60.0" --filter-name "FS60" ' '-filter "MQ < 40.0" --filter-name "MQ40" ' '-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" ' '-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" ' "-O {output.vcf}" |
143 144 145 146 147 148 149 150 151 152 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" VariantFiltration ' "--tmp-dir {params.t} " "-V {input.vcf} " '-filter "QD < 2.0" --filter-name "QD2" ' '-filter "QUAL < 30.0" --filter-name "QUAL30" ' '-filter "FS > 200.0" --filter-name "FS200" ' '-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" ' "-O {output.vcf}" |
179 180 181 | shell: "bcftools view --threads {threads} -r {params.region} -Oz -o {output.vcf} {input.vcf} && " "tabix -p vcf {output.vcf}" |
205 206 207 208 209 210 | shell: "verifyBamID " "--best " "--vcf {input.vcf} " "--bam {input.bam} " "--out {params.d}" |
220 221 | shell: 'grep -v "^#" {params}*selfSM > {output}' |
245 246 247 248 | shell: "bcftools concat -a -Ov {input.snps} {input.indels} | " "bcftools sort -T {params.t} -Oz -o {output.vcf} && " "tabix -p vcf {output.vcf}" |
279 280 281 282 283 284 285 286 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" CollectVariantCallingMetrics ' "--TMP_DIR {params.t} " "-I {input.calls} " "--DBSNP {input.dbsnp} " "-SD {input.d} " "-O {params.d}" |
35 36 37 38 39 40 41 42 43 44 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" HaplotypeCaller ' "--tmp-dir {params.t} " "-R {input.r} " "-I {input.bam} " "-ERC GVCF " "-L {input.intervalfile} " "-O {output.gvcf} " "-G StandardAnnotation " "-G StandardHCAnnotation" |
59 60 61 | shell: "bgzip -c {input.gvcf} > {output.gvcf} && " "tabix -p vcf {input.gvcf}.gz" |
94 95 96 97 | shell: 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GatherVcfs ' "-I {params.l} " "-O {output}" |
110 111 | shell: "tabix -p vcf {input}" |
14 15 16 | shell: "ln -s {input.gvcf} {output.gvcf} && " "ln -s {input.index} {output.index}" |
39 40 41 | shell: "n=$(bcftools query -l {input.gvcf});" 'echo "${{n}}\t{input.gvcf}" > {output}' |
54 55 | shell: "for i in {params.mapDir}*.map; do cat $i >> {output}; done" |
126 127 128 129 130 131 132 133 134 135 136 137 | shell: 'export _JAVA_OPTIONS="" && ' "rm -r {params.db} && " 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GenomicsDBImport ' "--batch-size {params.batch_size} " "--disable-bam-index-caching " "--sample-name-map {input.sampleMap} " "--genomicsdb-workspace-path {params.db} " "-L {input.intervalfile} " "--tmp-dir {params.t} " "--reader-threads {params.reader_threads} " "--genomicsdb-shared-posixfs-optimizations" |
170 171 172 173 174 175 176 177 178 179 180 181 182 | shell: 'export _JAVA_OPTIONS="" && ' 'gatk --java-options "-Xmx{resources.xmx}m {params.java_opts}" GenotypeGVCFs ' "-R {input.r} " "-V gendb://{params.db} " "-O {output.vcf} " "--tmp-dir {params.t} " "--max-alternate-alleles {params.m} " "--genomicsdb-max-alternate-alleles {params.g} " "--max-genotype-count {params.c} " "-stand-call-conf 30 " "-G StandardAnnotation " "-G StandardHCAnnotation" |
204 205 206 207 | shell: "bcftools concat -a {input.vcfList} -Ou | " "bcftools sort -T {params.t} -Oz -o {output.projectVCF} && " "tabix -p vcf {output.projectVCF}" |
16 17 18 19 20 | shell: "bcftools stats " "--af-bins 0.01,0.05,0.1,1 " "-F {input.r} " "-s- {input.vcf} > {output}" |
52 53 | shell: "plot-vcfstats -P -p {params.d} {input}" |
68 69 | shell: "python workflow/scripts/generate_ped.py {input} {params.prefix}" |
90 91 92 93 94 95 | shell: "somalier extract " "-d {params.d} " "--sites $CONDA_PREFIX/share/somalier/sites.hg38.vcf.gz " "-f {input.r} {input.vcf} && " "somalier relate --ped {input.ped} -o {params.o} {params.d}/*.somalier" |
106 107 | shell: "touch {output}" |
135 136 137 | shell: "bcftools +guess-ploidy -g hg38 {input.vcf} -v > {output.txt} && " "guess-ploidy.py {output.txt} {params.p}" |
163 164 | shell: "python workflow/scripts/create_exclude_list.py {input.b} {params.out} --verify {input.v} -r {params.r} -d {params.d} -c {params.c}" |
183 184 | shell: "python workflow/scripts/create_exclude_list.py {input.b} {params.out} -r {params.r} -d {params.d}" |
200 201 202 203 | shell: "bcftools view -S ^{input.l} --threads {threads} -Ou {input.v} | " "bcftools annotate --threads {threads} --set-id '%CHROM:%POS:%REF:%ALT' -Oz -o {output.v} && " "tabix -p vcf {output.v}" |
245 246 | shell: "multiqc --force -o {params.outDir} -n {params.outName} --config {input.mqc_config} {params.inDirs} {params.relatedness}" |
279 280 | shell: "multiqc --force -o {params.outDir} --config {input.mqc_config} -n {params.outName} {params.inDirs} {params.relatedness}" |
304 305 | shell: "multiqc --force -o {params.outDir} --config {input.mqc_config} -n {params.outName} {params.inDirs}" |
14 15 | shell: "bcftools query -l {input} > {output}" |
45 46 | script: "../scripts/run_summary.Rmd" |
75 76 | script: "../scripts/run_summary.Rmd" |
101 102 | script: "../scripts/run_initial_summary.Rmd" |
129 130 | script: "../scripts/combine_benchmarks.R" |
148 149 | script: "../scripts/benchmarking_report.Rmd" |
46 47 48 49 50 51 52 53 54 55 56 57 58 59 | shell: "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf resources/ --no-sign-request && " "aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx resources/ --no-sign-request " |
5 6 | shell: "date +%s > {output}" |
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | # Load required packages knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=8) require(knitr, quietly = TRUE) require(ggplot2, quietly=TRUE) require(RColorBrewer, quietly=TRUE) require(dplyr, quietly = TRUE) require(gluedown, quietly = TRUE) require(gridExtra, quietly = TRUE) require(lubridate, quietly = TRUE) # set input variable from snakemake benchmarks.table <- snakemake@input[["benchmarks"]] # read in tsv dataset <- read.table(benchmarks.table, header=TRUE, sep="\t") # some light date/time manipulation for ease in plotting dataset$times <- hms(dataset$h.m.s) d.lub <- hour(dataset$times) + minute(dataset$times)/60 + second(dataset$times)/3600 dataset$walltime <- d.lub # some upfront unit conversions & derivations for ease in comprehension # # convert IO units from MB to GB dataset$io_in <- dataset$io_in / 1000 dataset$io_out <- dataset$io_out / 1000 # # convert CPU time from seconds to hours dataset$cpu_time_hrs <- dataset$cpu_time / 3600 # # calculate cpu/walltime ratio dataset$cpuwall_ratio <- (dataset$cpu_time / dataset$s) # subset the dataset to get data for rules where there is more than 1 # observation/process to pass to geom_boxplot subset.df <- dataset %>% group_by(rule) %>% filter(n() > 1) # subset the dataset to get data points for rules meeting time_threshold # parameter specified at the rule level time.param <- snakemake@params[["threshold"]] time.threshold <- time.param * 60 filtered.df <- dataset %>% group_by(rule) %>% filter(s >= time.threshold) |
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | # plotting function for box/jitter plot where single data points are omitted # from boxplots create.boxjitter.plot <- function(df, df.subset, xvar, yvar, xlab, ylab) { bd.plot <- ggplot(data=df, aes_string(x=xvar, y=yvar)) + geom_boxplot(data = df.subset, aes_string(fill=xvar), show.legend = FALSE) + geom_point(position="identity", show.legend = FALSE, alpha=1/2, size=1) + labs(x=xlab, y=ylab) + theme_minimal() + theme(axis.text.x=element_text(angle = 90, vjust=0.5, hjust=1), axis.title = element_text(size = 14), axis.text = element_text(size = 12)) bd.plot } create.subset.plot <- function(subdf, xvar, yvar, xlab, ylab, title){ sb.plot <- ggplot(data = subdf, aes_string(x = xvar, y = yvar)) + geom_boxplot(aes_string(fill=xvar), show.legend = FALSE) + geom_point(position = "identity", show.legend = FALSE, alpha=1/2, size=1) + labs(x=xlab, y=ylab, title=title) + theme_minimal() + theme(axis.text.x=element_text(angle = 90, vjust=0.5, hjust=1), axis.title = element_text(size = 14), axis.text = element_text(size = 12)) sb.plot } |
92 93 94 95 | inlines <- c( md_bold(unique(dataset$rule)) ) md_bullet(inlines) |
107 108 109 | # plot walltime for all rules benchmarked exec.time.all <- create.boxjitter.plot(dataset, subset.df, "rule", "walltime", "Rule", "Total walltime in hours for all rules") exec.time.all |
121 122 123 124 | max.pss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_pss", "Rule", "PSS (MB) for all rules") + scale_y_continuous(n.breaks = 12) max.pss.all |
130 131 132 133 | max.uss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_uss", "Rule", "USS (MB) for all rules") + scale_y_continuous(n.breaks = 12) max.uss.all |
139 140 141 142 | max.rss.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_rss", "Rule", "RSS (MB) for all rules") + scale_y_continuous(n.breaks = 12) max.rss.all |
148 149 150 151 | max.vms.all <- create.boxjitter.plot(dataset, subset.df, "rule", "max_vms", "Rule", "VMS (MB) for all rules") + scale_y_continuous(n.breaks = 12) max.vms.all |
159 | cat(paste0("Time threshold (minutes): ", time.param)) |
170 171 172 173 174 | # plot the memory usage with boxplots for rules with >1 process otherwise # just plot the datapoint max.pss.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_pss", "Rule", "PSS (MB)") + scale_y_continuous(n.breaks = 12) max.pss.detail |
180 181 182 | max.uss.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_uss", "Rule", "USS (MB)") + scale_y_continuous(n.breaks = 12) max.uss.detail |
188 189 190 | max.rss.detail <- create.boxjitter.plot(dataset, subset.df, "rule", "max_rss", "Rule", "RSS (MB)") + scale_y_continuous(n.breaks = 12) max.rss.detail |
196 197 198 | max.vms.detail <- create.boxjitter.plot(filtered.df, subset.df, "rule", "max_vms", "Rule", "VMS (MB)") + scale_y_continuous(n.breaks = 12) max.vms.detail |
203 204 205 206 207 208 209 | # plot detailed grid view mem.df1 <- subset(filtered.df, max_pss > 5000) mem.df2 <- subset(filtered.df, max_pss < 5000) max.pss.subset1 <- create.subset.plot(mem.df1, "rule", "max_pss", "Rule", "max_pss (MB)", "Rules using memory >5GB") max.pss.subset2 <- create.subset.plot(mem.df2, "rule", "max_pss", "Rule", "max_pss (MB)", "Rules using memory <5GB") grid.arrange(max.pss.subset1, max.pss.subset2, nrow=1) |
220 221 222 | # plot the io read io.read <- create.boxjitter.plot(filtered.df, subset.df, "rule", "io_in", "Rule", "Cumulative GB Read") io.read |
228 229 230 | # plot io write io.write <- create.boxjitter.plot(filtered.df, subset.df, "rule", "io_out", "Rule", "Cumulative GB Written") io.write |
235 236 237 238 239 240 241 242 243 244 245 246 247 248 | # plot detailed grid view ioin.df1 <- subset(filtered.df, io_in > 10) ioin.df2 <- subset(filtered.df, io_in < 10) ioout.df1 <- subset(filtered.df, io_out > 10) ioout.df2 <- subset(filtered.df, io_out < 10) ioin.subset1 <- create.subset.plot(ioin.df1, "rule", "io_in", "Rule", "Cumulative GB Written", "Rules with >10GB Read") ioin.subset2 <- create.subset.plot(ioin.df2, "rule", "io_in", "Rule", "Cumulative GB Written", "Rules with <10GB Read") ioout.subset1 <- create.subset.plot(ioout.df1, "rule", "io_out", "Rule", "Cumulative GB Written", "Rules with >10GB Written") ioout.subset2 <- create.subset.plot(ioout.df2, "rule", "io_out", "Rule", "Cumulative GB Written", "Rules with <10GB Written") grid.arrange(ioin.subset1, ioin.subset2, ioout.subset1, ioout.subset2, nrow=2) |
257 258 259 | # plot the CPU time in hours cpu.time <- create.boxjitter.plot(filtered.df, subset.df, "rule", "cpu_time_hrs", "Rule", "CPU time in hours") cpu.time |
274 275 276 | # plot cpu to walltime ratio (i.e. CPU/s) cpuwall.ratio <- create.boxjitter.plot(filtered.df, subset.df, "rule", "cpuwall_ratio", "Rule", "CPU/second") cpuwall.ratio |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | library(tidyverse) # explicitly specify column types to avoid errors when coercing rows col.types <- cols( s = col_double(), `h:m:s` = col_time(format = ""), max_rss = col_double(), max_vms = col_double(), max_uss = col_double(), max_pss = col_double(), io_in = col_double(), io_out = col_double(), mean_load = col_double(), cpu_time = col_double() ) # function to read tsv input and append rule name and process read_benchmarks <- function(filename) { read_tsv(filename, col_types=col.types) %>% mutate( process = gsub("*.tsv","",basename(filename)), rule = basename(dirname(filename)) ) } # run function for all specified snakemake inputs and output benchmarks tsv dataset <- snakemake@input[["tsv"]] %>% lapply(read_benchmarks) %>% bind_rows() head(dataset) write_tsv(dataset, snakemake@output[['benchmarks']]) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | import argparse from typing import Dict, List, Optional, Set, Tuple import pandas as pd def get_args(): """Handle command line arguments""" parser = argparse.ArgumentParser( description="Generates a list of samples to exclude based on previously-run QC metrics." ) parser.add_argument("bstats", type=str, help="bcftools stats file name") parser.add_argument("outfile", type=str, help="Output file prefix") parser.add_argument( "-v", "--verify", type=str, help="concatenated verifyBamID *.selfSM output files", default=False, ) parser.add_argument( "-r", "--ratio", type=float, help="maximum allowed het/hom_alt ratio", default=2.5 ) parser.add_argument( "-d", "--depth", type=float, help="minimum allowed average depth", default=20.0 ) parser.add_argument( "-c", "--contam", type=float, help="maximum allowed contamination", default=0.03 ) results = parser.parse_args() return ( results.bstats, results.outfile, results.verify, results.ratio, results.depth, results.contam, ) def read_in_bcftools_PSC(bstats: str) -> pd.DataFrame: """Read into a dataframe just the PSC portion of bcftools stats output""" skip: list[int] = [i for i, line in enumerate(open(bstats)) if not line.startswith("PSC")] return pd.read_csv( bstats, sep="\t", skiprows=skip, header=None, names=[ "PSC", "id", "sample", "nRefHom", "nNonRefHom", "nHets", "nTransitions", "nTransversions", "nIndels", "average depth", "nSingletons", "nHapRef", "nHapAlt", "nMissing", ], ) def exclude_high_het_hom(df: pd.DataFrame, r: float) -> pd.DataFrame: """Exclude any samples with a het/hom ratio over 2.5""" df.loc[:, "het_hom_ratio"] = df["nHets"] / df["nNonRefHom"] df_het = df.loc[(df["het_hom_ratio"] > r)].copy() if df_het.empty: return pd.DataFrame({"sample": [], "exclude_reason": []}) else: df_het.loc[:, "exclude_reason"] = "high_het_hom" return df_het[["sample", "exclude_reason"]] def exclude_low_depth(df: pd.DataFrame, d: float) -> pd.DataFrame: """Setting a 20x avg depth cutoff This is reported by the lab as well, but this should catch samples that are borderline, that were accidentally included despite low coverage, or that are from outside collaborators.""" df_depth = df.loc[df["average depth"] < d].copy() if df_depth.empty: return pd.DataFrame({"sample": [], "exclude_reason": []}) else: df_depth.loc[:, "exclude_reason"] = "low_depth" return df_depth[["sample", "exclude_reason"]] def read_in_verifybamid(verify: str) -> pd.DataFrame: """Requires single *.selfSM files to be concatenated without headers""" return pd.read_csv( verify, sep="\t", header=None, names=[ "#SEQ_ID", "RG", "CHIP_ID", "#SNPS", "#READS", "AVG_DP", "FREEMIX", "FREELK1", "FREELK0", "FREE_RH", "FREE_RA", "CHIPMIX", "CHIPLK1", "CHIPLK0", "CHIP_RH", "CHIP_RA", "DPREF", "RDPHET", "RDPALT", ], ) def exclude_contam(df_v: pd.DataFrame, c: float) -> pd.DataFrame: """Setting a 3% contamination threshold""" df_contam = df_v.loc[df_v["FREEMIX"] > c].copy() if df_contam.empty: return pd.DataFrame({"sample": [], "exclude_reason": []}) else: df_contam.loc[:, "exclude_reason"] = "contamination" t = df_contam["#SEQ_ID"].str.split(":", expand=True) if len(t.columns) == 2: df_contam["sample"] = t[1] else: # if you're running a single sample, verifyBamID doesn't use the sample_self:sample format in this field df_contam["sample"] = t[0] return df_contam[["sample", "exclude_reason"]] def combine_exclusions(df_list: list) -> pd.DataFrame: """""" all_df = pd.concat(df_list).groupby("sample")["exclude_reason"].apply(",".join).reset_index() return all_df if __name__ == "__main__": bstats, outfile, verify, r, d, c = get_args() df = read_in_bcftools_PSC(bstats) exclude1 = exclude_high_het_hom(df, r) exclude2 = exclude_low_depth(df, d) if verify: df_v = read_in_verifybamid(verify) exclude3 = exclude_contam(df_v, c) exclude_df = combine_exclusions([exclude1, exclude2, exclude3]) else: exclude_df = combine_exclusions([exclude1, exclude2]) exclude_df.to_csv(outfile + "_with_annotation.tsv", sep="\t", header=None, index=None) exclude_df[["sample"]].to_csv(outfile + ".tsv", sep="\t", header=None, index=None) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | import argparse as ap import re import shutil import sys import pandas as pd def get_args() -> list: """Get command line arguments""" parser = ap.ArgumentParser( description="Creates a plink-formatted ped from a tsv with subject and sex" ) parser.add_argument("infile", type=str, help="linker filename") parser.add_argument( "outprefix", type=str, help="output filename prefix (.ped extension will be added)" ) results = parser.parse_args() return (results.infile, results.outprefix) def check_if_ped(infile: str) -> bool: """Look for .ped file extension This script can either be provided a tsv, which it will convert to a plink-style ped, or it can be provided a ped which it will simply copy to the expected output name/ location. """ if re.search(".ped$", infile, re.IGNORECASE): return True else: return False def read_in_linker(infile: str) -> pd.DataFrame: """Read tsv linker file into a dataframe""" df = pd.read_table(infile, sep="\t") return df def check_column_number(df: pd.DataFrame): """Check for two and only two columns in input tsv""" if len(df.columns) != 2: raise def check_column_headers(df: pd.DataFrame): """Check for required headers in input tsv""" if ["Sample", "Sex"] != list(df.columns)[0:2]: raise def add_ped_columns(df: pd.DataFrame) -> pd.DataFrame: """Add dummy columns to conform to plink ped format Should end up with a dataframe like so: Sample Sample 0 0 [F|M] -9 """ df.insert(1, "a", df.loc[:, "Sample"]) df.insert(2, "b", 0) df.insert(2, "c", 0) df["d"] = "-9" return df def check_sex_values(df: pd.DataFrame) -> pd.DataFrame: """Check for allowed values for sex information Input tsv may only have (case-insensitive) f, female, m, or male to encode sex. Missing data can be encoded by one of the standard NA values that pandas can deal with automatically. The error message emitted when this error is raised lists out the allowed values. """ df["Sex"] = df["Sex"].str.lower() if not all(df[~df["Sex"].isna()].isin(["f", "female", "m", "male"])["Sex"]): raise def encode_sex(df: pd.DataFrame) -> pd.DataFrame: """Convert sex information to ped encoding Missing sex information is converted to 0; male and female are converted to 1 and 2 respectively. """ df["Sex"] = df["Sex"].str.lower() df["Sex"] = df["Sex"].replace(["f", "female", "m", "male"], [2, 2, 1, 1]) df = df.fillna(0) df["Sex"] = df["Sex"].astype(int) return df if __name__ == "__main__": infile, outprefix = get_args() if check_if_ped(infile): shutil.copyfile(infile, outprefix + ".ped") sys.exit(0) df = read_in_linker(infile) try: check_column_number(df) except Exception: sys.exit("Error: {} is required to have exactly two columns.".format(infile)) try: check_column_headers(df) except Exception: sys.exit( "Error: Correct headers not detected in {} (requires 'Sample' and 'Sex', in that order).".format( infile ) ) try: check_sex_values(df) except Exception: sys.exit( "Error: Sex reported in {} must be represented by f, female, m, or male (case insensitive). Missing data can be represented by any of the following strings: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', or 'null'.".format( infile ) ) df = add_ped_columns(df) df = encode_sex(df) df.to_csv(outprefix + ".ped", sep="\t", header=False, index=False) |
23 24 25 26 27 28 | ## Load required R packages library(knitr, quietly = TRUE) library(kableExtra, quietly = TRUE) library(ggplot2, quietly = TRUE) library(RColorBrewer, quietly = TRUE) |
32 33 | ## Load tested functions for this report source(snakemake@input[["r_resources"]]) |
37 38 39 40 41 42 43 | ## Configure standard theme information for ggplot2 my.theme <- theme_light() + theme(plot.title = element_text(size = 15, hjust = 0.5), axis.title = element_text(size = 14), axis.text = element_text(size = 12), strip.background = element_blank(), strip.text = element_text(size = 14, colour = "black")) |
47 48 49 50 | input.subjects <- snakemake@params[["input_samples"]] fastqc.filename <- snakemake@input[["fastqc"]] tsv.path <- snakemake@params[["out_prefix"]] start.time.filename <- snakemake@input[["start_time"]] |
58 59 60 61 62 63 64 65 | res <- prepare.initial.subject.tracking.table(input.subjects) res <- add.fastqc.data(res, fastqc.filename) res <- res[order(res[, "Subject"]), ] rownames(res) <- NULL knitr::kable(res, caption = "Summary of Subject Fate") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE) write.output.table(res, tsv.path) |
74 75 76 | start.time <- read.table(start.time.filename, header = FALSE)[1,1] end.time <- as.integer(format(Sys.time(), "%s")) elapsed.time <- round((end.time - start.time) / 3600, 2) |
98 | sessionInfo() |
23 24 25 26 27 28 | ## Load required R packages library(knitr, quietly = TRUE) library(kableExtra, quietly = TRUE) library(ggplot2, quietly = TRUE) library(RColorBrewer, quietly = TRUE) |
32 33 | ## Load tested functions for this report source(snakemake@input[["r_resources"]]) |
37 38 39 40 41 42 43 | ## Configure standard theme information for ggplot2 my.theme <- theme_light() + theme(plot.title = element_text(size = 15, hjust = 0.5), axis.title = element_text(size = 14), axis.text = element_text(size = 12), strip.background = element_blank(), strip.text = element_text(size = 14, colour = "black")) |
47 48 49 50 51 52 53 54 55 56 57 | input.subjects <- snakemake@params[["input_samples"]] exclude.reasons.filename <- snakemake@input[["exclude_list"]] output.subjects.filename <- snakemake@input[["output_subject_list"]] somalier.run <- snakemake@params[["somalier"]] somalier.relatedness.filename <- snakemake@input[["relatedness"]] somalier.sex.filename <- snakemake@input[["sex"]] fastqc.filename <- snakemake@input[["fastqc"]] bcftools.stats.filename <- snakemake@input[["bcftools_stats"]] tsv.path <- snakemake@params[["out_prefix"]] start.time.filename <- snakemake@input[["start_time"]] run.mode <- snakemake@params[["run_mode"]] |
61 62 | somalier.related.min <- 0.5 somalier.duplicate.min <- 0.95 |
73 74 75 76 77 78 79 80 81 82 83 84 85 | res <- prepare.subject.tracking.table(input.subjects, output.subjects.filename, exclude.reasons.filename) if (!run.mode == "jointgeno"){ res <- add.fastqc.data(res, fastqc.filename) } res <- add.coverage(res, bcftools.stats.filename) res <- res[order(res[, "Subject"]), ] rownames(res) <- NULL knitr::kable(res, caption = "Summary of Subject Fate") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE) write.output.table(res, tsv.path) |
99 100 101 102 103 104 105 | res <- report.sex.discordances(output.subjects.filename, somalier.sex.filename) if (nrow(res) == 0) { cat("No sex discordant subjects were identified.") } else { print(knitr::kable(res, caption = "Sex Discordant Subjects") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)) } |
117 118 119 120 121 122 123 | res <- report.related.subject.pairs(output.subjects.filename, somalier.relatedness.filename, somalier.duplicate.min, 1.0) if (nrow(res) == 0) { cat("No duplicate subjects were detected above the specified relatedness cutoff.") } else { print(knitr::kable(res, caption = "Genetic Duplicates") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)) } |
137 138 139 140 141 142 143 | res <- report.related.subject.pairs(output.subjects.filename, somalier.relatedness.filename, somalier.related.min, somalier.duplicate.min) if (nrow(res) == 0) { cat("No non-identical related subjects were detected within the specified range.") } else { print(knitr::kable(res, caption = "Potentially Related Non-identical Subjects") %>% kableExtra::kable_styling("striped", position = "left", full_width = FALSE)) } |
152 153 154 | start.time <- read.table(start.time.filename, header = FALSE)[1,1] end.time <- as.integer(format(Sys.time(), "%s")) elapsed.time <- round((end.time - start.time) / 3600, 2) |
176 | sessionInfo() |
Support
- Future updates
Related Workflows





