About
This is a snakemake workflow for analyzing targeted HiFi sequence datasets. The workflow was designed for Twist gene panels sequenced on PacBio Sequel IIe or Revio systems. Learn more in this App Note . Please note that this workflow is still in development. There may be updates to the input / output files and workflow steps which may affect the behavior of the program.
Getting started
Dependencies
This workflow uses
conda
to install dependencies and set up the compute environment. The workflow is tested with
conda v22+
on a
SLURM
cluster and we recommend 16 cores, 4 GB RAM per core, on a single machine (64GB total memory). Otherwise, use at your own risk.
Installation
# create the base conda environment
$ conda create \
--channel conda-forge \
--channel bioconda \
--prefix ./conda_env \
python=3.9 snakemake mamba pysam lockfile
# activate the base conda environment
$ conda activate ./conda_env
# clone the github repo
$ git clone https://github.com/PacificBiosciences/HiFiTargetEnrichment.git workflow
Quick start
We recommend using this GRCh38 reference . Refer to the FAQ for more details on input files and formats.
# create a couple directories
$ mkdir reference annotation cluster_logs
# drop your reference.fasta and reference.fasta.fai into reference directory
# and adjust the [ref][fasta|index] paths in workflow/config.yaml
# Run full workflow starting with demultiplexing for batch <batch_name>
$ sbatch workflow/run_snakemake.sh <batch_name> <biosample_csv> <hifi_reads> <target_bed> [<probe_bed>]
# Use the <target_bed> as the probes bed if you do not have the probes. This will allow for generation of HS_Metrics
Understanding output
BAM and VCF files
Haplotagged aligned BAM files and VCF files are available for each sample in the
whatshap
subdirectory for each sample. These files were produced using Deep Variant and Whatshap. See FAQ for a detailed directory structure.
If
pbsv
is used in the pipeline, a VCF for structral variants can be found in the
pbsv
subdirectory for each sample.
Demultiplexed output
The
demux
directory contains unmapped BAM files for each sample. These files are named with the barcode ID. The directory also contains all of the output for
lima
, the demultiplexing software for PacBio.
Target enrichment stats
See the
stats
directory for many useful tables and summary figures. The file
hs_metrics_consolidated_quickview.tsv
is a good place to start to understand run performance. Definitions of each statistics can be found
here
.
read_categories.png
gives a graphical summary of read lengths and the percentage of reads that are demultiplexed, PCR duplicates, unmapped, on and off target.
FAQ
1) What is the format for the biosample_csv file?
This file provides a mapping of barcodes to samples. It is used for sample demultiplexing, to trim adapter sequences, and to rename files with sample ID. The file is comma-delimited where the first column is a pair of barcode IDs (forward and reverse, must match the headers in the barcodes fasta) and the second column in a sample ID (must be unique). The csv must contain a header row, but the header names can be arbitrary.
Here is an example:
Barcode,Biosample
UDI_1_A01_f--UDI_1_A01_r,HG002
UDI_2_B01_f--UDI_2_B01_r,HG003
UDI_3_C01_f--UDI_3_C01_r,HG004
UDI_4_D01_f--UDI_4_D01_r,HG005
2) What is the file format for HiFi reads?
Please use hifi_reads.bam, the native format for PacBio HiFi data, which is an unmapped BAM file ("ubam"). Learn more here .
3) Can I change the barcode file?
By default the pipeline uses the Tru-Seq barcodes from Twist. You can change you barcode file by providing a different fasta file in
workflow/barcodes
. For example, the barcode file "Sequel_96_barcodes_v1.IDT_pad.fasta" was used for this
dataset
and in this
preprint
. Adjust the path to the barcode file in
workflow/config.yaml
and make sure you update your
biosample_csv
file.
4) What is the format of the
target.bed
file?
This is a standard BED file that specifies the coordinates of the regions of interest. It is used by HS Metrics to compute target coverage, on-target mapping rate, and other statistics.
Here is an example
targets.bed
:
chr1 155234451 155244627 GBA
chr5 70049638 70078522 SMN2
chr5 70925030 70953942 SMN1
chr22 42126498 42130810 CYP2D6
5) What is a
probes.bed
file? What if I don't have one?
The
probes.bed
file specifies the coordinate for the probes use to prepare the target capture library. The file usually contains hundreds or even thousands of probes depending on the size of the panel. You may not have access to the probe design due to it being proprietary. In this case, you can use the
target.bed
file in place of the
probes.bed
file (you will use
target.bed
twice when you call the workflow command).
Here is an example of
probes.bed
:
chr1 48037726 48037846
chr1 48038919 48039039
chr1 48040111 48040231
chr1 48041516 48041636
6) What software is used in the pipeline?
python v3.9+
conda v22+
lima v2.5+
pbmarkdups v1+
pbmm2 v1.7+
deep variant v1.4+
whatshap v1.1+
glnexus v1.4.1+
pbsv v2.8+
htslib v1.13+
hsmetrics v1.3.1
pharmCAT
pangu v0.2.1
7) What are the steps in the workflow?
-
demultiplex HiFi reads with lima
-
Mark PCR duplicate HiFi reads by sample with pbmarkdups
-
align HiFi reads to reference with pbmm2
-
call small variants with DeepVariant v1.4.0
-
phase small variants with WhatsHap
-
haplotag aligned BAMs with WhatsHap
-
call SV with pbsv
-
jointly call all variants (excl pbsv) with glnexus
-
[optionally] call PGx star (*) alleles with PharmCAT and pangu
-
[optionally] annotate output gVCF with dbsnp or other database containing variant IDs
-
Run some QC reports including hsMetrics
8) What is the full directory structure?
.
├── cluster_logs # slurm stderr/stdout logs
├── reference
│ ├── reference.chr_lengths.txt # cut -f1,2 reference.fasta.fai > reference.chr_lengths.txt
│ ├── reference.fasta
│ └── reference.fasta.fai
├── batches
│ └── <batch_name> # batch_id
│ ├── benchmarks/ # cpu time per task
│ ├── demux/ # demultiplexed hifi reads
│ ├── glnexus/ # intermediate cohort vcf files
│ ├── logs/ # per-rule stdout/stderr logs
│ ├── picard/ # interval lists for hsmetrics
│ ├── stats/ # batch-wide collated reports, including HS Metrics summary
│ ├── whatshap_cohort/ # joint-called, phased SNV
│ ├── merged_gvcf/ # [optional] annotated gvcf with all batch samples included
│ ├── <sample_id 1>/ # per-sample results, one for each sample
│ : ...
│ └── <sample_id n>/ # per-sample results, one for each sample
│ ├── coverage/ # read coverage by target beds
│ ├── deepvariant/ # intermediate DV vcf, incl g.vcf per sample
│ ├── hs_metrics/ # picard hsmetrics for this sample
│ ├── markdup/ # unaligned reads with PCR dups marked
│ ├── pangu/ # [optional] HiFi CYP2D6 star calling results
│ ├── pbsv/ # structural variant calls
│ ├── pharmcat/ # [optional] pharmcat results
│ ├── read_metrics/ # per-read information
│ └── whatshap/ # phased small variants; merged haplotagged alignments
│
└── workflow # clone of this repo
Detailed run guidance
# create the base conda environment
$ conda create \
--channel conda-forge \
--channel bioconda \
--prefix ./conda_env \
python=3.9 snakemake mamba pysam lockfile
# activate the base conda environment
$ conda activate ./conda_env
# clone the github repo
$ git clone https://github.com/PacificBiosciences/HiFiTargetEnrichment.git workflow
# create a couple directories
$ mkdir reference annotation cluster_logs
# drop your reference.fasta and reference.fasta.fai into reference
# and adjust the [ref][fasta|index] paths in workflow/config.yaml
# Annotation [optional]
# drop your annotation file and index into annotation
# and adjust the [annotate][variants] path in workflow/config.yaml
# ensure that [annotate][gVCF] is set to True in workflow/config.yaml
# PharmCAT [optional]
# Set [pharmcat][run_analysis] to True in workflow/config.yaml
# run the full workflow including demux/markdup/mapping from a single HiFi movie for batch <batch_name>
# Use the <target_bed> as the probes bed if you do not have the probes. This will allow for generation of HS_Metrics
$ sbatch workflow/run_snakemake.sh <batch_name> <biosample_csv> <hifi_reads> <target_bed> [<probe_bed>]
# run just variant calling and phasing for a set of bams following demux/markdup/mapping on SL
# <hifi_reads> can be a directory of bams or a textfile with one bam path per line (fofn)
$ sbatch workflow/run_snakemake_SLmapped.sh <batch_name> <hifi_reads> <target_bed> [<probe_bed>]
Code Snippets
16 17 18 19 20 21 22 23 24 | shell: ''' (bcftools convert --gvcf2vcf \ -f {input.reference} \ -R {params.variants} \ -Oz \ -o {output} \ {input.gvcf}) > {log} 2>&1 ''' |
41 42 43 44 45 46 47 48 | shell: ''' (bcftools annotate -c ID \ -R {params.region} \ -a {params.variants} \ -Oz -o {output} \ {input.gvcf}) > {log} 2>&1 ''' |
65 66 67 68 69 70 71 | shell: ''' (bcftools merge \ -R {params.region} \ -Oz -o {output} \ {input.vcf} {params.variants}) > {log} 2>&1 ''' |
95 96 97 98 99 100 101 | shell: ''' (bcftools merge \ -R {params.region} \ -Oz -o {output} \ {input.gvcfs}) > {log} 2>&1 ''' |
17 18 | shell: "tabix {params} {input} > {log} 2>&1" |
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | shell: """ (/opt/deepvariant/bin/make_examples \ --add_hp_channel \ --alt_aligned_pileup=diff_channels \ --min_mapping_quality=1 \ --parse_sam_aux_fields \ --partition_size=25000 \ --max_reads_per_partition=600 \ --phase_reads \ --pileup_image_width {params.pileup_image_width} \ --norealign_reads \ --sort_by_haplotypes \ --track_ref_reads \ --vsc_min_fraction_indels {params.vsc_min_fraction_indels} \ --mode calling \ --ref {input.reference} \ --reads {input.bam} \ --examples {params.examples} \ --gvcf {params.gvcf} \ --task {params.shard}) > {log} 2>&1 """ |
73 74 75 76 77 78 79 80 | shell: """ (echo "CUDA_VISIBLE_DEVICES=" $CUDA_VISIBLE_DEVICES; \ /opt/deepvariant/bin/call_variants \ --outfile {output} \ --examples {params.examples} \ --checkpoint {params.model}) > {log} 2>&1 """ |
108 109 110 111 112 113 114 115 116 | shell: """ (/opt/deepvariant/bin/postprocess_variants \ --ref {input.reference} \ --infile {input.tfrecord} \ --outfile {output.vcf} \ --nonvariant_site_tfrecord_path {params.nonvariant_tfrecs} \ --gvcf_outfile {output.gvcf}) > {log} 2>&1 """ |
134 135 | shell: "(bcftools stats --threads 3 {params} {input} > {output}) > {log} 2>&1" |
24 25 26 27 28 29 30 31 32 33 34 | shell: ''' (mkdir -p {output.odir} lima {params.filters} \ --split-named \ --dump-removed \ -j {threads} \ --log-level {params.loglevel} \ --biosample-csv {input.biosamples} \ {input.reads} {input.barcodes} {params.odir}/demultiplex.bam) > {log} 2>&1 ''' |
43 44 45 46 47 48 | run: missing = sample2barcode.keys() - params.samples if len( missing ): with open( f"batches/{batch}/demux_FAIL.txt", 'w' ) as ofile: for sample in sorted( missing ): ofile.write( f'{sample2barcode[sample]},{sample}\n' ) |
26 27 28 29 30 | shell: ''' (bedtools merge -i <(bedtools slop -b 10000 -i {input.targets} -g {input.chr_len} | sort -k1,1 -k2,2n) \ -d {params.distance} -c 4 -o first > {output}) 2> {log} ''' |
54 55 56 57 58 59 60 61 62 | shell: """ (rm -rf {output.scratch_dir} && \ glnexus_cli --threads {threads} \ --dir {output.scratch_dir} \ --config DeepVariant_unfiltered \ --bed {input.bed} \ {input.gvcf} > {output.bcf}) 2> {log} """ |
79 80 81 82 | shell: ''' (bcftools view {params} {input} -o {output}) > {log} 2>&1 ''' |
104 105 106 107 | shell: ''' (tabix {params.extra} {input.vcf} {params.region} > {params.vcf} && bgzip {params.vcf}) > {log} 2>&1 ''' |
133 134 135 136 137 138 139 | shell: """ (whatshap phase {params.extra} \ --output {output} \ --reference {input.reference} \ {input.vcf} {input.phaseinput}) > {log} 2>&1 """ |
168 169 170 171 | shell: ''' (bcftools concat {params} -o {output} {input.calls}) > {log} 2>&1 ''' |
191 192 193 194 195 196 197 198 199 | shell: """ (whatshap stats \ --gtf {output.gtf} \ --tsv {output.tsv} \ --block-list {output.blocklist} \ --chr-lengths {input.chr_lengths} \ {input.vcf}) > {log} 2>&1 """ |
13 14 | wrapper: "v1.10.0/bio/picard/createsequencedictionary" |
29 30 | wrapper: "v1.10.0/bio/picard/bedtointervallist" |
54 55 | wrapper: "v1.3.1/bio/picard/collecthsmetrics" |
66 67 68 69 70 71 72 | shell: ''' awk 'BEGIN {{ FS=OFS="\\t" }} \ /METRICS/ {{getline; print; getline; split(FILENAME,s,"/"); $67=s[3]; print}}' {input} \ | awk 'NR==1 || !/^BAIT_SET/' > {output} ''' |
95 96 97 98 99 100 101 102 103 104 105 | run: import csv from operator import itemgetter quickgetter = itemgetter(*quickviewColumns) with open( input[0], newline='' ) as hsmetrics,\ open( output[0], 'w', newline='' ) as quickview: reader = csv.DictReader( hsmetrics, dialect='unix', delimiter='\t') writer = csv.DictWriter( quickview, quickviewColumns, dialect='unix', delimiter='\t', quoting=0 ) writer.writeheader() for row in reader: writer.writerow( dict( zip( quickviewColumns, quickgetter(row) ) ) ) |
21 22 | shell: "(pangu -m capture -p {params.prefix} {input.bam}) > {log} 2>&1" |
34 35 36 37 | shell: ''' awk 'BEGIN {{OFS="\t"}} !($2 ~ /\//) {{$2=$2"/[]"}} 1' {input} > {output} ''' |
23 24 25 26 27 28 29 30 31 32 33 | shell: """ (pbmm2 align --num-threads {threads} \ --preset {params.preset} \ --sample {params.sample} \ --log-level {params.loglevel} \ {params.extra} \ {input.reference} \ {input.query} \ {output.bam}) > {log} 2>&1 """ |
20 21 22 23 24 25 26 | shell: """ (pbsv discover {params.extra} \ --log-level {params.loglevel} \ --tandem-repeats {input.tr_bed} \ {input.bam} {output}) > {log} 2>&1 """ |
47 48 49 50 51 52 53 | shell: """ (pbsv call {params.extra} \ --log-level {params.loglevel} \ --num-threads {threads} \ {input.reference} {input.svsigs} {output}) > {log} 2>&1 """ |
18 19 20 21 22 23 24 25 26 27 | shell: ''' (/pharmcat/pharmcat_vcf_preprocessor.py \ --missing-to-ref \ -vcf {input.vcf} \ -refFna {input.reference} \ -refVcf {params.regions} \ -bf {params.basefile} \ -o {params.odir} ) > {log} 2>&1 ''' |
45 46 47 48 49 50 51 52 53 54 55 56 | shell: ''' (bedtools coverage \ -sorted \ -g {input.genome} \ -f 1 \ -header \ -mean \ -a {input.vcf} \ -b {input.bam} \ | ( sed -u '/^#CHROM/q' ; awk '$11 >= {params.mincov}' | cut -f1-10 ) > {output}) > {log} 2>&1 ''' |
73 74 75 76 77 78 79 80 | shell: ''' (/pharmcat/pharmcat \ -vcf {input.vcf} \ -reporterJson \ -po {input.cyp2d6} \ -o {params.odir}) > {log} 2>&1 ''' |
20 21 22 23 24 25 26 27 | shell: ''' (pbmarkdup --log-level {params.loglevel} \ -j {threads} \ {params.options} \ {input} {output.markdup} samtools index {output.markdup}) > {log} 2>&1 ''' |
36 37 38 39 | shell: ''' pbindex {input} ''' |
23 24 25 26 | shell: ''' (bedtools intersect -u -a {input.ensmbl} -b {input.targets} > {output}) > {log} 2>&1 ''' |
35 36 37 38 | shell: ''' cp {input} {output} ''' |
55 56 57 58 59 60 61 62 63 64 65 66 | shell: ''' samtools view -hbF {params.filter} {input.bam} > {output.filtered} #by element count of reads bedtools intersect -a {input.bed} -b {output.filtered} -c > {output.by_elem} #by read count of elements bedtools intersect -a {output.filtered} -b {input.bed} -bed -c \ | awk -v elem={params.elem} \ 'BEGIN {{ OFS="," ; print "readname,chr,start,stop,"elem }} \ {{ print $4,$1,$2,$3,$NF }}' \ | ( sed -u 1q; sort ) > {output.by_read} ''' |
85 86 87 88 89 90 91 92 93 94 | shell: ''' bedtools intersect -a <(samtools view -hbF {params.filter} {input.bam}) -b {input.bed} > {output.ontarget} bedtools genomecov -bga -ibam {output.ontarget} > {output.bg} bedtools intersect -a {output.bg} -b {input.bed} -wb \ | awk -v s={params.sample} \ 'BEGIN {{ OFS = ","; print "sample,chr,start,stop,coverage,target" }} \ {{ print s,$1,$2,$3,$4,$8 }}' \ > {output.cov} ''' |
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | shell: ''' awk -F, -v s={params.sample} -v low={params.lowcov} \ 'BEGIN {{ OFS=","; print "sample,target,totalBp,lowcovBp,fracLow" }} NR>1 {{ span = $4 - $3; split( $6, t, "_" ); target = t[1]; total += span; subset[target] += span; if( $5 <= low ) {{ lowcov += span; lowsubset[target] += span; }} }} END {{ for ( tg in subset ) {{ print s, tg, subset[tg], lowsubset[tg], lowsubset[tg]/subset[tg]; }} print s,"all",total,lowcov,lowcov/total; }}' {input} > {output} ''' |
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | shell: ''' tail -qn1 {input} \ | awk -F, \ 'BEGIN {{ OFS=","; print "sample,target,totalBp,lowcovBp,fracLow" }} \ {{ \ total += $3; \ lowcov += $4; \ print; \ }} \ END {{ \ print "Total","allTargets",total,lowcov,lowcov/total; }}' \ > {output} ''' |
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | shell: ''' samtools view -F {params.filter} {input.bam} \ | awk 'BEGIN {{ OFS="," ; print "readname,length,qual,ismapped,duplicates,softclip" }} \ {{ if( match( $0, /ds:i:[0-9]/ ) ) {{ split( substr( $0, RSTART, RLENGTH ),d,":"); dup=d[3] }} \ else {{ dup=0 }} \ }} ; \ {{ soft = match( $6, /[0-9]+S/ ) ? \ substr( $6, RSTART, RLENGTH-1 ) : \ 0 \ }} ; \ {{ mapped = !and($2,4) }} ; \ {{ match( $0, /rq:f:[0-9.]+/ ); \ split( substr( $0, RSTART, RLENGTH ),q,":") ; \ print $1, length($10), q[3], mapped, dup, soft \ }}' \ | ( sed -u 1q; sort) > {output.stats} ''' |
193 194 195 196 197 198 199 | shell: ''' bedtools intersect -a <(samtools view -hbF {params.filter} {input.bam}) -b {input.bed} -bed -loj \ | awk 'BEGIN {{ print "readname,target" }} \ {{ print $4","$NF }}' \ | ( sed -u 1q; sort -t, -u -k1,1 ) > {output.target} ''' |
212 213 214 215 216 217 | shell: ''' samtools view -f 1024 {input} \ | awk -v s={params.sample} '{{print s","length($10)",Duplicates"}}' \ > {output} ''' |
226 227 228 229 230 | shell: ''' awk ' BEGIN {{ print "sample,length,category" }} {{ print }}' {input} > {output} ''' |
243 244 | script: 'scripts/merge_read_metrics.py' |
253 254 255 256 | shell: ''' awk 'NR==1 || FNR>1' {input} > {output} ''' |
265 266 267 268 | shell: ''' awk 'NR==1 || FNR>1' {input} > {output} ''' |
279 280 281 282 283 284 285 286 287 | shell: ''' awk -v min={params.low} '$NF <= min {{ print $1, $2, $3, $4 ~ /^[0-9]+$/ ? "" : $4 }}' {input} \ | sort -k1,1 -k2,2n \ | uniq -c \ | awk 'BEGIN {{ OFS=","; print "chr,start,stop,target,samplesDropped" }} \ {{ print $2,$3,$4,$5,$1 }}' \ > {output} ''' |
297 298 299 300 301 302 303 | shell: ''' bedtools nuc -fi {input.ref} -bed {input.bed} \ | awk 'BEGIN {{ OFS=","; print "chr,start,stop,target,frac_gc" }} \ NR>1 {{ print $1,$2,$3,$4,$(NF-7) }}' \ > {output} ''' |
327 328 | script: 'scripts/plot_read_data.py' |
339 340 | script: 'scripts/plot_coverage.py' |
351 352 | script: 'scripts/plot_multi_coverage.py' |
375 376 | script: 'scripts/plot_multi_reads.py' |
13 14 | script: 'scripts/plot_read_cats.py' |
1 2 3 4 5 6 7 8 | import pandas as pd res = pd.concat( [ pd.read_csv( csv, index_col='readname' ) for csv in snakemake.input ], axis=1 )\ .reset_index() res.insert( 0, 'sample', snakemake.params.sample ) res.to_csv( snakemake.output[0], index=False ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns DPI=200 coverage = pd.read_csv( snakemake.input[0] ) order = sorted(coverage.target.unique()) maxCols = int( len(order) ** 0.5 ) + 1 plt.figure(figsize=(40,40)) g = sns.FacetGrid(data=coverage,sharex=False, col='target',col_wrap=maxCols,col_order=order, hue='target') g.map(plt.plot, 'start','coverage') g.set_xlabels('chr start pos') g.set_ylabels('Coverage') g.savefig(f'{snakemake.params.odir}/coverage_by_target.png', dpi=DPI) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from math import ceil DPI=400 data = pd.read_csv( snakemake.input[0] ) #replace "." with "off-target" ot = 'off-target' data.target = data.target.str.replace('.',ot,regex=False) #### #multi-sample coverage per target #### order = sorted(data.target.unique()) ncols = ceil(len(order)**.5) g = sns.FacetGrid(data=data,sharex=False, col='target',col_wrap=ncols,col_order=order, hue='sample') g.map(plt.plot, 'start','coverage')\ .set_xlabels('chr start pos')\ .set_ylabels('Coverage')\ .add_legend()\ .savefig(f'{snakemake.params.odir}/multi_coverage_by_target.png',dpi=DPI) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns DPI=400 readCsv = snakemake.input.csv targetBed = snakemake.input.bed targetBuffer = int(snakemake.params.buffer) targetPerPlot = int(snakemake.params.targetsPerPanel) outDir = snakemake.params.odir # function for partitioning target sets into readable subplots def partition(lst, size): for i in range(0, len(lst), size): yield lst[i:i+size] targets = pd.read_csv(targetBed, sep='\t', names=['chr','start','stop','target'])\ .set_index('target') #Set length of target region targets['tlength'] = targets.eval('stop - start + 2 * @targetBuffer') data = pd.read_csv(readCsv) #replace "." with "off-target" ot = 'off-target' data.target = data.target.str.replace('.',ot,regex=False) #### #mean base coverage #### print('writing mean base cov') pdata = data.query(' target != "off-target" ')\ .groupby( ['target','sample'] )\ .length.sum().reset_index() pdata['meanBaseCoverage'] = pdata.length / pdata.target.map(targets.tlength) # assign groups to partition targets order = sorted(pdata.target.unique()) grps = { tgt:i for i,grp in enumerate( partition( order, targetPerPlot ) ) for tgt in grp } pdata[ 'plotGroup' ] = pdata.target.map(grps) g = sns.catplot(data=pdata, x='target',sharex=False, y='meanBaseCoverage', col='plotGroup', col_wrap=2, kind='box',aspect=2) g.set_xticklabels(rotation=45); # remove plotgroup labels for ax in g.axes.flatten(): ax.set_title('') plt.tight_layout() g.savefig(f'{outDir}/mean_base_coverage.png',dpi=DPI) plt.clf() g = sns.catplot(data=pdata, x='target', sharex=False, y='meanBaseCoverage', col='plotGroup', col_wrap=2, hue='sample', kind='strip',aspect=2 ) #facet_kws=dict(legend_out=True)) g.set_xticklabels(rotation=45); for ax in g.axes.flatten(): ax.set_title('') sns.move_legend(g, "upper left", bbox_to_anchor=(1, 0.75), frameon=False) plt.tight_layout() g.savefig(f'{outDir}/mean_base_coverage_by_sample.png',dpi=DPI) plt.clf() #### #dedup rate by target #### print('writing dedup_rate') def dedup_rate(d): dups = d.duplicates.sum() return dups / ( dups + len(d) ) pdata = data.groupby(['target','sample'])\ .apply(dedup_rate)\ .rename('Duplication Rate')\ .reset_index() order = sorted(pdata.target.unique()) grps = { tgt:i for i,grp in enumerate( partition( order, targetPerPlot ) ) for tgt in grp } pdata[ 'plotGroup' ] = pdata.target.map(grps) g = sns.catplot(data=pdata, x='target', sharex=False, col='plotGroup',col_wrap=2, y='Duplication Rate', kind='box',aspect=2) g.set_xticklabels(rotation=45); for ax in g.axes.flatten(): ax.set_title('') plt.tight_layout() g.savefig(f'{outDir}/dedup_rate_by_target.png',dpi=DPI) plt.clf() #### #readlength by target #### #TODO better represent dups by replicating length dup times print('writing dedup_length') pdata = data order = sorted(pdata.target.unique()) grps = { tgt:i for i,grp in enumerate( partition( order, targetPerPlot ) ) for tgt in grp } pdata[ 'plotGroup' ] = pdata.target.map(grps) g = sns.catplot( data=pdata, x='target', sharex=False, col='plotGroup',col_wrap=2, y='length', kind='violin',aspect=2) g.set_xticklabels(rotation=45); for ax in g.axes.flatten(): ax.set_title('') plt.tight_layout() g.savefig(f'{outDir}/dedup_length_by_target.png',dpi=DPI) plt.clf() #### #readlength by target #### print('writing readlength by target') pdata = data wrap = int( len(order) ** 0.5 + 1 ) g = sns.FacetGrid(data=pdata, hue='sample', col='target',col_wrap=wrap, col_order=order,sharey=False) order = sorted(pdata.target.unique()) g.map(sns.kdeplot,'length') g.set_xlabels('Read Length') g.set_ylabels('HiFi Reads') g.add_legend() g.savefig(f'{outDir}/readlength_hist_by_target.png',dpi=DPI) plt.clf() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import sys allDemuxReads = pd.read_csv( snakemake.input.readCsv ) limaReport = pd.read_csv( snakemake.input.lima, sep='\t', usecols=['ReadLengths','PassedFilters'] ) dups = pd.read_csv( snakemake.input.dups, usecols=['sample','length','category'] ) outdir = snakemake.params.odir #load data onTarget = allDemuxReads.query('target != "."')\ .assign(category='OnTarget Unique')[['sample','category','length']] offTarget = allDemuxReads.query('target == "." and ismapped==1')\ .assign(category='OffTarget Unique')[['sample','category','length']] unmapped = allDemuxReads.query('ismapped==0')\ .assign(category='Unmapped Unique')[['sample','category','length']] noDemux = limaReport.query('PassedFilters==0')\ .assign(category='Failed Demux')\ .drop('PassedFilters',axis=1).rename(columns={'ReadLengths':'length'}) #merge allData = pd.concat([onTarget,offTarget,unmapped,dups,noDemux],ignore_index=True) #plot hue_order=['OnTarget Unique','OffTarget Unique','Unmapped Unique','Duplicates','Failed Demux'] f, (ax0, ax1) = plt.subplots(1, 2, figsize=(10,8), gridspec_kw={'width_ratios': [1, 5]}) #histogram sns.histplot(data=allData, x='length', hue_order=hue_order, hue='category', multiple='stack', binrange=(0,15000), ax=ax1) #side plot data = allData.groupby('category').size().reindex(hue_order[::-1]).rename('reads').reset_index() data['yieldFrac'] = data.reads.transform(lambda x:x/x.sum()).fillna(0) bottom=0 colors = sns.color_palette()[:5][::-1] for i,row in data.iterrows(): ax0.bar(1,row.yieldFrac,bottom=bottom,color=colors[i],alpha=0.75) ax0.annotate(f'{row.yieldFrac:.2f}',(1,bottom + .5*row.yieldFrac),ha='center',va='center') bottom += row.yieldFrac ax0.set_xticks([]) ax0.set_ylabel('Total Fraction') plt.tight_layout() f.savefig(f'{outdir}/read_categories.png',dpi=400) # write readlength legnth stats table allData.fillna('noSampleData')\ .groupby(['category','sample'])\ .length.describe().dropna().astype(int)\ .to_csv(f'{outdir}/read_length_by_sample.csv') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | import sys import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns DPI=200 readCsv = snakemake.input.csv targetBed = snakemake.input.bed targetBuffer = int(snakemake.params.buffer) outDir = snakemake.params.odir targets = pd.read_csv(targetBed, sep='\t', names=['chr','start','stop','target'])\ .set_index('target') #Set length of target region targets['tlength'] = targets.eval('stop - start + 2 * @targetBuffer') data = pd.read_csv(readCsv) #replace "." with "off-target" ot = 'off-target' data.target = data.target.str.replace('.',ot,regex=False) #### #mean base coverage #### pdata = data.query(' target != @ot ')\ .groupby( 'target' )\ .length.sum().reset_index() try: pdata['meanBaseCoverage'] = pdata.length / pdata.target.map(targets.tlength) except pd.errors.InvalidIndexError as e: print(f'Error: Is your target name list unique?\n\n{e}') sys.exit() order = sorted(pdata.target) g = sns.catplot(data=pdata, x='target',order=order, y='meanBaseCoverage', kind='bar',aspect=2) plt.xticks(rotation=45) g.savefig(f'{outDir}/mean_base_coverage_by_target.png',dpi=DPI) plt.clf() ### #sample readlength ### pdata = data.query( ' target != @ot ') g = sns.displot(pdata.length,kind='hist',kde=True) g.savefig(f'{outDir}/readlength_hist.png',dpi=DPI) plt.clf() #### #readlength by target ### pdata = data order = sorted(pdata.target.unique()) g = sns.FacetGrid(data=pdata, xlim=(0,10000), col='target',col_wrap=7, col_order=order, hue='sample', sharey=False) g.map(sns.kdeplot,'length') g.set_xlabels('Read Length') g.set_ylabels('HiFi Reads') g.add_legend() g.savefig(f'{outDir}/readlength_hist_by_target.png',dpi=DPI) plt.clf() #### #readlength by target #### pdata = data order = sorted(pdata.target.unique()) g = sns.catplot( data=pdata, x='target', y='length', order=order, kind='violin', aspect=2) plt.xticks(rotation=45); g.savefig(f'{outDir}/readlength_violins_by_target.png',dpi=DPI) plt.clf() #### #dedup count by target #### pdata = data order = sorted(pdata.target.unique()) g = sns.catplot( data=pdata, x='target', y='duplicates', order=order, kind='box',aspect=2) plt.xticks(rotation=45); g.savefig(f'{outDir}/dedup_count_by_target.png',dpi=DPI) plt.clf() #### #dedup rate by target #### def dedup_rate(d): dups = d.duplicates.sum() return dups / ( dups + len(d) ) pdata = data.groupby('target')\ .apply(dedup_rate)\ .rename('Duplication Rate')\ .reset_index() order = sorted(pdata.target) g = sns.catplot(data=pdata, x='target',order=order, y='Duplication Rate', kind='bar',aspect=2) plt.xticks(rotation=45) g.savefig(f'{outDir}/dedup_rate_by_target.png',dpi=DPI) plt.clf() #### #rq by target ### pdata = data order = sorted(pdata.target.unique()) g = sns.catplot( data=pdata, x='target', y='qual', order=order, kind='box',aspect=2) plt.xticks(rotation=45) g.savefig(f'{outDir}/readqual_by_target.png',dpi=DPI) plt.clf() ### #off target length ### pdata = data.query( ' target == @ot ') g = sns.displot(pdata.length,kind='hist',kde=True) g.savefig(f'{outDir}/off-target_length_hist.png',dpi=DPI) |
20 21 22 23 24 25 26 27 | shell: """ (whatshap phase {params.extra} \ --output {output} \ --reference {input.reference} \ {input.vcf} \ {input.phaseinput}) > {log} 2>&1 """ |
47 48 49 50 51 52 53 54 55 | shell: """ (whatshap stats \ --gtf {output.gtf} \ --tsv {output.tsv} \ --block-list {output.blocklist} \ --chr-lengths {input.chr_lengths} \ {input.vcf}) > {log} 2>&1 """ |
77 78 79 80 81 82 83 | shell: """ (whatshap haplotag {params} \ --output {output} \ --reference {input.reference} \ {input.vcf} {input.bam}) > {log} 2>&1 """ |
100 101 | shell: "(samtools index -@ 3 {input}) > {log} 2>&1" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = "Fabian Kilpert" __copyright__ = "Copyright 2020, Fabian Kilpert" __email__ = "fkilpert@gmail.com" __license__ = "MIT" import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts log = snakemake.log_fmt_shell() extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) with tempfile.TemporaryDirectory() as tmpdir: shell( "picard BedToIntervalList" " {java_opts} {extra}" " --INPUT {snakemake.input.bed}" " --SEQUENCE_DICTIONARY {snakemake.input.dict}" " --TMP_DIR {tmpdir}" " --OUTPUT {snakemake.output}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2018, Johannes Köster" __email__ = "johannes.koester@protonmail.com" __license__ = "MIT" import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=False, stderr=True) with tempfile.TemporaryDirectory() as tmpdir: shell( "picard CreateSequenceDictionary" " {java_opts} {extra}" " --REFERENCE {snakemake.input[0]}" " --TMP_DIR {tmpdir}" " --OUTPUT {snakemake.output[0]}" " {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts log = snakemake.log_fmt_shell(stdout=False, stderr=True) extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) with tempfile.TemporaryDirectory() as tmpdir: shell( "picard CollectHsMetrics" " {java_opts} {extra}" " --INPUT {snakemake.input.bam}" " --TMP_DIR {tmpdir}" " --OUTPUT {snakemake.output[0]}" " --REFERENCE_SEQUENCE {snakemake.input.reference}" " --BAIT_INTERVALS {snakemake.input.bait_intervals}" " --TARGET_INTERVALS {snakemake.input.target_intervals}" " {log}" ) |
Support
- Future updates
Related Workflows





