Structural Variant Detection and Analysis Pipeline for High-Quality Human Genomes using SVDSS
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Tools
conda install minimap2 pbsv=2.6.2 sniffles=1.0.12 cutesv=1.0.11 svim=1.4.2 dipcall samtools bcftools pysam biopython numpy pandas seaborn
pip3 install truvari
# We need ngmlr v0.2.8 (master) due to a bug in previous releases (not yet in conda at the time of the experiments)
git clone https://github.com/philres/ngmlr.git
cd ngmlr
git checkout a2a31fb6a63547be29c5868a7747e0c6f6e9e41f
mkdir build ; cd build
cmake ..
make
# pbmm2 and debreak will be loaded by snakemake from envs/ folder due to conflicts
# clone and install SVDSS from https://github.com/Parsoa/SVDSS
Data
In our experimental evaluation we used hg38.
Reference and annotations
# Reference
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
samtools faidx hg38.fa chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY > hg38.chroms.fa
sed -i '/^[^>]/ y/BDEFHIJKLMNOPQRSUVWXYZbdefhijklmnopqrsuvwxyz/NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN/' hg38.chroms.fa
samtools faidx hg38.chroms.fa
cut -f1,2 hg38.chroms.fa.fai | sort -k 1 > hg38.chroms.fa.genome
# Dipcall PAR
wget https://raw.githubusercontent.com/lh3/dipcall/master/data/hs38.PAR.bed
# pbsv TRF regions
wget https://raw.githubusercontent.com/PacificBiosciences/pbsv/master/annotations/human_GRCh38_no_alt_analysis_set.trf.bed
Tiers
In our paper we used the Tier 1 from GIAB and then we created an Extended Tier 2, that is everything outside Tier 1. Tier 1 is downloaded from GIAB FTP server and lifted to hg38. Extended Tier 2 (here called Tier23) is computed by complementing Tier 1.
# Get Tier 1 and lift to hg38
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.bed
sed -e 's/^/chr/' HG002_SVs_Tier1_v0.6.bed > Tier1_v0.6.hg19.bed
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
chmod +x liftOver
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
gunzip hg19ToHg38.over.chain.gz
./liftOver Tier1_v0.6.hg19.bed hg19ToHg38.over.chain Tier1_v0.6.hg38.bed Tier1_v0.6.unlifted.bed
# Clean and sort bed to be able to complement it
for c in $(seq 1 22) X Y ; do grep -P "^chr${c}\t" Tier1_v0.6.hg38.bed ; done | sort -k1,1 -k2,2n > Tier1_v0.6.hg38.noalt.bed
bedtools complement -i Tier1_v0.6.hg38.noalt.bed -g hg38.chroms.genome > Tier23_v0.6.hg38.noalt.bed
Truth VCFs
For users convenience, we provide the three VCFs computed with dipcall against hg38 and used as groundtruth in our analysis (see
truthsets
folder).
config.yaml
The experiments are provided as
Snakemake
workflows. The file
config.yaml
is used to tweak snakemake execution:
-
name : custom name for the run
-
fa : reference genome (i.e., hg38.chroms.fa )
-
fq : HiFi sample
-
trf : trf annotation
-
par : PAR regions
-
hap1 : first haplotype
-
hap2 : second haplotype
-
tier1 : GIAB Tier 1
-
tier23 : Extended Tier 2
-
out : output directory (everything will go here)
-
pingpong_bin : path to SVDSS binary
-
ngmlr_bin : path to ngmlr (v0.2.8) binary
-
threads : number of threads for each rule that requires more threads
-
aligners : list of aligners to use
-
callers : list of callers to use
HG007
Download corrected HiFi reads and corresponding haplotypes:
Change
config.yaml
accordingly and run:
# Mappers + callers
snakemake --use-conda [-n] -p -j 16
# Dipcall and truvari
snakemake -s Snakefile.dipvari [-n] -p -j 8
Coverage titration for 5x and 10x
samtools view -b -s 0.3 /path/to/out/dir/pbmm2.bam > pbmm2.5x.bam
samtools fastq pbmm2.5x.bam > pbmm2.5x.fq
# change config.yaml
snakemake --use-conda -p -j 16
snakemake -s Snakefile.dipvari -p -j 8
samtools view -b -s 0.6 /path/to/out/dir/pbmm2.bam > pbmm2.10x.bam
samtools fastq pbmm2.10x.bam > pbmm2.10x.fq
# change config.yaml
snakemake --use-conda -p -j 16
snakemake -s Snakefile.dipvari -p -j 8
To plot results, assuming
${OUT-*}
to be the output folders set in
config.yaml
, run:
python3 plot_coverage.py ${OUT-5x}/pbmm2/results.csv ${OUT-10x}/pbmm2/results.csv ${OUT}/pbmm2/results.csv
Other plots
Assuming
${OUT}
to be the output folder set in
config.yaml
:
# VCF length distribution
python3 plot_vcflen.py ${OUT}
# Upset plot for Ext. Tier 2 region
python3 plot_upset.py ${OUT}/pbmm2/truvari/ tier23conf > hg007.tier23conf.ppexclusive.vcf
# Exclusive calls from SVDSS heterozygosity
python3 check_exclusive.py get hg007.tier23conf.ppexclusive.vcf ${OUT}/pbmm2/truvari/dipcall.tier23conf.vcf > hg007.tier23conf.ppexclusive.dist
python3 check_exclusive.py plot hg007.tier23conf.ppexclusive.dist
HG002
Download corrected HiFi reads and corresponding haplotypes:
Update
config.yaml
accordingly and run
snakemake
.
CMRG analysis
To compare the callsets to the CMRG callset:
# Download the CMRG callset
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/HG002_GRCh38_CMRG_SV_v1.00.vcf.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/HG002_GRCh38_CMRG_SV_v1.00.vcf.gz.tbi
# Run truvari against the CMRG callset (assuming ${HG2-OUT} to be the output folder for the HG002 analysis)
for t in pp cutesv pbsv svim sniffles debreak ; do truvari bench -b HG002_GRCh38_CMRG_SV_v1.00.vcf.gz -c ${HG2-OUT}/pbmm2/${t}.vcf.gz -o ${t} -f ~/data/hg38-refanno/hg38.chroms.fa -r 1000 -p 0.00 -s 20 -S 20 ; done
# Plot the CMRG venn and supervenn
python3 medical_analysis.py . # . is the folder with the folders created by truvari in the previous cycle
GIAB vs dipcall analysis
To compare the tools against the GIAB callset, download the hg19 data and run the pipeline.
Download (hg19) reference and annotations:
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gunzip hs37d5.fa.gz
# Extract chromosomes
samtools faidx hs37d5.fa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y > hs37d5.chroms.fa
# Map non-ACGT characters to N:
sed -i '/^[^>]/ y/BDEFHIJKLMNOPQRSUVWXYZbdefhijklmnopqrsuvwxyz/NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN/' hs37d5.chroms.fa
# Get hg19 PAR region from dipcall repository:
wget https://raw.githubusercontent.com/lh3/dipcall/master/data/hs37d5.PAR.bed
Download GIAB Tier 1 and complement it:
# Get GIAB VCF and tiers:
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz.tbi
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.bed
# Here we just complement the Tier 1 to get the Extended Tier 2
bedtools complement -i HG002_SVs_Tier1_v0.6.bed -g hs37d5.chroms.fa.fai > HG002_SVs_Tier23_v0.6.bed
Then, update
config.yaml
accordingly (see
config-hg19.yaml
, i.e., everything should point to hg19 and a new output directory) and:
# run everything on the older reference release
snakemake --use-conda -p -j 16
# compare the new results with dipcall (against hg19)
snakemake -s Snakefile.dipvari -p -j 8
# compare the new results with the GIAB callset, GIAB vs dipcall, and dipcall vs GIAB
snakemake -s Snakefile.giabvari -p -j 8
# Heterozygosity plot
python3 plot_hetebars.py {dipcall.clean.vcf.gz} {GIAB.clean.vcf.gz} ${OUT}/pbmm2/
CHM13
Download HiFi reads and corresponding haplotypes .
Update
config.yaml
accordingly and run
snakemake
.
Code Snippets
3 4 5 6 7 8 9 10 | vcf=$1 vcf2=$2 grep "^##" ${vcf} grep "^##contig" ${vcf2} grep "^##FORMAT" ${vcf2} grep "^#C" ${vcf} grep -v "^#" ${vcf} | grep -v "SVLEN=NULL" |
48 49 50 51 52 53 | shell: """ minimap2 -ax map-hifi --MD --eqx -Y -R \'@RG\\tID:{params.name}\' -t {threads} {input.fa} {input.fq} -o {params.sam} samtools view -u {params.sam} | samtools sort -T {output.bam}.sort-tmp > {output.bam} samtools index {output.bam} """ |
67 68 69 70 71 72 73 | shell: """ pbmm2 align -j {threads} --preset CCS --sort --rg \'@RG\\tID:{params.name}\' --sample {params.name} {input.fa} {input.fq} {params.bam} samtools index {params.bam} samtools calmd -b {params.bam} {input.fa} > {output.bam} samtools index {output.bam} """ |
86 87 88 89 90 91 | shell: """ {NGMLR_BIN} -t {threads} --rg-id {params.name} -r {input.fa} -q {input.fq} -o {params.sam} samtools view -uS {params.sam} | samtools sort -T {output.bam}.sort-tmp > {output.bam} samtools index {output.bam} """ |
100 101 102 103 104 | shell: """ samtools view -b -F 2304 {input.bam} > {output.bam} samtools index {output.bam} """ |
118 119 120 121 | shell: """ pbsv discover --tandem-repeats {input.bed} {input.bam} {output.svsig} """ |
131 132 133 134 | shell: """ pbsv call -j {threads} --ccs -t INS,DEL {input.fa} {input.svsig} {output.vcf} """ |
142 143 144 145 | shell: """ mv {input.vcf} {output.vcf} """ |
158 159 160 161 162 | shell: """ mkdir -p {params.wdir} cuteSV -t {threads} -s 2 --max_cluster_bias_INS 1000 --diff_ratio_merging_INS 0.9 --max_cluster_bias_DEL 1000 --diff_ratio_merging_DEL 0.5 {input.bam} {input.fa} {output.vcf} {params.wdir} """ |
170 171 172 173 174 | shell: """ grep -v 'INV\|DUP\|BND' {input.vcf} > {output.vcf} sed -i '4 a ##INFO=<ID=STRAND,Number=1,Type=String,Description="Strand ? (line manually added)">' {output.vcf} """ |
184 185 186 187 | shell: """ svim alignment --min_sv_size 30 --cluster_max_distance 1.4 --interspersed_duplications_as_insertions --tandem_duplications_as_insertions {output.odir} {input.bam} {input.fa} """ |
195 196 197 198 199 200 201 202 203 204 | shell: """ cat {input.idir}/variants.vcf \ | sed 's/INS:NOVEL/INS/g' \ | sed 's/DUP:INT/INS/g' \ | sed 's/DUP:TANDEM/INS/g' \ | awk '{{ if($1 ~ /^#/) {{ print $0 }} else {{ if($5=="<DEL>" || $5=="<INS>") {{ print $0 }} }} }}' \ | grep -v 'SUPPORT=1;' \ | sed 's/q5/PASS/g' > {output.vcf} """ |
215 216 217 218 | shell: """ sniffles --ccs_reads -s 2 -l 30 -m {input.bam} -v {output.vcf} --genotype """ |
225 226 227 228 229 | shell: """ cat <(cat {input.vcf} | grep "^#") <(cat {input.vcf} | grep -vE "^#" | grep 'DUP\|INS\|DEL' | sed 's/DUP/INS/g' | sort -k1,1 -k2,2g) > {output.vcf} sed -i '4 a ##FILTER=<ID=STRANDBIAS,Description="Strand is biased.">' {output.vcf} """ |
269 270 271 272 | shell: """ debreak --bam {input.bam} --outpath {params.odir} --rescue_large_ins --poa --ref {input.fa} --min_size 30 --min_support 2 --aligner params.aligner --thread {threads} """ |
281 282 283 284 | shell: """ bash clean_debreak.sh {input.debreak} {input.pp} > {output.vcf} """ |
297 298 299 300 | shell: """ {PP_BIN} index --fastq {input.fa} --index {output.fmd} """ |
312 313 314 315 | shell: """ {PP_BIN} smooth --reference {input.fa} --bam {input.bam} --workdir {params.wd} --threads {threads} """ |
324 325 326 327 328 | shell: """ samtools sort -T {output.bam}.sort-tmp {input.bam} > {output.bam} samtools index {output.bam} """ |
340 341 342 343 | shell: """ {PP_BIN} search --index {input.fmd} --bam {input.bam} --threads {threads} --workdir {params.wd} --assemble """ |
356 357 358 359 360 361 | shell: """ n=$(ls {params.wd}/solution_batch_*.assembled.sfs | wc -l) {PP_BIN} call --reference {input.fa} --bam {input.bam} --threads {threads} --workdir {params.wd} --batches ${{n}} --min-cluster-weight 2 mv {params.wd}/svs_poa.vcf {output.vcf} """ |
376 377 378 379 380 | shell: """ bcftools sort -T {params.tmp_prefix} -Oz {input} > {output} tabix -p vcf {output} """ |
Support
- Future updates
Related Workflows





