Structural Variant Detection and Analysis Pipeline for High-Quality Human Genomes using SVDSS

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Tools

conda install minimap2 pbsv=2.6.2 sniffles=1.0.12 cutesv=1.0.11 svim=1.4.2 dipcall samtools bcftools pysam biopython numpy pandas seaborn
pip3 install truvari
# We need ngmlr v0.2.8 (master) due to a bug in previous releases (not yet in conda at the time of the experiments)
git clone https://github.com/philres/ngmlr.git
cd ngmlr
git checkout a2a31fb6a63547be29c5868a7747e0c6f6e9e41f
mkdir build ; cd build
cmake ..
make
# pbmm2 and debreak will be loaded by snakemake from envs/ folder due to conflicts
# clone and install SVDSS from https://github.com/Parsoa/SVDSS

Data

In our experimental evaluation we used hg38.

Reference and annotations

# Reference
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
samtools faidx hg38.fa chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY > hg38.chroms.fa
sed -i '/^[^>]/ y/BDEFHIJKLMNOPQRSUVWXYZbdefhijklmnopqrsuvwxyz/NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN/' hg38.chroms.fa
samtools faidx hg38.chroms.fa
cut -f1,2 hg38.chroms.fa.fai | sort -k 1 > hg38.chroms.fa.genome
# Dipcall PAR
wget https://raw.githubusercontent.com/lh3/dipcall/master/data/hs38.PAR.bed
# pbsv TRF regions
wget https://raw.githubusercontent.com/PacificBiosciences/pbsv/master/annotations/human_GRCh38_no_alt_analysis_set.trf.bed

Tiers

In our paper we used the Tier 1 from GIAB and then we created an Extended Tier 2, that is everything outside Tier 1. Tier 1 is downloaded from GIAB FTP server and lifted to hg38. Extended Tier 2 (here called Tier23) is computed by complementing Tier 1.

# Get Tier 1 and lift to hg38
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.bed
sed -e 's/^/chr/' HG002_SVs_Tier1_v0.6.bed > Tier1_v0.6.hg19.bed
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
chmod +x liftOver
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
gunzip hg19ToHg38.over.chain.gz
./liftOver Tier1_v0.6.hg19.bed hg19ToHg38.over.chain Tier1_v0.6.hg38.bed Tier1_v0.6.unlifted.bed
# Clean and sort bed to be able to complement it
for c in $(seq 1 22) X Y ; do grep -P "^chr${c}\t" Tier1_v0.6.hg38.bed ; done | sort -k1,1 -k2,2n > Tier1_v0.6.hg38.noalt.bed
bedtools complement -i Tier1_v0.6.hg38.noalt.bed -g hg38.chroms.genome > Tier23_v0.6.hg38.noalt.bed

Truth VCFs

For users convenience, we provide the three VCFs computed with dipcall against hg38 and used as groundtruth in our analysis (see truthsets folder).

config.yaml

The experiments are provided as Snakemake workflows. The file config.yaml is used to tweak snakemake execution:

name : custom name for the run
fa : reference genome (i.e., hg38.chroms.fa )
fq : HiFi sample
trf : trf annotation
par : PAR regions
hap1 : first haplotype
hap2 : second haplotype
tier1 : GIAB Tier 1
tier23 : Extended Tier 2
out : output directory (everything will go here)
pingpong_bin : path to SVDSS binary
ngmlr_bin : path to ngmlr (v0.2.8) binary
threads : number of threads for each rule that requires more threads
aligners : list of aligners to use
callers : list of callers to use

HG007

Download corrected HiFi reads and corresponding haplotypes:

Change config.yaml accordingly and run:

# Mappers + callers
snakemake --use-conda [-n] -p -j 16
# Dipcall and truvari
snakemake -s Snakefile.dipvari [-n] -p -j 8

Coverage titration for 5x and 10x

samtools view -b -s 0.3 /path/to/out/dir/pbmm2.bam > pbmm2.5x.bam
samtools fastq pbmm2.5x.bam > pbmm2.5x.fq
# change config.yaml
snakemake --use-conda -p -j 16
snakemake -s Snakefile.dipvari -p -j 8
samtools view -b -s 0.6 /path/to/out/dir/pbmm2.bam > pbmm2.10x.bam
samtools fastq pbmm2.10x.bam > pbmm2.10x.fq
# change config.yaml
snakemake --use-conda -p -j 16
snakemake -s Snakefile.dipvari -p -j 8

To plot results, assuming ${OUT-*} to be the output folders set in config.yaml , run:

python3 plot_coverage.py ${OUT-5x}/pbmm2/results.csv ${OUT-10x}/pbmm2/results.csv ${OUT}/pbmm2/results.csv

Other plots

Assuming ${OUT} to be the output folder set in config.yaml :

# VCF length distribution
python3 plot_vcflen.py ${OUT}
# Upset plot for Ext. Tier 2 region
python3 plot_upset.py ${OUT}/pbmm2/truvari/ tier23conf > hg007.tier23conf.ppexclusive.vcf
# Exclusive calls from SVDSS heterozygosity
python3 check_exclusive.py get hg007.tier23conf.ppexclusive.vcf ${OUT}/pbmm2/truvari/dipcall.tier23conf.vcf > hg007.tier23conf.ppexclusive.dist
python3 check_exclusive.py plot hg007.tier23conf.ppexclusive.dist

HG002

Download corrected HiFi reads and corresponding haplotypes:

Update config.yaml accordingly and run snakemake .

CMRG analysis

To compare the callsets to the CMRG callset:

# Download the CMRG callset
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/HG002_GRCh38_CMRG_SV_v1.00.vcf.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/HG002_GRCh38_CMRG_SV_v1.00.vcf.gz.tbi
# Run truvari against the CMRG callset (assuming ${HG2-OUT} to be the output folder for the HG002 analysis)
for t in pp cutesv pbsv svim sniffles debreak ; do truvari bench -b HG002_GRCh38_CMRG_SV_v1.00.vcf.gz -c ${HG2-OUT}/pbmm2/${t}.vcf.gz -o ${t} -f ~/data/hg38-refanno/hg38.chroms.fa -r 1000 -p 0.00 -s 20 -S 20 ; done
# Plot the CMRG venn and supervenn
python3 medical_analysis.py . # . is the folder with the folders created by truvari in the previous cycle

GIAB vs dipcall analysis

To compare the tools against the GIAB callset, download the hg19 data and run the pipeline.

Download (hg19) reference and annotations:

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gunzip hs37d5.fa.gz
# Extract chromosomes
samtools faidx hs37d5.fa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y > hs37d5.chroms.fa
# Map non-ACGT characters to N:
sed -i '/^[^>]/ y/BDEFHIJKLMNOPQRSUVWXYZbdefhijklmnopqrsuvwxyz/NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN/' hs37d5.chroms.fa
# Get hg19 PAR region from dipcall repository:
wget https://raw.githubusercontent.com/lh3/dipcall/master/data/hs37d5.PAR.bed

Download GIAB Tier 1 and complement it:

# Get GIAB VCF and tiers:
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz.tbi
wget https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.bed
# Here we just complement the Tier 1 to get the Extended Tier 2
bedtools complement -i HG002_SVs_Tier1_v0.6.bed -g hs37d5.chroms.fa.fai > HG002_SVs_Tier23_v0.6.bed

Then, update config.yaml accordingly (see config-hg19.yaml , i.e., everything should point to hg19 and a new output directory) and:

# run everything on the older reference release
snakemake --use-conda -p -j 16
# compare the new results with dipcall (against hg19)
snakemake -s Snakefile.dipvari -p -j 8
# compare the new results with the GIAB callset, GIAB vs dipcall, and dipcall vs GIAB
snakemake -s Snakefile.giabvari -p -j 8
# Heterozygosity plot
python3 plot_hetebars.py {dipcall.clean.vcf.gz} {GIAB.clean.vcf.gz} ${OUT}/pbmm2/

CHM13

Download HiFi reads and corresponding haplotypes .

Update config.yaml accordingly and run snakemake .

Code Snippets

vcf=$1
vcf2=$2

grep "^##" ${vcf}
grep "^##contig" ${vcf2}
grep "^##FORMAT" ${vcf2}
grep "^#C" ${vcf}
grep -v "^#" ${vcf} | grep -v "SVLEN=NULL"

Shell From line 3 of main/clean_debreak.sh

shell:
    """
    minimap2 -ax map-hifi --MD --eqx -Y -R \'@RG\\tID:{params.name}\' -t {threads} {input.fa} {input.fq} -o {params.sam}
    samtools view -u {params.sam} | samtools sort -T {output.bam}.sort-tmp > {output.bam}
    samtools index {output.bam}
    """

SnakeMake SAMtools Minimap2 From line 48 of main/Snakefile

shell:
    """
    pbmm2 align -j {threads} --preset CCS --sort --rg \'@RG\\tID:{params.name}\' --sample {params.name} {input.fa} {input.fq} {params.bam}
    samtools index {params.bam}
    samtools calmd -b {params.bam} {input.fa} > {output.bam}
    samtools index {output.bam}
    """

SnakeMake SAMtools pbmm2 From line 67 of main/Snakefile

shell:
    """
    {NGMLR_BIN} -t {threads} --rg-id {params.name} -r {input.fa} -q {input.fq} -o {params.sam}
    samtools view -uS {params.sam} | samtools sort -T {output.bam}.sort-tmp > {output.bam}
    samtools index {output.bam}
    """

SnakeMake SAMtools From line 86 of main/Snakefile

shell:
    """
    samtools view -b -F 2304 {input.bam} > {output.bam}
    samtools index {output.bam}
    """

SnakeMake SAMtools From line 100 of main/Snakefile

    shell:
        """
	    pbsv discover --tandem-repeats {input.bed} {input.bam} {output.svsig}
        """

SnakeMake pbsv From line 118 of main/Snakefile

shell:
    """
    pbsv call -j {threads} --ccs -t INS,DEL {input.fa} {input.svsig} {output.vcf}
    """

SnakeMake pbsv From line 131 of main/Snakefile

shell:
    """
    mv {input.vcf} {output.vcf}
    """

SnakeMake From line 142 of main/Snakefile

shell:
    """
    mkdir -p {params.wdir}
    cuteSV -t {threads} -s 2 --max_cluster_bias_INS 1000 --diff_ratio_merging_INS 0.9 --max_cluster_bias_DEL 1000 --diff_ratio_merging_DEL 0.5 {input.bam} {input.fa} {output.vcf} {params.wdir}
    """

SnakeMake cuteSV From line 158 of main/Snakefile

shell:
    """
    grep -v 'INV\|DUP\|BND' {input.vcf} > {output.vcf}
    sed -i '4 a ##INFO=<ID=STRAND,Number=1,Type=String,Description="Strand ? (line manually added)">' {output.vcf}
    """

SnakeMake From line 170 of main/Snakefile

shell:
    """
    svim alignment --min_sv_size 30 --cluster_max_distance 1.4 --interspersed_duplications_as_insertions --tandem_duplications_as_insertions {output.odir} {input.bam} {input.fa}
    """

SnakeMake svim From line 184 of main/Snakefile

shell:
    """
    cat {input.idir}/variants.vcf \
    | sed 's/INS:NOVEL/INS/g' \
    | sed 's/DUP:INT/INS/g' \
    | sed 's/DUP:TANDEM/INS/g' \
    | awk '{{ if($1 ~ /^#/) {{ print $0 }} else {{ if($5=="<DEL>" || $5=="<INS>") {{ print $0 }} }} }}' \
    | grep -v 'SUPPORT=1;' \
    | sed 's/q5/PASS/g' > {output.vcf}
    """

SnakeMake From line 195 of main/Snakefile

shell:
    """
    sniffles --ccs_reads -s 2 -l 30 -m {input.bam} -v {output.vcf} --genotype
    """

SnakeMake sniffles From line 215 of main/Snakefile

shell:
    """
    cat <(cat {input.vcf} | grep "^#") <(cat {input.vcf} | grep -vE "^#" | grep 'DUP\|INS\|DEL' | sed 's/DUP/INS/g' | sort -k1,1 -k2,2g) > {output.vcf}
    sed -i '4 a ##FILTER=<ID=STRANDBIAS,Description="Strand is biased.">' {output.vcf}
    """

SnakeMake From line 225 of main/Snakefile

shell:
    """
    debreak --bam {input.bam} --outpath {params.odir} --rescue_large_ins --poa --ref {input.fa} --min_size 30 --min_support 2 --aligner params.aligner --thread {threads}
    """

SnakeMake DeBreak From line 269 of main/Snakefile

shell:
    """
    bash clean_debreak.sh {input.debreak} {input.pp} > {output.vcf}
    """

SnakeMake From line 281 of main/Snakefile

shell:
    """
    {PP_BIN} index --fastq {input.fa} --index {output.fmd}
    """

SnakeMake From line 297 of main/Snakefile

shell:
    """
    {PP_BIN} smooth --reference {input.fa} --bam {input.bam} --workdir {params.wd} --threads {threads}
    """

SnakeMake From line 312 of main/Snakefile

shell:
    """
    samtools sort -T {output.bam}.sort-tmp {input.bam} > {output.bam}
    samtools index {output.bam}
    """

SnakeMake SAMtools From line 324 of main/Snakefile

shell:
    """
    {PP_BIN} search --index {input.fmd} --bam {input.bam} --threads {threads} --workdir {params.wd} --assemble
    """

SnakeMake From line 340 of main/Snakefile

shell:
    """
    n=$(ls {params.wd}/solution_batch_*.assembled.sfs | wc -l)
    {PP_BIN} call --reference {input.fa} --bam {input.bam} --threads {threads} --workdir {params.wd} --batches ${{n}} --min-cluster-weight 2
    mv {params.wd}/svs_poa.vcf {output.vcf}
    """

SnakeMake From line 356 of main/Snakefile

shell:
    """
    bcftools sort -T {params.tmp_prefix} -Oz {input} > {output}
    tabix -p vcf {output}
    """

SnakeMake BCFtools tabix From line 376 of main/Snakefile

ShowHide 14 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/ldenti/SVDSS-experiments

Name: svdss-experiments

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

pbsv sniffles svim tabix BCFtools DeBreak Minimap2 pbmm2 SAMtools Snakemake cuteSV Structure analysis

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free