A Snakemake pipeline for variant calling of genomic FASTQ data using GATK
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Introduction
This project is a snakemake workflow for processing
.fastq.gz
files and downstream analyses. The variant calling is primarily based on the GATK best practices for germline short variant discovery. Later analyses include pairwise relationship estimation and admixture.
Dependencies
conda
is a package manager that can install many of the tools for this program.
Once installed, use the below conda commands to obtain the required packages:
conda install mamba
mamba install -c bioconda snakemake
mamba install -c conda-forge biopython
mamba install -c anaconda pandas
Some will still have to be installed manually outside of conda. Depending on which part of the workflow is being used, the following may need to be installed as well and must be accessible through
$PATH
. The main one being
gatk
.
Other necessary libraries have packages and versions stored in
workflow/envs
. These will be installed automatically in their own conda environment when running this snakemake pipeline.
Configuration
Configuration settings are set with a
.py
file inside the directory
config
. By default, the file
template.py
is active. This can be copied and renamed to create multiple config files for convenient testing with different datasets. The name of the desired config file for a particular run, however, should be set in
workflow/Snakefile
on the line that says
from template import __dict__ as config
. For example, say we create a new config
rhesus.py
. This file would then be imported by changing
template
in that line to
rhesus
.
Initial required files to add to config file for first steps of variant calling are the following.
-
Species reference genome
-
Raw FASTQ files (ending with
R1.fastq.gz
andR2.fastq.gz
)
Other subsequent steps require their own additional user-created files as well, which can all be specified in the config file.
Also, be sure to set the value for
path
for where the
resources/
and
results/
directory should go.
Setting Target Files
In addition to setting input files and variables inside the config files, target outputs (the files that we want to generate) must also be listed. These are listed inside the variable
target_files
. When we run, Snakemake will create the files (as well as any intermediates) based on what is listed there.
Writing the output files requires finding the rule in
workflow/rules/
for what files are wanted and copying the
output
. For instance, if I want to make BAM files, I can find the corresponding
rule align
, which is in
variant_calling.smk
and find:
output: alignment = config["results"] + "alignments/raw/{sample}.bam",
Now, place that inside list of target files back inside our
Snakefile
, making sure to replace
config["results"]
with just
results
. Also replace
{sample}
with the actual sample name that we are interested in.
target_files = [ results + "alignments/raw/WGS12345.bam",
]
Often, many samples will be worked on at the same time. To facilitate this, we could just add more to the target list like so:
target_files = [ results + "alignments/raw/WGS12345.bam", results + "alignments/raw/WGS23456.bam",
]
Although a more efficient way is to utilize the
expand
function:
target_files = [ expand(results + "alignments/raw/{sample}.bam", sample=[12345, 23456]),
]
A couple convenience variables are preset,
SAMPLE_NAMES
and
CHROMOSOMES
.
SAMPLE_NAMES
stores all of the sample names listed in
config["samples"]
. And
CHROMOSOMES
stores all numbered chromosomes plus X, Y and MT if in the reference genome.
target_files = [ expand(results + "alignments/raw/{sample}.bam", sample=SAMPLE_NAMES),
]
As another example, to create genotyped VCFs for each chromosome, we can look at
rule genotype_passing
:
output: config["results"] + "genotypes/pass/{dataset}.{mode}.chr{chr}.vcf.gz",
For our config file, we could have the following below.
{dataset}
can essentially be named anything, just be consistent. Modifying this same is helpful if we want to have multiple parallel analyses when tweaking parameters. Currently, this workflow only works with
{mode}
set to SNP. This
mode
is for future work with indels or indels and SNPs together.
target_files = [ expand(results + "genotypes/pass/rhesus.SNP.chr{chr}.vcf.gz", chr=CHROMOSOMES),
]
Running
Running
snakemake
directly within the project's main directory on the command line will initiate the program by calling
workflow/Snakefile
. To perform a "dry-run", you can use the command
snakemake -n
. This will check that the rules are functional without actually processing any data. This is helpful to quickly test for any obvious problems with new rules. If an upstream file has been modified after a downstream one, Snakemake tries to redo the pipeline from the modified file. If done accidentally, this will delete downstream files, making the rules between run again. So it is good practice to perform the dry-run before each actual run to make sure important files aren't deleted. If a file was accidentally modified and downstream files don't want to be deleted, use Unix command
touch -d
can be used to set the timestamp to an earlier time. For example, setting
file.txt
to a timestamp of October 3rd could look like
touch -d '3 Oct' file.txt
.
To find the reason for Snakemake rerunning files, run
snakemake -n -r
. One of the proposed jobs should say "reason: Updated input files: _________". Those file(s) are the source of the problem.
On SGE Cluster
To actually run the program on a Sun Grid Engine cluster, the following command can be issued from the project directory:
NAME=smk_variant; LOG=log/dirname; nohup snakemake --profile profile --cluster "qsub -V -b n -cwd -pe smp {threads} -N $NAME -o $NAME.out.log -e $NAME.err.log" > $LOG/$NAME.smk.log 2>&1
Most of the Snakemake settings are stored in
profile/config.yaml
, which is called by the
--profile profile
option. The values there can be modified to adjust the maximum number of jobs and other information. The
--cluster
option sets the options used by the cluster. These can be left as is. However, set the names for variables
NAME
and
LOG
.
NAME
will be the name given to the cluster's queue and viewable with the
qstat
command.
NAME
is also used for the log files.
LOG
is then the directory in which the log files will be stored. This can be changed as needed to help organize logs.
Locally
Ideally, use the cluster configuration above. Otherwise, to run locally, use:
nohup snakemake --use-conda -c2
. The integer in
-c2
means that two cores will be used. Most rules list the number of threads used, which can be used here. But unlike the cluster method above, this will need to be manually changed. Higher integers (as long as the system has such) will lead to faster processing. This command will leave the error log inside a file in the current directory labelled
nohup.out
.
Log Files
In addition to generating the files given in
target_files
, the above cluster command will also generate log files. Check these when errors occur. There are three logs ending with the following patterns:
-
.smk.log
tells what jobs have been submitted, what files each is trying to make, and when jobs have completed. -
.err.log
gives details about the actual jobs themselves. When submitting more than one job at once, this can become cluttered with interleaved messages from different jobs. To better find an error message for a specific job, it might require running only a single job at a time, which can be done by setting-j 1
. -
.out.log
prints output messages from the jobs. Once again, these will be interleaved. But in general, there is fewer information in the this file and often not as important as the error log.
Make sure to create the directories in which the log files will be placed prior to running to avoid errors.
Common Errors
Error: Directory cannot be locked
- All that is required is to add
--unlock
to command from earlier as below. Then, once that finishes, run again without the
--unlock
option.
NAME=smk_variant; LOG=log/dirname; nohup snakemake --use-conda --cluster "qsub -V -b n -cwd -pe smp {threads} -N $NAME -o $NAME.out.log -e $NAME.err.log" -j 20 --unlock > $LOG/$NAME.smk.log 2>&1
MissingOutputException
- Usually, can just be rerun and will work. Likely due to system latency. Otherwise, if the same file continues to cause this problem, the rule's output (if any) does not match the file(s) listed under the rule's
output
.
JSONDecodeError
- Removing the
.snakemake/metadata
directory (which has JSON files) appears to fix this. This issue is documented here: https://github.com/snakemake/snakemake/issues/1342. Newer versions of
snakemake
may not have this problem.
Sometimes, submitted Snakemake commands will persist even when running of jobs has stopped. To deal with this, occasionally check running processes with
ps -ax | grep variant-analysis
. The first column of each row shows the process_id. For a "zombie process" with id
12345
, we could clear it with
kill -9 12345
.
Code Snippets
10 11 12 13 | shell: """ awk -v START=0 'BEGIN {{PREV_START=START}} {{$4=(($2-PREV_START)*$4*0.000001)+PREV_CM; printf "%s %s %s %0.9f\n", $1, $2, $3, $4; PREV_START=$2; PREV_CM=$4}}' {input.recomb_rates_chrom} \ > {output.genetic_map} """ |
23 24 25 | shell: """ liftOver {input.recomb_rates} {input.chain} {output.remapped} {output.unmapped}; \ """ |
49 50 51 52 53 54 55 56 57 58 59 | shell: """ rfmix \ -f {input.query_VCF} \ -r {input.ref_VCF} \ -m {input.sample_map} \ -g {input.genetic_map} \ -o {params.output_base} \ --chromosome={wildcards.chr} \ -e 3 \ --reanalyze-reference \ """ |
70 71 72 73 74 75 76 | shell: """ FILES=({input}); \ head -n 2 ${{FILES[0]}} > {output.concat}; \ for FILE in "${{FILES[@]}}"; do \ sed '1d;2d' $FILE >> {output.concat}; \ done \ """ |
11 12 13 14 15 16 | shell: """ bcftools view {input.joint_vcf} \ -s {wildcards.sample} \ -Oz \ -o {output.indiv_vcf} \ """ |
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | shell: """ bcftools annotate {input.vcf} \ -x INFO,FORMAT \ | bcftools view \ -s {wildcards.sample1},{wildcards.sample2} \ | bcftools view \ -i 'F_MISSING=0' -H \ | cut -f 10-11 \ | sed 's/|/\//g' \ | sed 's/1\/0/0\/1/g' \ | awk -v OFS='\t' \ 'BEGIN {{print "SAMPLE1","SAMPLE2","MATCHES","MISMATCHES","CONCORDANCE"}} \ {{if ($1 == $2) MATCH+=1; else MISMATCH+=1}} \ END {{CONCORDANCE = MATCH/(MATCH + MISMATCH); print "{wildcards.sample1}","{wildcards.sample2}",MATCH,MISMATCH,CONCORDANCE}}' \ > {output.concordance} \ """ |
61 62 63 64 65 66 | shell: """ cat {input[0]} | head -n 1 > {output}; \ for FILE in {input}; do \ cat $FILE | sed '1d' | sed 's/\t\t/\t0\t/g' >> {output}; \ done \ """ |
74 75 76 77 78 79 80 81 | shell: """ bcftools view {input.vcf} \ -s {wildcards.sample} \ | bcftools view \ -i "F_MISSING=0" -H \ | wc -l \ > {output.called_count} \ """ |
90 91 92 93 | shell: """ cat {input.called_counts} > {output.combined_counts}; \ paste {config[samples]} {output.combined_counts} > {output.tsv} \ """ |
13 14 15 | shell: "bedtools genomecov \ -ibam {input.bam} \ -bg > {output.counts}" |
23 24 25 | shell: "bedtools unionbedg \ -i {input} \ > {output}" |
36 37 38 39 40 41 | shell: """ awk -v 'OFS=\t' '{{for (i=4; i<=NF; i++) {{if ($i > 0) count += 1}}; if (count/(NF - 3) > {params.cutoff}) print $1,$2,$3; count = 0}}' {input} | grep '^NC' | bedtools merge -d 1 > {output}""" |
52 53 54 55 56 57 58 59 60 61 62 | shell: """ bedtools intersect \ -a {input.target} \ -b {input.ref_bed} \ | samtools view \ | wc -l \ > {output}; \ samtools view {input.target} \ | wc -l \ >> {output} \ """ |
72 73 74 75 76 77 | shell: """ for i in {{1..2}}; \ do awk -v i=$i 'FNR==i {{ print }}' {input.intersections} | awk '{{ sum+=$1 }} END {{ print sum }}'; \ done \ > {output.total} \ """ |
88 89 90 91 92 93 94 95 | shell: """ bcftools view {input.vcf} \ -R {input.bed} \ -S {input.subset} \ --min-ac 3 \ -Oz \ -o {output.vcf} \ """ |
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | run: import allel import pandas as pd callset = allel.read_vcf(input.vcf, ['samples', 'calldata/GT']) g = allel.GenotypeArray(callset['calldata/GT']) nonmissing_GT = g.count_called(axis=0)/len(g) data = { "sample": callset['samples'], "nonmissing_GT": nonmissing_GT, } with open(output, 'w') as f: f.write("# Based on the file: " + output) df = pd.DataFrame(data).set_index('sample') df.to_csv(output, mode="a") |
168 169 170 171 172 173 174 | shell: """ mosdepth {params.prefix} {input.bam} \ --by {params.window_size} \ -n \ --fast-mode \ -t {threads} \ """ |
184 185 186 187 188 189 190 191 | shell: """ for BED in {input.bed}; do \ SAMPLE=$(echo $BED | cut -d "." -f 1 | rev | cut -d "/" -f 1 | rev); \ gunzip -c $BED \ | awk -v SAMPLE=$SAMPLE 'BEGIN {{OFS="\\t"}} {{print SAMPLE, $0}}' \ >> {output.merged_bed}; \ done \ """ |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | shell: """echo ''' Information about where these files fall in the workflow can be seen under the `.smk` files under `workflow/rules`. The directories for variant calling from `variant_calling.smk` and `phasing.smk` are created in the following order: ├─ alignments # Duplicate sequences are marked and sorted | | | ├─ raw # .bam files after .fastq files aligned to reference | | | ├─ markdup # .bam files after running fixing mates pairs and marking duplicate reads | | | └─ alignments_recalibrated # BQSR tables applied to sequences | | | └─ recal_tables # BQSR recalibration tables that are applied to sequences | ├─ gvcf # Called variants as .g.vcf files | ├─ db # GenomicsDB datastore consolidating .g.vcf files. Not human-readable | ├─ joint_call # db data converted to a joint .vcf file with VQSR applied | ├─ vcf # Multisample VCF | | | ├─ hard_filtered # (Option 1) Filters by hard values | | | | | ├─ filter_applied # FILTER column filled included those variants that didn't pass | | | | | └─ pass_only # Only variants with FILTER=PASS | | | └─ VQSR # (Option 2) Variant recalibration, filters by using truth and training data | | | ├─ model # Models to be applied to VCFs | | | ├─ filter_applied # FILTER column filled included those variants that didn't pass | | | └─ pass_only # Only variants with FILTER=PASS | ├─* plink # Contains plink files | ├─ genotypes | | | ├─ posteriors # Calculate PP tag (posterior probabilites) and recaluclate MQ tag | | | ├─ filtered # Set to missing genotypes with MQ < 20 | | | ├─ filtered # Set to missing genotypes with MQ < 20 | | | └─ subsets # Taking subsets of samples based on sequencing methods: GBS, WES, WGS | └─ haplotypes # Phased/imputed VCFs | ├─ scaffolds # Preliminary VCFs of genotypes incorportating pedigree information | └─ SHAPEIT4 # Phased/imputed VCFs using scaffold and original VCF The directories for relations from `relations.smk` are as follows. `plink` is created first. The others do not follow an order. However, these require the joint vcf file from `joint_call`: ├─ plink # Contains plink files | ├─ relatedness # Pairwise estimates of relatedness between samples | ├─ admixture # Finds rates of admixture among samples | | | ├─ supervised # Uses additional samples of known origin | | | └─ unsupervised # No known origins | └─ aims # Ancestry informative markers Directories for determining kinship using LASAR and SEEKIN: └─ kinship # Allele frequencies and pairwise relatedness Other: └─ structural_variants # Strutural variants ''' > {output} """ |
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | shell: """echo ''' Recommended layout for resources: ├─ barcodes # Table of barcodes for samples (used for GBS and AMP reads) | ├─ reads # .fastq.gz file for each sample (R1 and R2) | ├─ ref_fna # Reference genome | ├─ ref_vcf # Reference .vcf.gz files | ├─ samples # List of samples ''' > {output} """ |
14 | script: "../scripts/fst.py" |
32 33 34 35 36 37 38 39 40 | shell: """ vcftools \ --bcf {input.bcf} \ --weir-fst-pop {input.pop1} \ --weir-fst-pop {input.pop2} \ --out {params.out_prefix} \ --fst-window-size {params.size} \ --fst-window-step {params.step} \ """ |
39 40 41 42 43 | shell: """ gatk --java-options '-Xmx8g' VariantFiltration \ -V {input.vcf} \ -O {output.filtered} \ {params.filters} """ |
54 55 56 57 58 | shell: """ bcftools view {input.vcf} \ -f PASS \ -Oz \ -o {output.vcf}""" |
76 77 78 79 80 81 | shell: """ bcftools {input.vcfs} \ --allow-overlaps \ -Oz \ -o {output} \ --threads {threads}""" |
154 155 156 157 158 159 160 161 162 163 164 165 166 | shell: """ bcftools view {input.vcf} \ -S {input.samples_subset} \ -Ou \ | bcftools reheader \ -s {input.sample_map} \ | bcftools view \ --min-ac=1 \ -Ou \ | bcftools norm \ -m+any \ -Oz \ -o {output}""" |
186 187 188 189 190 191 192 193 194 195 196 197 198 | shell: """ bcftools view {input.vcf} \ -S {input.samples_subset} \ -Ou \ | bcftools norm \ -m+any \ -Ov \ | vcftools \ --vcf - \ --max-missing 0.7 \ -c \ --recode --recode-INFO-all \ | bgzip > {output}""" |
7 8 9 10 11 12 13 | shell: """ gunzip -c {input.ref_fna} \ | grep ">" \ | cut -d ":" -f 4,6 \ | sed 's/:/\t/' \ > {output.contig_lengths} \ """ |
27 28 29 | shell: """ python3 workflow/scripts/runs_of_homozygosity.py {input.vcf} {input.contig_lengths} {output.roh_pickle} {output.froh_pickle} {wildcards.chr} \ """ |
15 16 17 18 19 20 21 | shell: """ makeScaffold \ --gen {input.vcf} \ --fam {input.fam} \ --reg {wildcards.chr} \ --out {output.scaffold} \ """ |
122 123 124 125 126 127 128 | shell: """ bcftools annotate {input.vcf} \ -a {input.phased} \ -c FORMAT/GT \ -o {output.annotated} \ -Oz \ """ |
140 141 142 143 144 | shell: """ sed 's/\\t\\t/\\tNA\\t/g;s/\\t$/\\tNA/g' {input.tsv} \ | grep -v NA$'\\t'NA \ > {output.fam} \ """ |
160 161 162 163 164 165 166 167 168 | shell: """ SHAPEIT5_phase_common \ --input {input.vcf} \ --pedigree {input.fam} \ --region {wildcards.chr} \ --output {output.phased} \ --log {log} \ --thread {threads} \ """ |
186 187 188 189 190 191 192 193 194 195 | shell: """ SHAPEIT5_phase_common \ --input {input.vcf} \ --pedigree {input.fam} \ --reference {input.ref} \ --region {wildcards.chr} \ --output {output.phased} \ --log {log} \ --thread {threads} \ """ |
215 216 217 218 219 220 | shell: """ java -Xmx50g -jar {params.jar} \ gt={input.vcf} \ out={output.out} \ nthreads={threads} \ """ |
273 274 275 276 277 278 279 280 281 | shell: """ SHAPEIT5_phase_common \ --input {input.vcf} \ --pedigree {input.fam} \ --region {wildcards.chr} \ --output {output.phased} \ --log {log} \ --thread {threads} \ """ |
296 297 298 299 300 301 302 303 304 | shell: """ GLIMPSE2_chunk \ --input {input.vcf} \ --region {wildcards.chr} \ --sequential \ --output {output.chunks} \ --log {log} \ --threads {threads} \ """ |
356 357 358 359 360 361 362 363 364 365 | shell: """ ORG=$(grep {wildcards.chr}:{wildcards.start}-{wildcards.end} {input.chunks} | cut -f 4); \ GLIMPSE2_split_reference \ --reference {input.vcf} \ --input-region {wildcards.chr}:{wildcards.start}-{wildcards.end} \ --output-region $ORG \ --output {params.out_prefix} \ --threads {threads} \ --log {log.log}; \ """ |
404 405 406 407 408 | shell: """ for SAMPLE in $(bcftools query -l {input.vcf}); do \ echo {params.prefix}${{SAMPLE}}{params.suffix}; \ done > {output.bam_list} \ """ |
420 421 422 423 424 425 426 | shell: """ GLIMPSE2_phase \ --bam-list {input.bam_list} \ --reference {input.ref} \ --output {output.vcf} \ --threads {threads} \ """ |
451 452 453 454 | run: with open(output.txt, "w") as f: for line in input.chunk_files: f.write(line + "\n") |
477 478 479 480 481 482 | shell: """ GLIMPSE2_ligate \ --input {input.txt} \ --output {output.vcf} \ --threads {threads} \ """ |
10 | shell: "bwa index {input}" |
18 | shell: "samtools faidx {input}" |
39 40 | shell: "gatk CreateSequenceDictionary \ -R {input.fasta}" |
49 | shell: "samtools index {input}" |
60 61 | shell: "bcftools index {input} \ --tbi" |
72 73 | shell: "bcftools index {input} \ --csi" |
17 18 19 20 21 | shell: """ bcftools concat {input.vcfs} \ -o {output.vcf} \ -Oz \ """ |
18 19 20 21 22 | shell: """ gatk --java-options "-Xmx8g" CalculateGenotypePosteriors \ -V {input.vcf} \ -O {output} \ --tmp-dir ~/tmp/{rule}""" |
38 39 40 41 42 43 44 | shell: """ gatk --java-options "-Xmx8g" VariantFiltration \ -V {input.vcf} \ --genotype-filter-expression "GQ < {params.GQ}" \ --genotype-filter-name "GQ{params.GQ}" \ --set-filtered-genotype-to-no-call \ -O {output.vcf}""" |
61 62 63 64 65 66 67 68 69 70 71 | shell: """ bcftools view {input.vcf} \ -S {input.samples} \ -Ou \ | bcftools view \ --min-af {params.min_AF} \ --max-af {params.max_AF} \ --min-ac {params.min_AC} \ -Oz \ -o {output.vcf} \ """ |
81 82 83 84 85 86 87 | shell: """ bcftools query -l {input.vcf} | grep WGS > {output.samples}; bcftools view {input.vcf} \ -S {output.samples} \ -Oz \ -o {output.vcf} \ """ |
98 99 100 101 102 103 104 105 | shell: """ bcftools query -l {input.vcf} | grep WES > {output.samples}; bcftools view {input.vcf} \ -S {output.samples} \ -R {input.bed} \ -Oz \ -o {output.vcf} \ """ |
126 127 128 129 130 131 132 133 134 | shell: """ sed 's/./&\t/3' {input.samples} \ | awk -v OFS='\t' '{{print $2,$1}}' \ | sort \ | awk '$1!=p{{if(p)print s; p=$1; s=$0; next}}{{sub(p,x); s=s $0}} END {{print s}}' \ | awk -v OFS='' '{{print $NF,$1}}' \ | grep -v "_" \ | sort > {output.largest_samples} \ """ |
147 148 149 150 151 152 153 | shell: """ sed 's/AMP/AMP\t/g;s/GBS/GBS\t/g; s/WES/WES\t/g; s/WGS/WGS\t/g' {input.samples} \ | sort -k 2 \ | join - {input.parents} -1 2 -2 1 \ | awk '{{print $2$1"\t"$3"\t"$4}}' \ > {output.parents} \ """ |
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | run: samples = None with open(input.samples, "r") as f: samples = f.read().strip().split("\n") with open(input.parents, "r") as f: with open(output.tsv, "a") as out: for line in f: child, sire, dam = line.strip("\n").split("\t") seq_type = child[:3] if f"WGS{sire}" in samples: sire = f"WGS{sire}" elif f"{seq_type}{sire}" in samples: sire = f"{seq_type}{sire}" else: sire = "" if f"WGS{dam}" in samples: dam = f"WGS{dam}" elif f"{seq_type}{dam}" in samples: dam = f"{seq_type}{dam}" else: dam = "" out.write(f"{child}\t{sire}\t{dam}\n") |
199 200 201 202 203 204 | shell: """ awk 'BEGIN {{OFS="\t"}} {{print 0,$1,$2,$3,0,0}}' {input.tsv} \ | sed 's/\t\t/\t0\t/g' \ | sed 's/\t\t/\t0\t/g' \ > {output.ped} \ """ |
213 214 215 216 | shell: """ grep -E "(WGS.*|WES.*|GBS.*|AMP.*){{3}}" {input.tsv} \ > {output.tsv} \ """ |
228 229 230 231 232 233 234 235 236 237 238 239 240 | shell: """ read -r CHILD SIRE DAM <<< $(grep ^{wildcards.sample} {input.tsv}); \ if [ -n "$SIRE" ]; \ then SIRE=",$SIRE"; \ fi; \ if [ -n "$DAM" ]; \ then DAM=",$DAM"; \ fi; \ bcftools view {input.vcf} \ -s {wildcards.sample}$SIRE$DAM \ -Oz \ -o {output.trio} \ """ |
253 254 255 256 257 258 259 260 | shell: """ bcftools annotate -x INFO,^FORMAT/GT {input.trio} \ | bcftools view -H -i "F_MISSING==0" \ | awk '{{print $10,$11,$12}}' \ | python3 {params.script} \ | sort \ | uniq -c > {output.counts} \ """ |
298 299 300 301 302 303 304 | shell: """ whatshap phase {input.vcf} {params.child_bam} {params.sire_bam} {params.dam_bam} \ --reference={input.ref_fasta} \ --ped={input.ped} \ --tag=PS \ -o {output.phased} \ """ |
315 316 317 318 319 320 321 | shell: """ SAMPLE=$(bcftools query -l {input.trio} | head -n 1); \ bcftools view {input.trio} \ -s $SAMPLE \ -Ou \ -o {output.indiv} \ """ |
340 341 342 343 344 | shell: """ bcftools merge {input.trios} \ -Oz \ -o {output.merged} \ """ |
354 355 356 357 358 359 360 | shell: """ bcftools query -l {input.vcf} | grep {wildcards.seq} > {output.samples}; bcftools view {input.vcf} \ -S {output.samples} \ -Oz \ -o {output.vcf} \ """ |
370 371 372 373 374 375 376 377 | shell: """ bcftools query -l {input.vcf} | grep {wildcards.seq} > {output.samples}; bcftools view {input.vcf} \ -S {output.samples} \ -R {input.bed} \ -Oz \ -o {output.vcf} \ """ |
23 24 25 26 27 | shell: """ bcftools concat {input.vcfs} \ -o {output} \ -Oz \ """ |
40 41 42 43 44 | shell: """ bcftools concat {input.vcfs} \ -o {output} \ -Oz \ """ |
63 64 65 66 67 68 69 70 71 | shell: """ plink \ --vcf {input.vcf} \ --update-parents {input.parents_table} \ --update-sex {input.sex_table} \ --recode12 \ --make-bed \ --out {params.out_path}/{wildcards.dataset} \ """ |
85 86 87 88 89 90 91 | shell: """ plink \ --bfile {params.path}/{wildcards.dataset} \ --update-ids {input.update_fids} \ --make-bed \ --out {params.path}/{wildcards.dataset}.same_fid \ """ |
106 107 108 109 110 111 | shell: """ king \ -b {input.bed} \ --kinship \ --prefix {params.prefix} \ """ |
133 134 135 136 137 138 139 140 141 | shell: """ plink \ --vcf {input.vcf} \ --update-parents {input.parents_table} \ --update-sex {input.sex_table} \ --recode12 \ --make-bed \ --out {params.out_path}/{wildcards.dataset}.chr{wildcards.chr} \ """ |
155 156 157 158 159 160 161 | shell: """ plink \ --bfile {params.path}/{wildcards.dataset}.chr{wildcards.chr} \ --update-ids {input.update_fids} \ --make-bed \ --out {params.path}/{wildcards.dataset}.chr{wildcards.chr}.same_fid \ """ |
176 177 178 179 180 181 | shell: """ king \ -b {input.bed} \ --kinship \ --prefix {params.prefix} \ """ |
196 197 198 199 200 201 202 | shell: """ vcftools \ --gzvcf {input.vcf} \ --het \ --stdout \ > {output.het} \ """ |
212 213 214 215 216 217 218 | shell: "lcmlkin \ -i {input.vcf} \ -o {output} \ -g all \ -l phred \ -u {input.founders} \ -t {threads}" |
231 | script: "../scripts/fst.py" |
239 | script: "../scripts/diversity.py" |
272 273 274 275 276 | shell: "plink \ --file " + config["results"] + " plink/{wildcards.dataset} \ --out " + config["results"] + " plink/recode12/{wildcards.dataset}_recode12 \ --recode12" ''' |
287 288 289 290 291 | shell: "admixture {input.ped} {wildcards.clusters} \ -j{threads} \ --cv | tee {output.out}; \ mv {wildcards.dataset}_recode12.{wildcards.clusters}.Q {output.q}; \ mv {wildcards.dataset}_recode12.{wildcards.clusters}.P {output.p}" |
305 306 307 308 309 310 | shell: "echo -e 'sample_id\tCluster_1\tCluster_2' > {output}; \ TMP=delimited.tmp; \ sed 's/ /\t/g' {input.q} > $TMP; \ cut {input.ped} -d ' ' -f 2 \ | paste - $TMP >> {output}; \ rm $TMP;" |
331 332 333 334 335 | shell: "plink \ --vcf {input.vcf} \ --recode12 \ --make-bed \ --out {config[results]}admixture/supervised/plink/{wildcards.sample}" |
349 350 351 352 353 | shell: "admixture {input.bed} {wildcards.clusters} \ --supervised \ -j{threads}; \ mv {wildcards.dataset}.{wildcards.clusters}.Q {output.q}; \ mv {wildcards.dataset}.{wildcards.clusters}.P {output.p}" |
364 365 366 367 | shell: "plink \ --bfile " + config["results"] + "admixture/supervised/plink/{wildcards.dataset} \ --recode \ --out " + config["results"] + "admixture/supervised/plink/{wildcards.dataset}" |
383 384 385 386 387 388 389 390 | shell: "echo -e 'Sample_id\tChinese\tIndian' > {output}; \ TMP=delimited.tmp; \ TMP2=pasted.tmp; \ sed 's/ /\t/g' {input.q} > $TMP; \ cut {input.ped} -d ' ' -f 2 \ | paste - $TMP >> $TMP2; \ tail -n 674 $TMP2 >> {output}; \ rm $TMP; rm $TMP2" |
397 398 399 400 401 402 403 404 405 406 | run: import matplotlib.pyplot as plt import pandas as pd import seaborn as sns data = pd.read_table(str(input)) sns.set_theme() ax = sns.histplot(data=data, bins=20, multiple="stack") ax.set(title="Ancestral Admixture in SNPRC Rhesus Macaques", xlabel="Ancestry ratio", xlim=(0.5, 1), ylabel="Number of individuals") plt.savefig(str(output)) |
419 420 421 422 | shell: "awk '{{print $1,$4}}' {input.map} \ | paste - {input.p} \ | awk '$4 - $3 > {params.max_diff} || $3 - $4 > {params.max_diff} {{print $0}}' - \ > {output}" |
16 17 18 19 20 | shell: """ delly call {input.bam} \ -g {input.ref_fasta} \ > {output.bcf} \ """ |
31 32 33 34 | shell: """ delly merge {input.bcfs} \ -o {output.bcf} \ """ |
47 48 49 50 51 52 | shell: """ delly call {input.bam} \ -g {input.ref_fasta} \ -v {input.bcf} \ -o {output.bcf} \ """ |
66 67 68 69 70 71 | shell: """ bcftools merge {input.bcfs}\ -m id \ -Ob \ -o {output.bcf} \ """ |
82 83 84 85 86 | shell: """ delly filter {input.bcf} \ -f germline \ -o {output.bcf} \ """ |
99 100 101 102 103 104 | shell: """ bcftools view {input.bcf} \ -i "ALT='<{wildcards.SV_type}>'" \ -Ob \ -o {output.bcf} \ """ |
121 122 123 124 125 | shell: """ dicey chop {input.ref_fasta} \ --fq1 {params.R1} \ --fq2 {params.R2} \ """ |
142 143 144 145 146 147 148 | shell: """ bwa mem {input.ref_fasta} {input.R1} {input.R2} \ -t {params.bwa_threads} \ | samtools sort \ -@ {params.samtools_threads} \ -o {output.bam} \ """ |
162 163 164 165 166 | shell: """ dicey mappability2 {input.bam} \ -o {output.map}; \ gunzip {output.map} && bgzip {params.map} \ """ |
182 183 184 185 186 187 | shell: """ delly cnv {input.bam} \ -g {input.ref_fasta} \ -l {input.map} \ -o {output.bcf} \ """ |
198 199 200 201 202 203 204 205 | shell: """ delly merge {input.bcfs} \ --cnvmode \ --pass \ --minsize 1000 \ --maxsize 100000 \ -o {output.bcf} \ """ |
220 221 222 223 224 225 226 227 | shell: """ delly cnv {input.bam} \ --segmentation \ -v {input.bcf} \ -g {input.ref_fasta} \ -m {input.map} \ -o {output.bcf} \ """ |
241 242 243 244 245 246 | shell: """ bcftools merge {input.bcfs} \ -m id \ -Ob \ -o {output.bcf} \ """ |
256 257 258 259 260 | shell: """ delly classify {input.bcf} \ --filter germline \ -o {output.bcf} \ """ |
34 35 36 37 38 39 40 41 42 43 44 45 | shell: """ # ADAPTERS=$(gunzip -c {input.reads[0]} | head -n 1 | cut -d " " -f 2 | cut -d ":" -f 4); # i7=$(echo $ADAPTERS | cut -d "+" -f 1); # i5=$(echo $ADAPTERS | cut -d "+" -f 2); R1_END_ADAPTER="{params.pre_i7}"; # ${{i7}}{params.post_i7}"; R2_END_ADAPTER="{params.pre_i5}"; # ${{i5}}{params.post_i5}"; cutadapt {input.reads} \ -a $R1_END_ADAPTER \ -A $R2_END_ADAPTER \ --cores {threads} \ -o {output.trimmed[0]} \ -p {output.trimmed[1]}""" |
112 113 114 115 116 117 118 119 120 121 | shell: """ ADAPTERS=$(gunzip -c {input.reads[0]} | head -n 1 | cut -d " " -f 2 | cut -d ":" -f 4); R1_END_ADAPTER="{params.R1_adapter}"; R2_END_ADAPTER="{params.barcode}{params.R2_adapter}"; cutadapt {input.reads} \ -a $R1_END_ADAPTER \ -A $R2_END_ADAPTER \ --cores {threads} \ -o {output.trimmed[0]} \ -p {output.trimmed[1]}""" |
141 142 | shell: """ bash workflow/scripts/align.sh {input.trimmed[0]} {input.trimmed[1]} {input.ref} {output} {threads} {wildcards.sample}""" |
155 156 157 158 159 160 161 162 163 164 165 | shell: """ samtools sort {input.alignment} \ -n \ -T ~/tmp/{rule} \ | samtools fixmate - - \ -m \ | samtools sort - \ -T ~/tmp/{rule} \ | samtools markdup - {output.alignment} \ -T ~/tmp/{rule} \ """ |
179 180 181 182 183 184 185 186 187 188 | shell: """ samtools sort {input.alignment} \ -n \ -T ~/tmp/{rule} \ | samtools fixmate - - \ -m \ | samtools sort - \ -T ~/tmp/{rule} \ -o {output.alignment} \ """ |
206 207 208 209 210 211 212 | shell: """ gatk --java-options '-Xmx8g' BaseRecalibrator \ -R {input.ref} \ -I {input.bam} \ --known-sites {input.known_variants} \ -O {output} \ --tmp-dir ~/tmp/{rule}""" |
227 228 229 230 231 232 233 234 | shell: """ gatk --java-options '-Xmx8g' ApplyBQSR \ -R {input.ref} \ -I {input.bam} \ --bqsr-recal-file {input.recal} \ -O {output} \ --tmp-dir ~/tmp/{rule} \ --create-output-bam-index false""" |
249 250 251 252 253 254 255 256 | shell: """ gatk --java-options '-Xmx16g' HaplotypeCaller \ -R {input.ref} \ -I {input.bam} \ -O {output.vcf} \ -ERC GVCF \ --native-pair-hmm-threads {threads} \ --tmp-dir ~/tmp/{rule}""" |
271 272 273 274 275 | run: with open(output.sample_map, "w") as sample_map: for gvcf in input.gvcfs: sample = gvcf.split("/")[-1].split(".")[0] sample_map.write(f"{sample}\t{gvcf}\n") |
290 291 292 | shell: """ zcat {input.ref} | grep "^>" | cut -c 2- | cut -d " " -f 1 | grep -x -E "^[0-9]+|X|Y|MT" > {output} """ |
309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | shell: """ CONTIGS=$(awk 'BEGIN {{ORS = ","}} {{print $0}}' {input.contigs}); \ if [ -d {params.db} ]; \ then WORKSPACE_FLAG="genomicsdb-update-workspace-path"; \ else WORKSPACE_FLAG="genomicsdb-workspace-path"; \ fi; \ gatk --java-options '-Xmx8g' GenomicsDBImport \ --$WORKSPACE_FLAG {params.db} \ --intervals {input.contigs} \ --sample-name-map {input.sample_map} \ --batch-size 50 \ --genomicsdb-shared-posixfs-optimizations true \ --reader-threads {threads} \ --max-num-intervals-to-import-in-parallel {params.parallel_intervals} \ && touch {output} """ |
342 343 344 345 346 347 348 | shell: """ gatk --java-options '-Xmx16g' GenotypeGVCFs \ -R {input.ref} \ -V gendb://{params.db} \ -O {output.vcf} \ -L {wildcards.chr} \ """ |
374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 | shell: """ bcftools norm {input.vcf} \ -m-any \ --fasta-ref {input.ref_fasta} \ -Ou \ | bcftools view \ -e'type{params.equality}"snp"' \ -Ou \ | bcftools norm \ -m+any \ -Ou \ | bcftools view \ -M2 \ -m2 \ -Oz \ -o {output.split} \ """ |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | READ1=$1 READ2=$2 REF=$3 OUTPUT=$4 THREADS=$5 SAMPLE=$6 echo "In .sh file"; ID=$(gunzip -c "$READ1" | head -n 1 | cut -c 2- | awk -v FS=':' '{print "ID:"$1"."$2}'); echo $ID; echo post_id; PU=$(gunzip -c "$READ1" | head -n 1 | cut -c 2- | awk -v FS=':' '{print "PU:"$3"."$4}'); echo $PU; echo post_pu; echo "$SAMPLE" echo '@RG\tSM:'$SAMPLE'\tLB:'$SAMPLE'\t'$ID'\t'$PU'.'$SAMPLE'\tPL:ILLUMINA' bwa mem $REF $READ1 $READ2 \ -t $THREADS \ -R '@RG\tSM:'$SAMPLE'\tLB:'$SAMPLE'\t'$ID'\t'$PU'.'$SAMPLE'\tPL:ILLUMINA' \ | samtools view -Sb - > $OUTPUT |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | import allel import numpy as np import pandas as pd # Variables taken from Snakemake rule vcf = snakemake.input.vcf chromosome_file = snakemake.output.chromosomes annotation_file = snakemake.output.annotations # Reference genome Mmul_8.0.1 chromosomes chromosomes = {"name": ["chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chrX"], "start": [1 for i in range(1, 22)], "stop": [225_584_828, 204_787_373, 185_818_997, 172_585_720, 190_429_646, 180_051_392, 169_600_520, 144_306_982, 129_882_849, 92_844_088, 133_663_169, 125_506_784, 108_979_918, 127_894_412, 111_343_173, 77_216_781, 95_684_472, 70_235_451, 53_671_032, 74_971_481, 149_150_640]} chromosomes_df = pd.DataFrame(data=chromosomes) chromosomes_df.to_csv(chromosome_file, header=False, index=False, sep="\t") # Current ids don't include any individuals with missing years. # Colony 1 subpop = ['17509', '17515', '17511', '17514', '17516', '17584', '17534', '17535', '17527', '17536', '17525', '17553', '17546', '17552', '17551', '17557', '8796', '17563', '17572', '17570', '17567', '9793', '17562', '18314', '17578', '17581', '17582', '17521', '17575', '17588', '17583', '17585', '17740', '10244', '17590', '19325', '17591', '17597', '19323', '17606', '12908', '17640', '17620', '17639', '17619', '17626', '17624', '17615', '17622', '17629', '16340', '16345', '16346', '16347', '16349', '16350', '16351', '16353', '16355', '16356', '16358', '16639', '16640', '16641', '16643', '16644', '16647', '16649', '16653', '16655', '16656', '16658', '16342', '16344', '16348', '16354', '17657', '16657', '16645', '16646', '16650', '16652', '17674', '17663', '17661', '17647', '17679', '17678', '17656', '17642', '17660', '17641', '14970', '17666', '17654', '17658', '17659', '17662', '17645', '16339', '17701', '17684', '17710', '17680', '17723', '17732', '17733', '17730', '17714', '17717', '17745', '18400', '18408', '17133', '17724', '18416', '18401', '17737', '17736', '19223', '19224', '19150', '19177', '19312', '18987', '19315', '18419', '18994', '18421', '19316', '18990', '19314', '18983', '18985', '18422', '19699', '19702', '19852', '19857', '19853', '19846', '20064', '19689', '19695', '19845', '19691', '20077', '20082', '20078', '20080', '20083', '19851', '19856', '19688', '19700', '19318', '29495', '29501', '26837', '26858', '26353', '26935', '27010', '26957', '29453', '29461', '29454', '29455', '29458', '26786', '26875', '26955', '26899', '27128', '27441', '26515', '27037', '26695', '26177', '26813', '26232', '27056', '26992', '26732', '28035', '28324', '28282', '28303', '28403', '28356', '28092', '28138', '28302', '28215', '27646', '28168', '28195', '28210', '27783', '28196', '27682', '28109', '28110', '28066', '28240', '28189', '28144', '28060', '28071', '28310', '27886', '28829', '28796', '28898', '28935', '29141', '28927', '28937', '28794', '29576', '29372', '29429', '29445', '29555', '29532', '29364', '29362', '29560', '29437', '29436', '29507', '30618', '30142', '30009', '30125', '30118', '30019', '30204', '30024', '30122', '30128', '30139', '30285', '30003', '30034', '30086', '30739', '30649', '30506', '30513', '30633', '30732', '30644', '30663', '30742', '30736', '30735', '30643', '30876', '30881', '30651', '30956', '30638', '30755', '30565', '30855', '30505', '30654', '30610', '30641', '30848', '31310', '31204', '31209', '31225', '31163', '31127', '31414', '31357', '31160', '31397', '31236', '31143', '31229', '31154', '31351', '31137', '31144', '31161', '31194', '31147', '31808', '31798', '31776', '31809', '31802', '31810', '31833', '32351', '31834', '32111', '31715', '31691', '31832', '32022', '32217', '31806', '31814', '31885', '31762', '31781', '31982', '32017', '31783', '31840', '32767', '32642', '32578', '32594', '32593', '32609', '32872', '32592', '32764', '32914', '32693', '32789', '32571', '32542', '33055', '32694', '32639', '33202', '33229', '33397', '33326', '33221', '33177', '33442', '33312', '33171', '33576', '33231', '33166', '33325', '33313', '33302', '34746', '34921', '34353', '34561', '35034', '34717', '34901', '34986', '34696', '34722', '34362', '34223', '34186', '34723', '34764', '35422', '35955', '35325', '35351', '36041', '35454', '36089', '36033', '35949', '36063', '35444', '36066', '36032', '35964', '35443', '35948', '35947', '35958', '35975', '35956', '36806', '36671', '37218', '37005', '36793', '36992', '36931', '36906', '36894', '36878', '36881', '36862', '36875', '36672', '36805', '36932' ] # Colony 2 # subpop = ['33914', '33916', '33917', '33919', '33918', '33920', '33923', '33926', '33935', '33929', '33932', '33938', '33924', '33937', '33933', '33930', '33931', '33943', '34229', '34235', '34241', '34228', '33945', '34231', '34230', '33942', '33947', # '33952', '33951', '33949', '33948', '34257', '34245', '34246', '34252', '34256', '34243', '34244', '34249', '34247', '34250', '34254', '34273', '34272', '34274', '34266', '34271', '34261', '34260', '34265', '34258', '34270', '34264', '34259', # '34283', '34285', '34282', '34281', '34276', '34284', '33959', '34277', '33960', '34280', '34296', '34298', '34294', '34297', '34293', '34292', '33963', '34291', '34295', '34290', '34289', '34286', '34288', '34299', '34287', '33965', '33964', # '33990', '33997', '34007', '34002', '34004', '34305', '33979', '33982', '33984', '33993', '33998', '33992', '34008', '34009', '34006', '33971', '34311', '34309', '34308', '33980', '33981', '33968', '34301', '33967', '33976', '34319', '33987', '33978', '34001', '34005', '33999', '33991', '33995', '33986', '34003', '34000', '34320', '34313', '34312', '34310', '34306', '34304', '34321', '34316', '34307', '34318', '34314', '34322', '33970', '33994', '33988', '33973', '34300', '34302', '33966', '33975', '33972', '33989', '33977', '33983', '33996', '33985', '34303', '34315', '33969', '34023', '34018', '34342', '34024', '34021', '34010', '34343', '34349', '34346', '34326', '34329', '34328', '34014', '34019', '34020', '34344', '34331', '34332', '34341', '34012', '34011', '34013', '34017', '34352', '34333', '34334', '34015', # '35162', '35087', '35043', '34911', '35169', '34899', '34898', '35196', '35191', '35434', '35195', '35268', '35170', '34846', '35038', '34707', '34841', '34695', '34988', '34697', '34698', '34948', '34845', '34980', '34969', '36405', '35270', '35373', '35326', '35269', '35475', '36044', '36029', '35969', '35372', '35324', '36258', '36131', '35430', '36102', '36239', '35961', '36421', '35366', '36198', '36380', '35301', '35455', '36176', '36873', '36930', '36720', '36936', '36905', '36880', '36911', '37169', '36997', '36699', '36710', '36856', '36874', '36821', '37006', '37076', '36929', '36677', '37053' # ] # Pull genotype data from VCF. callset = allel.read_vcf(vcf, ['variants/CHROM', 'variants/POS', 'samples', 'calldata/GT']) #sample_filter = np.isin(callset["samples"], subpop) sample_subpop_indices = [list(callset["samples"]).index(id) for id in subpop] # Create pandas dataframe for chromosome file (for chromoMap) header = ["id", "chrom", "win_start", "win_end", "diversity"] df = pd.DataFrame(columns=header) # Create pandas dataframe for annotation file (for chromoMap) header = ["id", "chrom", "win_start", "win_end", "diversity"] df = pd.DataFrame(columns=header) # Loop through each chromosome for chrom in sorted(list(set(callset['variants/CHROM']))): chrom_filter = np.equal(callset['variants/CHROM'], chrom) # Take genotypes from current chromosome genotypes = allel.GenotypeArray(callset['calldata/GT'][chrom_filter]) # Keep only genotypes from subpopulation genotypes = genotypes.take(sample_subpop_indices, axis=1) ac = genotypes.count_alleles() pos = callset["variants/POS"][chrom_filter] end = pos.max() pi, windows, n_bases, counts = allel.windowed_diversity(pos, ac, size=1_000_000, start=1, stop=end) data_dict = {"id": [f"chr{chrom}:{window[0]}" for window in windows], "chrom": [f"chr{chrom}"]*len(pi), "win_start": [window[0] for window in windows], "win_end": [window[1] for window in windows], "diversity": pi, } chr_df = pd.DataFrame.from_dict(data_dict) df = pd.concat([df, chr_df]) df = df.sort_values(["chrom", "win_start"]) df.to_csv(annotation_file, header=False, index=False, sep="\t") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | import allel import numpy as np import snakemake as snek # Variables taken from Snakemake rule vcf = snakemake.input.vcf pickle = snakemake.output.pickle # Taken from config file. Needs an implementation to pull from there subpops_dict = { '1980-1995': ['10235', '10244', '10250', '10814', '11452', '11491', '12001', '12147', '12540', '12543', '12908', '12932', '14392', '14393', '14396', '17504', '17505', '17507', '17508', '17509', '17511', '17514', '17515', '17516', '17521', '17522', '17525', '17527', '17534', '17535', '17536', '17537', '17539', '17544', '17546', '17551', '17552', '17553', '17557', '17562', '17563', '17566', '17567', '17568', '17569', '17570', '17572', '17573', '17575', '17578', '17581', '17582', '17583', '17584', '17585', '17588', '17590', '17591', '17597', '17606', '17740', '18314', '19323', '19325', '7139', '7893', '8796', '8896', '9038', '9590', '9710', '9793', '9908'], '1996-2001': ['14970', '16339', '16340', '16342', '16344', '16345', '16346', '16347', '16348', '16349', '16350', '16351', '16353', '16354', '16355', '16356', '16358', '16639', '16640', '16641', '16643', '16644', '16645', '16646', '16647', '16649', '16650', '16652', '16653', '16655', '16656', '16657', '16658', '17133', '17615', '17619', '17620', '17622', '17624', '17626', '17629', '17639', '17640', '17641', '17642', '17645', '17647', '17654', '17656', '17657', '17658', '17659', '17660', '17661', '17662', '17663', '17666', '17674', '17678', '17679', '17680', '17684', '17701', '17710', '17714', '17717', '17723', '17724', '17730', '17732', '17733', '17745', '18400', '18408', '33914', '33916', '33917', '33918', '33919', '33920'], '2002-2005': ['17736', '17737', '18401', '18416', '18419', '18421', '18422', '18983', '18985', '18987', '18990', '18994', '19150', '19177', '19223', '19224', '19312', '19314', '19315', '19316', '19318', '19688', '19689', '19691', '19695', '19699', '19700', '19702', '19845', '19846', '19851', '19852', '19853', '19856', '19857', '20064', '20077', '20078', '20080', '20082', '20083', '26177', '26232', '26353', '26515', '26695', '26732', '26786', '26813', '26837', '26858', '26875', '26899', '26935', '26955', '26957', '26992', '27010', '27037', '27056', '27128', '27441', '29453', '29454', '29455', '29458', '29461', '29495', '29501', '33923', '33924', '33926', '33929', '33930', '33931', '33932', '33933', '33935', '33937', '33938', '33942', '33943', '33945', '33947', '33948', '33949', '33951', '33952', '34228', '34229', '34230', '34231', '34235', '34241'], '2006-2009': ['27646', '27682', '27783', '27886', '28035', '28060', '28066', '28071', '28092', '28109', '28110', '28138', '28144', '28168', '28189', '28195', '28196', '28210', '28215', '28240', '28282', '28302', '28303', '28310', '28324', '28356', '28403', '28794', '28796', '28829', '28898', '28927', '28935', '28937', '29141', '29362', '29364', '29372', '29429', '29436', '29437', '29445', '29507', '29532', '29555', '29560', '29576', '30003', '30009', '30019', '30024', '30034', '30086', '30118', '30122', '30125', '30128', '30139', '30142', '30204', '30285', '30618', '33959', '33960', '34243', '34244', '34245', '34246', '34247', '34249', '34250', '34252', '34254', '34256', '34257', '34258', '34259', '34260', '34261', '34264', '34265', '34266', '34270', '34271', '34272', '34273', '34274', '34276', '34277', '34280', '34281', '34282', '34283', '34284', '34285'], '2010-2012': ['30505', '30506', '30513', '30565', '30610', '30633', '30638', '30641', '30643', '30644', '30649', '30651', '30654', '30663', '30732', '30735', '30736', '30739', '30742', '30755', '30848', '30855', '30876', '30881', '30956', '31127', '31137', '31143', '31144', '31147', '31154', '31160', '31161', '31163', '31194', '31204', '31209', '31225', '31229', '31236', '31310', '31351', '31357', '31397', '31414', '31691', '31715', '31762', '31776', '31781', '31783', '31798', '31802', '31806', '31808', '31809', '31810', '31814', '31832', '31833', '31834', '31840', '31885', '31982', '32017', '32022', '32111', '32217', '32351', '33963', '33964', '33965', '34286', '34287', '34288', '34289', '34290', '34291', '34292', '34293', '34294', '34295', '34296', '34297', '34298', '34299'], '2013': ['32542', '32571', '32578', '32592', '32593', '32594', '32609', '32639', '32642', '32693', '32694', '32764', '32767', '32789', '32872', '32914', '33055', '33966', '33967', '33968', '33969', '33970', '33971', '33972', '33973', '33975', '33976', '33977', '33978', '33979', '33980', '33981', '33982', '33983', '33984', '33985', '33986', '33987', '33988', '33989', '33990', '33991', '33992', '33993', '33994', '33995', '33996', '33997', '33998', '33999', '34000', '34001', '34002', '34003', '34004', '34005', '34006', '34007', '34008', '34009', '34300', '34301', '34302', '34303', '34304', '34305', '34306', '34307', '34308', '34309', '34310', '34311', '34312', '34313', '34314', '34315', '34316', '34318', '34319', '34320', '34321', '34322'], '2014-2015': ['33166', '33171', '33177', '33202', '33221', '33229', '33231', '33302', '33312', '33313', '33325', '33326', '33397', '33442', '33576', '34010', '34011', '34012', '34013', '34014', '34015', '34017', '34018', '34019', '34020', '34021', '34023', '34024', '34186', '34223', '34326', '34328', '34329', '34331', '34332', '34333', '34334', '34341', '34342', '34343', '34344', '34346', '34349', '34352', '34353', '34362', '34561', '34695', '34696', '34697', '34698', '34707', '34717', '34722', '34723', '34746', '34764', '34841', '34845', '34846', '34898', '34899', '34901', '34911', '34921', '34948', '34969', '34980', '34986', '34988', '35034', '35038', '35043', '35087', '35162', '35169', '35170', '35191', '35195', '35196', '35268', '35434'], '2016-2017': ['35269', '35270', '35301', '35324', '35325', '35326', '35351', '35366', '35372', '35373', '35422', '35430', '35443', '35444', '35454', '35455', '35475', '35947', '35948', '35949', '35955', '35956', '35958', '35961', '35964', '35969', '35975', '36029', '36032', '36033', '36041', '36044', '36063', '36066', '36089', '36102', '36131', '36176', '36198', '36239', '36258', '36380', '36405', '36421', '36671', '36672', '36677', '36699', '36710', '36720', '36793', '36805', '36806', '36821', '36856', '36862', '36873', '36874', '36875', '36878', '36880', '36881', '36894', '36905', '36906', '36911', '36929', '36930', '36931', '36932', '36936', '36992', '36997', '37005', '37006', '37053', '37076', '37169', '37218'], } # # Stringify subpops (because ids later were interpreted as strings and subpops as integers) # stringified_subpops = [] # for subpop in subpops: # stringified_subpops.append([str(item) for item in subpop]) # subpops = stringified_subpops ids = list(snek.shell(f'bcftools query -l {vcf}', iterable=True)) print("ids:", ids) # Find indicies of individuals #subpops_by_indices = [] subpops_by_indices = {} #print("subs:", subpops_dict) for subpop_name, subpop_members in subpops_dict.items(): print("subpop_name:", subpop_name, "subpop_members", subpop_members) indices = [ids.index(id) for id in subpop_members] #subpops_by_indices.append(indices) subpops_by_indices[subpop_name] = indices print("subpops_by_indices:", subpops_by_indices) # Pull genotype data from VCF. callset = allel.read_vcf(vcf, ['variants/CHROM', 'variants/POS', 'calldata/GT']) from itertools import combinations from statistics import mean import pandas as pd # Create pandas dataframe header = ["chrom", "center", "fst", "pop1", "pop2"] df = pd.DataFrame(columns=header) # Loop through populations pariwise for pair_names, pair_indices in zip(combinations(subpops_by_indices.keys(), 2), combinations(subpops_by_indices.values(), 2)): # Loop through each chromosome for chrom in sorted(list(set(callset['variants/CHROM']))): chrom_sites = np.equal(callset['variants/CHROM'], chrom) genotypes = allel.GenotypeArray(callset['calldata/GT'][chrom_sites]) positions = callset['variants/POS'][chrom_sites] # Run Fst. fsts, windows, counts = allel.windowed_weir_cockerham_fst(positions, genotypes, pair_indices, size=500000) # Set negative Fst values to 0. for idx, fst in enumerate(fsts): if fst < 0: fsts[idx] = 0 # Create chromosome dataframe window_centers = [mean(positions) for positions in windows] print("Lengths:") print("chrom", len([chrom]*len(fsts))) print("center", len(window_centers)) print("fst", len(fsts)) print("pop1", len([pair_names[0]]*len(fsts))) print("pop2", len([pair_names[1]]*len(fsts))) data_dict = {"chrom": [chrom]*len(fsts), "center": window_centers, "fst": fsts, "pop1": [pair_names[1]]*len(fsts), "pop2": [pair_names[0]]*len(fsts)} chr_df=pd.DataFrame.from_dict(data_dict) df = pd.concat([df, chr_df]) df = df.sort_values(["chrom", "center"]) # Store data as a pickle file. # This can then be opened into a pandas dataframe and used to make graphs with Seaborn df.to_pickle(pickle) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | import allel import numpy as np import pandas as pd # Variables taken from Snakemake rule vcf = snakemake.input.vcf roh_pickle = snakemake.output.roh_pickle froh_pickle = snakemake.output.froh_pickle # Reference genome Mmul_8.0.1 chromosomes chromosome_lengths = {"1": 225_584_828, "2": 204_787_373, "3": 185_818_997, "4": 172_585_720, "5": 190_429_646, "6": 180_051_392, "7": 169_600_520, "8": 144_306_982, "9": 129_882_849, "10": 92_844_088, "11": 133_663_169, "12": 125_506_784, "13": 108_979_918, "14": 127_894_412, "15": 111_343_173, "16": 77_216_781, "17": 95_684_472, "18": 70_235_451, "19": 53_671_032, "20": 74_971_481, "X": 149_150_640} # Create empty roh list for all samples roh_list = [] # Create empty froh dictionary, which will contain proportion of genome in a ROH for each sample. froh_list = [] # Pull genotype data from VCF callset = allel.read_vcf(vcf, ['variants/CHROM', 'variants/POS', 'samples', 'calldata/GT']) # Loop through each sample for sample_idx, sample in enumerate(callset['samples']): for chrom in sorted(list(set(callset['variants/CHROM']))): chrom_sites = np.equal(callset['variants/CHROM'], chrom) # Select genotypes by chromosome and sample genotypes = allel.GenotypeArray(callset['calldata/GT'][chrom_sites]) genotypes = genotypes[:, sample_idx] positions = callset['variants/POS'][chrom_sites] roh_df, froh = allel.roh_poissonhmm(genotypes, positions, min_roh=1_000_000, contig_size=chromosome_lengths[chrom]) # Append to samples list roh_list.append(roh_df) # Add froh to dictionary froh_list.append((sample, chrom, froh)) # Store data as pickle files. # These can then be opened into a pandas dataframe and used to make graphs with Seaborn pd.concat(roh_list, ignore_index=True).to_pickle(roh_pickle) pd.DataFrame(froh_list, columns=['samples', 'chrom', 'froh']).to_pickle(froh_pickle) |
Support
- Future updates
Related Workflows





