Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Bo_demography
Demographic analysis of Brassica oleracea
Alignment of sequence data
Follows GATK4 best practices (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145)
Implemented as a Snakemake workflow (https://snakemake.readthedocs.io/en/stable/index.html)
TODO
-
Clean up filtering steps
-
Add adjustable parameters & metrics for filtering criteria (depth, missingness)
-
More efficient use of scratch space
-
Include sample ids & read groups in bwa-mem step (thanks, Michelle!): bwa mem -R $(echo "@RG\tID:${LANE}_${SAMPLE}\tSM:$SAMPLE\tLB:${SAMPLE}.1\tPL:ILLUMINA") -a $B73FA $R1 $R2
-
Add a config file to specify samples/reference/paths/etc. (will make workflow more general)
- include variable for scratch directory
-
add rule for generating mappability mask using SNPable
-
Find a better solution to create input for ADMIXTURE (right now requires manual edit of chr names and snp ids)
Workflow
Snakefile provides general organization (names of samples, calls other scripts/rules)
/rules contains rules (*.smk) that tell snakemake what jobs to run; organized broadly into categories
-
mapping.smk - map reads to reference genome
-
download fastq files (
fasterq-dump
) -
trim adapters (
Trimmomatic PE
) -
map to reference genome (
bwa mem
; B. oleracea HDEM reference) -
sort BAM files (
gatk/SortSam
) -
mark duplicates with Picard (
gatk/MarkDuplicates
) -
add read groups with Picard (
gatk/AddOrReplaceReadGroups
)
-
-
calling.smk - call variants
-
identify raw SNPs/indels (
gatk/HaplotypeCaller
) -
combine GVCFs (
gatk/GenomicsDBImport
) -
joing genotyping (
gatk/GenotypeGVCFs
)
-
-
filtering.smk - filter SNPs
-
select SNPs (
gatk/SelectVariants
) -
apply hard filters (
gatk/VariantFiltration
) -
export diagnostics (qual, quality depth, depth, marker quality, etc.) (
gatk/VariantsToTable
) -
remove sites with missing data and select biallelic SNPs (
gat/SelectVariants
) -
filter on individual sample depth (
gatk/VariantFiltration
+gatk/SelectVariants
) -
export filtered depth for each accession (
gatk/VariantsToTable
) -
combine vcfs into a single file (
bcftools concat
)
-
-
smc.smk - estimate population size history with
SMC++
-
convert vcf to smc format - requires mappability mask (
smc++ vcf2smc
) -
fit population size history with cross-validation (
smc++ cv
) -
fit population size history without cross-validation (
smc++ estimate
)
-
-
admixture.smk - maximum likelihood estimation of individual ancestries
-
create input file (some bash commands + LD pruning in
plink
) -
run admixture (
admixture
)
-
To run the workflow:
-
Create a conda environment (only need to do this once)
conda env create bo-demography --file environment.yaml
-
to access for future use, start a screen session with
screen
and use:
source activate bo-demography
-
to access for future use, start a screen session with
-
If running on a cluster, update the cluster config file (submit.json)
-
Dry run to test that everything is in place...
snakemake -n
-
Submit the workflow with
./submit.sh
Project Organization
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── src <- Source code for use in this project.
│ ├── data <- Scripts to download or generate data
│ ├── features <- Scripts to turn raw data into features for modeling
│ ├── models <- Scripts to train models and then use trained models to make
│ │ predictions
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
Project based on the cookiecutter data science project template . #cookiecutterdatascience
Code Snippets
16 17 18 19 20 21 | run: shell("fastqc \ -t {threads} \ -o {params} \ --noextract \ {input}") |
41 42 43 44 | run: shell("multiqc {params.indir}/*L004* -o {params.outdir} -n STJRI01_multiqc.html") shell("multiqc {params.indir}/*L001* -o {params.outdir} -n STJRI02_multiqc.html") shell("multiqc {params.indir}/*L002* -o {params.outdir} -n STJRI03_multiqc.html") |
31 32 33 34 35 36 37 38 | run: shell("wget --directory-prefix=data/external/ref \ http://www.genoscope.cns.fr/externe/plants/data/Boleracea_chromosomes.fasta") shell("bwa index -a bwtsw {output.ref}") shell("samtools faidx {output.ref}") shell("gatk CreateSequenceDictionary \ -R={output.ref} \ -O={output.dict}") |
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | run: print({params.sample}, {params.SRR}) shell("mkdir -p {params.tmp}") shell("fasterq-dump \ {params.SRR} \ -O {params.tmp} \ -o {params.sample} \ -t {params.tmp} \ -e {threads} \ -p") shell("java -jar /share/apps/Trimmomatic-0.36/trimmomatic.jar PE \ {params.stem}_1.fastq {params.stem}_2.fastq \ {params.stem}.forward.1.fastq \ {params.stem}.unpaired.1.fastq \ {params.stem}.reverse.2.fastq \ {params.stem}.unpaired.2.fastq \ LEADING:3 \ |
117 118 119 120 121 122 123 124 | run: shell("mkdir -p {params.tmp}") shell("java -jar /share/apps/Trimmomatic-0.36/trimmomatic.jar PE \ {input.fastq1} {input.fastq2} \ {params.stem}.forward.1.fastq \ {params.stem}.unpaired.1.fastq \ {params.stem}.reverse.2.fastq \ {params.stem}.unpaired.2.fastq \ |
150 151 | run: shell("mkdir -p {params.tmp}") |
184 185 186 187 188 189 190 191 192 193 194 | run: shell("mkdir -p {params.tmp}") shell("gatk MarkDuplicates \ -I={input} \ -O={output.bam} \ --METRICS_FILE={output.metrics} \ --CREATE_INDEX=true \ -MAX_FILE_HANDLES=1000 \ --ASSUME_SORT_ORDER=coordinate \ --TMP_DIR={params.tmp}") shell("rm -rf {params.tmp}") |
210 211 | run: shell("mkdir -p {params.tmp}") |
239 240 241 242 243 244 245 246 247 | run: shell("qualimap bamqc \ -bam {input.bam} \ --paint-chromosome-limits \ -gff {input.gff} \ -nt {threads} \ -outdir {params.dir} \ -outformat HTML \ --skip-duplicated") |
263 264 265 266 267 268 269 270 271 272 273 274 275 276 | run: shell("find reports/bamqc -mindepth 1 -maxdepth 1 -type d | grep SamC > models/ALL.bamqclist.txt") import pandas as pd data = pd.read_csv("models/ALL.bamqclist.txt", sep = " ", header = None, names = ['filename']) data['sample'] = data['filename'].str.split('.').str[0].str.split('/').str[2].str.split('_stats').str[0] data = data.sort_values('sample', axis = 0, ascending = True) data = data[['sample','filename']] data.to_csv(r'models/bamqc_list.txt', header = None, index = None, sep = ' ', mode = 'a') shell("qualimap multi-bamqc \ --data {params.infile} \ --paint-chromosome-limits \ -outdir {params.outdir} \ -outformat html \ --java-mem-size=40G") |
27 28 29 30 31 32 33 34 35 | run: shell("gatk HaplotypeCaller \ -I {input.bam} \ -O {output} \ -R {input.ref} \ -L {params.regions} \ -G StandardAnnotation \ -G AS_StandardAnnotation \ --emit-ref-confidence BP_RESOLUTION") |
53 54 55 56 57 58 | run: shell("gatk SplitIntervals \ -R {input.ref} \ -L {params.regions} \ --scatter-count {params.intervals} \ -O {params.outdir}") |
77 78 79 80 81 82 83 84 85 86 87 | run: shell("mkdir -p {params.tmp}") shell("gatk --java-options \"-Xmx60g -Xms60g\" \ GenomicsDBImport \ --genomicsdb-workspace-path {output} \ --batch-size 50 \ --reader-threads 6 \ --sample-name-map {input.map} \ --intervals {input.region} \ --tmp-dir {params.tmp}") shell("rm -rf {params.tmp}") |
102 103 104 105 106 107 108 109 110 111 | run: shell("gatk GenotypeGVCFs \ -R {input.ref} \ -V {params.db} \ -L {params.region} \ -new-qual \ -G StandardAnnotation \ -G AS_StandardAnnotation \ --include-non-variant-sites \ -O {output}") |
122 123 124 | run: shell("bcftools concat {input} -Oz -o {output}") shell("tabix -p vcf {output}") |
23 24 25 26 27 28 29 | run: shell("gatk SelectVariants \ -R {input.ref} \ -V {input.vcf} \ -select-type SNP \ --restrict-alleles-to BIALLELIC \ -O {output}") |
41 42 43 44 45 46 | run: shell("gatk SelectVariants \ -R {input.ref} \ -V {input.vcf} \ -select-type NO_VARIATION \ -O {output}") |
64 65 66 67 68 69 70 71 72 73 74 75 | run: shell("gatk VariantsToTable \ -R {input.ref} \ -V {input.variant_vcf} \ -F CHROM -F POS -F QUAL -F QD -F DP -F MQ -F MQRankSum -F FS -F ReadPosRankSum -F SOR -F AD \ -O {output.variant_stats}") shell("gatk VariantsToTable \ -R {input.ref} \ -V {input.invariant_vcf} \ -F CHROM -F POS -F QUAL -F QD -F DP -F MQ -F MQRankSum -F FS -F ReadPosRankSum -F SOR -F AD \ -O {output.invariant_stats}") shell("Rscript src/filtering_diagnostics.R") |
112 113 114 115 116 117 118 119 120 121 122 123 124 125 | run: shell("gatk VariantFiltration \ -V {input.raw_snp_vcf} \ --intervals {params.chr} \ -filter \"QD < 2.0\" --filter-name \"QD2\" \ -filter \"QUAL < 30.0\" --filter-name \"QUAL30\" \ -filter \"SOR > 3.0\" --filter-name \"SOR3\" \ -filter \"FS > 60.0\" --filter-name \"FS60\" \ -filter \"MQ < 40.0\" --filter-name \"MQ40\" \ -filter \"MQRankSum < -12.5\" --filter-name \"MQRankSum-12.5\" \ -filter \"ReadPosRankSum < -8.0\" --filter-name \"ReadPosRankSum-8\" \ -G-filter \"DP < 5 || DP > 200\" \ -G-filter-name \"DP_5-200\" \ -O {output.snps_flagged}") |
134 135 136 137 138 | run: shell("gatk VariantFiltration \ -V {input.snps_flagged} \ --set-filtered-genotype-to-no-call true \ -O {output.snps_nocall}") |
148 149 150 151 152 153 154 155 156 | run: shell("gatk SelectVariants \ -R {input.ref} \ -V {input.snps_nocall} \ -select-type SNP \ --restrict-alleles-to BIALLELIC \ --max-nocall-fraction 0.1 \ --exclude-filtered true \ -O {output.filtered_snps}") |
166 167 168 169 170 171 172 173 174 175 176 | run: shell("gatk VariantFiltration \ -V {input.raw_invariant_vcf} \ --intervals {params.chr} \ -filter \"QUAL < 30.0\" --filter-name \"QUAL30\" \ -filter \"MQ < 40.0\" --filter-name \"MQ40\" \ -filter \"MQRankSum < -12.5\" --filter-name \"MQRankSum-12.5\" \ -filter \"ReadPosRankSum < -8.0\" --filter-name \"ReadPosRankSum-8\" \ -G-filter \"DP < 5 || DP > 200\" \ -G-filter-name \"DP_5-200\" \ -O {output.invariant_flagged}") |
185 186 187 188 189 | run: shell("gatk VariantFiltration \ -V {input.invariant_flagged} \ --set-filtered-genotype-to-no-call true \ -O {output.invariant_nocall}") |
199 200 201 202 203 204 205 | run: shell("gatk SelectVariants \ -R {input.ref} \ -V {input.invariant_nocall} \ --max-nocall-fraction 0.1 \ --exclude-filtered true \ -O {output.filtered_invariant}") |
214 215 216 | run: shell("bgzip {input}") shell("tabix -p vcf {input}.gz") |
225 226 227 228 | run: shell("gatk MergeVcfs \ -I={params.input} \ -O={output.merged_snps}") |
236 237 238 239 240 | run: shell("gatk MergeVcfs \ -I={input.snps} \ -I={input.invariant} \ -O={output.allsites_vcf}") |
248 249 250 251 252 253 254 255 256 | run: shell("vcftools \ --gzvcf {input.merged_snps} \ --missing-indv \ --out reports/filtered.qual.dp5_200.maxnocall10") shell("vcftools \ --gzvcf {input.merged_snps} \ --het \ --out reports/filtered.qual.dp5_200.maxnocall10") |
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | run: shell("{params.plink2} --vcf {input.vcf} \ --allow-extra-chr \ --max-alleles 2 \ --vcf-filter \ --geno 0.1 \ --mind {params.mind} \ --maf 0.05 \ --make-bed \ --out {params.stem}") shell("cp {params.stem}.bim {params.stem}_temp.bim") shell("""sed \"s/^C//g" {params.stem}_temp.bim > {params.stem}_temp2.bim""") shell("./src/replace_rsids.sh {params.stem}_temp2.bim > {params.stem}.bim") shell("rm {params.stem}_temp*.bim") shell("{params.plink2} --bfile {params.stem} \ --allow-extra-chr \ --indep-pairwise 50 10 0.1 \ --out {params.stem}") shell("{params.plink2} --bfile {params.stem} \ --allow-extra-chr \ --extract {params.stem}.prune.in \ --out {params.pruned} \ --make-bed") |
60 61 62 | run: shell("admixture -B --cv -j{threads} {input.bed} {params.k}") shell("mv {params.stem}.{params.k}.* models/admixture") |
80 81 82 83 84 85 86 87 88 89 | run: shell("tabix -p vcf -f {input.allsites_vcf}") shell("pixy --stats pi \ --vcf {input.allsites_vcf} \ --populations {input.samps} \ --window_size {params.window_size} \ --n_cores {threads} \ --output_folder models/pixy \ --output_prefix B_oleracea_grouped_{params.chr} \ --chunk_size 50000") |
38 39 40 41 42 43 44 45 46 47 48 | run: shell("mkdir -p {params.tmp}") shell("samtools faidx {input} {params.chr} > {params.newfasta}") shell("{snpable}/splitfa {params.newfasta} 100 | split -l 80000000 - {params.tmp}/x") shell("cat {params.tmp}/x* > {params.tmp}/simu_reads.fq") shell("bwa aln -R 1000000 -O 3 -E 3 -t {threads} {input} {params.tmp}/simu_reads.fq > {params.tmp}/simu_reads.sai") shell("bwa samse {input} {params.tmp}/simu_reads.sai {params.tmp}/simu_reads.fq > {params.tmp}/simu_reads.sam") shell("perl {snpable}/gen_raw_mask.pl {params.tmp}/simu_reads.sam > {params.tmp}/raw_mask_100.fa") shell("{snpable}/gen_mask -l 100 -r 0.5 {params.tmp}/raw_mask_100.fa > {params.famask}") shell("rm -rf {params.tmp}") shell("./src/makeMappabilityMask.py {params.famask}") |
121 122 123 124 125 | shell: "smc++ cv \ --cores {threads} \ --spline cubic \ -o {params.model_out_dir} {params.mu} {params.model_in}" |
143 144 145 146 147 148 | shell: "smc++ plot \ --csv \ -g {params.gen} \ {output} \ {input.smc_out}" |
167 168 169 170 171 172 173 174 | shell: "python3 src/smc_bootstrap.py \ --nr_bootstraps {params.nr_bootstraps} \ --chunk_size {params.chunk_size} \ --chunks_per_chromosome 10 \ --nr_chromosomes {params.nr_chr} \ models/smc/bootstrap_input/{params.population}.{params.distind}_rep \ {params.input_dir}" |
189 190 191 192 193 | shell: "smc++ cv \ --cores {threads} \ --spline cubic \ -o {params.model_out_dir} {params.mu} {params.model_in}" |
204 205 206 207 208 209 | shell: "smc++ plot \ --csv \ -g {params.gen} \ {output} \ {input.smc_out}" |
16 17 18 19 20 21 22 | shell: """ smc++ vcf2smc \ --mask {input.mask} \ -d {params.distind1} {params.distind1} \ {input.vcf} {output.out12} {params.chrom} {params.pop_pair_string12} """ |
37 38 39 40 41 42 43 | shell: """ smc++ vcf2smc \ --mask {input.mask} \ -d {params.distind2} {params.distind2} \ {input.vcf} {output.out21} {params.chrom} {params.pop_pair_string21} """ |
54 55 56 57 58 | shell: "smc++ split \ --cores 8 \ -o {params.model_out_dir} \ {input}" |
74 75 76 77 78 79 80 81 | shell: "python3 scripts/smc_bootstrap.py \ --nr_bootstraps {params.nr_bootstraps} \ --chunk_size {params.chunk_size} \ --chunks_per_chromosome 10 \ --nr_chromosomes {params.nr_chr} \ models/smc/split_bootstrap_input/{params.pop_pair}_12.{params.distind1}_rep \ {params.input_dir}" |
95 96 97 98 99 100 101 102 | shell: "python3 scripts/smc_bootstrap.py \ --nr_bootstraps {params.nr_bootstraps} \ --chunk_size {params.chunk_size} \ --chunks_per_chromosome 10 \ --nr_chromosomes {params.nr_chr} \ models/smc/split_bootstrap_input/{params.pop_pair}_21.{params.distind1}_rep \ {params.input_dir}" |
113 114 115 116 117 | shell: "smc++ split \ --cores 2 \ -o {params.model_out_dir} \ {input}" |
130 131 132 133 134 135 136 137 138 139 140 141 142 | shell: """ smc++ plot \ --csv \ -g {params.gen} \ {output.split} \ {input.smc_split} smc++ plot \ --csv \ -g {params.gen} \ {output.bootstrap} \ {input.smc_split_bootstrap} """ |
26 27 28 29 | shell: """ cat {input.allsites_vcf} | sed -e 's/chr//' | awk '{{OFS="\\t"; if (!/^#/){{print $1,$2,$2}}}}' | bedtools complement -i stdin -g {input.chr_lengths} | grep {params.chr} > {output} """ |
36 37 | run: shell("gunzip -c {input} > {output}") |
50 51 52 | run: shell("RAiSD -R -s -m 0.05 -P -X {input.excluded_sites} -n {params.population} -I {input.allsites_vcf} -S {input.samples} -w 500 -f -c 2") shell("mv {params.out}* {params.outdir}") |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | library(tidyverse) library(cowplot) VCFsnps <- read.csv('reports/filtering_bpres/all_samps.raw.snps.table', header = T, na.strings=c("","NA"), sep = "\t") dim(VCFsnps) VCFinvariant <- read.csv('reports/filtering_bpres/all_samps.raw.invariant.table', header = T, na.strings=c("","NA"), sep = "\t") dim(VCFinvariant) VCF <- bind_rows(list(variant = VCFsnps, invariant = VCFinvariant), .id = "class") dim(VCF) xmax <- mean(VCF$DP)*3 # colors snps <- '#A9E2E4' invariant <- '#F4CCCA' DP <- ggplot(VCF, aes(x=DP, fill=class)) + geom_density() + geom_vline(xintercept=c(10,xmax)) + xlim(c(0, 6000)) + theme_bw() QD <- ggplot(VCF, aes(x=QD, fill=class)) + geom_density(alpha=.3) + geom_vline(xintercept=2, size=0.7) + theme_bw() FS <- ggplot(VCF, aes(x=FS, fill=class)) + geom_density(alpha=.3) + geom_vline(xintercept=c(60, 200), size=0.7) + ylim(0,0.1) + theme_bw() MQ <- ggplot(VCF, aes(x=MQ, fill=class)) + geom_density(alpha=.3) + geom_vline(xintercept=40, size=0.7) + theme_bw() MQRankSum <- ggplot(VCF, aes(x=MQRankSum, fill=class)) + geom_density(alpha=.3) + geom_vline(xintercept=-20, size=0.7) + theme_bw() SOR <- ggplot(VCF, aes(x=SOR, fill=class)) + geom_density(alpha=.3) + geom_vline(xintercept=c(4, 10), size=1, colour = c(snps, invariant)) + theme_bw() ReadPosRankSum <- ggplot(VCF, aes(x=ReadPosRankSum, fill=class)) + geom_density(alpha=.3) + geom_vline(xintercept=c(-10,10), size=1) + xlim(-30, 30) + theme_bw() plot_grid(QD, DP, FS, MQ, MQRankSum, SOR, ReadPosRankSum, nrow=4) ggsave("reports/filtering_bpres/unfiltered_diagnostics.png") |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | import click import argparse import os import sys import random import gzip import time @click.command() @click.option('--nr_bootstraps', type=int, help="nr of bootstraps [20]", default=20) @click.option("--chunk_size", type=int, help="size of bootstrap chunks [5000000]", default=5000000) @click.option("--chunks_per_chromosome", type=int, help="nr of chunks to put on one chromosome in the bootstrap [20]", default=20) @click.option("--nr_chromosomes", type=int, help="nr of chromosomes to write [30]", default=24) @click.option("--seed", type=int, help="initialize the random number generator", default=None) @click.argument("out_dir_prefix") @click.argument("files", nargs=-1) def main(nr_bootstraps, chunk_size, chunks_per_chromosome, nr_chromosomes, seed, out_dir_prefix, files): chunks = [] offset = 0 chunks_in_chrom = [] if not seed: seed = int(time.time()) random.seed(seed) print('seed: %s' % seed) for file in files: print(file) with gzip.open(file, 'rb') as f: header = f.readline().decode() prev_chunk_idx = -1 for line_count,line in enumerate(f): line = line.decode() line = [int(x) for x in line.strip().split()] if line_count == 0: pos = 1 offset += 1 else: pos = line[0] + offset offset += line[0] chunk_index = (pos - 1) // chunk_size # The code below won't work if there are super large hom. stretches in the smc input file # This might happen if, for example you're only analyzing part of the chromosome, and you # mask everything outside of that part. The last line in the input file might look like: # 72402527 -1 0 0 # Which tells SMC++ to mask the final ~72MB of the chromosome. # I've written a check below that should deal with this and stop the for loop if the next # position is more than (2*chunk_size) away if (chunk_index - prev_chunk_idx) >= 2: print('\n') print('Next position is in %s bases' % pos) print('Last relative position was at %s bases' % lastpos) print('This might be the result of how vcf2smc masks large stretches of the vcf (say for subsampling a chromosome), or it might be a real error.\nDouble check that this input file is formatted how you intend.\nStopping "chunking" of input file and moving on\n') break print('Processing chunk %s of input file' % chunk_index) ## if chunk_index > len(chunks_in_chrom)-1: chunks_in_chrom.append([]) chunks_in_chrom[chunk_index].append(line) ## prev_chunk_idx = prev_chunk_idx + (chunk_index - prev_chunk_idx) lastpos = pos chunks.extend(chunks_in_chrom) for bootstrap_id in range(1, nr_bootstraps +1): for chr_ in range(1, nr_chromosomes + 1): chr_dir = "{}_{}".format(out_dir_prefix, bootstrap_id) if not os.path.exists(chr_dir): os.makedirs(chr_dir) chr_file = "{}/bootstrap_chr{}.gz".format(chr_dir, chr_) print("writing", chr_file, file=sys.stderr) with gzip.open(chr_file, 'wb') as f: f.write(header.encode()) for i in range(chunks_per_chromosome): chunk_id = random.randrange(len(chunks)) for line in chunks[chunk_id]: line = ' '.join([str(x) for x in line]) + '\n' f.write(line.encode()) if __name__ == "__main__": main() |
Support
- Future updates
Related Workflows





