Use an ensemble of variant callers to call variants from ATAC-seq data
varCA
A pipeline for running an ensemble of variant callers to predict variants from ATAC-seq reads.
The entire pipeline is made up of two smaller subworkflows. The
prepare
subworkflow calls each variant caller and prepares the resulting data for use by the
classify
subworkflow, which uses an ensemble classifier to predict the existence of variants at each site.
Code Ocean
Using
our Code Ocean compute capsule
, you can execute
VarCA v0.2.1
on example data without downloading or setting up the project. To interpret the output of VarCA, see the output sections of the
prepare
subworkflow
and the
classify
subworkflow
in the
rules README
.
download
Execute the following command or download the latest release manually.
git clone https://github.com/aryarm/varCA.git
Also consider downloading the example data .
cd varCA
wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz | tar xvzf -
setup
The pipeline is written as a Snakefile which can be executed via Snakemake . We recommend installing version 5.18.0:
conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'
We highly recommend you install
Snakemake via conda
like this so that you can use the
--use-conda
flag when calling
snakemake
to let it
automatically handle all dependencies
of the pipeline. Otherwise, you must manually install the dependencies listed in the
env files
.
execution
-
Activate snakemake via
conda
:conda activate snakemake
-
Execute the pipeline on the example data
Locally:
./run.bash &
or on an SGE cluster:
./run.bash --sge-cluster &
Output
VarCA will place all of its output in a new directory (
out/
, by default). Log files describing the progress of the pipeline will also be created there: the
log
file contains a basic description of the progress of each step, while the
qlog
file is more detailed and will contain any errors or warnings. You can read more about the pipeline's output in the
rules README
.
Executing the pipeline on your own data
You must modify the config.yaml file to specify paths to your data. The config file is currently configured to run the pipeline on the example data provided.
Executing each portion of the pipeline separately
The pipeline is made up of two subworkflows . These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the rules README for execution instructions and a description of the outputs. You will need to execute the subworkflows separately if you ever want to create your own trained models .
Reproducing our results
We provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow these steps to create all of the plots and tables in our paper.
If this is your first time using Snakemake
We recommend that you run
snakemake --help
to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the
-n -p -r
flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the
all
rule).
Note that Snakemake will not recreate output that it has already generated, unless you request it. If a job fails or is interrupted, subsequent executions of Snakemake will just pick up where it left off. This can also apply to files that you create and provide in place of the files it would have generated.
By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the
--notemp
flag to Snakemake when executing the pipeline.
files and directories
Snakefile
A Snakemake pipeline for calling variants from a set of ATAC-seq reads. This pipeline automatically executes two subworkflows:
-
the
prepare
subworkflow , which prepares the reads for classification and -
the
classify
subworkflow , which creates a VCF containing predicted variants
rules/
Snakemake rules for the
prepare
and
classify
subworkflows. You can either execute these subworkflows from the
master Snakefile
or individually as their own Snakefiles. See the
rules README
for more information.
configs/
Config files that define options and input for the pipeline and the
prepare
and
classify
subworkflows. If you want to predict variants from your own ATAC-seq data, you should start by filling out
the config file for the pipeline
.
callers/
Scripts for executing each of the variant callers which are used by the
prepare
subworkflow. Small pipelines can be written for each caller by using a special naming convention. See the
caller README
for more information.
breakCA/
Scripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used as features for the classifier. These scripts are an adaptation from @Arkosen 's BreakCA code .
scripts/
Various scripts used by the pipeline. See the script README for more information.
run.bash
An example bash script for executing the pipeline using
snakemake
and
conda
. Any arguments to this script are passed directly to
snakemake
.
citation
There is an option to "Cite this repository" on the right sidebar of the repository homepage .
Massarat, A. R., Sen, A., Jaureguy, J., Tyndale, S. T., Fu, Y., Erikson, G., & McVicker, G. (2021). Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq. Nucleic Acids Research, gkab621. https://doi.org/10.1093/nar/gkab621
Code Snippets
45 46 | shell: "zcat {input} | scripts/cgrep.bash - -E '^(CHROM|POS|REF)$|^({params.callers})~' | gzip > {output}" |
55 56 | shell: "zcat {input} | scripts/filter.bash {params.expr:q} | gzip > {output}" |
73 74 | shell: "zcat {input} | scripts/cgrep.bash - -E '^CHROM$|^POS$|~CLASS:' | gzip > {output}" |
89 90 91 92 93 94 95 | shell: "paste " "<(zcat {input.annot} | cut -f 3- | scripts/cgrep.bash - -v '{params.truth}') " "<(zcat {input.annot} | cut -f 3- | scripts/cgrep.bash - '{params.truth}') | " "sed 's/^\\t//' | paste <(zcat {input.tsv} | " "scripts/cgrep.bash - -v '~CLASS:' | " "scripts/cgrep.bash - -Evx '(CHROM|POS|REF)') - | gzip > {output}" |
113 114 | shell: "Rscript scripts/train_RF.R {input} {params.balance} {output}" |
121 122 | shell: "Rscript scripts/tune_plot.R {input} {output}" |
139 140 | shell: "Rscript scripts/predict_RF.R {input.predict} {input.model} {output}" |
154 155 156 157 158 | shell: "cat {input.predict} | paste <(zcat {input.annot}) " "<(read -r head && echo \"$head\" | tr '\\t' '\\n' | " "sed 's/response/CLASS:/' | sed 's/^/varca~/' | " "paste -s && cat) | gzip > {output}" |
182 183 184 185 186 187 | shell: "paste " "<(zcat {input.results} | scripts/cgrep.bash - '{params.truth}~CLASS:') " "<(zcat {input.results} | scripts/cgrep.bash - '{wildcards.caller}~CLASS:') " "<(zcat {input.predicts} | scripts/cgrep.bash - -F '{wildcards.caller}~{params.predict_col}')" " | tail -n+2 | scripts/metrics.py -o {output} {params.ignore_probs} {params.flip}" |
201 202 203 204 205 | shell: "paste " "<(zcat {input.annot} | scripts/cgrep.bash - '{params.truth}~CLASS:') " "<(zcat {input.predicts} | scripts/cgrep.bash - -F '{wildcards.caller}~{params.predict_col}') | " "tail -n+2 | scripts/statistics.py -o {output} {params.flip} {params.thresh}" |
232 233 | shell: "scripts/prc.py {output} {params.pts} {params.curves}" |
244 245 | shell: "scripts/2vcf.py -o {output} {params.callers} {input.results} {input.merge}" |
254 255 | shell: "bcftools reheader -f {input.genome} {input.vcf} -o {output}" |
82 83 84 85 | shell: "bwa mem {input.ref} {input.fastq1} {input.fastq2} " "-R '@RG\\tID:{wildcards.sample}\\tLB:lib1\\tPL:Illumina\\tPU:unit1\\tSM:{wildcards.sample}' | " "samtools view -b -h -F 4 -q 20 -> {output}" |
96 97 98 99 | shell: "samtools sort -n {input} | " "samtools fixmate -m -O bam - - | " "samtools sort -o {output} -" |
108 109 | shell: "samtools markdup {input} {output.final_bam}" |
118 119 | shell: "samtools index -b {input}" |
131 132 133 | shell: "macs2 callpeak --nomodel --extsize 200 --slocal 1000 --qvalue 0.05 " "-g hs -f BAMPE -t {input.final_bam} -n {wildcards.sample} --outdir {params.output_dir}" |
179 180 181 182 183 | shell: "mkdir -p \"{output}\" && " "{input.caller_script} {input.bam} {input.peaks} " "{input.genome} {output} {wildcards.sample} " "{input.shared} {params.caller_params}" |
201 202 203 204 205 | shell: "mkdir -p \"{params.out_dir}\" && " "{input.caller_script} {input.bam} {input.peaks} " "{input.genome} {params.out_dir} {wildcards.sample} " "{input.shared} {params.caller_params}" |
231 232 | shell: "bcftools view {config[bcftools_params]} {input.vcf} > {output.vcf}" |
243 244 245 | shell: "bcftools norm -m -any {input.vcf} | bcftools norm --check-ref xw -d " "all -f {input.ref} > {output.vcf}" |
258 259 | shell: "bgzip <{input.vcf} >{output.gzvcf} && tabix -p vcf -f {output.gzvcf}" |
305 306 307 | shell: "echo -e '{params.cols}' > {output.tsv} && " "bcftools query -f '{params.qstr}' {input.gzvcf} >> {output.tsv}" |
334 335 336 | shell: "tail -n+2 {input} | awk -F '\\t' -v 'OFS=\\t' '{{for (i=1; i<=NF; i++) if ($i==\"NA\") $i=\".\"}}1' | " "sed 's/\\t\+/,/' | LC_ALL=C sort -t $'\\t' -k1,1 > {output}" |
345 346 347 348 | shell: "awk '{{printf(\"%s\\t%d\\t%d\\t%s\\n\",$1,int($2)+1,int($3),$4);}}' {input} | " "awk -F '\\t' -v 'OFS=\\t' '{{for (i=$2;i<$3;i++) print $1\",\"i,substr($4,i-$2+1,1)}}' | " "LC_ALL=C sort -t $'\\t' -k1,1 > {output}" |
364 365 366 367 368 | shell: "LC_ALL=C join -t $'\\t' -e. -a1 -j1 -o auto --nocheck-order " "<(cut -f 1 {input.sites}) {input.prepared_tsv} | cut -f 2- | cat " "<(head -n1 {input.tsv} | cut -f 3- | tr '\\t' '\\n' | " "sed 's/^/{wildcards.caller}~/' | paste -s) - > {output}" |
383 384 385 386 | shell: "paste <(echo -e 'CHROM\\tPOS\\tREF'; sed 's/,/\\t/' " "{input.all_sites}) {input.caller_output} | " "(read -r head; echo \"$head\"; sort -t $'\\t' -k1,1V -k2,2n) | gzip > {output}" |
395 396 | shell: "paste <(zcat {input} | scripts/cgrep.bash - -Ev '~(REF|ALT)$') <(scripts/classify.bash {input} '{params.label:q}') | gzip >{output}" |
423 424 425 426 | shell: "scripts/fillna.bash {input} {params.na_vals:q} | " "awk -F $'\\t' -v 'OFS=\\t' '{{for (i=1;i<=NF;i++) if ($i==\".\") $i=0}}1' | " "gzip >{output}" |
437 438 | shell: "zcat {input} | scripts/filter.bash {params.expr:q} | gzip > {output}" |
449 450 451 | shell: "zcat {input} | awk -f scripts/norm_nums.awk -F $'\\t' -v 'OFS=\\t' | " "gzip >{output}" |
458 459 | shell: "ln -sf \"$(basename {input})\" {output}" |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 | import sys import pysam import argparse import warnings import numpy as np import pandas as pd from pathlib import Path import matplotlib.pyplot as plt from sklearn.metrics import accuracy_score, precision_score from sklearn.metrics import roc_curve def phred(vals): """ apply the phred scale to the vals provided """ with np.errstate(divide='raise'): try: return -10*np.log10(1-vals) except FloatingPointError: return np.float64(30) return -10*np.ma.log10(1-vals).filled(-3) # fill all infinite values with a phred scale of 30 def plot_line(lst, show_discards=False): plt.clf() roc = np.copy(lst.T) roc[1] = phred(roc[1]) max_val = phred(max(roc[2])/(max(roc[2])+1)) # discard inf or na cols inf_or_na_cols = np.isinf(roc).any(axis=0) | np.isnan(roc).any(axis=0) # warn the user if we're discarding the majority of points discarded = np.sum(inf_or_na_cols)/roc.shape[1]*100 if (not show_discards and discarded != 0) or discarded >= 50: warnings.warn("Discarding NaN or Inf points ({}% of points)".format(discarded)) roc = roc[:,~(inf_or_na_cols)] # perform a simple linear regression p = np.polyfit(roc[0], roc[1], 1) r_squared = 1 - (sum((roc[1] - (p[0] * roc[0] + p[1]))**2) / ((len(roc[1]) - 1) * np.var(roc[1], ddof=1))) p = np.poly1d(p) # plot the points and the line plt.scatter(roc[0], roc[1], color='r', label="_nolegend_") if max(roc[0]) <= 1: plt.xlabel("RF Probability") elif max(roc[0]) <= np.pi/2: plt.xlabel("Reverse Arcsin of RF Probability") else: plt.xlabel("Phred-Scaled RF Probability") plt.ylabel("Phred-Scaled Precision (QUAL)") plt.plot( roc[0], p(roc[0]), label=str(p)+"\nr-squared: "+str(round(r_squared, 2))+ \ ("\ndiscarded: "+str(int(discarded))+"%" if show_discards else "") ) plt.hlines(max_val, min(roc[0]), max(roc[0]), colors='g', linestyles='dashed') plt.legend(frameon=False, loc='lower right') plt.tight_layout() def eqbin_mean(grp, log=True, pseudo=True, discards_ok=False, inverse=False): if inverse: return np.arcsin(grp.mean()) else: if log: if discards_ok or not pseudo: return phred(grp).mean() else: return phred(grp.sum()/(len(grp) + pseudo)) else: return grp.mean() def tpr_probs(df, bins=15, eqbin=True, log=True, pseudo=True, discards_ok=False, inverse=False): """ bin the sites and calculate an accuracy (predicted positive value) for that bin """ # retrieve only predicted positives df = df[df['varca~CLASS:']>=0.5] if eqbin: bin_size = int(len(df)/bins) # create bins (ie each grp) and add a single false positive to it so we don't get Inf lst = np.array([ ( eqbin_mean(grp['varca~prob.1'], log, pseudo, discards_ok, inverse), precision_score( np.append(grp['varca~truth'].values, 0) if pseudo else grp['varca~truth'], np.append(grp['varca~CLASS:'].values, 1) if pseudo else grp['varca~CLASS:'] ), grp['varca~prob.1'].size ) for grp in (df.iloc[i:i+bin_size] for i in range(0,len(df)-bin_size+1,bin_size)) ]) else: if log: df = df.copy() df['varca~prob.1'] = phred(df['varca~prob.1']) start = phred(0.5) if log else 0.5 # get the end excluding inf values (in case log == True) end = df.loc[df['varca~prob.1'] != np.inf, 'varca~prob.1'].max() # create bins (ie each grp) and add a single false positive to it so we don't get Inf lst = np.array([ ( grp[1]['varca~prob.1'].mean(), precision_score( np.append(grp[1]['varca~truth'].values, 0) if pseudo else grp['varca~truth'], np.append(grp[1]['varca~CLASS:'].values, 1) if pseudo else grp['varca~CLASS:'] ), grp[1]['varca~prob.1'].size ) for grp in df.groupby(pd.cut(df['varca~prob.1'], pd.interval_range(start, end, bins))) ]) return lst def strip_type(caller): """ strip the -indel or -snp from the end of a caller name """ vartype = '' if caller.endswith('-snp'): caller = caller[:-len('-snp')] vartype = 'snp' elif caller.endswith('-indel'): caller = caller[:-len('-indel')] vartype = 'indel' # if there is still a dash, get everything after it i = caller.rfind('-') if i != -1: caller = caller[i+1:] return caller, vartype def isnan(val): return type(val) is float and np.isnan(val) def get_calls(prepared, callers=None, pretty=False): """ get the alleles in each row of prepared at the location (CHROM/POS) of loc when choosing an alt allele, choose from the callers in the order given """ # keep track of the contigs that we've seen contigs = set() # whether we've read the header yet read_header = False # iterate through each row in the df and check whether they match loc for chunk in prepared: # do some special stuff (parsing the header) on the very first iteration if not read_header: # if callers is None, retrieve the callers from the columns of the dataframe if callers is None: callers = [ caller[:-len('~ALT')] for caller in chunk.columns if caller.endswith('~ALT') and not caller.startswith('pg-') ] # what types of variants are we dealing with? let's count how many # times they appear in the caller names vartypes = {'snp': 0, 'indel': 0} # also, let's retrieve the callers as a dictionary pretty_callers = {} for caller in callers: pretty_caller, vartype = strip_type(caller) # don't beautify the callers if the user didn't request it pretty_callers[caller] = pretty_caller if pretty else caller # keep track of how often each vartype appears if vartype in vartypes: vartypes[vartype] += 1 callers = pretty_callers # retrieve the first CHROM/POS location and yield whether we are reading indels or snps loc, predict = yield max(vartypes, key=vartypes.get) read_header = True # now iterate through each row (and also convert the POS column to an int) for idx, row in chunk.iterrows(): # check if we already passed the row -- ie we either: # 1) moved onto a new contig or # 2) moved passed the position while ( idx[0] != loc[0] and loc[0] in contigs ) or ( idx[0] == loc[0] and idx[1] > loc[1] ): # return None if we couldn't find loc in the df loc, predict = yield None if idx == loc: # we found it! found = False # now, we must figure out which caller to get the alleles from for caller in callers: ref, alt = row[caller+"~REF"], row[caller+"~ALT"] # TODO: make this work for an arbitrary number of variant types for multilabel classification using the other CLASS values in classified # right now, this only works if there's a single binary label if not isnan(ref) and ( (isnan(alt) + predict) % 2 ): found = True break if found: loc, predict = yield callers[caller], ref, alt else: # if we know there is a variant here, but none of the other # callers found it, just label it as a non-variant # TODO: figure out the alt allele from inspecting the ref genome? loc, predict = yield 'varca', row['REF'], float('nan') # save the current contig so that we know which ones we've seen contigs.add(idx[0]) def prob2qual(prob, vartype): # values are from linear model that we created from using the "-i" option if vartype == 'snp': return 0.6237*phred(prob)+8.075 elif vartype == 'indel': return 0.8463*phred(prob)+2.724 else: # we shouldn't ever encounter this situation, but just in case... return phred(prob) def main(prepared, classified, callers=None, cs=1000, all_sites=False, pretty=False, vartype=None): """ use the results of the prepare pipeline and the classify pipeline to create a VCF with all of the classified sites """ # first, get a generator that can read through each call in the prepared df prepared = get_calls( pd.read_csv( prepared, sep="\t", header=0, index_col=["CHROM", "POS"], dtype=str, chunksize=cs, na_values=".", usecols=lambda col: col in ['CHROM', 'POS', 'REF'] or col.endswith('~REF') or col.endswith('~ALT') ), callers, pretty ) # flush out the first item in the generator: the vartype if vartype is None: vartype = next(prepared) else: # if the user already gave us the vartype, then just discard this next(prepared) # also retrieve the classifications as a df classified = pd.read_csv( classified, sep="\t", header=0, index_col=["CHROM", "POS"], dtype={'CHROM':str, 'POS':int}, chunksize=cs, na_values="." ) # keep track of how many sites in the classifications df we've had to skip skipped = 0 # keep track of how many sites we skipped but were predicted to have a variant no_alts = 0 # iterate through each site in the classifications df, get its alleles, and # then return them in a nice-looking dictionary for chunk in classified: for idx, row in chunk.iterrows(): try: # get the alleles for this CHROM/POS location call = prepared.send((idx, row['varca~CLASS:'])) except StopIteration: call = None # we found a variant but couldn't find an alternate allele! no_alts += not (call is None or (isnan(call[2]) + row['varca~CLASS:']) % 2) # check: does the site appear in the prepared pipeline? # and does this site have a variant? if call is None or (not all_sites and isnan(call[2])): skipped += 1 continue caller, ref, alt = call qual = prob2qual( row['varca~prob.'+str(int(not isnan(alt)))], vartype ) # construct a dictionary with all of the relevant details yield { 'contig': str(idx[0]), 'start': idx[1], 'stop': idx[1]+len(ref), 'qual': qual, 'alleles': (ref, "." if isnan(alt) else alt), 'info': {'CALLER':caller} } if skipped: warnings.warn( "Ignored {:n} classification sites that didn't have a variant.".format(skipped) ) if no_alts: warnings.warn( "Ignored {:n} sites that we predicted to have variants but didn't appear in any of the callers.".format(no_alts) ) def write_vcf(out, records): """ write the records to the output vcf """ vcf = pysam.VariantFile(args.out, mode='w') # write the necessary VCF header info vcf.header.info.add("CALLER", 1, 'String', "The caller from which this site was taken") contigs = set() for rec in records: # handle pysam increasing the start and end sites by 1 rec['start'] -= 1 rec['stop'] -= 1 # parse the record into a pysam.VariantRecord try: record = vcf.new_record( **rec, samples=None, id=None, filter=None ) except ValueError: # add the contig if it hasn't already been added if rec['contig'] not in contigs: vcf.header.contigs.add(rec['contig']) contigs.add(rec['contig']) else: raise # now, try again record = vcf.new_record( **rec, samples=None, id=None, filter=None ) # write the record to a file vcf.write(record) return vcf if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument( "-o", "--out", default=sys.stdout, help="the filename to save the vcf (or bcf) to" ) parser.add_argument( "classified", type=Path, help="a sorted, results.tsv.gz file from the output of the classify pipeline" ) parser.add_argument( "prepared", type=Path, nargs="?", default=sys.stdin, help="a sorted, merge.tsv.gz file from the prepare pipeline (if not supplied, this is read from stdin)" ) parser.add_argument( '-c', "--callers", default=None, help="a comma separated list of the callers from which to choose alleles, supplied in order of priority (default: all of the callers in the file, in the order they appear)" ) parser.add_argument( '-s', "--chunksize", type=np.uint32, default=100000, help="how many rows to read into memory at once (default: 100,000)" ) parser.add_argument( '-a', '--all', action='store_true', help="whether to also write non-variant sites to create a gVCF (default: no)" ) parser.add_argument( '-p', '--pretty', action='store_true', help="should caller names appear in the vcf by their pretty form (with all dashes intelligently removed) or their original caller ID form? (default: the original form)" ) parser.add_argument( '-t', '--type', choices=['indel', 'snp'], default=None, help="whether to recalibrate QUAL values assuming your data are SNPs or indels (default: infer from callers)" ) parser.add_argument( '-i', '--internal', action='store_true', help="For testing and internal use: recalibrate the QUAL scores (assumes varca~truth column exists in classified)" ) args = parser.parse_args() callers = None if args.callers is not None: callers = args.callers.split(",") if not args.internal: import matplotlib matplotlib.use('Agg') vcf = write_vcf(args.out, main(args.prepared, args.classified, callers, args.chunksize, args.all, args.pretty, args.type)) else: if not sys.flags.interactive: sys.exit("ERROR: You must run this script in python's interactive mode (using python's -i flag) when providing the -i flag to this script.") try: df = pd.read_csv(args.classified, sep="\t", header=0, index_col=["CHROM", "POS"], usecols=['CHROM', 'POS', 'varca~truth', 'varca~prob.1', 'varca~CLASS:'], low_memory=False).sort_values(by='varca~prob.1') except ValueError: df = pd.read_csv(args.classified, sep="\t", header=0, index_col=["CHROM", "POS"], usecols=['CHROM', 'POS', 'breakca~truth', 'breakca~prob.1', 'breakca~CLASS:'], low_memory=False).sort_values(by='breakca~prob.1') df.columns = ['varca~truth', 'varca~prob.1', 'varca~CLASS:'] plt.ion() plot_line(tpr_probs(df)) |
11 12 13 14 15 16 17 18 19 20 21 22 23 | script_dir="$(dirname "$BASH_SOURCE")"; pfiles=() for (( i=2; i<$#; i+=2 )); do j=$((i+1)) reg="${!i}" val="${!j}" # retrieve the idx of the column among the columns reg="$(zcat "$1" | head -n1 | tr '\t' '\n' | grep -Fn "$reg" | cut -f1 -d: | head -n1)" pfiles+=("{if (\$$reg==\".\") \$$reg=\"$val\"}") done zcat "$1" | awk -F $'\t' -v 'OFS=\t' "${pfiles[*]} 1" |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | import sys import argparse import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from sklearn.metrics import auc from itertools import product parser = argparse.ArgumentParser() parser.add_argument( "out", nargs="?", default=sys.stdout, help="the filename to save the data to" ) known_args, unknown_args = parser.parse_known_args() # dynamically parse whatever options the user passes us count = 0 for arg in unknown_args: if arg.startswith('--'): parser.add_argument('--{}'.format(arg[2:]), type=argparse.FileType('r')) count += 1 if count < 1: parser.error("Specify the path to at least one space separated table (w/o a header) with two rows (recall/precision) using options like --gatk-indel path/to/gatk-table") args = parser.parse_args() all_args = vars(args) def get_marker(): """ retrieve a unique marker in a deterministic order """ yield from product(range(2, 6), range(1, 3), [52]) # if that wasn't enough, create some more just at a different angle yield from product(range(2, 8), range(1, 3), [0]) # if we ever get to this point, we need more than 20 markers yield from product([2]+list(range(3, 6, 2)), range(1, 3), [90]) # initialize all of the colors callers = ['gatk_indel', 'varscan_indel', 'vardict_indel', 'breakca', 'delly', 'pindel', 'illumina_manta', 'illumina_strelka', 'pg_indel'] colors = dict(zip(callers, plt.cm.Set1(range(len(callers))))) # add the other callers, as well for k in list(colors.keys()): if k.endswith('_indel'): colors[k[:-len('indel')]+"snp"] = colors[k] elif k == 'breakca': colors['varca'] = colors[k] markers = get_marker() # go through each table and get its name from all of the args for arg in sorted(all_args.keys()): if arg not in known_args: table = np.loadtxt(all_args[arg]) # recall: 1st row, precision: 2nd row if table.ndim != 1: # area = auc(table[0], table[1]) # colors[arg] = plt.step( # table[0], table[1], where='post', # label=arg+": area={0:0.2f}".format(area) # )[0].get_color() if arg in colors: plt.step( table[0], table[1], where='post', color=colors[arg], label=arg.replace('breakca', 'varca') ) else: # pick the color randomly plt.step( table[0], table[1], where='post', label=arg.replace('breakca', 'varca') ) else: area = table[1] # check if this pt has a curve of the same name # to make sure they're the same color if arg.endswith("_pt") and arg[:-3] in colors: extra = {'color': colors[arg[:-3]]} else: extra = {} plt.plot( table[0], table[1], marker=next(markers), ms=12, label=arg.replace('breakca', 'varca')+": height={0:0.2f}".format(area), **extra ) plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", fontsize='small') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.0]) plt.xlim([0.0, 1.0]) plt.gcf().set_size_inches(10, 10) plt.savefig(args.out, bbox_inches='tight', pad_inches=0.1, set_dpi=350) |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | args <- commandArgs(trailingOnly = TRUE) test.data<- args[1] model <- args[2] output<- args[3] # load libraries library(data.table) library(plyr) library(dplyr) suppressMessages(library(mlr)) # load model print("loading appropriate model") load(model) # load test print("loading and formatting test data") test<- read.table(test.data, header=TRUE, sep="\t", na.strings=c("NA",".","na","N/A"), skipNul=FALSE, row.names=NULL) # making predictions print("making predictions and outputting results") pred= predict(fit, newdata= test, type="prob") write.table(pred$data, sep='\t', quote=FALSE, row.names=FALSE, na=".", output) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | args <- commandArgs(trailingOnly = TRUE) training<- args[1] balance<- args[2] # an integer (0 or 1) indicating whether to balance the data model<- args[3] importance<- args[4] # importance for each of the variables is saved here tune<- args[5] # if specified, the results of cross validation are saved here # load libraries library(plyr) library(dplyr) suppressMessages(library(mlr)) library(parallelMap) library(parallel) # load data.frame print("loading training data into R") training<- read.table(training, header=TRUE, sep="\t", na.strings=c("NA",".","na","N/A","inf","Inf","infinity","Infinity"), skipNul=T, row.names=NULL) nrows_total <- nrow(training) training <- na.omit(training) if (nrows_total-nrow(training)>0) { print(paste("ignoring", nrows_total-nrow(training), "rows that have NA values")) } print("creating training task and making RF learner") # optimize hyper parameters # make training task traintask <- makeClassifTask(data = training, target = colnames(training)[ncol(training)], positive = 1) # create learner rf.lrn <- makeLearner("classif.ranger", predict.type = "prob") if (as.integer(balance)) { print("calculating class weights in order to ensure data is balanced when sampled") # first, retrieve the inverse of the counts of each of the labels w <- 1/table(training[,ncol(training)]) # calculate probabilities for each label w <- w/sum(w) # create a vector containing weights instead of labels weights <- rep(0, nrow(training)) for (val in names(w)) { weights[training[,ncol(training)] == val] <- w[val] } # create par.vals rf.lrn$par.vals <- list(importance='impurity', verbose=TRUE, case.weights=weights) } else { # create par.vals rf.lrn$par.vals <- list(importance='impurity', verbose=TRUE) } if (!is.na(tune)) { # mtry default: sqrt(number of features) # nodesize default: 1 params <- makeParamSet(makeIntegerParam("mtry",lower = 1,upper = 10), makeIntegerParam("min.node.size",lower = 7,upper = 25)) # set validation strategy; 4-fold cross validation rdesc <- makeResampleDesc("CV",iters=5L) # set optimization technique ctrl <- makeTuneControlGrid(resolution=c(mtry=10, min.node.size=19)) # tune hyperparameters print("initiating multicore tuning of hyperparameters") # but run the hyperparameter tuning in parallel, since it'll take a while # number of cores should be detected automatically (but don't use # all of the cores because otherwise we'll use too much memory!) parallelStartSocket(cpus=trunc(detectCores()/12), level="mlr.tuneParams") parallelLibrary("mlr") # create a custom F beta measure fbeta = makeMeasure(id = "fbeta", minimize = FALSE, best = 1, worst = 0, properties = c("classif", "req.pred", "req.truth"), name = "Fbeta measure", note = "Defined as: (1+beta^2) * tp/ (beta^2 * sum(truth == positive) + sum(response == positive))", fun = function(task, model, pred, feats, extra.args) { beta = 0.5 beta = beta^2 truth = pred$data$truth response = pred$data$response positive = pred$task.desc$positive (1+beta) * measureTP(truth, response, positive) / (beta * sum(truth == positive) + sum(response == positive)) } ) tuned = tuneParams(learner=rf.lrn, task=traintask, resampling=rdesc, measures=list(fbeta), par.set=params, control=ctrl, show.info=T) parallelStop() print("matrix of classifier performance for each pair of hyperparams") data = generateHyperParsEffectData(tuned) print(data$data) write.table(data$data, sep="\t", file=tune, quote=FALSE, row.names=F) print("tuned params are") print(tuned$x) rf.lrn$par.vals = c(rf.lrn$par.vals, tuned$x) } else { # set default hyperparameter values rf.lrn$par.vals = c(rf.lrn$par.vals, mtry=9, min.node.size=16) } print("training model") fit = mlr::train(rf.lrn, traintask) # print out variable importance print("recording variable importance:") importance_df = as.data.frame(sort(fit$learner.model$variable.importance, decreasing=TRUE)) print(importance_df) names(importance_df) <- c("variable\timportance") write.table(importance_df, sep="\t", file=importance, quote=FALSE) # save.data save.image( model ) |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | args <- commandArgs(trailingOnly = TRUE) results<- args[1] out<- args[2] # load libraries suppressMessages(library(mlr)) library(BBmisc) library(ggplot2) library(R.devices) # load data.frame print("loading results of hyperparam tuning into R") results<- read.table(results, header=TRUE, sep="\t", na.strings=c("NA",".","na","N/A"), row.names=NULL) data = makeS3Obj("HyperParsEffectData", data = results, measures=c("fbeta.test.mean"), hyperparameters=c("mtry", "min.node.size"), partial=F, nested=F) plt = plotHyperParsEffect(data, x = "mtry", y = "min.node.size", z = "fbeta.test.mean", plot.type = "heatmap", show.experiments = TRUE) # min_plt = min(data$data$acc.test.mean, na.rm = TRUE) # max_plt = max(data$data$acc.test.mean, na.rm = TRUE) # med_plt = mean(c(min_plt, max_plt)) # plt = plt + scale_fill_gradient2(breaks = seq(min_plt, max_plt, length.out = 5), low = "blue", mid = "white", high = "red", midpoint = med_plt) suppressGraphics(ggsave(out, plt)) |
Support
- Future updates
Related Workflows





