Identification of 3' UTR Variants Disrupting Transcript Function Pipeline
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Identify variants occurring in 3’ UTRs that could disrupt the function of a transcript.
This project involves:
-
retrieving reference genome using genomepy
-
extracting 3'UTR regions from gene annotation
-
extracting PAS hexamers from PolyA_DB3
-
computing MAPS score on different 3'UTR variants locally and on the cloud using hailctl dataproc
Evaluation Metrics
The different 3' UTR variants are evaluated by the MAPS (mutability-adjusted proportion of singletons) score to measure the overall depletion of singletons, as described in the gnomAD flagship paper . A higher MAPS score (compared to a baseline) suggests that there is a stronger evolutionary selection against the variant group.
Repository Files
The repository mainly contains a Snakemake workflow in
workflow
and a python package in
utr3variants
for the functions used in the pipeline scripts. Input paths, URLs and parameters are collected in
configs/config.yml
and should be adapted by the user. Additionally, there is a
qc/
directory for any exploratory analyses on input databases.
The Snakemake file structure under
workflow
is organised
as recommended
by the Snakemake documentation. It contains Snakefiles for Snakemake pipeline rules in
rules/
and python scripts in
scripts/
. The Snakefiles in
rules/
are used by the
Snakefile
in the repository root and cannot be run independently. Snakemake objects are passed to the python scripts when the pipeline is called locally. However, scripts that are to be submitted to the GCP do not handle Snakemake objects and can be run independently as well. How the python scripts are used, is best demonstrated in the Snakemake rules, where input and output files are defined. More information on scripts in Snakemake pipelines is documented
here
.
Pipeline
The workflow is implemented as a Snakemake pipeline. A good resource on the idea of Snakemake and how it works is its official documentation .
Installation
The dependencies are available as a conda environment.
conda env create -f workflow/envs/environment.yml
conda activate utr-variants
The pipeline can run
hail
operations locally or on the Google Cloud Platform (GCP). In order to run locally, you need to install the
gcs-connector
in the new environment.
curl -sSL https://broad.io/install-gcs-connector | python
Configuration
The
configs/config.yml
file contains links to
hail
datasets and local files. Local input files and output directories can be modified accordingly. For GCP runs, keys
bucket
and
cluster
need to be specified, and the key
local
needs to be set to
false
in the
config.yml
.
bucket: bucket_name
cluster: cluster_name
local: false
The key
bucket
specifies the GCP storage bucket name you want your output to be stored in, while
cluster
defines the name of the
hailctl dataproc
cluster. Before calling the Snakemake pipeline on the GCP, you need to start the
dataproc
session, preferably with a compute timeout.
hailctl dataproc start cluster_name --max-age=2h --packages gnomad
You don't need to start a
dataproc
session, if you want to run the pipeline locally, setting
local
to
true
. In that case, the
bucket
and
cluster
keys do not need to be specified.
Call the Pipeline
The pipeline was configured to be called from the root directory of this repository. For a dry run, you can supply the
-n
argument:
snakemake -n
In order to execute the steps listed in the dry run, use
-j
or
--cores
to set the number of cores available to the pipeline:
snakemake -j 10
Plotting rules have a separately defined conda environment for R dependencies (
workflow/envs/utr-variants-r.yml
). In order to use this environment call the pipeline as follows:
snakemake --use-conda -j10
This will create the environment once before activating for every rule call requiring it. You can also precompute the environment by calling
snakemake --use-conda --conda-create-envs-only
See Snakemake's documentation for more information on integrating conda environments.
Workflow Graph
It is useful to get a visual representation of what will be computed. For that, you can call the
dependency
rule that creates graphs for the dependency between rules (more general) and jobs (specific to wildcards).
snakemake dependency -fj
Here,
-f
enforces the pipeline to overwrite any pre-existing dependency graphs.
The output files can be found in the
output_root
directory specified in
configs/config.yml
. Below are some examples of a local run.
Rule graph for GCP run
Job graph for GCP run
Code Snippets
17 18 19 20 21 22 23 24 25 | shell: """ genome_dir=$(dirname "{output[0]}") genome_dir=$(dirname "$genome_dir") echo downloading genome to: $genome_dir genomepy install {wildcards.assembly} \ -p {params.provider} --annotation \ -g $genome_dir >> {log} 2>&1 """ |
37 38 39 40 41 | shell: """ wget --timestamping {params.url} -O {output}.gz gunzip {output}.gz """ |
49 50 51 52 53 54 55 | shell: """ tmpdir="$(dirname "$(tempfile)")" # TODO: download from GCP bucket wget --no-check-certificate -nc -P $tmpdir {params.url} unzip "$tmpdir"/human_pas.zip -d $(dirname {output}) """ |
63 64 65 66 | shell: """ wget -nc -P $(dirname {output}) {params.url} """ |
73 74 75 76 77 | shell: """ wget -nc -P $(dirname {output}) {params.url} gzip -d {output}.gz """ |
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | shell: """ INTERVAL_PATH="gs://{config[bucket]}/$(basename {input})" gsutil cp {input} $INTERVAL_PATH hailctl dataproc submit {config[cluster]} \ --pyfiles {params.annotate_gnomad},{params.maps} \ {params.script} \ -o gs://{output.maps} \ --intervals $INTERVAL_PATH \ --annotations {params.annotations} \ --gnomAD_ht {config[gnomAD][gnomAD_ht]} \ --context_ht {config[gnomAD][context_ht]} \ --mutation_ht {config[gnomAD][mutation_rate_ht]} \ --genome_assembly {config[genome_assembly]} \ --chr_subset {params.chr_subset} \ --skip_checks {config[gnomAD][skip_checks]} gsutil rm $INTERVAL_PATH """ |
50 | script: '../scripts/prepare_gnomad.py' |
63 | script: '../scripts/count_singletons.py' |
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | run: import pandas as pd from utr3variants.utils import extract_annotations df = pd.read_table(input.counts,sep='\t') df = extract_annotations(df,'target',ALL_ANNOTATIONS) if 'hexamer_motif' in ALL_ANNOTATIONS: df = df[df.hexamer_motif != 'AAAAAA'] # remove A-rich hexamers # bin hexamers df['hexamer_motif_bin'] = df.hexamer_motif df.loc[ (df.feature == 'hexamer') & ~df.hexamer_motif.isin(['AATAAA', 'ATTAAA']) , 'hexamer_motif_bin' ] = 'other hexamer' if 'percent_expressed' in ALL_ANNOTATIONS: df['percent_expressed'] = pd.cut( pd.to_numeric(df.percent_expressed,errors='coerce'),10 ).astype(str) # fill in blank values df['database'].fillna('gnomAD',inplace=True) df['feature'].fillna('other variant',inplace=True) # for anno in [x for x in ALL_ANNOTATIONS if x not in ['feature', 'database']]: # df[anno].fillna(df.feature,inplace=True) df.to_csv(output.counts,sep='\t',index=False) |
123 | script: '../scripts/maps.R' |
17 18 19 20 21 22 | shell: """ zcat {input} | grep three_prime_UTR | sortBed |\ mergeBed -c 3,6,7 -o distinct,distinct,distinct > {output.utr} # TODO: determine 3UTR start depending on strand """ |
37 | script: '../scripts/extract_polyadb.py' |
49 | script: '../scripts/extract_polyasite2.py' |
65 | script: '../scripts/merge_utr_intervals.py' |
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | import hail as hl import pandas as pd from utr3variants.annotate_gnomad import annotate_by_intervals from utr3variants.maps import count_for_maps if __name__ == '__main__': ref = snakemake.config['genome_assembly'] interval_file = snakemake.input['intervals'].__str__() gnomad_path = snakemake.input['gnomAD'].__str__() mutation_path = snakemake.config['gnomAD']['mutation_rate_ht'] counts_out = snakemake.output.counts hl.init( local=f'local[{snakemake.threads}]', log=snakemake.log['hail'], default_reference=ref, ) mutation_ht = hl.read_table(mutation_path) ht = hl.read_table(gnomad_path) intervals = hl.import_bed(interval_file, min_partitions=100) ht = annotate_by_intervals(ht, intervals, columns=['target']) count_ht = count_for_maps( ht, mutation_ht, additional_grouping=['target'], skip_mut_check=snakemake.config['gnomAD']['skip_checks'], ) print('save...') count_ht.export(counts_out) # Preview of counts table count_df = pd.read_table(counts_out) print(count_df.head()) |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | import re from typing import Iterable from typing import List import numpy as np import pandas as pd import pybedtools from pysam import FastaFile # pylint: disable=no-name-in-module from utr3variants.utils import extract_annotations, get_most_expressed ANNO_COLUMN_MAP = { # Mapping of snakemake wildcard to column name in database 'hexamer_motif': 'PAS Signal', 'conservation': 'Conservation', 'percent_expressed': 'PSE', 'expression': 'Mean RPM', 'most_expressed': 'most_expressed', } def create_signal_interval( feature: pybedtools.Interval, signal: str, sequence: str, name: str = None ) -> Iterable[pybedtools.Interval]: """ Create pybedtools.Interval objects of signal locations within given interval :param feature: interval in which to search for signal :param signal: sequence of signal motif :param sequence: sequence of feature interval :param name: value to overwrite in feature.name if specified :return: generator of signal locations """ name = feature.name if name is None else name return ( pybedtools.Interval( chrom=feature.chrom, start=feature.end - i.end() if feature.strand == '-' else feature.start + i.start(), end=feature.end - i.start() if feature.strand == '-' else feature.start + i.end(), name=name, score=feature.score, strand=feature.strand, ) for i in re.finditer(signal, sequence, re.IGNORECASE) ) def get_hexamers( feature: pybedtools.Interval, sequence: str, hexamer_column: int ) -> List[pybedtools.Interval]: """ Retrieve all hexamers of regions from within given interval :param feature: interval with interval.name containing annotation of form <PAS signal>|<conservation> :param sequence: sequence of feature interval :param hexamer_column: column index of hexamer in feature name :return: list of hexamer coordinates """ name_split = feature.name.split('|') hexamer_motif = name_split[hexamer_column] if hexamer_motif == 'NoPAS': return [] if hexamer_motif == 'OtherPAS': signal_sequences = [ 'AGTAAA', 'TATAAA', 'CATAAA', 'GATAAA', 'AATATA', 'AATACA', 'AATAGA', 'AAAAAG', 'ACTAAA', ] elif hexamer_motif == 'Arich': signal_sequences = ['AAAAAA'] else: signal_sequences = [hexamer_motif] dna_signals = [s.replace('U', 'T') for s in signal_sequences] hexamer_list = [] for signal in dna_signals: name_split[hexamer_column] = signal # rename hexamer annotation intervals = create_signal_interval( feature, signal, sequence, name='|'.join(name_split) ) hexamer_list.extend(intervals) return hexamer_list if __name__ == '__main__': genome = snakemake.config['assembly_ucsc'] annotations = snakemake.params.annotations pas_db = snakemake.input.db.__str__() fasta_file = snakemake.input.fasta.__str__() output = snakemake.output print('read database') df = pd.read_csv(pas_db, sep='\t') # cleanup columns df['Start'] = df['Position'] - 1 df['PSE'] = df['PSE'].str.rstrip('%').astype('float') / 100 df['score'] = '.' df = get_most_expressed( df, aggregate_column='Gene Symbol', expression_column='Mean RPM', interval_columns=['Chromosome', 'Position', 'Strand'], new_column='most_expressed', ) annotations.append('most_expressed') # put annotation into name db_cols = [ANNO_COLUMN_MAP[a] for a in annotations] df['Name'] = df[db_cols].astype(str).agg('|'.join, axis=1) # create bedtools object/table print('Create BedTools object') # BedTool object needed for hexamer extraction bed_cols = [ 'Chromosome', # chrom 'Start', # start 'Position', # end 'Name', # name 'score', # score 'Strand', # strand ] bed = pybedtools.BedTool.from_dataframe(df[bed_cols]) # convert back to dataframe with annotations intervals_df = extract_annotations( bed.to_dataframe(), annotation_string='name', annotation_columns=annotations, database='PolyA_DB', feature='PAS', ) if 'hexamer_motif' in annotations: print('extract hexamers') bed_40nt_us = bed.slop( # pylint: disable=unexpected-keyword-arg l=40, r=0, s=True, genome=genome ).sequence(fi=fasta_file, s=True, fullHeader=True) sequences = FastaFile(bed_40nt_us.seqfn) hexamer_intervals = [] hex_column = annotations.index('hexamer_motif') for f in bed_40nt_us: seq = sequences.fetch(f'{f.chrom}:{f.start}-{f.stop}({f.strand})') hex_intervals = get_hexamers( feature=f, sequence=seq, hexamer_column=hex_column ) hexamer_intervals.extend(hex_intervals) hex_df = pybedtools.BedTool(hexamer_intervals).to_dataframe() hex_df = extract_annotations( hex_df, annotation_string='name', annotation_columns=annotations, database='PolyA_DB', feature='hexamer', ) intervals_df['hexamer_motif'] = np.nan intervals_df = pd.concat([intervals_df, hex_df]) print('save...') # get stats intervals_df['feature'].value_counts().to_csv(output.stats, sep='\t') intervals_df.to_csv(output.intervals, index=False, sep='\t') |
11
of
scripts/extract_polyadb.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | from collections import OrderedDict import pandas as pd import pybedtools from utr3variants.utils import ( encode_annotations, extract_annotations, convert_chromosome, get_most_expressed, ) COLUMN_TYPES = OrderedDict( chrom=str, start=int, end=int, cluster_id=str, score=float, strand=str, percent_expressed=float, n_protocols=int, expression=float, cluster_annotation=str, hexamer_motif=str, ) MERGED_COLUMN_TYPES = OrderedDict( chrom=str, start=int, end=int, name=str, score=float, strand=str, gchrom=str, gstart=int, gend=int, gname=str, gscore=str, gstrand=str, g13=str, g14=str, g15=str, g16=str, g17=str, g18=str, g19=str, ) if __name__ == '__main__': genome = snakemake.config['assembly_ucsc'] annotations = snakemake.params.annotations pas_db = snakemake.input.db.__str__() genes_file = snakemake.input.genes.__str__() output = snakemake.output df_db = pd.read_csv(pas_db, sep='\t', names=COLUMN_TYPES.keys(), dtype=COLUMN_TYPES) # match chromosome style to genome annotation df_db['chrom'] = df_db.chrom.apply(lambda x: convert_chromosome(x, style='chr')) if 'most_expressed' not in annotations: df_db = encode_annotations(df_db, annotations, 'name') else: df_db = encode_annotations( df_db, annotation_columns=[x for x in annotations if x != 'most_expressed'], encode_column='name', ) bed_cols = ['chrom', 'start', 'end', 'name', 'score', 'strand'] bed = pybedtools.BedTool.from_dataframe(df_db[bed_cols]) print('Annotate overlapping genes') genes_bed = pybedtools.BedTool(genes_file).sort() df_db = ( bed.sort() .closest(genes_bed, s=True, fu=True, D='ref') .to_dataframe(names=MERGED_COLUMN_TYPES.keys(), dtype=MERGED_COLUMN_TYPES) ) print('Annotate most expressed PAS') df_db = get_most_expressed( df_db, aggregate_column='gname', expression_column='score', interval_columns=['chrom', 'start', 'end', 'strand'], new_column='most_expressed', ) df_db['name'] = df_db['name'] + '|' + df_db['most_expressed'].map(str) # drop unnecessary columns df_db.drop( columns=[x for x in df_db.columns if x.startswith('g')], inplace=True ) intervals_df = extract_annotations( df_db, annotation_string='name', annotation_columns=annotations, database='PolyASite2', feature='PAS', ) if 'hexamer_motif' in annotations: print('Extract hexamers') hexamers_df = intervals_df[intervals_df['hexamer_motif'] != ''].copy() # split by different signals per PAS by ; hexamers_df['hexamer_motif_split'] = hexamers_df['hexamer_motif'].str.split(';') hexamers_df = hexamers_df.explode('hexamer_motif_split') # split each signal by @ into different columns hexamers_df[['hexamer_motif', 'rel_pos', 'hex_start']] = hexamers_df[ 'hexamer_motif_split' ].str.split('@', expand=True) # retrieve hexamer interval hexamers_df.hex_start = hexamers_df.hex_start.astype(int) hexamers_df.start = (hexamers_df.hex_start - 1).where( hexamers_df.strand == '+', hexamers_df.hex_start - 6 ) hexamers_df.end = hexamers_df.hex_start.where( hexamers_df.strand == '-', hexamers_df.hex_start - 1 + 6 ) # remove unnecessary columns hexamers_df.drop( columns=['hexamer_motif_split', 'rel_pos', 'hex_start'], inplace=True ) # append hexamer regions translated into BED coordinates hexamers_df['database'] = 'PolyASite2' hexamers_df['feature'] = 'hexamer' intervals_df['hexamer_motif'] = '' intervals_df = pd.concat([intervals_df, hexamers_df]) print('save...') intervals_df['feature'].value_counts().to_csv(output.stats, sep='\t') # get stats intervals_df.to_csv(output.intervals, index=False, sep='\t') |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | library(data.table) library(ggplot2) source(snakemake@input$utils) count_dt <- fread(snakemake@input$counts) # get reference MAPS dt_csq <- maps_reference(count_dt) # get aggregation MAPS aggregations <- strsplit(snakemake@wildcards$aggregation, '-')[[1]] dt <- maps(count_dt, grouping = aggregations) setorder(dt, -maps) fwrite(dt, snakemake@output$tsv, sep = '\t') # Plot only top n variants dt <- dt[variant_count > snakemake@params$variant_count_min] new_columns <- c('anno_x', 'shape', 'facet')[seq_along(aggregations)] # Flatten for 3'UTR and other variants if (length(aggregations) == 3) { if ('database' %in% aggregations) { flatten_values <- c('GENCODE', 'gnomAD') #dt_lines <- dt[database %in% flatten_values] dt <- dt[!database %in% flatten_values] } else if ('feature' %in% aggregations) { flatten_values <- c('3UTR', 'other variant') #dt_lines <- dt[feature %in% flatten_values] dt <- dt[!feature %in% flatten_values] } #setnames(dt_lines, old = aggregations, new = new_columns) } setnames(dt, old = aggregations, new = new_columns) dt[, anno_x := factor(anno_x, levels = unique(anno_x), ordered = TRUE)] title <- paste0('MAPS on ', snakemake@wildcards$chr_subset, ' (', snakemake@params$chr_subset, ')') dodge_width <- 0.5 if ('shape' %in% names(dt)) { p <- ggplot(dt, aes(reorder(anno_x, -maps), maps, color = shape, group = shape)) } else { p <- ggplot(dt, aes(reorder(anno_x, -maps), maps)) } if ('facet' %in% names(dt)) { p <- p + facet_grid('facet~.') + ggtitle(title, subtitle = paste('Facet:', toTitleCase(aggregations[3]))) } if (exists('dt_lines')) { p <- p + geom_hline( aes(yintercept = maps, color = shape), data = dt_lines[, .(maps, shape)] ) } p <- p + geom_point(position = position_dodge(width = dodge_width)) + labs( title = title, x = toTitleCase(aggregations[1]), y = 'MAPS', color = toTitleCase(aggregations[2]) ) + geom_hline( # add lines for reference aes(yintercept = maps, group = consequence), data = dt_csq[, .(maps, consequence)], linetype = 'dashed', color = 'grey50' ) + geom_text( aes(x = 0, y = maps, label = consequence, vjust = 1.3), hjust = -0.05, data = dt_csq, inherit.aes = FALSE, size = 2.5 ) + geom_text( aes(y = min(maps - maps_sem) - 0.03, label = variant_count), size = 3, angle = 30, position = position_dodge(dodge_width) ) + geom_errorbar( aes(ymin = maps - maps_sem, ymax = maps + maps_sem), width = 0.15, position = position_dodge(width = dodge_width) ) + #scale_color_brewer(palette = 'Set1') + theme_classic() + theme( legend.position = 'bottom', axis.text.x = element_text(angle = 90, hjust = 1), axis.ticks = element_blank(), ) ggsave(snakemake@output$png, width = 8, height = 6) |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | import tempfile from functools import reduce import pandas as pd from pandas.api.types import CategoricalDtype import pybedtools from utr3variants.utils import ( encode_annotations, extract_annotations, convert_chromosome, ) BED_COLS = ['chrom', 'start', 'end', 'name', 'score', 'strand'] CANONICAL_CHROMOSOMES = [str(x) for x in list(range(1, 23))] + ['X', 'Y'] CANONICAL_CHROMOSOMES_CHR = [f'chr{chr}' for chr in CANONICAL_CHROMOSOMES] chr_dtype = CategoricalDtype(CANONICAL_CHROMOSOMES, ordered=True) def rename_interval( interval: pybedtools.Interval, prefix: str = None, name: str = None, sep='|', keep_col=-1, ) -> pybedtools.Interval: """ Change name of interval by adding a prefix and/or overwriting the old name :param interval: :param prefix: prefix to add to existing name if specified :param name: value to overwrite old name, depending on value of keep_col :param sep: character to separate prefix from rest of name :param keep_col: column index of annotation column to keep in interval.name list separated by sep. Overwrite all annotations with name if keep_col=-1 (default) e.g. for interval.name = 'AAUAAA|No' and keep_col = 0: keep 'AAUAAA', drop 'No' """ if name is None: name = interval.name if keep_col >= 0: name = name.split(sep)[keep_col] if prefix is not None: name = sep.join(filter(None, [prefix, name])) interval.name = name return interval if __name__ == '__main__': chr_style = snakemake.params.chr_style_gnomAD annotations = snakemake.params.annotations chainfile = snakemake.input.chainfile.__str__() utrs = pd.read_table( snakemake.input.utrs.__str__(), sep='\t', dtype=str, names=BED_COLS ) # pybedtools.BedTool(snakemake.input.utrs.__str__()) polya_db = pd.read_table(snakemake.input.PolyA_DB.__str__(), sep='\t', dtype=str) polya_site = pd.read_table( snakemake.input.PolyASite2.__str__(), sep='\t', dtype=str ) # liftover PolyASite2 to hg19 pasite_anno = [x for x in annotations if x in polya_site.columns] polya_site = encode_annotations(polya_site, pasite_anno, 'name') polya_site_bed = pybedtools.BedTool.from_dataframe(polya_site[BED_COLS]).liftover( chainfile ) polya_site = extract_annotations( polya_site_bed.to_dataframe(dtype=str), 'name', pasite_anno ) polya_site = polya_site[polya_site.chrom.isin(CANONICAL_CHROMOSOMES_CHR)] # convert chromosome format to match gnomAD dataset print('Convert chromosome styles') utrs.chrom = utrs.chrom.apply(lambda x: convert_chromosome(x, chr_style)) polya_db.chrom = polya_db.chrom.apply(lambda x: convert_chromosome(x, chr_style)) polya_site.chrom = polya_site.chrom.apply( lambda x: convert_chromosome(x, chr_style) ) # subtract from 3'UTR utrs_bed = pybedtools.BedTool.from_dataframe(utrs) polya_db_bed = pybedtools.BedTool.from_dataframe(polya_db) polya_site_bed = pybedtools.BedTool.from_dataframe(polya_site) with tempfile.TemporaryDirectory() as tmp_dir: other_utr = ( utrs_bed.subtract(polya_db_bed) # pylint: disable=too-many-function-args .subtract(polya_site_bed) .saveas(f'{tmp_dir}/other_utr.bed') .to_dataframe(dtype=str) ) other_utr['database'] = 'GENCODE' other_utr['feature'] = '3UTR' # merge all intervals via pandas df = reduce( lambda df_left, df_right: pd.merge( df_left, df_right, how='outer', on=list(set(df_left.columns) & set(df_right.columns)), ), [other_utr, polya_db, polya_site], ) # sort intervals df['chrom'] = df['chrom'].astype(str).astype(chr_dtype) df.sort_values(by=['chrom', 'start', 'end', 'strand'], inplace=True) print('Create hail-parsable locus intervals') df['locus_interval'] = ( '(' + df.chrom.astype(str) + ':' + df.start.map(str) + '-' + df.end.map(str) + ']' ) # Encode annotations df = encode_annotations(df, annotations, 'name', verbose=True) # save print('save...') df[['locus_interval', 'strand'] + annotations].to_csv( snakemake.output.intervals, sep='\t', index=False ) df[['chrom', 'start', 'end', 'name', 'score', 'strand']].to_csv( snakemake.output.bed, sep='\t', index=False, header=False ) |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import hail as hl from utr3variants.annotate_gnomad import filter_gnomad, annotate_for_maps if __name__ == '__main__': hl.init( local=f'local[{snakemake.threads}]', log=snakemake.log['hail'], default_reference=snakemake.config['genome_assembly'], ) context_ht = hl.read_table(snakemake.config['gnomAD']['context_ht']) ht = hl.read_table(snakemake.config['gnomAD']['gnomAD_ht']) subset_interval = hl.parse_locus_interval(snakemake.params['chr_subset']) print('Filter gnomAD') ht = filter_gnomad(ht, [subset_interval]) print('Annotate') ht = annotate_for_maps(ht, context_ht) print('save...') ht.write(snakemake.output['gnomAD_ht']) |
36 37 38 39 40 41 42 | shell: """ snakemake --dag | dot -Tsvg -Grankdir=TB > {output.dag[0]} snakemake --dag | dot -Tpng -Grankdir=TB > {output.dag[1]} snakemake --rulegraph | dot -Tsvg -Grankdir=TB > {output.rulegraph[0]} snakemake --rulegraph | dot -Tpng -Grankdir=TB > {output.rulegraph[1]} """ |
Support
- Future updates
Related Workflows





