Snakemake workflow for Illumina RNA-sequencing experiments - extract population genomic signals from RNA-Seq data
Documentation : https://sanjaynagi.github.io/rna-seq-pop/
Walkthrough : https://www.youtube.com/watch?v=5QQe7DLHO4M
RNA-Seq-Pop is a computational pipeline to analyse Illumina RNA-Sequencing data of any organism. As well as performing standard transcriptomic analyses, such as differential expression, RNA-Seq-Pop also calls and analyses genetic polymorphisms, extracting population genomic signals. The workflow can perform the following analyses of illumina paired-end RNA-Sequencing data:
-
Quality control of fastq reads with fastqc , QC metrics integrated into a final report with multiQC
-
Differential expression analysis with Kallisto at the gene level ( DESeq2 ) and transcript level ( Sleuth ), and gene set enrichment analyses.
-
Variant calling with freebayes
-
Fst and Population branch statistic (PBS) , both in windows across contigs and at the gene-level ( Scikit-allel ).
-
Allele frequencies at variants of interest (pre-specified loci of choice).
-
Various summary statistics (Wattersons Theta, Sequence Diversity, Dxy etc)
-
Analysis of Ancestry Informative Markers (AIMs) to determine relative ancestry of An.gambiae/coluzzii/arabiensis ( Anopheles gambiae s.l )
-
Determines Karyotype of chromosomal inversions using compkaryo ( Anopheles gambiae s.l )
The workflow is generalised, and will function with any Illumina single or paired-end RNA-sequencing. However, certain modules, such as the ancestry and karyotyping, are only appropriate for An. gambiae s.l . These can be activated in the configuration file (config.yaml). The workflow works with pooled samples, diploid, or haploid individuals.
If you have any feedback on how the workflow may be improved, please do get in touch, or feel free to fork the github repo and create a pull request for any additional features you would like to implement. If you are using the workflow and would like to give feedback or troubleshoot, consider joining the discord server here , otherwise email or log an issue on github.
Authors
- Sanjay Curtis Nagi (@sanjaynagi) - sanjay.c.nagi@gmail.com
Citation
Sanjay C Nagi, Ambrose Oruni, David Weetman, Martin J Donnelly (2022). RNA-Seq-Pop : Exploiting the sequence in RNA-Seq - a Snakemake workflow reveals patterns of insecticide resistance in the malaria vector Anopheles gambiae . Molecular Ecology Resources ; doi: https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13759
If you use this workflow in a paper, please give credits to the author by citing the manuscript and its DOI - https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13759
Usage
Please see the [documentation](https://sanjaynagi.github.io/rna-seq-pop/
- for instructions on how to run and contribute to RNA-Seq-Pop.
Release notes
-
1.0.4 - config files have been slightly restructured to make them neater. FDR control in GSEA. exampleconfig now up to date.
-
1.0.3 - support for single-end reads, venn and custom heatmaps added, updated results folder structure. fastqs can be specified in sample metadata.
-
1.0.2 - Changed Pi, theta calculations to be in windows across genome, and removed plotting from SummaryStats.py. Changed use of 'chrom' to 'contig'
-
1.0.1 - New feature to plot a heatmap of various gene families
Code Snippets
14 15 | wrapper: "v1.15.0/bio/kallisto/index" |
35 36 | wrapper: "v1.15.0/bio/kallisto/quant" |
48 49 | shell: "gzip -d -c {input} > {output} 2> {log}" |
61 62 | wrapper: "v1.15.0/bio/samtools/faidx" |
81 82 | shell: "hisat2-build -p {threads} {input.fasta} {params.prefix} 2> {log}" |
107 108 109 110 111 | shell: """ hisat2 {params.extra} --threads {threads} -x {params.idx} {params.readflags} 2> {log.align} | samblaster {params.samblaster} 2> {log.sort} | samtools sort -@{threads} - -o {output} 2>> {log.sort} """ |
124 125 | wrapper: "v1.15.0/bio/samtools/index" |
28 29 30 31 | shell: """ python -m ipykernel install --user --name=pythonGenomics 2> {log} """ |
55 56 57 58 | shell: """ papermill -k pythonGenomics {input.input_nb} {output.out_nb} -p wkdir {params.wkdir} 2> {log} """ |
29 30 | script: "../scripts/DeseqGeneDE.R" |
57 58 | script: "../scripts/SleuthIsoformsDE.R" |
90 91 | script: "../scripts/ProgressiveGenes.R" |
128 129 130 131 132 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p go_path {input.gaf} -p metadata_path {input.metadata} -p config_path {params.config} -p selection {params.selection} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
154 155 | script: "../scripts/Ag1000gSweepsDE.py" |
198 199 200 201 202 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p metadata_path {input.metadata} -p config_path {params.config_path} -p go_path {input.eggnog} -p normcounts_path {input.normcounts} -p pfam_path {input.pfam} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
223 224 225 226 227 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p wkdir {params.wd} -p metadata_path {input.metadata} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
251 252 253 254 255 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p wkdir {params.wd} -p dataset {params.dataset} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
19 20 | shell: "bcftools concat {input.calls} | vcfuniq > {output} 2> {log}" |
35 36 | wrapper: "v1.15.0/bio/bcftools/view" |
52 53 | shell: "snpEff download {params.ref} -dataDir {params.dir} 2> {log}" |
77 78 79 80 81 | shell: """ snpEff eff {params.db} -dataDir {params.dir} -csvStats {output.csvStats} {input.calls} > {params.prefix} 2> {log} bgzip {params.prefix} """ |
23 24 25 26 27 | shell: """ jupyter-book build --all docs/rna-seq-pop-results --path-output results/rna-seq-pop-results 2> {log} ln -sf docs/rna-seq-pop-results/_build/html/index.html {params.dataset}-results.html 2>> {log} """ |
23 24 | script: "../scripts/VennDiagrams.py" |
22 23 | script: "../scripts/checkInputs.py" |
38 39 | wrapper: "v2.2.1/bio/fastp" |
54 55 | wrapper: "v1.15.0/bio/samtools/flagstat" |
74 75 | shell: "mosdepth --threads {threads} --fast-mode {params.prefix} {input.bam}" |
93 94 95 96 | shell: """ bcftools stats {input} > {output} 2> {log} """ |
120 121 | wrapper: "v2.2.1/bio/multiqc" |
138 139 140 141 142 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p wkdir {params.wd} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
35 36 | script: "../scripts/SNPstatistics.py" |
100 101 102 103 104 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p metadata_path {input.metadata} -p dataset {params.dataset} -p config_path {params.config_path} -p ploidy {params.ploidy} -p missingprop {params.missingprop} -p qualflt {params.qualflt} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
137 138 139 140 141 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p metadata_path {input.metadata} -p dataset {params.dataset} -p config_path {params.config_path} -p ploidy {params.ploidy} -p missingprop {params.missingprop} -p qualflt {params.qualflt} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
188 189 190 191 192 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p config_path {params.config} -p pbs {params.pbs} -p pbscomps {params.pbscomps} -p metadata_path {input.metadata} -p dataset {params.dataset} -p ploidy {params.ploidy} -p missingprop {params.missingprop} -p qualflt {params.qualflt} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
227 228 | script: "../scripts/PerGeneFstPBS.py" |
262 263 | script: "../scripts/AncestryInformativeMarkers.py" |
285 286 287 288 289 290 | shell: """ paste <(bcftools query -l {input.vcf}) \ <(python {params.basedir}/scripts/compkaryo/compkaryo/compkaryo.py {input.vcf} {wildcards.karyo} -p {params.ploidy}) | column -s $'\\t' -t | sort -k 1 > {output} """ |
320 321 322 323 324 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p config_path {params.config_path} -p metadata_path {params.metadata} -p dataset {params.dataset} -p ploidy {params.ploidy} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
24 25 | script: "../scripts/GenerateFreebayesParams.R" |
48 49 | shell: "freebayes -f {input.ref} -t {input.regions} --ploidy {params.ploidy} --populations {input.pops} --pooled-discrete --use-best-n-alleles 5 -L {input.samples} > {output} 2> {log}" |
21 22 23 24 25 | shell: """ samtools mpileup {input.bam} -r {params.region} -f {params.ref} 2> {log} | python {params.basedir}/scripts/BaseParser.py > {output} 2>> {log} """ |
52 53 | script: "../scripts/VariantsOfInterestAlleleBalance.R" |
100 101 102 103 104 | shell: """ papermill {input.nb} {output.nb} -k pythonGenomics -p voi_path {input.VariantsOfInterest} 2> {log} cp {output.nb} {output.docs_nb} 2>> {log} """ |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd import numpy as np from collections import defaultdict # Read in parameters pval_threshold = snakemake.params['pval'] upper_fc = snakemake.params['fc'] lower_fc = 1/upper_fc # if someone wants a FC threshold of 2, need to have lower threshold of 0.5. # Read in list of contrasts comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast']) # Read in .csv file containing selection signals and associated metadata signals = pd.read_csv("resources/signals.csv") for comp in comparisons['contrast']: # Load DE results DEgenes = pd.read_csv(f"results/genediff/{comp}.csv") # Filter to overexpressed genes sigup = DEgenes[np.logical_and(DEgenes['padj'] < pval_threshold, np.logical_or(DEgenes['FC'] > upper_fc, DEgenes['FC'] < lower_fc))] sweep = {} nswept = {} # Loop through each signal, recording if any DE genes are found in one for i, cols in signals.iterrows(): if pd.isnull(cols['overlapping_genes']): # Skip signal if no overlapping genes continue sweptgenes = np.array(cols['overlapping_genes'].split(" ")) # Get boolean array - if list of swept genes isin our DE genes overlap = np.isin(sweptgenes, sigup['GeneID']) sweep[cols['uid']] = sweptgenes[overlap] nswept[cols['uid']] = sweptgenes genes = np.concatenate(list(sweep.values())) swept = sigup[np.isin(sigup['GeneID'], genes)] for k,v in sweep.items(): sweep[k] = ' '.join(v) # Build dataframe and add sweep metadata columns sweptDE = pd.DataFrame.from_dict(sweep, orient='index', columns=['overlapping_DE_genes']) sweptDE = sweptDE.reset_index().rename(columns={'index': 'Ag1000g_sweep'}) sweptDE['overlapping_genes'] = signals['overlapping_genes'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True) sweptDE['chromosome'] = signals['peak_end_seqid'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True) sweptDE['epicenter'] = signals['epicenter_coord'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True) sweptDE['known_loci'] = signals['known_loci'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True) wheresweep = defaultdict(dict) whatsweep = defaultdict(list) # Now loop through each swept gene to find name of sweeps it lies under # And location of sweep for gene in genes: for i, cols in sweptDE.iterrows(): sweptgenes = np.array(cols['overlapping_DE_genes'].split(" ")) if np.isin(sweptgenes, gene).any(): wheresweep[gene]['contig'] = cols['chromosome'] wheresweep[gene]['epicenter'] = cols['epicenter'] wheresweep[gene]['known_loci'] = cols['known_loci'] whatsweep[gene].append(cols['Ag1000g_sweep']) # Join name of sweeps column into a single string, so it fits in one column of data frame for k,v in whatsweep.items(): whatsweep[k] = ' '.join(v) dfwhere = pd.DataFrame.from_dict(wheresweep, orient='index') dfwhat = pd.DataFrame.from_dict(whatsweep, orient='index', columns=['Ag1000g_sweeps']) df = pd.concat([dfwhat, dfwhere], axis=1) df = df.reset_index().rename(columns={'index': 'GeneID'}) swept = swept.merge(df) swept.to_csv(f"results/genediff/ag1000gSweeps/{comp}_swept.tsv", sep="\t") |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | import sys sys.stderr = open(snakemake.log[0], "w") import rnaseqpoptools as rnaseqpop import pandas as pd import numpy as np import zarr import allel from collections import defaultdict ### AIMS ### dataset = snakemake.params['dataset'] metadata = rnaseqpop.load_metadata(snakemake.input['metadata']) metadata = metadata.sort_values(by='species').reset_index(drop=True) contigs = snakemake.params['contigs'] ploidy = snakemake.params['ploidy'] numbers = rnaseqpop.get_numbers_dict(ploidy) qualflt = snakemake.params['qualflt'] missingprop = snakemake.params['missingprop'] # read AIMs aims = zarr.open(snakemake.input['aims_zarr_gambcolu'], mode='r') ## initialize dicts ancestryPerAim = {} aims_chrom_gamb = {} aims_chrom_colu = {} all_gamb = defaultdict(list) all_colu = defaultdict(list) n_aims_per_chrom = {} for contig in contigs: # read in and filter data path = f"results/variantAnalysis/vcfs/{dataset}.{contig}.vcf.gz" vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path, contig=contig, samples=metadata, numbers=numbers, ploidy=ploidy, qualflt=qualflt, missingfltprop=missingprop) aimspos = aims[contig]['POS'][:] # get intersection of aims and our SNPs aims_pos_mask, aims_mask_2 = pos.locate_intersection(aimspos) our_aims = pos[aims_pos_mask] print(f"\n In the data, across all samples there are {our_aims.shape[0]} Ancestry Informative markers on Chromosome {contig}") # get gamb and colu alleles, and subset to aims that we have in the rna-seq data aimscolu = aims[contig]['colu_allele'][:][aims_mask_2] aimsgamb = aims[contig]['gamb_allele'][:][aims_mask_2] # get mask that was used in readAndFilterVcf() mask = pos.locate_intersection(vcf['variants/POS'])[1] ref = vcf['variants/REF'][mask][aims_pos_mask] alt = vcf['variants/ALT'][mask][aims_pos_mask] # filter geno array to set of aims geno_aims = geno.compress(aims_pos_mask, axis=0) totalgambscore = {} totalcoluscore = {} for aim in our_aims: gambscore = {} coluscore = {} # filter arrays mask = our_aims == aim ref_ = ref[mask] alt_ = alt[mask] aimscolu_ = aimscolu[mask] aimsgamb_ = aimsgamb[mask] gn_aim = geno_aims.compress(mask, axis=0) # convert genotypes to nucleotides gn2nucleotide = {0:ref_[0], 1:alt_[0][0], 2:alt_[0][1], 3:alt_[0][2], -1:float("nan")} gn = rnaseqpop.replace_with_dict2_generic(gn_aim, gn2nucleotide) # for each sample, get proportion of gambiae/coluzzii alleles # alleles that are different to both will be missed here for sample in metadata.treatment.unique(): alleles = gn.take(subpops[sample], axis=1).flatten() # at each AIM, do we have gamb or colu alleles gamb = alleles[alleles != 'nan'] == aimsgamb_ colu = alleles[alleles != 'nan'] == aimscolu_ # get proportion of gamb v colu alleles at each locus gambscore[sample] = np.mean(gamb) coluscore[sample] = np.mean(colu) totalgambscore[aim] = dict(gambscore) totalcoluscore[aim] = dict(coluscore) gambscores = rnaseqpop.flip_dict(totalgambscore) coluscores = rnaseqpop.flip_dict(totalcoluscore) prop_gambiae = {} prop_colu = {} n_aims_per_sample = {} for sample in metadata.treatment.unique(): prop_gambiae[sample] = np.nanmean(np.array(list(gambscores[sample].values()))) all_gamb[sample].append(np.nanmean(np.array(list(gambscores[sample].values())))) prop_colu[sample] = np.nanmean(np.array(list(coluscores[sample].values()))) all_colu[sample].append(np.nanmean(np.array(list(coluscores[sample].values())))) arr = np.array(list(gambscores[sample].values())) dim = arr.shape[0] n_aims_per_sample[sample] = dim-np.sum(np.isnan(arr)) # store AIM fractions for each chromosome in outer dict aims_chrom_gamb[contig] = dict(prop_gambiae) aims_chrom_colu[contig] = dict(prop_colu) n_aims_per_chrom[contig] = dict(n_aims_per_sample) # Store ancestry score per aim ancestryPerAim[contig] = pd.concat([pd.DataFrame(gambscores).add_suffix("_gamb"), pd.DataFrame(coluscores).add_suffix("_colu")], axis=1) ancestryPerAim[contig]['contig'] = contig # plot and store for each chromosome coludf = pd.DataFrame.from_dict(prop_colu, orient='index', columns=['AIM_fraction_coluzzii']) gambdf = pd.DataFrame.from_dict(prop_gambiae, orient='index', columns=['AIM_fraction_gambiae']) perchromdf = gambdf.merge(coludf, left_index=True, right_index=True) aimsperchromdf = pd.DataFrame.from_dict(n_aims_per_sample, orient='index', columns=['n_AIMs']) perchromdf.to_csv(f"results/variantAnalysis/ancestry/AIM_fraction_{contig}.tsv", sep="\t", index=True) rnaseqpop.plot_aims(perchromdf, aimsperchromdf, species1="coluzzii", species2="gambiae", figtitle=f"AIM_fraction_{contig}", total=False) aims_chrom_gamb = rnaseqpop.flip_dict(aims_chrom_gamb) aims_chrom_colu = rnaseqpop.flip_dict(aims_chrom_colu) n_aims_per_chrom = rnaseqpop.flip_dict(n_aims_per_chrom) # get ancestry per aim for later plotting on chromosome ancestryPerAim = pd.concat(ancestryPerAim, axis=0) ancestryPerAim.to_csv("results/variantAnalysis/ancestry/ancestryScorePerAim.tsv", sep="\t") # get genome wide average AIM fractions for k in all_gamb: all_gamb[k] = np.nanmean(all_gamb[k]) all_colu[k] = np.nanmean(all_colu[k]) df1 = pd.DataFrame.from_dict(all_gamb, orient='index',columns=['AIM_fraction_gambiae']) df2 = pd.DataFrame.from_dict(all_colu, orient='index', columns=['AIM_fraction_coluzzii']) n_aimsdf = pd.DataFrame.from_dict(n_aims_per_chrom) n_aimsdf.to_csv(f"results/variantAnalysis/ancestry/n_AIMS_per_chrom.tsv", sep="\t", index=True) df = df1.merge(df2, left_index=True, right_index=True) df.to_csv(f"results/variantAnalysis/ancestry/AIMs_summary.tsv", sep="\t", index=True) rnaseqpop.plot_aims(df, n_aimsdf, species1="coluzzii", species2="gambiae", figtitle="AIM_fraction_whole_genome", total=True) ########### The same but for arabiensis v gambiae/coluzzii # script should be modularised but no time atm, isnt necessary if metadata['species'].isin(['arabiensis']).any(): aims = zarr.open(snakemake.input['aims_zarr_arab'], mode='r') aims_chrom_gamb = {} aims_chrom_arab = {} all_gamb = defaultdict(list) all_arab = defaultdict(list) n_aims_per_chrom = {} for contig in contigs: # read in and filter data path = f"results/variantAnalysis/vcfs/{dataset}.{contig}.vcf.gz" vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path, contig=contig, samples=metadata, numbers=numbers, qualflt=qualflt, missingfltprop=missingprop) aimspos = aims[contig]['POS'][:] # get intersection of aims and our SNPs aims_pos_mask, aims_mask_2 = pos.locate_intersection(aimspos) our_aims = pos[aims_pos_mask] print(f"\n In the data, across all samples there are {our_aims.shape[0]} gamb-arab Ancestry Informative markers on Chromosome {contig}") # get gamb and colu alleles, and subset to aims that we have in the rna-seq data aimsgamb = aims[contig]['gambcolu_allele'][:][aims_mask_2] aimsarab = aims[contig]['arab_allele'][:][aims_mask_2] # get mask that was used in readAndFilterVcf() mask = pos.locate_intersection(vcf['variants/POS'])[1] ref = vcf['variants/REF'][mask][aims_pos_mask] alt = vcf['variants/ALT'][mask][aims_pos_mask] # filter geno array to set of aims geno_aims = geno.compress(aims_pos_mask, axis=0) totalgambscore = {} totalarabscore = {} for aim in our_aims: gambscore = {} arabscore = {} # filter arrays mask = our_aims == aim ref_ = ref[mask] alt_ = alt[mask] aimsarab_ = aimsarab[mask] aimsgamb_ = aimsgamb[mask] gn_aim = geno_aims.compress(mask, axis=0) #convert genotypes to nucleotides gn2nucleotide = {0:ref_[0], 1:alt_[0][0], 2:alt_[0][1], 3:alt_[0][2], -1:float("nan")} gn = rnaseqpop.replace_with_dict2_generic(gn_aim, gn2nucleotide) # for each sample, get proportion of gambiae/arabiensis alleles # alleles that are different to both will be missed here for sample in metadata.treatment.unique(): alleles = gn.take(subpops[sample], axis=1).flatten() # at each AIM, do we have gamb or arab alleles gamb = alleles[alleles != 'nan'] == aimsgamb_ arab = alleles[alleles != 'nan'] == aimsarab_ # get proportion of gamb v arab alleles at each locus gambscore[sample] = np.mean(gamb) arabscore[sample] = np.mean(arab) totalgambscore[aim] = dict(gambscore) totalarabscore[aim] = dict(arabscore) gambscores = rnaseqpop.flip_dict(totalgambscore) arabscores = rnaseqpop.flip_dict(totalarabscore) prop_gambiae = {} prop_arab = {} n_aims_per_sample = {} for sample in metadata.treatment.unique(): prop_gambiae[sample] = np.nanmean(np.array(list(gambscores[sample].values()))) all_gamb[sample].append(np.nanmean(np.array(list(gambscores[sample].values())))) prop_arab[sample] = np.nanmean(np.array(list(arabscores[sample].values()))) all_arab[sample].append(np.nanmean(np.array(list(arabscores[sample].values())))) arr = np.array(list(gambscores[sample].values())) dim = arr.shape[0] n_aims_per_sample[sample] = dim-np.sum(np.isnan(arr)) aims_chrom_gamb[contig] = dict(prop_gambiae) aims_chrom_arab[contig] = dict(prop_arab) n_aims_per_chrom[contig] = dict(n_aims_per_sample) # Store ancestry score per aim ancestryPerAim[contig] = pd.concat([pd.DataFrame(gambscores).add_suffix("_gamb"), pd.DataFrame(coluscores).add_suffix("_colu")], axis=1) ancestryPerAim[contig]['contig'] = contig # plot and store for each chromosome gambdf = pd.DataFrame.from_dict(prop_gambiae, orient='index', columns=['AIM_fraction_gambiae']) arabdf = pd.DataFrame.from_dict(prop_arab, orient='index', columns=['AIM_fraction_arabiensis']) perchromdf = gambdf.merge(arabdf, left_index=True, right_index=True) aimsperchromdf = pd.DataFrame.from_dict(n_aims_per_sample, orient='index', columns=['n_AIMs']) perchromdf.to_csv(f"results/variantAnalysis/ancestry/AIM_fraction_{contig}.arab.tsv", sep="\t", index=True) rnaseqpop.plot_aims(perchromdf, aimsperchromdf, species1="arabiensis", species2="gambiae", figtitle=f"AIM_fraction_arab_{contig}", total=False) aims_chrom_gamb = rnaseqpop.flip_dict(aims_chrom_gamb) aims_chrom_arab = rnaseqpop.flip_dict(aims_chrom_arab) n_aims_per_chrom = rnaseqpop.flip_dict(n_aims_per_chrom) # get ancestry per aim for later plotting on chromosome ancestryPerAim = pd.concat(ancestryPerAim, axis=0) ancestryPerAim.to_csv("results/variantAnalysis/ancestry/ancestryScorePerAim.tsv", sep="\t") # get genome wide average AIM fractions for k in all_gamb: all_gamb[k] = np.nanmean(all_gamb[k]) all_arab[k] = np.nanmean(all_arab[k]) df1 = pd.DataFrame.from_dict(all_gamb, orient='index', columns=['AIM_fraction_gambiae']) df2 = pd.DataFrame.from_dict(all_arab, orient='index', columns=['AIM_fraction_arabiensis']) n_aimsdf = pd.DataFrame.from_dict(n_aims_per_chrom) n_aimsdf.to_csv(f"results/variantAnalysis/ancestry/n_AIMS_per_chrom_arab.tsv", sep="\t", index=True) df = df1.merge(df2, left_index=True, right_index=True) df.to_csv(f"results/variantAnalysis/ancestry/AIMs_summary_arab.tsv", sep="\t", index=True) rnaseqpop.plot_aims(df, n_aimsdf, species1="arabiensis", species2="gambiae", figtitle="AIM_fraction_whole_genome", total=True) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | import os import sys class parseString(object): def __init__(self, ref, string): self.ref = ref.upper() self.string = string.upper() self.types = {'A':0,'G':0,'C':0,'T':0,'-':[],'*':0,'+':[],'X':[]} self.process() def process(self): # remove end of read character self.string = self.string.replace('$','') while self.string != '': if self.string[0] == '^': # skip two characters when encountering '^' as it indicates # a read start mark and the read mapping quality self.string = self.string[2:] elif self.string[0] == '*': self.types['*'] += 1 # skip to next character self.string = self.string[1:] elif self.string[0] in ['.',',']: if (len(self.string)== 1) or (self.string[1] not in ['+','-']): # a reference base self.types[self.ref] += 1 self.string = self.string[1:] elif self.string[1] == '+': insertionLength = int(self.string[2]) insertionSeq = self.string[3:3+ insertionLength] self.types['+'].append(insertionSeq) self.string = self.string[3+insertionLength:] elif self.string[1] == '-': deletionLength = int(self.string[2]) deletionSeq = self.string[3:3+deletionLength] self.types['-'].append(deletionSeq) self.string = self.string[3+deletionLength:] elif self.string[0] in self.types and\ ((len(self.string)==1) or (self.string[1] not in ['-','+'])): # one of the four bases self.types[self.string[0]] += 1 self.string = self.string[1:] else: # unrecognized character # or a read that reports a substitition followed by an insertion/deletion self.types['X'].append(self.string[0]) self.string = self.string[1:] return def __repr__(self): types = self.types return '\t'.join(list(map(str,[types['A'], types['C'], types['G'],types['T'],\ types['*']])) +\ list(map(','.join, [types['-'],types['+'],types['X']]))) def main(): print("chrom\tpos\tref\tcov\tA\tC\tG\tT\t*\t-\t+\tX", file=sys.stdout) for line in sys.stdin: toks = line.strip('\n').split('\t') ref = toks[2].upper() cov = toks[3] print('\t'.join([toks[0], toks[1],ref, cov]) + '\t' + \ parseString(ref, toks[4]).__repr__(), file=sys.stdout) if __name__ == '__main__': main() |
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | import sys sys.stderr = open(snakemake.log[0], "w") import os import pandas as pd import numpy as np import rnaseqpoptools as rnaseqpop import re import allel # Read in parameters metadata = rnaseqpop.load_metadata(snakemake.params['metadata']) ref = snakemake.input['ref'] gffpath = snakemake.params['gffpath'] contigs = snakemake.params['contigs'] autofastq = snakemake.params['fastq'] sweeps = snakemake.params['sweeps'] signalpath = "resources/signals.csv" ## Read fasta file and grep chromosome names, check if they match config file = open(ref, "r") refchroms = [] for line in file: if re.search(">", line): refchroms.append(line.split()[0].replace(">", "")) assert (np.isin(contigs, refchroms)).all(), f"Not all chromosome names specified in config.yaml are found in the reference fasta file ({ref})" # Check that the contrasts specified in config.yaml are all in the treatment column comparisons = pd.DataFrame(snakemake.params['contrasts'], columns=['contrast']) comparisons = comparisons.contrast.str.split("_", expand=True) # split contrasts into case v control comparisons = comparisons.values.ravel() # make 1d array assert (np.isin(comparisons, metadata['treatment']).all()), f"treatments specified in config.yaml contrasts do not exist in samples.tsv. {comparisons}" # Check that samples in fastq.tsv match those in metadata.tsv if autofastq == True: # Check to see if fastq files match the metadata for sample in metadata['sampleID']: for n in [1,2]: fqpath = f"resources/reads/{sample}_{n}.fastq.gz" assert os.path.isfile(fqpath), f"all sample names in 'samples.tsv' do not match a .fastq.gz file in the {snakemake.workflow.basedir}/resources/reads/ directory, please rename or consider using the fastq table option" # Check column names of gene_names.tsv gene_names = pd.read_csv(snakemake.params['gene_names'], sep="\t") colnames = ['GeneID', 'TranscriptID', 'GeneDescription', 'GeneName'] assert (gene_names.columns.isin(colnames)).all(), f"Column names of gene_names.tsv should be {colnames}" # Check that the chromosomes are present in the gff file gff = allel.gff3_to_dataframe(gffpath) assert np.isin(contigs, gff.seqid).all(), f"All provided chromosome names ({contigs}) are not present in the reference GFF file" |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") #' This script performs differential expression analysis at the #' gene level using DESeq2. It outputs heatmaps, PCAs, gene counts, #' as well as DE results. Final DE results for all contrasts are #' saved in an .xslx report ## Differential expression library(DESeq2) library(pheatmap) library(data.table) library(ggrepel) library(openxlsx) library(glue) library(RColorBrewer) library(tidyverse) library(jsonlite) load_metadata <- function(metadata_path) { # Check the file extension and load metadata accordingly if (tools::file_ext(metadata_path) == "xlsx") { metadata <- readxl::read_excel(metadata_path) } else if (tools::file_ext(metadata_path) == "tsv") { metadata <- data.table::fread(metadata_path, sep = "\t") } else if (tools::file_ext(metadata_path) == "csv") { metadata <- data.table::fread(metadata_path, sep = ",") } else { stop("Metadata file must be .xlsx, .tsv, or .csv") } return(metadata) } #read metadata and get contrasts metadata = load_metadata(snakemake@input[['metadata']]) %>% as.data.frame() gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% distinct() contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']]) contrasts = contrastsdf$contrast ##### define functions ###### round_df = function(df, digits) { #' This function rounds all the numeric columns of a data.frame nums = vapply(df, is.numeric, FUN.VALUE = logical(1)) df[,nums] = round(df[,nums], digits = digits) (df) } vst_pca = function(counts, samples, colourvar, name="PCA", st="", comparison=""){ #' This function takes counts and sample metadata as input and builds a DESeq2 dataset #' Returning normalised and variance-stabilised counts, and performs PCA on the data #' Plotting and saving to pdf # make DESeq dataset dds = DESeqDataSetFromMatrix(countData = counts, colData = samples, design = ~ treatment) ###### estimate paramters and normalise dds = estimateSizeFactors(dds) dds = estimateDispersions(dds) vsd = varianceStabilizingTransformation(dds) normcounts = counts(dds, normalized=TRUE) vstcounts = assay(vsd) vstcounts = vstcounts[order(apply(vstcounts,1,sum),decreasing=TRUE),] #### write pca of samples to pdf pca2 = prcomp(t(vstcounts),center=TRUE) pc = data.frame(pca2$x) %>% rownames_to_column("sampleID") pc = left_join(pc, samples) pdf(glue("results/counts/{name}{st}{comparison}.pdf")) plt = ggplot(data=pc,aes(x=PC1, y=PC2, colour=treatment)) + geom_point(size=6, alpha=0.8) + geom_text_repel(aes(label=sampleID), colour="black") + theme_light() + labs(title=glue("{name} {st} {comparison}"), x=glue("PC1 - Variance explained - {round(summary(pca2)$importance[2,][1], 3)}"), y=glue("PC2 - Variance explained - {round(summary(pca2)$importance[2,][2], 3)}")) + theme(legend.position = "bottom", legend.title = element_blank(), legend.background=element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.background = element_blank(), plot.title = element_text(hjust = 0.5), axis.text.x = element_text(colour="black", size=18), axis.text.y = element_text(colour="black", size=18)) print(plt) null = dev.off() ggsave(glue("results/counts/{name}{st}{comparison}.png"), plot=plt) return(list(vstcounts, dds, normcounts)) } volcano = function(data, title){ data = data %>% mutate(col = case_when(padj < 0.05 & abs(log2FoldChange) > 1 ~ "#ff6666", padj > 0.05 & abs(log2FoldChange) > 1 ~ "#39e600", padj < 0.05 & abs(log2FoldChange) < 1 ~ "#3385ff", padj > 0.05 & abs(log2FoldChange) < 1 ~ "#b3b3b3")) data$labels = data %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID, is.na(GeneName) ~ GeneID, TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe() plot = ggplot(data=data, aes(x=log2FoldChange, y=-log10(padj), color=col, alpha=0.4), size=2) + geom_point() + scale_color_identity() + xlim(-10, 10) + ylab("-log10(P)") + xlab("log2 Fold Change") + ggtitle(title) + theme_light() + theme(legend.position = "none") + geom_vline(xintercept = log2(2), linetype='dashed', colour='grey') + geom_vline(xintercept = log2(0.5), linetype='dashed', colour='grey') + geom_hline(yintercept = -log10(0.05), linetype='dashed', colour='grey')# + #geom_text_repel(data = subset(data, data[['padj']] < 0.01 & abs(data[['log2FoldChange']]) > 2), aes(label = subset(data, data[['padj']] < 0.01 & abs(data[['log2FoldChange']]) > 2)[["labels"]], colour='black')) print(plot) } #### main #### cat("\n", "------------- Kallisto - DESeq2 - RNASeq Differential expression ---------", "\n") #### Read counts for each sample df = list() for (sample in metadata$sampleID){ df[[sample]]= fread(glue("results/counts/{sample}/abundance.tsv"), sep = "\t") } counts = data.frame('TranscriptID' = df[[1]]$target_id) # Get read counts for each gene and fill table for (sample in metadata$sampleID){ reads = df[[sample]]$est_counts counts = cbind(counts, reads) } # Rename columns colnames(counts) = c("TranscriptID", metadata$sampleID) ## Aggregate to gene level counts = left_join(counts, gene_names) %>% select(GeneID, metadata$sampleID) counts = counts %>% group_by(GeneID) %>% summarise_all(sum) gene_names = gene_names %>% select(-TranscriptID) ### Count total reads counted counts = counts[!is.na(counts$GeneID),] counts = counts %>% column_to_rownames('GeneID') #### Get count statistics for each sample and plot #### count_stats = apply(counts, 2, sum) %>% enframe(name="Sample", value="total_counts") # total counts ngenes = nrow(counts) count_stats$genes_zerocounts = apply(counts, 2, function(x){sum(x==0)}) # genes with zero counts count_stats$genes_lessthan10counts = apply(counts, 2, function(x){sum(x<10)}) # genes with less than 10 counts count_stats = count_stats %>% dplyr::mutate("proportion_zero" = genes_zerocounts/ngenes, "proportion_low" = genes_lessthan10counts/ngenes) count_stats %>% fwrite(., "results/counts/countStatistics.tsv",sep="\t") print("Counting and plotting total reads per sample...") pdf("results/counts/total_reads_counted.pdf") ggplot(count_stats, aes(x=Sample, y=total_counts, fill=metadata$treatment)) + geom_bar(stat='identity') + theme_light() + ggtitle("Total reads counted to genes") + theme(axis.text.x = element_text(angle=45), plot.title = element_text(hjust = 0.5)) null = dev.off() # round numbers to be whole, they are not due averaging across transcripts counts = counts %>% rownames_to_column('GeneID') %>% dplyr::mutate_if(is.numeric, round) %>% column_to_rownames('GeneID') ############ Plots PCA with all data, and performs DESeq2 normalisation ######################## res = vst_pca(counts, metadata, colourvar = 'strain', name="PCA") vstcounts = res[[1]] dds = res[[2]] normcounts = res[[3]] ### write out raw and normalised counts counts %>% rownames_to_column("GeneID") %>% fwrite(., "results/counts/rawcounts.tsv", sep="\t", row.names = FALSE) normcounts %>% as.data.frame() %>% rownames_to_column("GeneID") %>% round_df(., 1) %>% fwrite(., "results/counts/normCounts.tsv", sep="\t", row.names = FALSE) # calculate correlations between samples based on the count data, and plot heatmap correlations = cor(vstcounts) pdf("results/counts/heatmap_correlations.pdf") pheatmap(correlations) garbage = dev.off() png("results/counts/heatmap_correlations.png") pheatmap(correlations) garbage = dev.off() # add column if it doesnt exist if(!("lab" %in% colnames(metadata))) { metadata$lab = 'FALSE' } # for each strain in the dataset, do a PCA plot # analysis of case v control with DESEq2 and make PCAs and volcano plots. if ("strain" %in% colnames(metadata)){ for (sp in unique(metadata$species)){ df_samples = metadata %>% filter(species == sp) df_samples = df_samples %>% filter(lab == 'FALSE') #remove lab samples for (st in unique(df_samples$strain)){ ai_list = list() ai_list[[st]] = df_samples[df_samples$strain == st,] # if the strain has both case and control (i.e. exposed v unexposed) if (length(unique(ai_list[[st]]$cohort)) > 1){ cat(glue("\n Running PCA for all {st} samples \n")) # subset counts data to our strains of interest subcounts = counts[colnames(counts) %in% ai_list[[st]]$sampleID] # perform PCA on the data at strain level vstcounts = vst_pca(subcounts, ai_list[[st]], 'treatment', "PCA_", st=st)[[1]] } } } } results_list = list() names_list = list() ngenes_list = list() ######### subset data and run DESeq for each combo we need, store in xlsx ######## for (cont in contrasts){ control = str_split(cont, "_")[[1]][1] # get first of string, which is control case = str_split(cont, "_")[[1]][2] # get case controls = which(metadata$treatment %in% control) cases = which(metadata$treatment %in% case) ## Perform PCA for each comparison subcounts = counts[,c(controls, cases)] subsamples = metadata[c(controls, cases),] # make treatment a factor with the 'susceptible' as reference subsamples$treatment = as.factor(as.character(subsamples$treatment)) subsamples$treatment = relevel(subsamples$treatment, control) null = vst_pca(subcounts, subsamples, colourvar='treatment', "PCA_", comparison=cont)[[]] # perform DE with each comparison, using DESeq dataset (dds) from whole library cat("\n", glue("--- Running DESeq2 differential expression analysis on {cont} ---"), "\n") cds = nbinomWaldTest(dds) results = results(cds, contrast = c("treatment", case, control)) %>% as.data.frame() results = results[order(results$padj),] #order by pvalue results = results %>% rownames_to_column("GeneID") %>% dplyr::mutate("FC" = (2^log2FoldChange)) ### absolute difference #### Get rowsums of counts, grouping by case/control. Then get difference of counts and join with DE results readdiff = data.frame(t(rowsum(t(subcounts), group = subsamples$treatment, na.rm = T))) #transpose and get rowsums for each group readdiff$absolute_diff = readdiff[,case] - readdiff[,control] #get difference readdiff = data.frame(readdiff) %>% rownames_to_column('GeneID') results = unique(left_join(results, readdiff[,c('GeneID','absolute_diff')])) # join DE results with normal gene names results = unique(left_join(results, gene_names)) fwrite(results, glue("results/genediff/{cont}.csv")) #write to csv # volcano plot for each comparison, using EnhancedVolcano. First make vector of labels which is AGAPs unless a gene name exists results$labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID, is.na(GeneName) ~ GeneID, TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe() #get number of sig genes res1 = results %>% filter(padj < 0.05) %>% count("direction" = FC > 1) %>% dplyr::mutate("direction" = case_when(direction == FALSE ~ "Downregulated, padj = 0.05", direction == TRUE ~ "Upregulated, padj = 0.05")) %>% dplyr::rename(!!glue("{cont}_ngenes") := "n") res2 = results %>% filter(padj < 0.001) %>% count("direction" = FC > 1) %>% dplyr::mutate("direction" = case_when(direction == FALSE ~ "Downregulated, padj = 0.001", direction == TRUE ~ "Upregulated, padj = 0.001")) %>% dplyr::rename(!!glue("{cont}_ngenes") := "n") ngenes_list[[cont]] = bind_rows(res1, res2) # store names of comparisons for xlsx report sheets results_list[[cont]] = results names_list[[cont]] = cont pdf(glue("results/genediff/Volcano_plot_{cont}.pdf")) volcano(data = results_list[[cont]], title = cont) null = dev.off() cat("\n", glue("{cont} complete!"), "\n") } # Join different comparisons together and write out number of sig genes purrr::reduce(ngenes_list, inner_join) %>% fwrite("results/genediff/nsig_genes.tsv", sep="\t", col.names = TRUE) #### write to excel file on diff sheets #### sheets = unlist(names_list) wb <- createWorkbook("Workbook") for (i in 1:length(sheets)){ addWorksheet(wb, glue("{sheets[[i]]}")) writeData(wb, sheets[i], results_list[[i]], rowNames = FALSE, colNames = TRUE) } #### save workbook to disk once all worksheets and data have been added #### saveWorkbook(wb,file=snakemake@output[['xlsx']], overwrite = TRUE) ### Get kallisto statistics ### totalReads = c() totalAligned = c() # open kallisto json info and store stats, save to file for (sample in metadata$sampleID){ run = fromJSON(glue("results/counts/{sample}/run_info.json"), flatten=TRUE) totalReads = c(totalReads, run$n_processed) totalAligned = c(totalAligned, run$n_pseudoaligned) } df = data.frame("sample" = c(metadata$sampleID, "Total"), "totalReads" = c(totalReads, sum(totalReads)), "totalAligned" = c(totalAligned, sum(totalAligned))) %>% mutate("percentage" = (totalAligned/totalReads)*100) %>% fwrite("results/counts/KallistoQuantSummary.tsv", sep="\t", col.names = TRUE) sessionInfo() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Generate freebayes params ## # 1) make bed for parallelising freebayes library(dplyr) library(data.table) library(glue) load_metadata <- function(metadata_path) { # Check the file extension and load metadata accordingly if (tools::file_ext(metadata_path) == "xlsx") { metadata <- readxl::read_excel(metadata_path) } else if (tools::file_ext(metadata_path) == "tsv") { metadata <- data.table::fread(metadata_path, sep = "\t") } else if (tools::file_ext(metadata_path) == "csv") { metadata <- data.table::fread(metadata_path, sep = ",") } else { stop("Metadata file must be .xlsx, .tsv, or .csv") } return(metadata) } # read inputs contigs = snakemake@params[['contigs']] chunks = snakemake@params[['chunks']] fai = fread(snakemake@input[['index']]) # select contigs we want, and start, end columns fai = fai[fai$V1 %in% contigs, c(1,2)] # for each chromsome for (contig in contigs){ #subset index to desired contig f = fai[fai$V1 == contig] #get sequence of n chunks from 0 to length of contig bedseq = round(seq(0, f$V2, length.out = chunks)) #for each chunk for (i in 1:(chunks-1)){ #write bed file, one for each chunk/interval, which will be passed as input to freebayes row = c(contig, bedseq[i], bedseq[i+1]) data.frame(row) %>% t() %>% fwrite(., glue("resources/regions/genome.{contig}.region.{i}.bed"), sep="\t", col.names = FALSE) } } # 2) Make bamlist and populations.tsv file metadata = load_metadata(snakemake@params[['metadata']]) metadata$bams = paste0("results/alignments/", metadata$sampleID,".bam") metadata %>% select(bams, strain) %>% fwrite(., snakemake@output[['pops']], sep="\t", row.names = FALSE, col.names = FALSE) metadata %>% select(bams) %>% fwrite(.,snakemake@output[['bamlist']], sep="\t", row.names = FALSE, col.names = FALSE) sessionInfo() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | import sys sys.stderr = open(snakemake.log[0], "w") import rnaseqpoptools as rnaseqpop import pandas as pd import numpy as np import allel from scipy import stats from functools import partial, reduce import warnings warnings.filterwarnings('ignore') # suppress numpy runtime warnings, this is a bit dangerous, should be removed for release or resolve source of warnings # snakemake inputs and params dataset = snakemake.params['dataset'] metadata_path = snakemake.input['metadata'] metadata = rnaseqpop.load_metadata(metadata_path) gffpath = snakemake.input['gff'] pbs = snakemake.params['pbs'] pbscomps = snakemake.params['pbscomps'] contigs = snakemake.params['contigs'] ploidy = snakemake.params['ploidy'] numbers = rnaseqpop.get_numbers_dict(ploidy) missingprop = snakemake.params['missingprop'] # gff features = allel.gff3_to_dataframe(gffpath, attributes=["ID", "description"]) gff = features[features.type == 'gene'] # gene names file, rename gene_names = pd.read_csv(snakemake.input['geneNames'], sep="\t").drop(columns=['TranscriptID']).drop_duplicates() ### main #### # Read in list of contrasts comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast']) comparisons = comparisons.contrast.str.split("_", expand=True) comparisons.columns = ['sus', 'res'] comparisons = [list(row) for i,row in comparisons.iterrows()] print(f"The pairwise comparisons for Fst are {comparisons}") fstbychrom={} if pbs: pbsbychrom={} tajdbychrom={} gdivbychrom = {} dxybychrom = {} for contig in contigs: # path to vcf path = f"results/variantAnalysis/vcfs/{dataset}.{contig}.vcf.gz" # function to read in vcfs and associated SNP data vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path, contig=contig, samples=metadata, numbers=numbers, ploidy=ploidy, qualflt=30, missingfltprop=missingprop) # subset gff to appropriate contig genes = gff[gff.seqid == contig].sort_values('start').reset_index(drop=True) ### Average Fst, pbs, tajima d for each gene fst_per_comp = {} fst_per_gene = {} pbs_per_gene = {} pbs_per_comp = {} tajd_per_pop = {} tajd_per_gene = {} gdiv_per_pop = {} gdiv_per_gene = {} se_per_comp = {} se_per_gene = {} dxy_per_comp = {} dxy_per_gene = {} pos_dict = {} n_dict = {} # loop through each gene and calculate fst, pbs, tajimas d, or sequence diversity for each comparison for i, gene in genes.iterrows(): ID = gene.ID # locate_ranges() to get a boolean, needed as locate_range() will throw errors if no snps found in gene gene_bool = pos.locate_ranges([gene['start']], [gene['end']], strict=False) nsnps = gene_bool.sum() # if there are less than 3 snps in this gene then skip to next in loop if nsnps < 2: continue # store number of snps per gene n_dict[ID] = nsnps # store midpoint positions of gene pos_dict[ID] = (gene['start'] + gene['end'])/2 # fst and dxy per gene between each comparison for comp1,comp2 in comparisons: name = comp1 + "_" + comp2 ac1 = acsubpops[comp1].compress(gene_bool, axis=0) ac2 = acsubpops[comp2].compress(gene_bool, axis=0) fst_per_comp[name], se_per_comp[name],_,_= allel.average_hudson_fst(ac1, ac2, blen=1) dxy_per_comp[name] = allel.sequence_divergence(pos[gene_bool], ac1, ac2) # tajimas d and sequence diversity per gene for each subpop(i.e treatment) for subpop in subpops: ac = acsubpops[subpop].compress(gene_bool) genepos = pos[gene_bool] tajd_per_pop[subpop] = allel.tajima_d(ac=ac, pos=genepos) gdiv_per_pop[subpop] = allel.sequence_diversity(ac=ac, pos=genepos) # pbs for each gene for each pbc comparison as defined in config.yaml if pbs is True: for pbscomp in pbscomps: pop1, pop2, outpop = pbscomp.split("_") pbs_per_comp[pbscomp],se,_,_ = rnaseqpop.meanPBS(acsubpops[pop1].compress(gene_bool, axis=0), acsubpops[pop2].compress(gene_bool, axis=0), acsubpops[outpop].compress(gene_bool, axis=0), window_size=1, normalise=True) # store inner dict in outer dicts fst_per_gene[ID] = dict(fst_per_comp) se_per_gene[ID] = dict(se_per_comp) if pbs is True : pbs_per_gene[ID] = dict(pbs_per_comp) tajd_per_gene[ID] = dict(tajd_per_pop) gdiv_per_gene[ID] = dict(gdiv_per_pop) dxy_per_gene[ID] = dict(dxy_per_comp) #reverse the dicts so the comparisons/subpops are on the outer dict fst_per_gene = rnaseqpop.flip_dict(fst_per_gene) se_per_gene = rnaseqpop.flip_dict(se_per_gene) if pbs is True : pbs_per_gene = rnaseqpop.flip_dict(pbs_per_gene) tajd_per_gene = rnaseqpop.flip_dict(tajd_per_gene) gdiv_per_gene = rnaseqpop.flip_dict(gdiv_per_gene) dxy_per_gene = rnaseqpop.flip_dict(dxy_per_gene) print(f"Chromosome {contig} complete...\n") for comp1,comp2 in comparisons: name = comp1 + "_" + comp2 a = np.array(list(fst_per_gene[name].values())) print(f"Overall Fst (averaged across genes) for chromosome {contig} between {name} is {np.nanmean(a)}") # Make dataframe of number of snps per gene (that pass quality and missingness filters) ndf = pd.DataFrame.from_dict(n_dict, orient='index').reset_index(drop=False) ndf.columns = ['GeneID', 'nSNPs'] # Make dataframe of midpoints of each gene posdf = pd.DataFrame.from_dict(pos_dict, orient='index').reset_index(drop=False) posdf.columns = ['GeneID', 'Gene_midpoint'] # Make dataframe of fst for each comparison fst_dfs = {} se_dfs = {} dxy_dfs = {} for comp1,comp2 in comparisons: name = comp1 + "_" + comp2 fst_df = pd.DataFrame.from_dict(fst_per_gene[name], orient='index').reset_index(drop=False) fst_df.columns = ['GeneID', (name + '_zFst')] fst_dfs[name] = fst_df se_df = pd.DataFrame.from_dict(se_per_gene[name], orient='index').reset_index(drop=False) se_df.columns = ['GeneID', (name + '_SE')] se_dfs[name] = se_df dxy_df = pd.DataFrame.from_dict(dxy_per_gene[name], orient='index').reset_index(drop=False) dxy_df.columns = ['GeneID', (name + '_dxy')] dxy_dfs[name] = dxy_df my_reduce = partial(pd.merge, on='GeneID', how='outer') fst_allcomparisons = reduce(my_reduce, fst_dfs.values()) se_allcomparisons = reduce(my_reduce, se_dfs.values()) fst_allcomparisons = reduce(my_reduce, [fst_allcomparisons, se_allcomparisons]) fst_allcomparisons['contig'] = contig dxy_allcomparisons = reduce(my_reduce, dxy_dfs.values()) dxy_allcomparisons['contig'] = contig tajd_dfs = {} gdiv_dfs = {} # Store sequence diversity and tajimas d for each gene and each subpop for subpop in subpops: tajd_df = pd.DataFrame.from_dict(tajd_per_gene[subpop], orient='index').reset_index(drop=False) tajd_df.columns = ['GeneID', (subpop+"_Tajima_d")] tajd_dfs[subpop] = tajd_df gdiv_df = pd.DataFrame.from_dict(gdiv_per_gene[subpop], orient='index').reset_index(drop=False) gdiv_df.columns = ['GeneID', (subpop+"_SeqDiv")] gdiv_dfs[subpop] = gdiv_df # Combine tajimas d and sequence diversity for each sample tajdall = reduce(my_reduce, tajd_dfs.values()) gdivall = reduce(my_reduce, gdiv_dfs.values()) tajdall['contig'] = contig gdivall['contig'] = contig if pbs is True: # PBS store as dataframes pbs_dfs = {} for pbscomp in pbscomps: pbs_df = pd.DataFrame.from_dict(pbs_per_gene[pbscomp], orient='index').reset_index(drop=False) pbs_df.columns = ['GeneID', (pbscomp+"PBS")] pbs_dfs[pbscomp] = pbs_df pbs_allcomparisons = reduce(my_reduce, pbs_dfs.values()) pbs_allcomparisons['contig'] = contig fstbychrom[contig] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'], how='inner'), [fst_allcomparisons, gene_names, ndf, posdf]) tajdbychrom[contig] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'], how='inner'), [tajdall, gene_names, ndf,posdf]) gdivbychrom[contig] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'], how='inner'), [gdivall, gene_names, ndf, posdf]) dxybychrom[contig] = reduce(lambda left, right: pd.merge(left, right, on=['GeneID'], how='inner'), [dxy_allcomparisons, gene_names, ndf, posdf]) if pbs is True: pbsbychrom[contig] = reduce(lambda left,right: pd.merge(left,right,on=['GeneID'], how='inner'), [pbs_allcomparisons, gene_names, ndf, posdf]) ## Concatenate chromosome dfs to one big data frame and remove duplicates fstall = pd.concat(fstbychrom.values(), ignore_index=True).drop_duplicates() tajdall = pd.concat(tajdbychrom.values(), ignore_index=True).drop_duplicates() gdivall = pd.concat(gdivbychrom.values(), ignore_index=True).drop_duplicates() dxyall = pd.concat(dxybychrom.values(), ignore_index=True).drop_duplicates() # Write to csv fstall.to_csv(f"results/variantAnalysis/selection/FstPerGene.tsv", index=False, sep="\t") tajdall.to_csv(f"results/variantAnalysis/selection/TajimasDPerGene.tsv", index=False, sep="\t") gdivall.to_csv(f"results/variantAnalysis/diversity/SequenceDivPerGene.tsv", index=False, sep="\t") dxyall.to_csv(f"results/variantAnalysis/diversity/DxyPerGene.tsv", index=False, sep="\t") if pbs is True: pbsall = pd.concat(pbsbychrom.values(), ignore_index=True).drop_duplicates() pbsall.to_csv(f"results/variantAnalysis/selection/PbsPerGene.tsv", index=False, sep="\t") |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") #' Finds genes that are differentially expressed in the same direction #' Across two DE comparisons. For example, a gene may be significantly #' upregulated in the control v intermediate phenotype comparison. #' And also in the intermediate phenotype vs full phenotype comparison. library(data.table) library(tidyverse) library(glue) # read metadata and get contrasts comps = snakemake@params['comps'][[1]] padj_threshold = snakemake@params['pval'] isoform_level = snakemake@params['isoform_level'] gene_level = snakemake@params['gene_level'] upper_fc = snakemake@params['fc'] lower_fc = 1/as.numeric(upper_fc) # if someone wants a FC threshold of 2, need to have lower threshold of 0.5. for (cont in comps){ sus = str_split(cont, "_")[[1]][1] # get control/susceptible name (such as Kisumu) intermediate = str_split(cont, "_")[[1]][2] # get intermediate res = str_split(cont, "_")[[1]][3] # get last of string, which is resistant/case if (gene_level == TRUE){ #### Gene diff #### one = fread(glue("results/genediff/{intermediate}_{res}.csv")) up1 = one %>% filter(FC > upper_fc, padj < padj_threshold) down1 = one %>% filter(FC < lower_fc, padj < padj_threshold) two = fread(glue("results/genediff/{sus}_{intermediate}.csv")) up2 = two %>% filter(FC > upper_fc, padj < padj_threshold) down2 = two %>% filter(FC < lower_fc, padj < padj_threshold) intersectdown = inner_join(down1, down2, by="GeneID", suffix=c("_field", "_lab")) intersectup = inner_join(up1, up2, by="GeneID", suffix=c("_field", "_lab")) fwrite(intersectup, glue("results/genediff/{sus}_{intermediate}_{res}.up.progressive.tsv"), sep="\t") fwrite(intersectdown, glue("results/genediff/{sus}_{intermediate}_{res}.down.progressive.tsv"), sep="\t") } if (isoform_level == TRUE){ #### Isoforms #### one = fread(glue("results/isoformdiff/{intermediate}_{res}.csv")) up1 = one %>% filter(FC > upper_fc, qval < padj_threshold) down1 = one %>% filter(FC < lower_fc, qval < padj_threshold) two = fread(glue("results/isoformdiff/{sus}_{intermediate}.csv")) up2 = two %>% filter(FC > upper_fc, qval < padj_threshold) down2 = two %>% filter(FC < lower_fc, qval < padj_threshold) intersectdown = inner_join(down1, down2, by="TranscriptID", suffix=c("_field", "_lab")) intersectup = inner_join(up1, up2, by="TranscriptID", suffix=c("_field", "_lab")) fwrite(intersectup, glue("results/isoformdiff/{sus}_{intermediate}_{res}.up.progressive.tsv"), sep="\t") fwrite(intersectdown, glue("results/isoformdiff/{sus}_{intermediate}_{res}.down.progressive.tsv"), sep="\t") } } sessionInfo() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") #' This script uses the sleuth R package to perform differential isoform analysis #' This will be performed for each #' Sleuth uses the bootstraps of kallisto to estimate uncertainty ## Differential expression library(tidyverse) library(sleuth) library(data.table) library(ggrepel) library(openxlsx) library(glue) library(RColorBrewer) library(EnhancedVolcano) print("------------- Kallisto - Sleuth - RNASeq isoform Differential expression ---------") #### read data #### load_metadata <- function(metadata_path) { # Check the file extension and load metadata accordingly if (tools::file_ext(metadata_path) == "xlsx") { metadata <- readxl::read_excel(metadata_path) } else if (tools::file_ext(metadata_path) == "tsv") { metadata <- data.table::fread(metadata_path, sep = "\t") } else if (tools::file_ext(metadata_path) == "csv") { metadata <- data.table::fread(metadata_path, sep = ",") } else { stop("Metadata file must be .xlsx, .tsv, or .csv") } return(metadata) } metadata = load_metadata(snakemake@input[['metadata']]) %>% as.data.frame() %>% dplyr::rename('sample' = "sampleID") #add path column for sleuth object metadata$path = paste0("results/counts/", metadata$sample) #read metadata and get contrasts gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% select(-GeneID) #contrasts contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']]) contrasts = contrastsdf$contrast ######### subset data and run DESeq for each combo we need, store in xlsx ######## results_list = list() names_list = list() #loop through each chosen contrast and perform DE for (cont in contrasts){ control = str_split(cont, "_")[[1]][1] #get first of string, which is control (or susceptible) case = str_split(cont, "_")[[1]][2] #get second of string, which is case (or resistant) controls = which(metadata$treatment %in% control) #get indices of case/control cases = which(metadata$treatment %in% case) ##subset to subcounts of our contrast subsamples = metadata[c(controls, cases),] #make treatment a factor with the 'susceptible' as reference subsamples$treatment = as.factor(as.character(subsamples$treatment)) subsamples$treatment = relevel(subsamples$treatment, control) print(glue("Running Sleuth differential isoform expression analysis on {cont}")) so = sleuth_prep(subsamples, extra_bootstrap_summary = TRUE, transformation_function = function(x) log2(x + 0.5)) so = sleuth_fit(so, ~treatment, 'full') #fit reduced model necessary for isoform analysis so = sleuth_fit(so, ~1, 'reduced') #fit likelihood ratio test so = sleuth_wt(so, which_beta = paste0("treatment",case)) results = sleuth_results(so, test =paste0("treatment",case), 'reduced:full', show_all = FALSE) %>% dplyr::rename("TranscriptID" = "target_id") %>% dplyr::mutate("FC" = (2^b)) # Join DE results with normal gene names results = unique(left_join(results, gene_names)) fwrite(results, glue("results/isoformdiff/{cont}.csv")) # write to csv # Store names of contrasts and results tables for xlsx report sheets results_list[[cont]] = results names_list[[cont]] = cont # volcano plot for each comparison, using EnhancedVolcano. First make vector of labels which is AGAPs unless a gene name exists labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ TranscriptID, is.na(GeneName) ~ TranscriptID, TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe() pdf(glue("results/isoformdiff/Volcano_plot_{cont}.pdf")) print(EnhancedVolcano(results_list[[cont]], lab=labels, x='b', y='pval', title = cont)) garbage = dev.off() print(glue("{cont} complete!")) } #### write to excel file on diff sheets #### sheets = unlist(names_list) wb <- createWorkbook("Workbook") for (i in 1:length(sheets)){ addWorksheet(wb, glue("{sheets[[i]]}")) writeData(wb, sheets[i], results_list[[i]], rowNames = FALSE, colNames = TRUE) } #### save workbook to disk once all worksheets and data have been added #### saveWorkbook(wb,file=snakemake@output[['xlsx']], overwrite = TRUE) sessionInfo() |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd import numpy as np import allel import rnaseqpoptools as rnaseqpop # Read in parameters from snakemake dataset = snakemake.params['dataset'] metadata = rnaseqpop.load_metadata(snakemake.input['metadata']) metadata = metadata.sort_values(by='species') contigs = snakemake.params['contigs'] ploidy = snakemake.params['ploidy'] numbers = rnaseqpop.get_numbers_dict(ploidy) qualflt = snakemake.params['qualflt'] missingprop = snakemake.params['missingprop'] gffpath = snakemake.input['gff'] # load gff features = allel.gff3_to_dataframe(gffpath, attributes=["ID", "Parent", "description"]) # define functions def isnotmissing(gn): """ Function to check if missing values are present at a SNP """ return((gn != -1).all()) #initialise dicts snpsPerGff = {} total_snps_per_chrom = {} snps_per_gene_allchroms = {} snpeffdict = {} seqdivdictchrom = {} thetadictchrom = {} coefdictchrom= {} for i, contig in enumerate(contigs): # Read in and Filter VCF path = f"results/variantAnalysis/vcfs/{dataset}.{contig}.vcf.gz" vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path, contig=contig, samples=metadata, numbers=numbers, ploidy=ploidy, qualflt=qualflt, missingfltprop=missingprop) # Store total SNPs per chromosome # And summaries from SnpEff total_snps_per_chrom[contig] = geno.shape[0] snpeffdict[contig] = snpeff[1].value_counts(normalize=True) # Plot SNP density rnaseqpop.plot_density(pos, window_size=100000, title=f"Variant Density | {dataset} | Chromosome {contig}", path=f"results/variantAnalysis/diversity/{dataset}_SNPdensity_{contig}.svg") ######## SNP counts per gene ######## # Subset GFF to appropriate chromosome gff = features.query("seqid == @contig").sort_values('start').reset_index(drop=True) genes = gff.query("type == 'gene'") exons = features.query("type == 'exon'").reset_index(drop=True) ## Proportion SNPs per GFF feature snpsPerGff[contig] = rnaseqpop.getSNPGffstats(gff, pos) snpsPerGff[contig]['chromosome'] = contig ## Calculate missing SNPs per sample, SNPs per gene etc snpsnotmissing = {} missing_array = {} snps_per_gene = {} snps_per_sample = {} # For each sample in metadata, compress the genotype array and check is data is missing for sample in metadata['sampleID']: bool_ = sample == populations gn = geno.compress(bool_, axis=1) res = map(isnotmissing, gn[:,0]) missing_array[sample] = np.fromiter(res, dtype=bool) snpsnotmissing[sample] = missing_array[sample].sum() # For each gene, find out how many SNPs per gene for i, gene in genes.iterrows(): # locate_ranges() to get a boolean, needed as locate_range() will throw errors if no snps found in gene gene_bool = pos.locate_ranges([gene['start']], [gene['end']], strict=False) for sample in metadata['sampleID']: presentSNPs = missing_array[sample].compress(gene_bool, axis=0) # subset to each gene snps_per_sample[sample] = presentSNPs.sum() snps_per_gene[gene['ID']] = dict(snps_per_sample) snps_per_gene_allchroms[contig] = pd.DataFrame.from_dict(rnaseqpop.flip_dict(snps_per_gene)) snpsPerGff = pd.concat(snpsPerGff).reset_index().rename(columns={'level_1':'feature'}) snpsPerGff = snpsPerGff.groupby('feature').agg({'ReferenceGenomeProportion': 'mean', 'called':'sum', 'proportion': 'mean', 'total':'sum'}) snpsPerGff.to_csv(snakemake.output['snpsPerGenomicFeature'], sep="\t", index=True) # Total SNPs per contig, all samples totalsnpsdf = pd.DataFrame.from_dict(total_snps_per_chrom, orient='index', columns=['Total_SNPs']) totalsnpsdf.to_csv(f"results/variantAnalysis/SNPstats/totalSNPs.tsv", sep="\t", index=True) # SNPcount per gene snpcountsdf = pd.concat(snps_per_gene_allchroms) snpcountsdf.to_csv("results/variantAnalysis/SNPstats/nSNPsPerGene.tsv", sep="\t", index=True) genesNoSNPs = pd.DataFrame((snpcountsdf == 0).sum(axis=0), columns=['Genes with zero SNPs']) genesNoSNPs.to_csv("results/variantAnalysis/SNPstats/nGenesZeroSNPs.tsv", sep="\t", index=True) # snpEff summary snpeffdf = pd.concat(snpeffdict) snpeffdf.to_csv("results/variantAnalysis/SNPstats/snpEffProportions.tsv", sep="\t", index=True) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(tidyverse) library(data.table) library(glue) library(openxlsx) load_metadata <- function(metadata_path) { # Check the file extension and load metadata accordingly if (tools::file_ext(metadata_path) == "xlsx") { metadata <- readxl::read_excel(metadata_path) } else if (tools::file_ext(metadata_path) == "tsv") { metadata <- data.table::fread(metadata_path, sep = "\t") } else if (tools::file_ext(metadata_path) == "csv") { metadata <- data.table::fread(metadata_path, sep = ",") } else { stop("Metadata file must be .xlsx, .tsv, or .csv") } return(metadata) } #### allele imbalance #### metadata = load_metadata(snakemake@input[['metadata']]) samples = metadata$sampleID # Read IR mutation data mutation_data = fread(snakemake@input[['mutations']], sep="\t") all_list = list() mean_list = list() # Loop through each mutation, finding allele coverage at each position for (m in mutation_data$Name){ base = mutation_data[mutation_data$Name == m]$ALT propstring = glue("proportion{base}") base2 = mutation_data[mutation_data$Name == m]$ALT2 #### add in if second alt propstring2 = glue("proportion{base2}") #### load allele balance data #### allele_list = list() # read data for each sample and subset to what we want for (sample in samples){ allele_list[[sample]] = fread(glue("results/variantAnalysis/variantsOfInterest/counts/{sample}_{m}_allele_counts.tsv"))[,c(1:8)] #for each sample read data, and store first 8 columns allele_list[[sample]]$sample = sample #add sample column allele_list[[sample]]$treatment = metadata$treatment[samples == sample] #add treatment column allele_list[[sample]]$mutation = m allele_list[[sample]]$gene = mutation_data[mutation_data$Name == m]$Gene cover = allele_list[[sample]] %>% select(A,C,G,T) %>% rowSums() allele_list[[sample]] = allele_list[[sample]] %>% mutate(!!propstring := (!!sym(base))/cover) #new column, proportion of Alts to total. if (!base2 %in% c("", NA)){ allele_list[[sample]] = allele_list[[sample]] %>% mutate(!!propstring2 := (!!sym(base2))/cover) #new column, proportion of Alts to total. } } # We have 24 separate dataframes in a list, bind them together into one big dataframe alleles = rbindlist(allele_list, fill = TRUE) # now lets calculate population means and lower and upper CIs alleles_per_pop_list = list() for (pop in unique(metadata$treatment)){ alleles_per_pop = alleles %>% filter(treatment == pop) pop_prop = sum(alleles_per_pop[, ..base])/ sum(alleles_per_pop[, 'cov']) error = sqrt((pop_prop*(1-pop_prop))/nrow(alleles_per_pop))*1.96 lower = pmax(pop_prop - error, 0) upper = pmin(pop_prop + error, 1) alleles_per_pop_list[[pop]] = alleles_per_pop %>% mutate(!!propstring := pop_prop, lowerCI = lower, upperCI = upper) # average across replicates if (!base2 %in% c("", NA)){ pop_prop2 = sum(alleles_per_pop[, ..base2])/ sum(alleles_per_pop[, 'cov']) error2 = sqrt((pop_prop*(1-pop_prop))/nrow(alleles_per_pop))*1.96 lower2 = pmax(pop_prop - error, 0) upper2 = pmin(pop_prop + error, 1) alleles_per_pop_list[[pop]] = alleles_per_pop_list[[pop]] %>% mutate(!!propstring2 := pop_prop2, lowerCI_2 = lower2, upperCI_2 = upper2) } } mean_alleles = rbindlist(alleles_per_pop_list, fill=TRUE) # average across replicates if (!base2 %in% c("", NA)){ mean_alleles = mean_alleles %>% group_by(chrom, pos, ref, mutation, treatment, !!sym(propstring), lowerCI, upperCI, !!sym(propstring2), lowerCI_2, upperCI_2) %>% summarise_at(.vars = c("cov","A","C","G","T"), .funs = c(mean="mean"), na.rm = TRUE) %>% select(chrom, pos, ref, mutation, treatment, cov_mean, A_mean, C_mean, G_mean, T_mean, lowerCI, upperCI, !!propstring, !!propstring2, lowerCI_2, upperCI_2) } else { mean_alleles = mean_alleles %>% group_by(chrom, pos, ref, mutation, treatment, !!sym(propstring), lowerCI, upperCI) %>% summarise_at(.vars = c("cov","A","C","G","T"), .funs = c(mean="mean"), na.rm = TRUE) %>% select(chrom,pos, ref, mutation, treatment, cov_mean, A_mean, C_mean, G_mean, T_mean, lowerCI, upperCI, !!propstring) } #write to file, reorder mean_kdr_alleles fwrite(alleles, glue("results/variantAnalysis/variantsOfInterest/csvs/{m}_alleleBalance.csv")) fwrite(mean_alleles, glue("results/variantAnalysis/variantsOfInterest/csvs/mean_{m}_alleleBalance.csv")) all_list[[m]] = alleles mean_list[[m]] = mean_alleles } #### write to excel file on diff sheets #### results_list = all_list sheets = unique(mutation_data$Name) wb <- createWorkbook("Workbook") for (i in 1:length(sheets)){ addWorksheet(wb, glue("{sheets[[i]]}")) writeData(wb, sheets[i], results_list[[i]], rowNames = TRUE, colNames = TRUE) } #### save workbook to disk once all worksheets and data have been added #### saveWorkbook(wb,file=snakemake@output[['alleleBalance']], overwrite = TRUE) ### mean balance #### #### write to excel file on diff sheets #### results_list = mean_list sheets = unique(mutation_data$Name) wb <- createWorkbook("Workbook") for (i in 1:length(sheets)){ addWorksheet(wb, glue("{sheets[[i]]}")) writeData(wb, sheets[i], results_list[[i]], rowNames = TRUE, colNames = TRUE) } #### save workbook to disk once all worksheets and data have been added #### saveWorkbook(wb,file=snakemake@output[['mean_alleleBalance']], overwrite = TRUE) sessionInfo() |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | import sys sys.stderr = open(snakemake.log[0], "w") import itertools import pandas as pd import matplotlib.pyplot as plt from matplotlib_venn import venn2, venn3 direction = snakemake.wildcards['dir_'] pval = snakemake.params['padj_threshold'] dataset = snakemake.params['dataset'] comparisons = snakemake.params['comparisons'] #['BusiaParental_BusiaSelected', 'Kisumu_BusiaParental', 'Kisumu_BusiaSelected'] assert len(comparisons) > 1, "Only one differential expression comparison is specified, cannot run venn analysis. Please disable venn in config.yaml or specify more comparisons" de_data = {} de_genes = {} for comp in comparisons: n_comps = len(comparisons) de_data[comp] = pd.read_csv(f"results/genediff/{comp}.csv") de_genes[comp] = de_data[comp].query(f"padj < {pval}") if direction == 'up': de_genes[comp] = de_genes[comp].query("FC > 1")['GeneID'].to_numpy() elif direction == 'down': de_genes[comp] = de_genes[comp].query("FC < 1")['GeneID'].to_numpy() for comp1, comp2 in itertools.combinations(comparisons, 2): all_de_comps = [comp1, comp2] all_de_comps_string = '.'.join(all_de_comps) all_de_comps = [a.replace("_", " v ") for a in all_de_comps] de1 = set(de_genes[comp1]) de2 = set(de_genes[comp2]) s = [de1, de2] venn2(s, all_de_comps) plt.title(f"Venn - {all_de_comps_string} - {direction}") plt.savefig(f"results/genediff/venn/{all_de_comps_string}-{direction}-Venn.png") plt.close() if n_comps >= 3: for all_de_comps in itertools.combinations(comparisons, 3): comp1, comp2, comp3 = all_de_comps all_de_comps_string = '.'.join(all_de_comps) all_de_comps = [a.replace("_", " v ") for a in all_de_comps] de_genes2 = dict((k, de_genes[k]) for k in (comp1, comp2, comp3)) s = [set(v) for v in de_genes2.values()] venn3(s, all_de_comps) plt.title(f"Venn - {all_de_comps_string} - {direction}") plt.savefig(f"results/genediff/venn/{all_de_comps_string}-{direction}-Venn.png") plt.close() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.bcftools import get_bcftools_opts bcftools_opts = get_bcftools_opts(snakemake, parse_ref=False, parse_memory=False) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell("bcftools view {bcftools_opts} {extra} {snakemake.input[0]} {log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Joël Simoneau" __copyright__ = "Copyright 2019, Joël Simoneau" __email__ = "simoneaujoel@gmail.com" __license__ = "MIT" from snakemake.shell import shell # Creating log log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Placeholder for optional parameters extra = snakemake.params.get("extra", "") # Allowing for multiple FASTA files fasta = snakemake.input.get("fasta") assert fasta is not None, "input-> a FASTA-file is required" fasta = " ".join(fasta) if isinstance(fasta, list) else fasta shell( "kallisto index " # Tool "{extra} " # Optional parameters "--index={snakemake.output.index} " # Output file "{fasta} " # Input FASTA files "{log}" # Logging ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | __author__ = "Joël Simoneau" __copyright__ = "Copyright 2019, Joël Simoneau" __email__ = "simoneaujoel@gmail.com" __license__ = "MIT" from snakemake.shell import shell # Creating log log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Placeholder for optional parameters extra = snakemake.params.get("extra", "") # Allowing for multiple FASTQ files fastq = snakemake.input.get("fastq") assert fastq is not None, "input-> a FASTQ-file is required" fastq = " ".join(fastq) if isinstance(fastq, list) else fastq shell( "kallisto quant " # Tool "{extra} " # Optional parameters "--threads={snakemake.threads} " # Number of threads "--index={snakemake.input.index} " # Input file "--output-dir={snakemake.output} " # Output directory "{fastq} " # Input FASTQ files "{log}" # Logging ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | __author__ = "Michael Chambers" __copyright__ = "Copyright 2019, Michael Chambers" __email__ = "greenkidneybean@gmail.com" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.samtools import get_samtools_opts samtools_opts = get_samtools_opts( snakemake, parse_threads=False, parse_write_index=False, parse_output_format=False ) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell("samtools faidx {samtools_opts} {extra} {snakemake.input[0]} {log}") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | __author__ = "Christopher Preusch" __copyright__ = "Copyright 2017, Christopher Preusch" __email__ = "cpreusch[at]ust.hk" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.samtools import get_samtools_opts samtools_opts = get_samtools_opts( snakemake, parse_write_index=False, parse_output=False, parse_output_format=False ) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) shell( "samtools flagstat {samtools_opts} {extra} {snakemake.input[0]} > {snakemake.output[0]} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Samtools takes additional threads through its option -@ # One thread for samtools merge # Other threads are *additional* threads passed to the '-@' argument threads = "" if snakemake.threads <= 1 else " -@ {} ".format(snakemake.threads - 1) shell( "samtools index {threads} {extra} {snakemake.input[0]} {snakemake.output[0]} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | __author__ = "Sebastian Kurscheid" __copyright__ = "Copyright 2019, Sebastian Kurscheid" __email__ = "sebastian.kurscheid@anu.edu.au" __license__ = "MIT" from snakemake.shell import shell import re extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Assert input n = len(snakemake.input.sample) assert ( n == 1 or n == 2 ), "input->sample must have 1 (single-end) or 2 (paired-end) elements." # Input files if n == 1: reads = "--in1 {}".format(snakemake.input.sample) else: reads = "--in1 {} --in2 {}".format(*snakemake.input.sample) # Output files trimmed_paths = snakemake.output.get("trimmed", None) if trimmed_paths: if n == 1: trimmed = "--out1 {}".format(snakemake.output.trimmed) else: trimmed = "--out1 {} --out2 {}".format(*snakemake.output.trimmed) # Output unpaired files unpaired = snakemake.output.get("unpaired", None) if unpaired: trimmed += f" --unpaired1 {unpaired} --unpaired2 {unpaired}" else: unpaired1 = snakemake.output.get("unpaired1", None) if unpaired1: trimmed += f" --unpaired1 {unpaired1}" unpaired2 = snakemake.output.get("unpaired2", None) if unpaired2: trimmed += f" --unpaired2 {unpaired2}" # Output merged PE reads merged = snakemake.output.get("merged", None) if merged: if not re.search(r"--merge\b", extra): raise ValueError( "output.merged specified but '--merge' option missing from params.extra" ) trimmed += f" --merged_out {merged}" else: trimmed = "" # Output failed reads failed = snakemake.output.get("failed", None) if failed: trimmed += f" --failed_out {failed}" # Stats html = "--html {}".format(snakemake.output.html) json = "--json {}".format(snakemake.output.json) shell( "(fastp --thread {snakemake.threads} " "{extra} " "{adapters} " "{reads} " "{trimmed} " "{json} " "{html} ) {log}" ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell extra = snakemake.params.get("extra", "") # Set this to False if multiqc should use the actual input directly # instead of parsing the folders where the provided files are located use_input_files_only = snakemake.params.get("use_input_files_only", False) if not use_input_files_only: input_data = set(path.dirname(fp) for fp in snakemake.input) else: input_data = set(snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {extra}" " --force" " -o {output_dir}" " -n {output_name}" " {input_data}" " {log}" ) |
Support
- Future updates
Related Workflows





