RNA-Seq-Pop: Analysis Workflow for Illumina Paired-End RNA Sequencing Data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
- Lack of a description for a new keyword tool/pypi/adjustText .
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
RNA-Seq-Pop
This snakemake workflow performs various analyses of illumina paired-end RNA-Sequencing data:
-
Quality control of fastq reads with FASTQC
-
QC metrics integrated into one final QC report with multiQC
-
Differential expression analysis with Kallisto at the gene level ( DESeq2 ) and transcript level ( Sleuth )
-
Variant calling with freebayes , and an Fst and Population branch statistic (PBS) analysis, both in windows and at the gene-level ( Scikit-allel ).
-
Allele counts and frequencies at variants of interest (pre-specified loci of choice).
-
Various summary statistics are calculated (Wattersons Theta, Sequence Diversity, Dxy etc)
-
Differential SNP testing with the R package kissDE , which accounts for allele-specific expression.
-
Gene Set Enrichment analyses.
-
Anopheles gambiae s.l - Analysis of Ancestry Informative Markers (AIMs) to determine relative ancestry of An.gambiae/coluzzii/arabiensis .
-
Anopheles gambiae s.l - Reports if DE genes are found underneath known selective sweep signals in the Ag1000g .
-
Anopheles gambiae s.l - Determines Karyotype of chromosome 2 inversions using compkaryo - GH
The workflow is generalised, and will function with any trimmed Illumina paired-end RNA-sequencing. However, certain modules, such as the AIMs analysis, are only appropriate for specific species. These can be activated in the configuration file (config.yaml). The workflow works with pooled samples, diploid, or haploid individuals.
The workflow is still in construction, and not yet ready for release. If you have any feedback on how the workflow may be improved, please get in touch, or feel free to fork the github repo and create a pull request for any additional features you would like to implement. If you are using the workflow and would like to give feedback or troubleshoot, consider joining the discord server here
Authors
- Sanjay Curtis Nagi (@sanjaynagi) - sanjay.c.nagi@gmail.com
Usage
If you use this workflow in a paper, don't forget to give credits to the author by citing the URL of this (original) repository and, if available, its DOI -
Workflow
Step 1: Obtain a copy of this workflow
-
Create a new github repository using this workflow as a template .
-
Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
-
As the workflow contains submodules (compkaryo & mpileup2readcounts),
git clone --recursive
should be used to clone the repo. -
If you want to run the differential SNPs analysis, please compile mpileup2readcounts by navigating to its folder in workflow/scripts and:
g++ -std=c++11 -O3 mpileup2readcounts.cc -o mpileup2readcounts
Step 2: Configure workflow
Configure the workflow according to your needs via editing the files in the
config/
folder. Adjust the example
config.yaml
to configure the workflow execution, and
samples.tsv
to specify your sample setup.
Step 3: Install Snakemake
Install Snakemake using conda :
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation .
Step 4: Execute workflow
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using
$N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
Step 5: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Step 6: Obtain updates from upstream
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
-
Once, register the upstream repository in your local copy:
git remote add -f upstream git@github.com:snakemake-workflows/rna-seq-pop.git
orgit remote add -f upstream https://github.com/snakemake-workflows/rna-seq-pop.git
if you do not have setup ssh keys. -
Update the upstream version:
git fetch upstream
. -
Create a diff with the current version:
git diff HEAD upstream/master workflow > upstream-changes.diff
. -
Investigate the changes:
vim upstream-changes.diff
. -
Apply the modified diff via:
git apply upstream-changes.diff
. -
Carefully check whether you need to update the config files:
git diff HEAD upstream/master config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
Step 7: Contribute back
In case you have also changed or added steps, please consider contributing them back to the original repository:
-
Fork the original repo to a personal or lab account.
-
Clone the fork to your local system, to a different place than where you ran your analysis.
-
Copy the modified files from your analysis to the clone of your fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary. -
Commit and push your changes to your fork.
-
Create a pull request against the original repository.
Testing
Test cases are in the subfolder
.test
. They are automatically executed via continuous integration with
Github Actions
.
Code Snippets
14 15 | wrapper: "0.66.0/bio/kallisto/index" |
35 36 | wrapper: "0.66.0/bio/kallisto/quant" |
49 50 | wrapper: "v0.69.0/bio/samtools/faidx" |
69 70 | shell: "hisat2-build -p {threads} {input.fasta} {params.prefix} 2> {log}" |
94 95 96 97 98 | shell: """ hisat2 {params.extra} --threads {threads} -x {params.idx} {params.readflags} 2> {log.align} | samblaster 2> {log.sort} | samtools sort -@{threads} - -o {output} 2>> {log.sort} """ |
111 112 | wrapper: "0.65.0/bio/samtools/index" |
127 128 | shell: "samtools dict {input} > {params.output} 2> {log} " |
145 146 | wrapper: "v1.4.0/bio/gatk/splitncigarreads" |
165 166 | wrapper: "v1.4.0/bio/gatk/baserecalibrator" |
184 185 | wrapper: "v1.4.0/bio/gatk/applybqsr" |
197 198 | wrapper: "0.65.0/bio/samtools/index" |
211 212 | wrapper: "0.65.0/bio/samtools/index" |
35 36 | script: "../scripts/DeseqGeneDE.R" |
63 64 | script: "../scripts/SleuthIsoformsDE.R" |
97 98 | script: "../scripts/ProgressiveDE.R" |
151 152 | script: "../scripts/GeneSetEnrichment.R" |
174 175 | script: "../scripts/Ag1000gSweepsDE.py" |
191 192 | script: "../scripts/GeneCategoryContribution.R" |
16 17 | shell: "bcftools concat {input.calls} | vcfuniq > {output} 2> {log}" |
32 33 | wrapper: "v1.0.0/bio/bcftools/view" |
49 50 | shell: "snpEff download {params.ref} -dataDir {params.dir} 2> {log}" |
71 72 73 74 75 | shell: """ snpEff eff {params.db} -dataDir {params.dir} -csvStats {output.csvStats} {input.calls} > {params.prefix} 2> {log} bgzip {params.prefix} """ |
92 93 94 95 | shell: """ SnpSift filter "{params.expression}" {input.vcf} > {output} 2> {log} """ |
116 117 | script: "../scripts/ExtractBedVCF.py" |
21 22 23 24 25 | shell: """ samtools mpileup -f {input.ref} -l {input.bed} {input.bam} 2> {log} | {params.basedir}/scripts/mpileup2readcounts/mpileup2readcounts.cc 0 {params.baseflt} {params.ignore_indels} {params.min_alt} {params.min_af} > {output} 2>> {log} """ |
58 59 | script: "../scripts/DifferentialSNPs.R" |
91 92 | script: "../scripts/VennDiagrams.py" |
26 27 | script: "../scripts/checkInputs.py" |
43 44 | wrapper: "0.74.0/bio/fastqc" |
64 65 | wrapper: "v0.86.0/bio/cutadapt/pe" |
82 83 | wrapper: "0.70.0/bio/samtools/flagstat" |
103 104 | shell: "mosdepth --threads {threads} --fast-mode --by {params.windowsize} --no-per-base {params.prefix} {input.bam}" |
119 120 121 122 | shell: """ bcftools stats {input} > {output} 2> {log} """ |
138 139 | shell: "qualimap bamqc -bam {input.bam} -c -gff {input.gff} -outfile {output} 2> {log}" |
161 162 | wrapper: "0.74.0/bio/multiqc" |
36 37 | script: "../scripts/SNPstatistics.py" |
66 67 | script: "../scripts/pca.py" |
96 97 | script: "../scripts/SummaryStats.py" |
143 144 | script: "../scripts/WindowedFstPBS.py" |
176 177 | script: "../scripts/PerGeneFstPBS.py" |
209 210 | script: "../scripts/AncestryInformativeMarkers.py" |
232 233 234 235 236 237 | shell: """ paste <(bcftools query -l {input.vcf}) \ <(python {params.basedir}/scripts/compkaryo/compkaryo/compkaryo.py {input.vcf} {wildcards.karyo} -p {params.ploidy}) | column -s $'\\t' -t | sort -k 1 > {output} """ |
258 259 | script: "../scripts/KaryoPlots.py" |
20 21 | script: "../scripts/GenerateFreebayesParams.R" |
44 45 | shell: "freebayes -f {input.ref} -t {input.regions} --ploidy {params.ploidy} --populations {params.pops} --pooled-discrete --use-best-n-alleles 5 -L {input.samples} > {output} 2> {log}" |
63 64 65 66 67 | shell: """ octopus -R {input.reference} -I {input.bam} -T {wildcards.chrom} --organism-ploidy {params.ploidy} --downsample-above {params.max_coverage} \ --sequence-error-model {params.err_model} --forest {params.forest} -o {output} --threads {threads} 2> {log} """ |
21 22 23 24 25 | shell: """ samtools mpileup {input.bam} -r {params.region} -f {params.ref} 2> {log} | python2 {params.basedir}/scripts/BaseParser.py > {output} 2>> {log} """ |
52 53 | script: "../scripts/VariantsOfInterestAlleleBalance.R" |
74 75 | script: "../scripts/VariantsOfInterestPlot.py" |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd import numpy as np from collections import defaultdict # Read in parameters pval_threshold = snakemake.params['pval'] upper_fc = snakemake.params['fc'] lower_fc = 1/upper_fc # if someone wants a FC threshold of 2, need to have lower threshold of 0.5. # Read in list of contrasts comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast']) # Read in .csv file containing selection signals and associated metadata signals = pd.read_csv("resources/signals.csv") for comp in comparisons['contrast']: # Load DE results DEgenes = pd.read_csv(f"results/genediff/{comp}.csv") # Filter to overexpressed genes sigup = DEgenes[np.logical_and(DEgenes['padj'] < pval_threshold, np.logical_or(DEgenes['FC'] > upper_fc, DEgenes['FC'] < lower_fc))] sweep = {} nswept = {} # Loop through each signal, recording if any DE genes are found in one for i, cols in signals.iterrows(): if pd.isnull(cols['overlapping_genes']): # Skip signal if no overlapping genes continue sweptgenes = np.array(cols['overlapping_genes'].split(" ")) # Get boolean array - if list of swept genes isin our DE genes overlap = np.isin(sweptgenes, sigup['GeneID']) sweep[cols['uid']] = sweptgenes[overlap] nswept[cols['uid']] = sweptgenes genes = np.concatenate(list(sweep.values())) swept = sigup[np.isin(sigup['GeneID'], genes)] for k,v in sweep.items(): sweep[k] = ' '.join(v) # Build dataframe and add sweep metadata columns sweptDE = pd.DataFrame.from_dict(sweep, orient='index', columns=['overlapping_DE_genes']) sweptDE = sweptDE.reset_index().rename(columns={'index': 'Ag1000g_sweep'}) sweptDE['overlapping_genes'] = signals['overlapping_genes'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True) sweptDE['chromosome'] = signals['peak_end_seqid'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True) sweptDE['epicenter'] = signals['epicenter_coord'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True) sweptDE['known_loci'] = signals['known_loci'][~pd.isnull(signals['overlapping_genes'])].reset_index(drop=True) wheresweep = defaultdict(dict) whatsweep = defaultdict(list) # Now loop through each swept gene to find name of sweeps it lies under # And location of sweep for gene in genes: for i, cols in sweptDE.iterrows(): sweptgenes = np.array(cols['overlapping_DE_genes'].split(" ")) if np.isin(sweptgenes, gene).any(): wheresweep[gene]['chrom'] = cols['chromosome'] wheresweep[gene]['epicenter'] = cols['epicenter'] wheresweep[gene]['known_loci'] = cols['known_loci'] whatsweep[gene].append(cols['Ag1000g_sweep']) # Join name of sweeps column into a single string, so it fits in one column of data frame for k,v in whatsweep.items(): whatsweep[k] = ' '.join(v) dfwhere = pd.DataFrame.from_dict(wheresweep, orient='index') dfwhat = pd.DataFrame.from_dict(whatsweep, orient='index', columns=['Ag1000g_sweeps']) df = pd.concat([dfwhat, dfwhere], axis=1) df = df.reset_index().rename(columns={'index': 'GeneID'}) swept = swept.merge(df) swept.to_csv(f"results/genediff/ag1000gSweeps/{comp}_swept.tsv", sep="\t") |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | import sys sys.stderr = open(snakemake.log[0], "w") import rnaseqpoptools as rnaseqpop import pandas as pd import numpy as np import zarr import allel from collections import defaultdict ### AIMS ### dataset = snakemake.params['dataset'] metadata = pd.read_csv(snakemake.input['metadata'], sep="\t") metadata = metadata.sort_values(by='species').reset_index(drop=True) chroms = snakemake.params['chroms'] ploidy = snakemake.params['ploidy'] numbers = rnaseqpop.get_numbers_dict(ploidy) qualflt = snakemake.params['qualflt'] missingprop = snakemake.params['missingprop'] # read AIMs aims = zarr.open(snakemake.input['aims_zarr_gambcolu'], mode='r') ## initialize dicts ancestryPerAim = {} aims_chrom_gamb = {} aims_chrom_colu = {} all_gamb = defaultdict(list) all_colu = defaultdict(list) n_aims_per_chrom = {} for chrom in chroms: # read in and filter data path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz" vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path, chrom=chrom, samples=metadata, numbers=numbers, ploidy=ploidy, qualflt=qualflt, missingfltprop=missingprop) aimspos = aims[chrom]['POS'][:] # get intersection of aims and our SNPs aims_pos_mask, aims_mask_2 = pos.locate_intersection(aimspos) our_aims = pos[aims_pos_mask] print(f"\n In the data, across all samples there are {our_aims.shape[0]} Ancestry Informative markers on Chromosome {chrom}") # get gamb and colu alleles, and subset to aims that we have in the rna-seq data aimscolu = aims[chrom]['colu_allele'][:][aims_mask_2] aimsgamb = aims[chrom]['gamb_allele'][:][aims_mask_2] # get mask that was used in readAndFilterVcf() mask = pos.locate_intersection(vcf['variants/POS'])[1] ref = vcf['variants/REF'][mask][aims_pos_mask] alt = vcf['variants/ALT'][mask][aims_pos_mask] # filter geno array to set of aims geno_aims = geno.compress(aims_pos_mask, axis=0) totalgambscore = {} totalcoluscore = {} for aim in our_aims: gambscore = {} coluscore = {} # filter arrays mask = our_aims == aim ref_ = ref[mask] alt_ = alt[mask] aimscolu_ = aimscolu[mask] aimsgamb_ = aimsgamb[mask] gn_aim = geno_aims.compress(mask, axis=0) # convert genotypes to nucleotides gn2nucleotide = {0:ref_[0], 1:alt_[0][0], 2:alt_[0][1], 3:alt_[0][2], -1:float("nan")} gn = rnaseqpop.replace_with_dict2_generic(gn_aim, gn2nucleotide) # for each sample, get proportion of gambiae/coluzzii alleles # alleles that are different to both will be missed here for sample in metadata.treatment.unique(): alleles = gn.take(subpops[sample], axis=1).flatten() # at each AIM, do we have gamb or colu alleles gamb = alleles[alleles != 'nan'] == aimsgamb_ colu = alleles[alleles != 'nan'] == aimscolu_ # get proportion of gamb v colu alleles at each locus gambscore[sample] = np.mean(gamb) coluscore[sample] = np.mean(colu) totalgambscore[aim] = dict(gambscore) totalcoluscore[aim] = dict(coluscore) gambscores = rnaseqpop.flip_dict(totalgambscore) coluscores = rnaseqpop.flip_dict(totalcoluscore) prop_gambiae = {} prop_colu = {} n_aims_per_sample = {} for sample in metadata.treatment.unique(): prop_gambiae[sample] = np.nanmean(np.array(list(gambscores[sample].values()))) all_gamb[sample].append(np.nanmean(np.array(list(gambscores[sample].values())))) prop_colu[sample] = np.nanmean(np.array(list(coluscores[sample].values()))) all_colu[sample].append(np.nanmean(np.array(list(coluscores[sample].values())))) arr = np.array(list(gambscores[sample].values())) dim = arr.shape[0] n_aims_per_sample[sample] = dim-np.sum(np.isnan(arr)) # store AIM fractions for each chromosome in outer dict aims_chrom_gamb[chrom] = dict(prop_gambiae) aims_chrom_colu[chrom] = dict(prop_colu) n_aims_per_chrom[chrom] = dict(n_aims_per_sample) # Store ancestry score per aim ancestryPerAim[chrom] = pd.concat([pd.DataFrame(gambscores).add_suffix("_gamb"), pd.DataFrame(coluscores).add_suffix("_colu")], axis=1) ancestryPerAim[chrom]['contig'] = chrom # plot and store for each chromosome coludf = pd.DataFrame.from_dict(prop_colu, orient='index', columns=['AIM_fraction_coluzzii']) gambdf = pd.DataFrame.from_dict(prop_gambiae, orient='index', columns=['AIM_fraction_gambiae']) perchromdf = gambdf.merge(coludf, left_index=True, right_index=True) aimsperchromdf = pd.DataFrame.from_dict(n_aims_per_sample, orient='index', columns=['n_AIMs']) perchromdf.to_csv(f"results/variantAnalysis/ancestry/AIM_fraction_{chrom}.tsv", sep="\t", index=True) rnaseqpop.plot_aims(perchromdf, aimsperchromdf, species1="coluzzii", species2="gambiae", figtitle=f"AIM_fraction_{chrom}", total=False) aims_chrom_gamb = rnaseqpop.flip_dict(aims_chrom_gamb) aims_chrom_colu = rnaseqpop.flip_dict(aims_chrom_colu) n_aims_per_chrom = rnaseqpop.flip_dict(n_aims_per_chrom) # get ancestry per aim for later plotting on chromosome ancestryPerAim = pd.concat(ancestryPerAim, axis=0) ancestryPerAim.to_csv("results/variantAnalysis/ancestry/ancestryScorePerAim.tsv", sep="\t") # get genome wide average AIM fractions for k in all_gamb: all_gamb[k] = np.nanmean(all_gamb[k]) all_colu[k] = np.nanmean(all_colu[k]) df1 = pd.DataFrame.from_dict(all_gamb, orient='index',columns=['AIM_fraction_gambiae']) df2 = pd.DataFrame.from_dict(all_colu, orient='index', columns=['AIM_fraction_coluzzii']) n_aimsdf = pd.DataFrame.from_dict(n_aims_per_chrom) n_aimsdf.to_csv(f"results/variantAnalysis/ancestry/n_AIMS_per_chrom.tsv", sep="\t", index=True) df = df1.merge(df2, left_index=True, right_index=True) df.to_csv(f"results/variantAnalysis/ancestry/AIMs_summary.tsv", sep="\t", index=True) rnaseqpop.plot_aims(df, n_aimsdf, species1="coluzzii", species2="gambiae", figtitle="AIM_fraction_whole_genome", total=True) ########### The same but for arabiensis v gambiae/coluzzii # script should be modularised but no time atm, isnt necessary if metadata['species'].isin(['arabiensis']).any(): aims = zarr.open(snakemake.input['aims_zarr_arab'], mode='r') aims_chrom_gamb = {} aims_chrom_arab = {} all_gamb = defaultdict(list) all_arab = defaultdict(list) n_aims_per_chrom = {} for chrom in chroms: # read in and filter data path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz" vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path, chrom=chrom, samples=metadata, numbers=numbers, qualflt=qualflt, missingfltprop=missingprop) aimspos = aims[chrom]['POS'][:] # get intersection of aims and our SNPs aims_pos_mask, aims_mask_2 = pos.locate_intersection(aimspos) our_aims = pos[aims_pos_mask] print(f"\n In the data, across all samples there are {our_aims.shape[0]} gamb-arab Ancestry Informative markers on Chromosome {chrom}") # get gamb and colu alleles, and subset to aims that we have in the rna-seq data aimsgamb = aims[chrom]['gambcolu_allele'][:][aims_mask_2] aimsarab = aims[chrom]['arab_allele'][:][aims_mask_2] # get mask that was used in readAndFilterVcf() mask = pos.locate_intersection(vcf['variants/POS'])[1] ref = vcf['variants/REF'][mask][aims_pos_mask] alt = vcf['variants/ALT'][mask][aims_pos_mask] # filter geno array to set of aims geno_aims = geno.compress(aims_pos_mask, axis=0) totalgambscore = {} totalarabscore = {} for aim in our_aims: gambscore = {} arabscore = {} # filter arrays mask = our_aims == aim ref_ = ref[mask] alt_ = alt[mask] aimsarab_ = aimsarab[mask] aimsgamb_ = aimsgamb[mask] gn_aim = geno_aims.compress(mask, axis=0) #convert genotypes to nucleotides gn2nucleotide = {0:ref_[0], 1:alt_[0][0], 2:alt_[0][1], 3:alt_[0][2], -1:float("nan")} gn = rnaseqpop.replace_with_dict2_generic(gn_aim, gn2nucleotide) # for each sample, get proportion of gambiae/arabiensis alleles # alleles that are different to both will be missed here for sample in metadata.treatment.unique(): alleles = gn.take(subpops[sample], axis=1).flatten() # at each AIM, do we have gamb or arab alleles gamb = alleles[alleles != 'nan'] == aimsgamb_ arab = alleles[alleles != 'nan'] == aimsarab_ # get proportion of gamb v arab alleles at each locus gambscore[sample] = np.mean(gamb) arabscore[sample] = np.mean(arab) totalgambscore[aim] = dict(gambscore) totalarabscore[aim] = dict(arabscore) gambscores = rnaseqpop.flip_dict(totalgambscore) arabscores = rnaseqpop.flip_dict(totalarabscore) prop_gambiae = {} prop_arab = {} n_aims_per_sample = {} for sample in metadata.treatment.unique(): prop_gambiae[sample] = np.nanmean(np.array(list(gambscores[sample].values()))) all_gamb[sample].append(np.nanmean(np.array(list(gambscores[sample].values())))) prop_arab[sample] = np.nanmean(np.array(list(arabscores[sample].values()))) all_arab[sample].append(np.nanmean(np.array(list(arabscores[sample].values())))) arr = np.array(list(gambscores[sample].values())) dim = arr.shape[0] n_aims_per_sample[sample] = dim-np.sum(np.isnan(arr)) aims_chrom_gamb[chrom] = dict(prop_gambiae) aims_chrom_arab[chrom] = dict(prop_arab) n_aims_per_chrom[chrom] = dict(n_aims_per_sample) # Store ancestry score per aim ancestryPerAim[chrom] = pd.concat([pd.DataFrame(gambscores).add_suffix("_gamb"), pd.DataFrame(coluscores).add_suffix("_colu")], axis=1) ancestryPerAim[chrom]['contig'] = chrom # plot and store for each chromosome gambdf = pd.DataFrame.from_dict(prop_gambiae, orient='index', columns=['AIM_fraction_gambiae']) arabdf = pd.DataFrame.from_dict(prop_arab, orient='index', columns=['AIM_fraction_arabiensis']) perchromdf = gambdf.merge(arabdf, left_index=True, right_index=True) aimsperchromdf = pd.DataFrame.from_dict(n_aims_per_sample, orient='index', columns=['n_AIMs']) perchromdf.to_csv(f"results/variantAnalysis/ancestry/AIM_fraction_{chrom}.arab.tsv", sep="\t", index=True) rnaseqpop.plot_aims(perchromdf, aimsperchromdf, species1="arabiensis", species2="gambiae", figtitle=f"AIM_fraction_arab_{chrom}", total=False) aims_chrom_gamb = rnaseqpop.flip_dict(aims_chrom_gamb) aims_chrom_arab = rnaseqpop.flip_dict(aims_chrom_arab) n_aims_per_chrom = rnaseqpop.flip_dict(n_aims_per_chrom) # get ancestry per aim for later plotting on chromosome ancestryPerAim = pd.concat(ancestryPerAim, axis=0) ancestryPerAim.to_csv("results/variantAnalysis/ancestry/ancestryScorePerAim.tsv", sep="\t") # get genome wide average AIM fractions for k in all_gamb: all_gamb[k] = np.nanmean(all_gamb[k]) all_arab[k] = np.nanmean(all_arab[k]) df1 = pd.DataFrame.from_dict(all_gamb, orient='index', columns=['AIM_fraction_gambiae']) df2 = pd.DataFrame.from_dict(all_arab, orient='index', columns=['AIM_fraction_arabiensis']) n_aimsdf = pd.DataFrame.from_dict(n_aims_per_chrom) n_aimsdf.to_csv(f"results/variantAnalysis/ancestry/n_AIMS_per_chrom_arab.tsv", sep="\t", index=True) df = df1.merge(df2, left_index=True, right_index=True) df.to_csv(f"results/variantAnalysis/ancestry/AIMs_summary_arab.tsv", sep="\t", index=True) rnaseqpop.plot_aims(df, n_aimsdf, species1="arabiensis", species2="gambiae", figtitle="AIM_fraction_whole_genome", total=True) |
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | import sys sys.stderr = open(snakemake.log[0], "w") import os import pandas as pd import numpy as np import re import allel # Read in parameters metadata = pd.read_csv(snakemake.params['metadata'], sep="\t") ref = snakemake.input['ref'] gffpath = snakemake.params['gffpath'] chroms = snakemake.params['chroms'] autofastq = snakemake.params['fastq'] sweeps = snakemake.params['sweeps'] signalpath = "resources/signals.csv" ## Read fasta file and grep chromosome names, check if they match config file = open(ref, "r") refchroms = [] for line in file: if re.search(">", line): refchroms.append(line.split()[0].replace(">", "")) assert (np.isin(chroms, refchroms)).all(), f"Not all chromosome names specified in config.yaml are found in the reference fasta file ({ref})" # Check that the contrasts specified in config.yaml are all in the treatment column comparisons = pd.DataFrame(snakemake.params['contrasts'], columns=['contrast']) comparisons = comparisons.contrast.str.split("_", expand=True) # split contrasts into case v control comparisons = comparisons.values.ravel() # make 1d array assert (np.isin(comparisons, metadata['treatment']).all()), f"treatments specified in config.yaml contrasts do not exist in samples.tsv. {comparisons}" # Check that samples in fastq.tsv match those in metadata.tsv if autofastq is False: fastq = pd.read_csv(snakemake.params['table'], sep="\t") assert np.isin(fastq['sampleID'], metadata['sampleID']).all(), f"all samples specified in fastq table do not match those in samples.tsv, please check your fastq.tsv file" else: # Check to see if fastq files match the metadata for sample in metadata['sampleID']: for n in [1,2]: fqpath = f"resources/reads/{sample}_{n}.fastq.gz" assert os.path.isfile(fqpath), f"all sample names in 'samples.tsv' do not match a .fastq.gz file in the {snakemake.workflow.basedir}/resources/reads/ directory, please rename or consider using the fastq table option" # Check column names of gene_names.tsv gene_names = pd.read_csv(snakemake.params['gene_names'], sep="\t") colnames = ['GeneID', 'TranscriptID', 'GeneDescription', 'GeneName'] assert (gene_names.columns.isin(colnames)).all(), f"Column names of gene_names.tsv should be {colnames}" # Check that the chromosomes are present in the gff file gff = allel.gff3_to_dataframe(gffpath) assert np.isin(chroms, gff.seqid).all(), f"All provided chromosome names ({chroms}) are not present in the reference GFF file" |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") #' This script performs differential expression analysis at the #' gene level using DESeq2. It outputs heatmaps, PCAs, gene counts, #' as well as DE results. Final DE results for all contrasts are #' saved in an .xslx report ## Differential expression library(DESeq2) library(pheatmap) library(data.table) library(ggrepel) library(openxlsx) library(glue) library(RColorBrewer) library(tidyverse) library(jsonlite) #read metadata and get contrasts metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame() gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% distinct() contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']]) contrasts = contrastsdf$contrast ##### define functions ###### round_df = function(df, digits) { #' This function rounds all the numeric columns of a data.frame nums = vapply(df, is.numeric, FUN.VALUE = logical(1)) df[,nums] = round(df[,nums], digits = digits) (df) } vst_pca = function(counts, samples, colourvar, name="PCA", st="", comparison=""){ #' This function takes counts and sample metadata as input and builds a DESeq2 dataset #' Returning normalised and variance-stabilised counts, and performs PCA on the data #' Plotting and saving to pdf # make DESeq dataset dds = DESeqDataSetFromMatrix(countData = counts, colData = samples, design = ~ treatment) ###### estimate paramters and normalise dds = estimateSizeFactors(dds) dds = estimateDispersions(dds) vsd = varianceStabilizingTransformation(dds) normcounts = counts(dds, normalized=TRUE) vstcounts = assay(vsd) vstcounts = vstcounts[order(apply(vstcounts,1,sum),decreasing=TRUE),] #### write pca of samples to pdf pca2=prcomp(t(vstcounts),center=TRUE) pc = data.frame(pca2$x) %>% rownames_to_column("sampleID") pc = left_join(pc, samples) pdf(glue("results/plots/{name}{st}{comparison}.pdf")) print(ggplot(data=pc,aes(x=PC1, y=PC2, colour=treatment)) + geom_point(size=6, alpha=0.8) + geom_text_repel(aes(label=sampleID), colour="black") + theme_light() + labs(title=glue("{name} {st} {comparison}"), x=glue("PC1 - Variance explained - {round(summary(pca2)$importance[2,][1], 3)}"), y=glue("PC2 - Variance explained - {round(summary(pca2)$importance[2,][2], 3)}")) + theme(legend.position = "bottom", legend.title = element_blank(), legend.background=element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.background = element_blank(), plot.title = element_text(hjust = 0.5), axis.text.x = element_text(colour="black", size=18), axis.text.y = element_text(colour="black", size=18))) null = dev.off() return(list(vstcounts, dds, normcounts)) } volcano = function(data, title){ data = data %>% mutate(col = case_when(padj < 0.05 & abs(log2FoldChange) > 1 ~ "#ff6666", padj > 0.05 & abs(log2FoldChange) > 1 ~ "#39e600", padj < 0.05 & abs(log2FoldChange) < 1 ~ "#3385ff", padj > 0.05 & abs(log2FoldChange) < 1 ~ "#b3b3b3")) data$labels = data %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID, is.na(GeneName) ~ GeneID, TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe() plot = ggplot(data=data, aes(x=log2FoldChange, y=-log10(padj), color=col, alpha=0.4), size=2) + geom_point() + scale_color_identity() + xlim(-10, 10) + ylab("-log10(P)") + xlab("log2 Fold Change") + ggtitle(title) + theme_light() + theme(legend.position = "none") + geom_vline(xintercept = log2(2), linetype='dashed', colour='grey') + geom_vline(xintercept = log2(0.5), linetype='dashed', colour='grey') + geom_hline(yintercept = -log10(0.05), linetype='dashed', colour='grey')# + #geom_text_repel(data = subset(data, data[['padj']] < 0.01 & abs(data[['log2FoldChange']]) > 2), aes(label = subset(data, data[['padj']] < 0.01 & abs(data[['log2FoldChange']]) > 2)[["labels"]], colour='black')) print(plot) } #### main #### cat("\n", "------------- Kallisto - DESeq2 - RNASeq Differential expression ---------", "\n") #### Read counts for each sample df = list() for (sample in metadata$sampleID){ df[[sample]]= fread(glue("results/quant/{sample}/abundance.tsv"), sep = "\t") } counts = data.frame('TranscriptID' = df[[1]]$target_id) # Get read counts for each gene and fill table for (sample in metadata$sampleID){ reads = df[[sample]]$est_counts counts = cbind(counts, reads) } # Rename columns colnames(counts) = c("TranscriptID", metadata$sampleID) ## Aggregate to gene level counts = left_join(counts, gene_names) %>% select(GeneID, metadata$sampleID) counts = counts %>% group_by(GeneID) %>% summarise_all(sum) gene_names = gene_names %>% select(-TranscriptID) ### Count total reads counted counts = counts[!is.na(counts$GeneID),] counts = counts %>% column_to_rownames('GeneID') #### Get count statistics for each sample and plot #### count_stats = apply(counts, 2, sum) %>% enframe(name="Sample", value="total_counts") # total counts ngenes = nrow(counts) count_stats$genes_zerocounts = apply(counts, 2, function(x){sum(x==0)}) # genes with zero counts count_stats$genes_lessthan10counts = apply(counts, 2, function(x){sum(x<10)}) # genes with less than 10 counts count_stats = count_stats %>% dplyr::mutate("proportion_zero" = genes_zerocounts/ngenes, "proportion_low" = genes_lessthan10counts/ngenes) count_stats %>% fwrite(., "results/quant/countStatistics.tsv",sep="\t") print("Counting and plotting total reads per sample...") pdf("results/quant/total_reads_counted.pdf") ggplot(count_stats, aes(x=Sample, y=total_counts, fill=metadata$treatment)) + geom_bar(stat='identity') + theme_light() + ggtitle("Total reads counted (mapped to Ag transcriptome (PEST))") + theme(axis.text.x = element_text(angle=45), plot.title = element_text(hjust = 0.5)) null = dev.off() # round numbers to be whole, they are not due averaging across transcripts counts = counts %>% rownames_to_column('GeneID') %>% dplyr::mutate_if(is.numeric, round) %>% column_to_rownames('GeneID') ############ Plots PCA with all data, and performs DESeq2 normalisation ######################## res = vst_pca(counts, metadata, colourvar = 'strain', name="PCA") vstcounts = res[[1]] dds = res[[2]] normcounts = res[[3]] ### write out raw and normalised counts counts %>% rownames_to_column("GeneID") %>% fwrite(., "results/quant/rawcounts.tsv", sep="\t", row.names = FALSE) normcounts %>% as.data.frame() %>% rownames_to_column("GeneID") %>% round_df(., 1) %>% fwrite(., "results/quant/normCounts.tsv", sep="\t", row.names = FALSE) # calculate correlations between samples based on the count data, and plot heatmap correlations = cor(vstcounts) pdf("results/plots/heatmap_correlations.pdf") pheatmap(correlations) garbage = dev.off() # add column if it doesnt exist if(!("lab" %in% colnames(metadata))) { metadata$lab = 'FALSE' } # for each strain in the dataset, do a PCA plot # analysis of case v control with DESEq2 and make PCAs and volcano plots. if ("strain" %in% colnames(metadata)){ for (sp in unique(metadata$species)){ df_samples = metadata %>% filter(species == sp) df_samples = df_samples %>% filter(lab == 'FALSE') #remove lab samples for (st in unique(df_samples$strain)){ ai_list = list() ai_list[[st]] = df_samples[df_samples$strain == st,] # if the strain has both case and control (i.e. exposed v unexposed) if (length(unique(ai_list[[st]]$cohort)) > 1){ cat(glue("\n Running PCA for all {st} samples \n")) # subset counts data to our strains of interest subcounts = counts[colnames(counts) %in% ai_list[[st]]$sampleID] # perform PCA on the data at strain level vstcounts = vst_pca(subcounts, ai_list[[st]], 'treatment', "PCA_", st=st)[[1]] } } } } results_list = list() names_list = list() ngenes_list = list() ######### subset data and run DESeq for each combo we need, store in xlsx ######## for (cont in contrasts){ control = str_split(cont, "_")[[1]][1] # get first of string, which is control case = str_split(cont, "_")[[1]][2] # get case controls = which(metadata$treatment %in% control) cases = which(metadata$treatment %in% case) ## Perform PCA for each comparison subcounts = counts[,c(controls, cases)] subsamples = metadata[c(controls, cases),] # make treatment a factor with the 'susceptible' as reference subsamples$treatment = as.factor(as.character(subsamples$treatment)) subsamples$treatment = relevel(subsamples$treatment, control) null = vst_pca(subcounts, subsamples, colourvar='treatment', "PCA_", comparison=cont)[[]] # perform DE with each comparison, using DESeq dataset (dds) from whole library cat("\n", glue("--- Running DESeq2 differential expression analysis on {cont} ---"), "\n") cds = nbinomWaldTest(dds) results = results(cds, contrast = c("treatment", case, control)) %>% as.data.frame() results = results[order(results$padj),] #order by pvalue results = results %>% rownames_to_column("GeneID") %>% dplyr::mutate("FC" = (2^log2FoldChange)) ### absolute difference #### Get rowsums of counts, grouping by case/control. Then get difference of counts and join with DE results readdiff = data.frame(t(rowsum(t(subcounts), group = subsamples$treatment, na.rm = T))) #transpose and get rowsums for each group readdiff$absolute_diff = readdiff[,case] - readdiff[,control] #get difference readdiff = data.frame(readdiff) %>% rownames_to_column('GeneID') results = unique(left_join(results, readdiff[,c('GeneID','absolute_diff')])) # join DE results with normal gene names results = unique(left_join(results, gene_names)) fwrite(results, glue("results/genediff/{cont}.csv")) #write to csv # volcano plot for each comparison, using EnhancedVolcano. First make vector of labels which is AGAPs unless a gene name exists results$labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID, is.na(GeneName) ~ GeneID, TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe() #get number of sig genes res1 = results %>% filter(padj < 0.05) %>% count("direction" = FC > 1) %>% dplyr::mutate("direction" = case_when(direction == FALSE ~ "Downregulated, padj = 0.05", direction == TRUE ~ "Upregulated, padj = 0.05")) %>% dplyr::rename(!!glue("{cont}_ngenes") := "n") res2 = results %>% filter(padj < 0.001) %>% count("direction" = FC > 1) %>% dplyr::mutate("direction" = case_when(direction == FALSE ~ "Downregulated, padj = 0.001", direction == TRUE ~ "Upregulated, padj = 0.001")) %>% dplyr::rename(!!glue("{cont}_ngenes") := "n") ngenes_list[[cont]] = bind_rows(res1, res2) # store names of comparisons for xlsx report sheets results_list[[cont]] = results names_list[[cont]] = cont pdf(glue("results/genediff/Volcano_plot_{cont}.pdf")) volcano(data = results_list[[cont]], title = cont) null = dev.off() cat("\n", glue("{cont} complete!"), "\n") } # Join different comparisons together and write out number of sig genes purrr::reduce(ngenes_list, inner_join) %>% fwrite("results/genediff/nsig_genes.tsv", sep="\t", col.names = TRUE) #### write to excel file on diff sheets #### sheets = unlist(names_list) wb <- createWorkbook("Workbook") for (i in 1:length(sheets)){ addWorksheet(wb, glue("{sheets[[i]]}")) writeData(wb, sheets[i], results_list[[i]], rowNames = FALSE, colNames = TRUE) } #### save workbook to disk once all worksheets and data have been added #### saveWorkbook(wb,file=snakemake@output[['xlsx']], overwrite = TRUE) ### Get kallisto statistics ### totalReads = c() totalAligned = c() # open kallisto json info and store stats, save to file for (sample in metadata$sampleID){ run = fromJSON(glue("results/quant/{sample}/run_info.json"), flatten=TRUE) totalReads = c(totalReads, run$n_processed) totalAligned = c(totalAligned, run$n_pseudoaligned) } df = data.frame("sample" = c(metadata$sampleID, "Total"), "totalReads" = c(totalReads, sum(totalReads)), "totalAligned" = c(totalAligned, sum(totalAligned))) %>% mutate("percentage" = (totalAligned/totalReads)*100) %>% fwrite("results/quant/KallistoQuantSummary.tsv", sep="\t", col.names = TRUE) sessionInfo() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") #' This script reads in tables of allele counts, produced with samtools mpileup. #' Using the R package kissDE, it performs a differential SNP testing #' analysis, testing if certain alleles are biased to be found in one #' treatment group library(tidyverse) library(data.table) library(kissDE) library(glue) library(rtracklayer) ######## parse inputs ############# chroms = snakemake@params[['chroms']] metadata = fread(snakemake@input[['metadata']]) contrasts = data.frame("contrast" = snakemake@params[['DEcontrasts']]) comparisons = contrasts %>% separate(contrast, into = c("control", "case"), sep = "_") mincounts = snakemake@params[['mincounts']] pval = snakemake@params[['pval_flt']] gene_names = fread(snakemake@input[['geneNames']], sep="\t") %>% select(-TranscriptID) %>% distinct() gffpath = snakemake@input[['gff']] #load gff with rtracklayer and filter to genes only gff = rtracklayer::import(gffpath) %>% as.data.frame() %>% filter(type == 'gene') %>% select(seqnames, start, end, ID) %>% dplyr::rename("chrom" = 'seqnames') %>% arrange(chrom, start) %>% as.data.table() #remove prefix for chrom column, so it matches the bed file #for each contrast/comparison, perform differential SNP analysis for (i in 1:nrow(comparisons)){ name = contrasts[i,] control = comparisons$control[i] case = comparisons$case[i] nrepscontrol = sum(metadata$treatment %in% control) nrepscase = sum(metadata$treatment %in% case) samples = metadata[metadata$treatment %in% c(case, control),]$sampleID print(glue("Extracting allele tables for {case}, {control}")) sample_list = list() for (sample in samples){ chrom_list = list() for (chrom in chroms){ #read in data alleles = fread(glue("results/variantAnalysis/alleleTables/{sample}.chr{chrom}.allele.table")) #sum lowercase and uppercase alleles, make new column (refcount) chrom_list[[chrom]] = alleles %>% mutate("A" = A+a,"T" = T+t,"G" = G+g,"C" = C+c) %>% select(-c(a,t,c,g,V15)) %>% mutate(refcount = case_when(ref == 'T' ~ T, ref == 'G' ~ G, ref == 'A' ~ A, ref == 'C' ~ C)) } #bind chroms together alleles = rbindlist(chrom_list) #this step find the most abundant ALT base, whether A,T,C or G, assigns its count as 'altcount' list_ = list() for (base in c("A", "C", "G", "T")){ a = alleles %>% filter(ref == base) %>% select(A,T,C,G) %>% select(-base) b = alleles %>% filter(ref == base) col = colnames(a)[apply(a, 1 , which.max)] b$alt = col b$altcount = apply(a, 1, max) list_[[base]] = b } alleles = rbindlist(list_) %>% arrange(chr, loc) #pivot to get separate rows for refcount and altcount as preferred by kissDE package sample_list[[sample]] = alleles %>% select(chr, loc, ref,alt, refcount, altcount) %>% pivot_longer(c(refcount, altcount),names_to="type", values_to=sample) %>% mutate(eventsName = paste0("chr",chr,"_",loc,"_",ref,">",alt)) %>% select(eventsName,type, sample) } #merge all counts across samples counts = sample_list %>% purrr::reduce(full_join, by=c("eventsName", "type")) %>% mutate("eventsLength" = 1) %>% replace(is.na(.), 0) %>% select(-type) %>% relocate(eventsLength, .after=eventsName) %>% as.data.frame() #sum refs and alts together for each snp to use as filter print(glue("Filtering SNP count data at a minimum of {mincounts} summed across samples")) summed = counts %>% group_by(eventsName, eventsLength) %>% summarise_all(sum) #get rowsums of counts, to use as filter rowcounts = summed %>% ungroup %>% select(-c(eventsName, eventsLength)) %>% rowSums() summed = summed[rowcounts >= mincounts,] print(glue("Retaining {nrow(summed)} SNPs, out of {length(rowcounts)}")) #remove snps that have a count across samples less than mincounts(100) counts = counts[counts$eventsName %in% summed$eventsName,] conditionsde = c(rep(control, nrepscontrol), rep(case, nrepscase)) #rename counts columns for kissde #colnames(counts) = colnames(counts) %>% # str_replace("1", "rep1") %>% # str_replace("2", "rep2") %>% # str_replace("3", "rep3") #run kissde quality control print(glue("Running kissDE QC for {name}")) qualityControl(counts, conditionsde, storeFigs = glue("results/variantAnalysis/diffsnps/kissDEfigs_{name}")) #Run kissde algorithm to find differentially expressed variants de_Vars = diffExpressedVariants(counts, conditions = conditionsde) #parse results to more readable, filterable form results = de_Vars[[1]] %>% separate(ID, into = c("chrom", "pos", "REF>ALT"), sep="_") %>% mutate("pos" = as.numeric(pos), "chrom" = str_remove(chrom, "chr")) %>% arrange(chrom, pos) ##### intersect (data.table::foverlap) gff and results to get gene names and descriptions of snps ####### bed = results %>% select(chrom, pos, Adjusted_pvalue, `Deltaf/DeltaPSI`, `REF>ALT`) %>% mutate("chrom" = as.character(chrom), "start" = as.numeric(pos) -1, "end" = as.numeric(pos)) %>% select(chrom, start, end, Adjusted_pvalue, `Deltaf/DeltaPSI`,`REF>ALT`, -pos) %>% as.data.table() # Data table fast overlaps, useful to find intersections and join setkey(gff, chrom, start, end) de_variants = foverlaps(bed, gff, by.x=c("chrom", "start", "end"), by.y=c("chrom", "start", "end"), type = "within", nomatch = 0L) %>% dplyr::rename("GeneID" = "ID") # Write to file print(glue("Writing {name} results to results/variantAnalysis/diffsnps/")) results %>% fwrite(., glue("results/variantAnalysis/diffsnps/{name}.normcounts.tsv"), sep="\t", row.names=FALSE) de_variants = left_join(de_variants, gene_names) %>% distinct() fwrite(de_variants, glue("results/variantAnalysis/diffsnps/{name}.kissDE.tsv"), sep="\t", row.names=FALSE) de_variants %>% filter(Adjusted_pvalue <= pval) %>% fwrite(., glue("results/variantAnalysis/diffsnps/{name}.sig.kissDE.tsv"), sep="\t", row.names=FALSE) } sessionInfo() |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import sys sys.stderr = open(snakemake.log[0], "w") import numpy as np import pandas as pd import allel chroms = snakemake.params['chroms'] for chrom in chroms: vcf = allel.read_vcf(f"results/variantAnalysis/vcfs/annot.missense.{chrom}.vcf") pos = vcf['variants/POS'] pos1 = pos+1 data = {'chrom':chrom, 'start':pos, 'stop':pos1} bed = pd.DataFrame(data) bed.to_csv(f"resources/regions/missense.pos.{chrom}.bed", sep="\t", header=None, index=None) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ### Contribution of IR gene categories to overall gene expression library(tidyverse) library(data.table) library(glue) ##### define functions ###### round_df = function(df, digits) { #' This function rounds all the numeric columns of a data.frame nums = vapply(df, is.numeric, FUN.VALUE = logical(1)) df[,nums] = round(df[,nums], digits = digits) (df) } metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame() #samples = fread("config/samples.tsv", sep="\t") %>% as.data.frame() gene_names = fread("resources/gene_names.tsv", sep="\t")[1:14459,] %>% distinct() # Read in normalised counts and sum biological replicates counts = fread("results/quant/normcounts.tsv", sep="\t") %>% column_to_rownames("GeneID") counts = data.frame(t(rowsum(t(counts), group = metadata$treatment, na.rm = T))) apply(counts, 2, var) colSums(counts) # Get total counts for each treatment totalCounts = colSums(counts) # Get lists of P450s, COEs and GSTs based on gene description p450genes = gene_names[str_detect(gene_names$Gene_description, "P450"),] %>% select(Gene_stable_ID, Gene_name) p450genes = p450genes[!p450genes$Gene_name %in% c("CYP4G16", "CYP4G17")] %>% distinct() # remove these genes from P450 column as I have added to cuticular coegenes = gene_names[str_detect(gene_names$Gene_description, "carboxylesterase"),] %>% select(Gene_stable_ID) coegenes = bind_rows(coegenes, data.frame("Gene_stable_ID" = "AGAP006227")) gstgenes = gene_names[str_detect(gene_names$Gene_description, c("glutathione S"))] %>% select(Gene_stable_ID, Gene_name, Gene_description) genegrps = list() for (file in list.files("resources/annotation/")){ group = str_remove(file, ".txt") genegrps[[group]] = fread(glue("resources/annotation/{file}"), sep="\t") %>% rename("Gene_stable_ID" = "Gene stable ID") %>% select(`Gene_stable_ID`) %>% distinct() } # Add manually searched groups genegrps[["P450s"]] = p450genes genegrps[["COEs"]] = coegenes genegrps[["GSTs"]] = gstgenes # Loop through groups and calculate % contribution to total counts perc_contrib = list() for (group in names(genegrps)){ subcounts = counts %>% rownames_to_column("GeneID") %>% filter(GeneID %in% genegrps[[group]]$`Gene_stable_ID`) %>% column_to_rownames("GeneID") perc_contrib[[group]] = data.frame((colSums(subcounts) / totalCounts) *100) %>% rownames_to_column("sample") colnames(perc_contrib[[group]]) = c("sample", group) } # Join dfs perc_contrib_df = purrr::reduce(perc_contrib, merge) %>% round_df(3) # Loop through each category and plot % contribution to total counts for (group in names(genegrps)){ plt = ggplot(perc_contrib_df, aes_string(x="sample", y=glue("{group}"), fill="sample")) + geom_bar(stat="identity") + ggtitle(glue("% of read counts {group}")) + theme_light() pdf(glue("results/plots/percentageContribution_{group}.pdf")) print(plt) dev.off() } # Write to file fwrite(perc_contrib_df, "results/quant/percentageContributionGeneCategories.tsv", sep="\t") sessionInfo() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") ## Generate freebayes params ## # 1) make bed for parallelising freebayes library(dplyr) library(data.table) library(glue) # read inputs chroms = snakemake@params[['chroms']] chunks = snakemake@params[['chunks']] fai = fread(snakemake@input[['index']]) # select chroms we want, and start, end columns fai = fai[fai$V1 %in% chroms, c(1,2)] # for each chromsome for (chrom in chroms){ #subset index to desired chrom f = fai[fai$V1 == chrom] #get sequence of n chunks from 0 to length of chrom bedseq = round(seq(0, f$V2, length.out = chunks)) #for each chunk for (i in 1:(chunks-1)){ #write bed file, one for each chunk/interval, which will be passed as input to freebayes row = c(chrom, bedseq[i], bedseq[i+1]) data.frame(row) %>% t() %>% fwrite(., glue("resources/regions/genome.{chrom}.region.{i}.bed"), sep="\t", col.names = FALSE) } } # 2) Make bamlist and populations.tsv file metadata = fread(snakemake@params[['metadata']], sep="\t") metadata$bams = paste0("results/alignments/", metadata$sampleID,".split.bq.bam") metadata %>% select(bams, strain) %>% fwrite(., snakemake@output[['pops']], sep="\t", row.names = FALSE, col.names = FALSE) metadata %>% select(bams) %>% fwrite(.,snakemake@output[['bamlist']], sep="\t", row.names = FALSE, col.names = FALSE) sessionInfo() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") #' This script runs GSEA analysis on each contrast in the workflow #' on gene lists, ranked using fold-change from DE data, Fst, PBS #' (optional) and differential SNPs analysis # GSEA RNA-Seq Ag library(GO.db) library(gage) library(fgsea) library(data.table) library(glue) library(tidyverse) ## functions ## runEnrich = function(rankedList, GeneSetList, outName){ for (set in c("kegg", "GO")){ # pathway enrichment with fgsea for both kegg and GO terms fgseaRes = fgsea(pathways = GeneSetList[[set]], stats = na.omit(rankedList), minSize = 15, maxSize = 500, nperm=10000) fgseaRes %>% arrange(padj) %>% fwrite(., file = glue("results/gsea/{outName}.{set}.tsv"), sep = "\t") topPathwaysUp = fgseaRes[ES > 0][head(order(pval), n=10), pathway] topPathwaysDown = fgseaRes[ES < 0][head(order(pval), n=10), pathway] topPathways = c(topPathwaysUp, rev(topPathwaysDown)) pdf(glue("results/gsea/{outName}.{set}.fgsea.pdf")) plotGseaTable(GeneSetList[[set]][topPathways], na.omit(rankedList), fgseaRes) null = dev.off() print(glue("{set} complete")) } } ###### configuration - metadata and parameters ###### metadata = fread(snakemake@input[['metadata']]) %>% as.data.frame() comparisons = data.frame("contrast" = snakemake@params[['DEcontrasts']]) gaffile = snakemake@input[['gaf']] variantCalling = snakemake@params[['VariantCalling']] pbs = snakemake@params[['pbs']] diffsnps = snakemake@params[['diffsnps']] pbscomps = snakemake@params[['pbscomps']] replaceString = snakemake@params[['replaceStringKegg']] speciesID = snakemake@params[['KeggSpeciesID']] GeneSetList = list() ######### GO Terms ######### # read in gaf file and select to appropriate columns and rename go = fread(gaffile, sep = "\t", skip = 1, header = FALSE) %>% as_tibble() %>% dplyr::select(c(2,5)) %>% distinct() %>% dplyr::rename("GeneID" = V2, "GO.id" = V5) # download GO terms and descriptions as we need descriptions goterms = Term(GOTERM) %>% enframe() %>% dplyr::rename("GO.id" = "name", "description" = "value") # join with our go terms go = inner_join(go, goterms) %>% dplyr::select("description", "GeneID") # turn into a large list of terms and all the genes with that term GOlist = go %>% pivot_wider(names_from="description", values_from="GeneID") %>% transpose() GeneSetList[["GO"]] = GOlist[[1]] #### KEGG Pathways #### kg.geneset = kegg.gsets(speciesID) GeneSetList[['kegg']] = kg.geneset$kg.sets[kg.geneset$sigmet.idx] # optional str_replace due to strange KEGG naming if (!is.null(replaceString)){ GeneSetList[['kegg']] = lapply(GeneSetList[['kegg']], str_remove, replaceString) } ##### gene DE #### for (comp in comparisons$contrast){ print(glue("Running KEGG and GO enrichment analyses for {comp}")) # make ranked list using DE results, rank based on log2foldchange rank = fread(glue("results/genediff/{comp}.csv")) %>% filter(FC > 1) %>% arrange("padj") %>% dplyr::select(c('GeneID', all_of("padj"))) %>% deframe() runEnrich(rankedList = rank, GeneSetList = GeneSetList, outName = glue("genediff/{comp}.DE")) } ### Fst #### if (variantCalling == TRUE){ for (comp in comparisons$contrast){ print(glue("Running KEGG and GO enrichment analyses for Fst {comp}")) # make ranked list using DE results, rank based on log2foldchange rank = fread("results/variantAnalysis/selection/FstPerGene.tsv") %>% distinct() %>% dplyr::select(GeneID, glue("{comp}_zFst")) %>% deframe() rank = rank %>% sort(decreasing = TRUE) runEnrich(rankedList = rank, GeneSetList = GeneSetList, outName = glue("/fst/{comp}.FST")) } } ### PBS #### if (pbs == TRUE){ for (comp in pbscomps){ print(glue("Running KEGG and GO enrichment analyses for PBS {comp}")) # make ranked list using DE results, rank based on log2foldchange rank = fread("results/variantAnalysis/selection/PbsPerGene.tsv") %>% distinct() %>% drop_na() %>% dplyr::select(GeneID, glue("{comp}PBS")) %>% deframe() rank = rank %>% abs() %>% sort(decreasing = TRUE) runEnrich(rankedList = rank, GeneSetList = GeneSetList, outName = glue("/pbs/{comp}.PBS")) } } #### diff SNPs #### if (diffsnps == TRUE){ for (comp in comparisons$contrast){ print(glue("Running KEGG and GO enrichment analyses for diffSNPs {comp}")) # make ranked list using DE results, rank based on log2foldchange rank = fread(glue("results/variantAnalysis/diffsnps/{comp}.kissDE.tsv"), sep="\t") %>% arrange("Deltaf/DeltaPSI") %>% dplyr::select(c('GeneID', all_of("Deltaf/DeltaPSI"))) %>% deframe() runEnrich(rankedList = rank, GeneSetList = GeneSetList, outName = glue("/diffsnps/{comp}.diffsnps")) } } sessionInfo() |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | import sys sys.stderr = open(snakemake.log[0], "w") import rnaseqpoptools as rnaseqpop import pandas as pd import matplotlib.pyplot as plt # Read in parameters from snakemake ploidy = snakemake.params['ploidy'] invs = snakemake.params['inversions'] metadata = pd.read_csv(snakemake.params['metadata'], sep="\t") metadata = metadata.sort_values(by='species') dataset = snakemake.params['dataset'] karyo = {} for inv in invs: df = pd.read_csv(f"results/karyotype/{inv}.{dataset}.karyo.txt", sep="\s+", header=None) df = df.rename(columns={0:'sampleID', 1:'KaryoScore', 2:'n_SNPtags'}) df[inv] = df['KaryoScore']/ploidy df.rename(columns={inv:f'{inv} frequency'}).to_csv(f"results/karyotype/{inv}.{dataset}.karyo.txt", sep="\t") karyo[inv] = df[['sampleID', inv]] # concat all the dfs in the dict and remove duplicate cols karyo = pd.concat(karyo.values(), axis=1).T.drop_duplicates().T.set_index("sampleID") ## transpose and round to 2 decimals karyo = karyo.T.astype("float64").round(2) rnaseqpop.plotRectangular(karyo, path="results/karyotype/karyoFreqs.svg" , cmap='mako_r', ylab='Inversion', figsize=[10,5]) # Produce for average karyos per treatment df = karyo.T.reset_index() df = df.merge(metadata[['sampleID', 'treatment']]) df = df.groupby("treatment").agg('mean').T.astype("float64").round(2) rnaseqpop.plotRectangular(df, path="results/karyotype/karyoOverallFreqs.svg", ylab='Inversion', cbar=False, figsize=[8,4]) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | import sys sys.stderr = open(snakemake.log[0], "w") import rnaseqpoptools as rnaseqpop import pandas as pd import numpy as np import allel from adjustText import adjust_text # Read in parameters from snakemake dataset = snakemake.params['dataset'] metadata = pd.read_csv(snakemake.input['metadata'], sep="\t") metadata = metadata.sort_values(by='species') chroms = snakemake.params['chroms'] ploidy = snakemake.params['ploidy'] numbers = rnaseqpop.get_numbers_dict(ploidy) qualflt = snakemake.params['qualflt'] missingprop = snakemake.params['missingprop'] for i, chrom in enumerate(chroms): # Read in and Filter VCF path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz" vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path, chrom=chrom, samples=metadata, numbers=numbers, ploidy=ploidy, qualflt=qualflt, missingfltprop=missingprop) #### Principal Components Analysis (PCA) #### # Set up dict to store indices for colours d={} for name, inds in subpops.items(): for n in range(len(inds)): p = inds[n] d[p] = name # Store dict as a dataframe and get colours treatment_indices = pd.DataFrame.from_dict(d, orient='index').reset_index() treatment_indices = treatment_indices.rename(columns = {'index':'sample_index', 0:"name"}) pop_colours = rnaseqpop.get_colour_dict(treatment_indices['name'], "viridis") # Run PCA function defined in tools.py print(f"Performing PCA on {dataset} chromosome {chrom}") rnaseqpop.pca(geno, chrom, ploidy, dataset, populations, metadata, pop_colours, prune=True, scaler=None) |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | import sys sys.stderr = open(snakemake.log[0], "w") import rnaseqpoptools as rnaseqpop import pandas as pd import numpy as np import allel from scipy import stats from functools import partial, reduce import warnings warnings.filterwarnings('ignore') # suppress numpy runtime warnings, this is a bit dangerous, should be removed for release or resolve source of warnings # snakemake inputs and params dataset = snakemake.params['dataset'] metadata_path = snakemake.input['metadata'] metadata = pd.read_csv(metadata_path, sep="\s+") gffpath = snakemake.input['gff'] pbs = snakemake.params['pbs'] pbscomps = snakemake.params['pbscomps'] chroms = snakemake.params['chroms'] ploidy = snakemake.params['ploidy'] numbers = rnaseqpop.get_numbers_dict(ploidy) missingprop = snakemake.params['missingprop'] # gff features = allel.gff3_to_dataframe(gffpath, attributes=["ID", "description"]) gff = features[features.type == 'gene'] # gene names file, rename gene_names = pd.read_csv(snakemake.input['geneNames'], sep="\t").drop(columns=['TranscriptID']).drop_duplicates() ### main #### # Read in list of contrasts comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast']) comparisons = comparisons.contrast.str.split("_", expand=True) comparisons.columns = ['sus', 'res'] comparisons = [list(row) for i,row in comparisons.iterrows()] print(f"The pairwise comparisons for Fst are {comparisons}") fstbychrom={} if pbs: pbsbychrom={} tajdbychrom={} gdivbychrom = {} dxybychrom = {} for chrom in chroms: # path to vcf path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz" # function to read in vcfs and associated SNP data vcf, geno, acsubpops, pos, depth, snpeff, subpops, pops = rnaseqpop.readAndFilterVcf(path=path, chrom=chrom, samples=metadata, numbers=numbers, ploidy=ploidy, qualflt=30, missingfltprop=missingprop) # subset gff to appropriate chrom genes = gff[gff.seqid == chrom].sort_values('start').reset_index(drop=True) ### Average Fst, pbs, tajima d for each gene fst_per_comp = {} fst_per_gene = {} pbs_per_gene = {} pbs_per_comp = {} tajd_per_pop = {} tajd_per_gene = {} gdiv_per_pop = {} gdiv_per_gene = {} se_per_comp = {} se_per_gene = {} dxy_per_comp = {} dxy_per_gene = {} pos_dict = {} n_dict = {} # loop through each gene and calculate fst, pbs, tajimas d, or sequence diversity for each comparison for i, gene in genes.iterrows(): ID = gene.ID # locate_ranges() to get a boolean, needed as locate_range() will throw errors if no snps found in gene gene_bool = pos.locate_ranges([gene['start']], [gene['end']], strict=False) nsnps = gene_bool.sum() # if there are less than 3 snps in this gene then skip to next in loop if nsnps < 2: continue # store number of snps per gene n_dict[ID] = nsnps # store midpoint positions of gene pos_dict[ID] = (gene['start'] + gene['end'])/2 # fst and dxy per gene between each comparison for comp1,comp2 in comparisons: name = comp1 + "_" + comp2 ac1 = acsubpops[comp1].compress(gene_bool, axis=0) ac2 = acsubpops[comp2].compress(gene_bool, axis=0) fst_per_comp[name], se_per_comp[name],_,_= allel.average_hudson_fst(ac1, ac2, blen=1) dxy_per_comp[name] = allel.sequence_divergence(pos[gene_bool], ac1, ac2) # tajimas d and sequence diversity per gene for each subpop(i.e treatment) for subpop in subpops: ac = acsubpops[subpop].compress(gene_bool) genepos = pos[gene_bool] tajd_per_pop[subpop] = allel.tajima_d(ac=ac, pos=genepos) gdiv_per_pop[subpop] = allel.sequence_diversity(ac=ac, pos=genepos) # pbs for each gene for each pbc comparison as defined in config.yaml if pbs is True: for pbscomp in pbscomps: pop1, pop2, outpop = pbscomp.split("_") pbs_per_comp[pbscomp],se,_,_ = rnaseqpop.meanPBS(acsubpops[pop1].compress(gene_bool, axis=0), acsubpops[pop2].compress(gene_bool, axis=0), acsubpops[outpop].compress(gene_bool, axis=0), window_size=1, normalise=True) # store inner dict in outer dicts fst_per_gene[ID] = dict(fst_per_comp) se_per_gene[ID] = dict(se_per_comp) if pbs is True : pbs_per_gene[ID] = dict(pbs_per_comp) tajd_per_gene[ID] = dict(tajd_per_pop) gdiv_per_gene[ID] = dict(gdiv_per_pop) dxy_per_gene[ID] = dict(dxy_per_comp) #reverse the dicts so the comparisons/subpops are on the outer dict fst_per_gene = rnaseqpop.flip_dict(fst_per_gene) se_per_gene = rnaseqpop.flip_dict(se_per_gene) if pbs is True : pbs_per_gene = rnaseqpop.flip_dict(pbs_per_gene) tajd_per_gene = rnaseqpop.flip_dict(tajd_per_gene) gdiv_per_gene = rnaseqpop.flip_dict(gdiv_per_gene) dxy_per_gene = rnaseqpop.flip_dict(dxy_per_gene) print(f"Chromosome {chrom} complete...\n") for comp1,comp2 in comparisons: name = comp1 + "_" + comp2 a = np.array(list(fst_per_gene[name].values())) print(f"Overall Fst (averaged across genes) for chromosome {chrom} between {name} is {np.nanmean(a)}") # Make dataframe of number of snps per gene (that pass quality and missingness filters) ndf = pd.DataFrame.from_dict(n_dict, orient='index').reset_index(drop=False) ndf.columns = ['GeneID', 'nSNPs'] # Make dataframe of midpoints of each gene posdf = pd.DataFrame.from_dict(pos_dict, orient='index').reset_index(drop=False) posdf.columns = ['GeneID', 'Gene_midpoint'] # Make dataframe of fst for each comparison fst_dfs = {} se_dfs = {} dxy_dfs = {} for comp1,comp2 in comparisons: name = comp1 + "_" + comp2 fst_df = pd.DataFrame.from_dict(fst_per_gene[name], orient='index').reset_index(drop=False) fst_df.columns = ['GeneID', (name + '_zFst')] fst_dfs[name] = fst_df se_df = pd.DataFrame.from_dict(se_per_gene[name], orient='index').reset_index(drop=False) se_df.columns = ['GeneID', (name + '_SE')] se_dfs[name] = se_df dxy_df = pd.DataFrame.from_dict(dxy_per_gene[name], orient='index').reset_index(drop=False) dxy_df.columns = ['GeneID', (name + '_dxy')] dxy_dfs[name] = dxy_df my_reduce = partial(pd.merge, on='GeneID', how='outer') fst_allcomparisons = reduce(my_reduce, fst_dfs.values()) se_allcomparisons = reduce(my_reduce, se_dfs.values()) fst_allcomparisons = reduce(my_reduce, [fst_allcomparisons, se_allcomparisons]) fst_allcomparisons['chrom'] = chrom dxy_allcomparisons = reduce(my_reduce, dxy_dfs.values()) dxy_allcomparisons['chrom'] = chrom tajd_dfs = {} gdiv_dfs = {} # Store sequence diversity and tajimas d for each gene and each subpop for subpop in subpops: tajd_df = pd.DataFrame.from_dict(tajd_per_gene[subpop], orient='index').reset_index(drop=False) tajd_df.columns = ['GeneID', (subpop+"_Tajima_d")] tajd_dfs[subpop] = tajd_df gdiv_df = pd.DataFrame.from_dict(gdiv_per_gene[subpop], orient='index').reset_index(drop=False) gdiv_df.columns = ['GeneID', (subpop+"_SeqDiv")] gdiv_dfs[subpop] = gdiv_df # Combine tajimas d and sequence diversity for each sample tajdall = reduce(my_reduce, tajd_dfs.values()) gdivall = reduce(my_reduce, gdiv_dfs.values()) tajdall['chrom'] = chrom gdivall['chrom'] = chrom if pbs is True: # PBS store as dataframes pbs_dfs = {} for pbscomp in pbscomps: pbs_df = pd.DataFrame.from_dict(pbs_per_gene[pbscomp], orient='index').reset_index(drop=False) pbs_df.columns = ['GeneID', (pbscomp+"PBS")] pbs_dfs[pbscomp] = pbs_df pbs_allcomparisons = reduce(my_reduce, pbs_dfs.values()) pbs_allcomparisons['chrom'] = chrom fstbychrom[chrom] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'], how='inner'), [fst_allcomparisons, gene_names, ndf, posdf]) tajdbychrom[chrom] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'], how='inner'), [tajdall, gene_names, ndf,posdf]) gdivbychrom[chrom] = reduce(lambda left, right: pd.merge(left,right, on=['GeneID'], how='inner'), [gdivall, gene_names, ndf, posdf]) dxybychrom[chrom] = reduce(lambda left, right: pd.merge(left, right, on=['GeneID'], how='inner'), [dxy_allcomparisons, gene_names, ndf, posdf]) if pbs is True: pbsbychrom[chrom] = reduce(lambda left,right: pd.merge(left,right,on=['GeneID'], how='inner'), [pbs_allcomparisons, gene_names, ndf, posdf]) ## Concatenate chromosome dfs to one big data frame and remove duplicates fstall = pd.concat(fstbychrom.values(), ignore_index=True).drop_duplicates() tajdall = pd.concat(tajdbychrom.values(), ignore_index=True).drop_duplicates() gdivall = pd.concat(gdivbychrom.values(), ignore_index=True).drop_duplicates() dxyall = pd.concat(dxybychrom.values(), ignore_index=True).drop_duplicates() # Write to csv fstall.to_csv(f"results/variantAnalysis/selection/FstPerGene.tsv", index=False, sep="\t") tajdall.to_csv(f"results/variantAnalysis/selection/TajimasDPerGene.tsv", index=False, sep="\t") gdivall.to_csv(f"results/variantAnalysis/diversity/SequenceDivPerGene.tsv", index=False, sep="\t") dxyall.to_csv(f"results/variantAnalysis/diversity/DxyPerGene.tsv", index=False, sep="\t") if pbs is True: pbsall = pd.concat(pbsbychrom.values(), ignore_index=True).drop_duplicates() pbsall.to_csv(f"results/variantAnalysis/selection/PbsPerGene.tsv", index=False, sep="\t") |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") #' Finds genes that are differentially expressed in the same direction #' Across two DE comparisons. For example, a gene may be significantly #' upregulated in the control v intermediate phenotype comparison. #' And also in the intermediate phenotype vs full phenotype comparison. library(data.table) library(tidyverse) library(glue) # read metadata and get contrasts comps = snakemake@params['comps'][[1]] padj_threshold = snakemake@params['pval'] upper_fc = snakemake@params['fc'] lower_fc = 1/as.numeric(upper_fc) # if someone wants a FC threshold of 2, need to have lower threshold of 0.5. for (cont in comps){ sus = str_split(cont, "_")[[1]][1] # get control/susceptible name (such as Kisumu) intermediate = str_split(cont, "_")[[1]][2] # get intermediate res = str_split(cont, "_")[[1]][3] # get last of string, which is resistant/case #### Gene diff #### one = fread(glue("results/genediff/{intermediate}_{res}.csv")) up1 = one %>% filter(FC > upper_fc, padj < padj_threshold) down1 = one %>% filter(FC < lower_fc, padj < padj_threshold) two = fread(glue("results/genediff/{sus}_{intermediate}.csv")) up2 = two %>% filter(FC > upper_fc, padj < padj_threshold) down2 = two %>% filter(FC < lower_fc, padj < padj_threshold) intersectdown = inner_join(down1, down2, by="GeneID", suffix=c("_field", "_lab")) intersectup = inner_join(up1, up2, by="GeneID", suffix=c("_field", "_lab")) fwrite(intersectup, glue("results/genediff/{sus}_{intermediate}_{res}.up.progressive.tsv"), sep="\t") fwrite(intersectdown, glue("results/genediff/{sus}_{intermediate}_{res}.down.progressive.tsv"), sep="\t") #### Isoforms #### one = fread(glue("results/isoformdiff/{intermediate}_{res}.csv")) up1 = one %>% filter(FC > upper_fc, qval < padj_threshold) down1 = one %>% filter(FC < lower_fc, qval < padj_threshold) two = fread(glue("results/isoformdiff/{sus}_{intermediate}.csv")) up2 = two %>% filter(FC > upper_fc, qval < padj_threshold) down2 = two %>% filter(FC < lower_fc, qval < padj_threshold) intersectdown = inner_join(down1, down2, by="TranscriptID", suffix=c("_field", "_lab")) intersectup = inner_join(up1, up2, by="TranscriptID", suffix=c("_field", "_lab")) fwrite(intersectup, glue("results/isoformdiff/{sus}_{intermediate}_{res}.up.progressive.tsv"), sep="\t") fwrite(intersectdown, glue("results/isoformdiff/{sus}_{intermediate}_{res}.down.progressive.tsv"), sep="\t") } sessionInfo() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") #' This script uses the sleuth R package to perform differential isoform analysis #' This will be performed for each #' Sleuth uses the bootstraps of kallisto to estimate uncertainty ## Differential expression library(tidyverse) library(sleuth) library(data.table) library(ggrepel) library(openxlsx) library(glue) library(RColorBrewer) library(EnhancedVolcano) print("------------- Kallisto - Sleuth - RNASeq isoform Differential expression ---------") #### read data #### metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame() %>% dplyr::rename('sample' = "sampleID") #add path column for sleuth object metadata$path = paste0("results/quant/", metadata$sample) #read metadata and get contrasts gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% select(-GeneID) #contrasts contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']]) contrasts = contrastsdf$contrast ######### subset data and run DESeq for each combo we need, store in xlsx ######## results_list = list() names_list = list() #loop through each chosen contrast and perform DE for (cont in contrasts){ control = str_split(cont, "_")[[1]][1] #get first of string, which is control (or susceptible) case = str_split(cont, "_")[[1]][2] #get second of string, which is case (or resistant) controls = which(metadata$treatment %in% control) #get indices of case/control cases = which(metadata$treatment %in% case) ##subset to subcounts of our contrast subsamples = metadata[c(controls, cases),] #make treatment a factor with the 'susceptible' as reference subsamples$treatment = as.factor(as.character(subsamples$treatment)) subsamples$treatment = relevel(subsamples$treatment, control) print(glue("Running Sleuth differential isoform expression analysis on {cont}")) so = sleuth_prep(subsamples, extra_bootstrap_summary = TRUE, transformation_function = function(x) log2(x + 0.5)) so = sleuth_fit(so, ~treatment, 'full') #fit reduced model necessary for isoform analysis so = sleuth_fit(so, ~1, 'reduced') #fit likelihood ratio test so = sleuth_wt(so, which_beta = paste0("treatment",case)) results = sleuth_results(so, test =paste0("treatment",case), 'reduced:full', show_all = FALSE) %>% dplyr::rename("TranscriptID" = "target_id") %>% dplyr::mutate("FC" = (2^b)) # Join DE results with normal gene names results = unique(left_join(results, gene_names)) fwrite(results, glue("results/isoformdiff/{cont}.csv")) # write to csv # Store names of contrasts and results tables for xlsx report sheets results_list[[cont]] = results names_list[[cont]] = cont # volcano plot for each comparison, using EnhancedVolcano. First make vector of labels which is AGAPs unless a gene name exists labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ TranscriptID, is.na(GeneName) ~ TranscriptID, TRUE ~ GeneName)) %>% select(Gene_name) %>% deframe() pdf(glue("results/isoformdiff/Volcano_plot_{cont}.pdf")) print(EnhancedVolcano(results_list[[cont]], lab=labels, x='b', y='pval', title = cont)) garbage = dev.off() print(glue("{cont} complete!")) } #### write to excel file on diff sheets #### sheets = unlist(names_list) wb <- createWorkbook("Workbook") for (i in 1:length(sheets)){ addWorksheet(wb, glue("{sheets[[i]]}")) writeData(wb, sheets[i], results_list[[i]], rowNames = FALSE, colNames = TRUE) } #### save workbook to disk once all worksheets and data have been added #### saveWorkbook(wb,file=snakemake@output[['xlsx']], overwrite = TRUE) sessionInfo() |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd import numpy as np import allel import rnaseqpoptools as rnaseqpop # Read in parameters from snakemake dataset = snakemake.params['dataset'] metadata = pd.read_csv(snakemake.input['metadata'], sep="\t") metadata = metadata.sort_values(by='species') chroms = snakemake.params['chroms'] ploidy = snakemake.params['ploidy'] numbers = rnaseqpop.get_numbers_dict(ploidy) qualflt = snakemake.params['qualflt'] missingprop = snakemake.params['missingprop'] gffpath = snakemake.input['gff'] # load gff features = allel.gff3_to_dataframe(gffpath, attributes=["ID", "Parent", "description"]) # define functions def isnotmissing(gn): """ Function to check if missing values are present at a SNP """ return((gn != -1).all()) #initialise dicts snpsPerGff = {} total_snps_per_chrom = {} snps_per_gene_allchroms = {} snpeffdict = {} seqdivdictchrom = {} thetadictchrom = {} coefdictchrom= {} for i, chrom in enumerate(chroms): # Read in and Filter VCF path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz" vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path, chrom=chrom, samples=metadata, numbers=numbers, ploidy=ploidy, qualflt=qualflt, missingfltprop=missingprop) # Store total SNPs per chromosome # And summaries from SnpEff total_snps_per_chrom[chrom] = geno.shape[0] snpeffdict[chrom] = snpeff[1].value_counts(normalize=True) # Plot SNP density rnaseqpop.plot_density(pos, window_size=100000, title=f"Variant Density | {dataset} | Chromosome {chrom}", path=f"results/variantAnalysis/diversity/{dataset}_SNPdensity_{chrom}.svg") ######## SNP counts per gene ######## # Subset GFF to appropriate chromosome gff = features.query("seqid == @chrom").sort_values('start').reset_index(drop=True) genes = gff.query("type == 'gene'") exons = features.query("type == 'exon'").reset_index(drop=True) ## Proportion SNPs per GFF feature snpsPerGff[chrom] = rnaseqpop.getSNPGffstats(gff, pos) snpsPerGff[chrom]['chromosome'] = chrom ## Calculate missing SNPs per sample, SNPs per gene etc snpsnotmissing = {} missing_array = {} snps_per_gene = {} snps_per_sample = {} # For each sample in metadata, compress the genotype array and check is data is missing for sample in metadata['sampleID']: bool_ = sample == populations gn = geno.compress(bool_, axis=1) res = map(isnotmissing, gn[:,0]) missing_array[sample] = np.fromiter(res, dtype=bool) snpsnotmissing[sample] = missing_array[sample].sum() # For each gene, find out how many SNPs per gene for i, gene in genes.iterrows(): # locate_ranges() to get a boolean, needed as locate_range() will throw errors if no snps found in gene gene_bool = pos.locate_ranges([gene['start']], [gene['end']], strict=False) for sample in metadata['sampleID']: presentSNPs = missing_array[sample].compress(gene_bool, axis=0) # subset to each gene snps_per_sample[sample] = presentSNPs.sum() snps_per_gene[gene['ID']] = dict(snps_per_sample) snps_per_gene_allchroms[chrom] = pd.DataFrame.from_dict(rnaseqpop.flip_dict(snps_per_gene)) snpsPerGff = pd.concat(snpsPerGff).reset_index().rename(columns={'level_1':'feature'}) snpsPerGff = snpsPerGff.groupby('feature').agg({'ReferenceGenomeProportion': 'mean', 'called':'sum', 'proportion': 'mean', 'total':'sum'}) snpsPerGff.to_csv(snakemake.output['snpsPerGenomicFeature'], sep="\t", index=True) # Total SNPs per chrom, all samples totalsnpsdf = pd.DataFrame.from_dict(total_snps_per_chrom, orient='index', columns=['Total_SNPs']) totalsnpsdf.to_csv(f"results/variantAnalysis/SNPstats/totalSNPs.tsv", sep="\t", index=True) # SNPcount per gene snpcountsdf = pd.concat(snps_per_gene_allchroms) snpcountsdf.to_csv("results/variantAnalysis/SNPstats/nSNPsPerGene.tsv", sep="\t", index=True) genesNoSNPs = pd.DataFrame((snpcountsdf == 0).sum(axis=0), columns=['Genes with zero SNPs']) genesNoSNPs.to_csv("results/variantAnalysis/SNPstats/nGenesZeroSNPs.tsv", sep="\t", index=True) # snpEff summary snpeffdf = pd.concat(snpeffdict) snpeffdf.to_csv("results/variantAnalysis/SNPstats/snpEffProportions.tsv", sep="\t", index=True) |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd import numpy as np import allel from collections import defaultdict import rnaseqpoptools as rnaseqpop # Read in parameters from snakemake dataset = snakemake.params['dataset'] metadata = pd.read_csv(snakemake.input['metadata'], sep="\t") metadata = metadata.sort_values(by='species') chroms = snakemake.params['chroms'] ploidy = snakemake.params['ploidy'] numbers = rnaseqpop.get_numbers_dict(ploidy) qualflt = snakemake.params['qualflt'] missingprop = snakemake.params['missingprop'] #initialise dicts total_snps_per_chrom = {} snps_per_gene_allchroms = {} snpeffdict = {} seqdivdictchrom = {} thetadictchrom = {} coefdictchrom= {} for i, chrom in enumerate(chroms): # Read in and Filter VCF path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz" vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path, chrom=chrom, samples=metadata, numbers=numbers, ploidy=ploidy, qualflt=qualflt, missingfltprop=missingprop) #### Genome-wide statistics (seqDiv, Wattersons Theta, LD, inbreeding coefficient) #### seqdivdict = {} thetadict = {} coefdict= {} allcoef = defaultdict(list) for pop in metadata['treatment'].unique(): # Sequence diversity seqdivdict[pop] = allel.sequence_diversity(pos, acsubpops[pop]) # Wattersons theta thetadict[pop] = allel.watterson_theta(pos, acsubpops[pop]) # Inbreeding coefficient if ploidy > 1: gn = geno.take(subpops[pop], axis=1) coef = allel.moving_statistic(gn, statistic=allel.inbreeding_coefficient, size=1000, step=100) coef = np.nanmean(coef, axis=1) coefdict[pop] = np.mean(coef) allcoef[pop].append(np.array(coef)) print(f"{pop} | {chrom} | Nucleotide Diversity (Pi) =", seqdivdict[pop]) print(f"{pop} | {chrom} | Wattersons Theta =", thetadict[pop]) if ploidy > 1: print(f"{pop} | {chrom} | Inbreeding Coef =", np.mean(coef), "\n") seqdivdictchrom[chrom] = dict(seqdivdict) thetadictchrom[chrom] = dict(thetadict) if ploidy > 1: coefdictchrom[chrom] = dict(coefdict) seqdivdictchrom= rnaseqpop.flip_dict(seqdivdictchrom) thetadictchrom = rnaseqpop.flip_dict(thetadictchrom) if ploidy > 1: coefdictchrom = rnaseqpop.flip_dict(coefdictchrom) # Get stats per chromosome and plot heatmap pidf = pd.DataFrame.from_dict(seqdivdictchrom) pidf.to_csv("results/variantAnalysis/diversity/SequenceDiversity.tsv", sep="\t", index=True) rnaseqpop.plotRectangular(pidf.round(4), path="results/variantAnalysis/diversity/piPerChrom.svg", ylab="Chromosome", xlab="Treatment", figsize=[5,5], title=r'$\pi$') thetadf = pd.DataFrame.from_dict(thetadictchrom) rnaseqpop.plotRectangular(thetadf.round(4), path="results/variantAnalysis/diversity/thetaPerChrom.svg", ylab="Chromosome", xlab="Treatment", figsize=[5,5], title=r'$\theta$') thetadf.to_csv("results/variantAnalysis/diversity/WattersonsTheta.tsv", sep="\t", index=True) thetamean = thetadf.apply(np.mean, axis=0).round(4) pimean = pidf.apply(np.mean, axis=0).round(4) summaryStats = pd.DataFrame({r'$\theta$':thetamean, r'$\pi$':pimean}) rnaseqpop.plotRectangular(summaryStats, path="results/variantAnalysis/diversity/pi_theta.overall.svg", ylab="", xlab="Statistic", figsize=[5,5], rotate=False) theta = pd.DataFrame(summaryStats.iloc[:,0]).round(4) pi = pd.DataFrame(summaryStats.iloc[:,1]).round(4) rnaseqpop.plotTwoRectangular(pi, True, theta, True, path="results/variantAnalysis/diversity/piThetaBoth.svg", cmap="Greys", figsize=[5,5], annotFontsize=20, ytickfontsize=12, ylab="") if ploidy > 1: pd.DataFrame.from_dict(coefdictchrom).to_csv("results/variantAnalysis/diversity/inbreedingCoef.tsv", sep="\t", index=True) # Get genome wide average stats if ploidy > 1: for pop in allcoef.keys(): allcoef[pop] = np.nanmean(allcoef[pop]) coefdf = pd.DataFrame.from_dict(allcoef, orient='index', columns=['InbreedingCoefficient']) coefdf.to_csv(f"results/variantAnalysis/diversity/inbreedingCoef.mean.tsv", sep="\t", index=True) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") library(tidyverse) library(data.table) library(glue) library(openxlsx) #### allele imbalance #### metadata = fread(snakemake@input[['metadata']], sep="\t") samples = metadata$sampleID # Read IR mutation data mutation_data = fread(snakemake@input[['mutations']], sep="\t") all_list = list() mean_list = list() # Loop through each mutation, finding allele coverage at each position for (m in mutation_data$Name){ base = mutation_data[mutation_data$Name == m]$ALT propstring = glue("proportion{base}") base2 = mutation_data[mutation_data$Name == m]$ALT2 #### add in if second alt propstring2 = glue("proportion{base2}") #### load allele balance data #### allele_list = list() # read data for each sample and subset to what we want for (sample in samples){ allele_list[[sample]] = fread(glue("results/variantAnalysis/variantsOfInterest/counts/{sample}_{m}_allele_counts.tsv"))[,c(1:8)] #for each sample read data, and store first 8 columns allele_list[[sample]]$sample = sample #add sample column allele_list[[sample]]$treatment = metadata$treatment[samples == sample] #add treatment column allele_list[[sample]]$mutation = m allele_list[[sample]]$gene = mutation_data[mutation_data$Name == m]$Gene cover = allele_list[[sample]] %>% select(A,C,G,T) %>% rowSums() allele_list[[sample]] = allele_list[[sample]] %>% mutate(!!propstring := (!!sym(base))/cover) #new column, proportion of Alts to total. if (!base2 %in% c("", NA)){ allele_list[[sample]] = allele_list[[sample]] %>% mutate(!!propstring2 := (!!sym(base2))/cover) #new column, proportion of Alts to total. } } # We have 24 separate dataframes in a list, bind them together into one big dataframe alleles = rbindlist(allele_list, fill = TRUE) # now lets calculate population means and lower and upper CIs alleles_per_pop_list = list() for (pop in unique(metadata$treatment)){ alleles_per_pop = alleles %>% filter(treatment == pop) pop_prop = sum(alleles_per_pop[, ..base])/ sum(alleles_per_pop[, 'cov']) error = sqrt((pop_prop*(1-pop_prop))/nrow(alleles_per_pop))*1.96 lower = pmax(pop_prop - error, 0) upper = pmin(pop_prop + error, 1) alleles_per_pop_list[[pop]] = alleles_per_pop %>% mutate(!!propstring := pop_prop, lowerCI = lower, upperCI = upper) # average across replicates if (!base2 %in% c("", NA)){ pop_prop2 = sum(alleles_per_pop[, ..base2])/ sum(alleles_per_pop[, 'cov']) error2 = sqrt((pop_prop*(1-pop_prop))/nrow(alleles_per_pop))*1.96 lower2 = pmax(pop_prop - error, 0) upper2 = pmin(pop_prop + error, 1) alleles_per_pop_list[[pop]] = alleles_per_pop_list[[pop]] %>% mutate(!!propstring2 := pop_prop2, lowerCI_2 = lower2, upperCI_2 = upper2) } } mean_alleles = rbindlist(alleles_per_pop_list, fill=TRUE) # average across replicates if (!base2 %in% c("", NA)){ mean_alleles = mean_alleles %>% group_by(chrom, pos, ref, mutation, treatment, !!sym(propstring), lowerCI, upperCI, !!sym(propstring2), lowerCI_2, upperCI_2) %>% summarise_at(.vars = c("cov","A","C","G","T"), .funs = c(mean="mean"), na.rm = TRUE) %>% select(chrom, pos, ref, mutation, treatment, cov_mean, A_mean, C_mean, G_mean, T_mean, lowerCI, upperCI, !!propstring, !!propstring2, lowerCI_2, upperCI_2) } else { mean_alleles = mean_alleles %>% group_by(chrom, pos, ref, mutation, treatment, !!sym(propstring), lowerCI, upperCI) %>% summarise_at(.vars = c("cov","A","C","G","T"), .funs = c(mean="mean"), na.rm = TRUE) %>% select(chrom,pos, ref, mutation, treatment, cov_mean, A_mean, C_mean, G_mean, T_mean, lowerCI, upperCI, !!propstring) } #write to file, reorder mean_kdr_alleles fwrite(alleles, glue("results/variantAnalysis/variantsOfInterest/csvs/{m}_alleleBalance.csv")) fwrite(mean_alleles, glue("results/variantAnalysis/variantsOfInterest/csvs/mean_{m}_alleleBalance.csv")) all_list[[m]] = alleles mean_list[[m]] = mean_alleles } #### write to excel file on diff sheets #### results_list = all_list sheets = unique(mutation_data$Name) wb <- createWorkbook("Workbook") for (i in 1:length(sheets)){ addWorksheet(wb, glue("{sheets[[i]]}")) writeData(wb, sheets[i], results_list[[i]], rowNames = TRUE, colNames = TRUE) } #### save workbook to disk once all worksheets and data have been added #### saveWorkbook(wb,file=snakemake@output[['alleleBalance']], overwrite = TRUE) ### mean balance #### #### write to excel file on diff sheets #### results_list = mean_list sheets = unique(mutation_data$Name) wb <- createWorkbook("Workbook") for (i in 1:length(sheets)){ addWorksheet(wb, glue("{sheets[[i]]}")) writeData(wb, sheets[i], results_list[[i]], rowNames = TRUE, colNames = TRUE) } #### save workbook to disk once all worksheets and data have been added #### saveWorkbook(wb,file=snakemake@output[['mean_alleleBalance']], overwrite = TRUE) sessionInfo() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd import matplotlib import rnaseqpoptools as rnaseqpop ### Variants of Interest patjh ### voiPath = snakemake.input['VariantsOfInterest'] ## Read VOI data muts = pd.read_csv(voiPath, sep="\t") ## separate chrom and pos data and sort muts['chrom'] = muts['Location'].str.split(":").str.get(0) muts['pos'] = muts['Location'].str.split(":").str.get(1).str.split("-").str.get(0) muts = muts.sort_values(['chrom', 'pos']) ## Run for all samples df, annot = rnaseqpop.getAlleleFreqTable(muts, "results/variantAnalysis/variantsOfInterest/csvs/{mut}_alleleBalance.csv", var="sample") rnaseqpop.plotRectangular(df, annot=annot, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapPerSample.svg") rnaseqpop.plotRectangular(df, annot=annot, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapPerSample.pdf") ## Run for avarage frequencies across treatments df2, annot2 = rnaseqpop.getAlleleFreqTable(muts, "results/variantAnalysis/variantsOfInterest/csvs/mean_{mut}_alleleBalance.csv", var="treatment", mean_=True) rnaseqpop.plotRectangular(df2, annot=annot2, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapPerTreatment.svg", xlab="strain") rnaseqpop.plotRectangular(df2, annot=annot2, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapPerTreatment.pdf", xlab="strain") # Join both plots rnaseqpop.plotTwoRectangular(df, annot, df2, annot2, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapBothPlots.svg", ratio='auto') rnaseqpop.plotTwoRectangular(df, annot, df2, annot2, path="results/variantAnalysis/variantsOfInterest/VOI.heatmapBothPlots.pdf", ratio='auto') |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | import sys sys.stderr = open(snakemake.log[0], "w") import matplotlib matplotlib.use('agg') import matplotlib.pyplot as plt from matplotlib_venn import * import pandas as pd import numpy as np from pathlib import Path def plotvenn2(name, group1, group2, nboth,stat="DE_PBS", group1name='Significant up DE genes', group2name='High PBS'): print(f"There are {group2} high Fst genes in {name}") print(f"There are {nboth} shared in {name}") venn2(subsets = (group1, group2, nboth), set_labels = (group1name, group2name), set_colors=('r', 'g'), alpha = 0.5); venn2_circles(subsets = (group1, group2, nboth)) plt.title(f"{name}") plt.savefig(f"results/venn/{name}_{stat}.venn.png") plt.close() def intersect2(one, two, df, write=True, path=None): inter = [x for x in list(one.GeneID) if x in list(two.GeneID)] length = len(inter) intersected_df = df[df.GeneID.isin(inter)] intersected_df.to_csv(f"{path}", sep="\t") return(length, intersected_df) def add_columns_xlsx(name, de, fst, highfst, diffsnps, diffsnpsdf=None): rnaxlsx = pd.read_excel("results/genediff/RNA-Seq_diff.xlsx", sheet_name=name) highfst_bool = de.GeneID.isin(highfst.GeneID).astype(str) rnaxlsx['HighFst'] = highfst_bool if diffsnps: diffsnps_bool = de.GeneID.isin(diffsnpsdf.GeneID).astype(str) rnaxlsx['DiffSNPs'] = diffsnps_bool # add column of number of SNPs merged = pd.merge(de, fst, how="outer") rnaxlsx['nSNPs'] = merged['nSNPs'] return(rnaxlsx) #### Main #### # Read contrasts in and other snakemake params comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast']) comparisons = comparisons.contrast.str.split("_", expand=True) comparisons = [list(row) for i,row in comparisons.iterrows()] percentile = snakemake.params['percentile'] diffsnps = snakemake.params['diffsnps'] # Create a Pandas Excel writer using XlsxWriter as the engine. writer = pd.ExcelWriter('results/RNA-Seq-full.xlsx', engine='xlsxwriter') #### Differential expression v Fst venn diagram for comp1,comp2 in comparisons: name = comp1 + "_" + comp2 print(f"\n-------------- Venn Diagram for {name} --------------") de = pd.read_csv(f"results/genediff/{name}.csv") fst = pd.read_csv("results/variantAnalysis/selection/FstPerGene.tsv", sep="\t") #compare sig DE genes and top 5% fst genes? #get sig up and down diffexp genes sigde = de[de['padj'] < pval_threshold] sigde_up = sigde[sigde['FC'] > upper_fc] sigde_down = sigde[sigde['FC'] < lower_fc] #take top percentile of fst genes highfst = fst.nlargest(int(fst.shape[0]*percentile),f"{name}_zFst") #how many fst? how many sig de up and down? nfst = highfst.shape[0] nde_up = sigde_up.shape[0] nde_down = sigde_down.shape[0] print(f"There are {nde_up} significantly upregulated genes in {name}") print(f"There are {nde_down} significantly downregulated genes in {name}") nboth, _ = intersect2(sigde_up, highfst, de, write=True, path=f"results/venn/{name}.DE.Fst.intersection.tsv") ###### XLSX file ###### if diffsnps: diffsnpsDE = pd.read_csv("results/diffsnps/{name}.sig.kissDE.tsv", sep="\t") sheet = add_columns_xlsx(name, de, fst, highfst, diffsnps, diffsnpsDE) else: sheet = add_columns_xlsx(name, de, fst, highfst, diffsnps, diffsnpsDE=None) # Write each dataframe to a different worksheet. sheet.to_excel(writer, sheet_name=name) # Close the Pandas Excel writer and output the Excel file. writer.save() |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | import sys sys.stderr = open(snakemake.log[0], "w") import pandas as pd import numpy as np import allel import rnaseqpoptools as rnaseqpop dataset = snakemake.params['dataset'] metadata = pd.read_csv(snakemake.input['metadata'], sep="\t") metadata = metadata.sort_values(by='species') chroms = snakemake.params['chroms'] ploidy = snakemake.params['ploidy'] numbers = rnaseqpop.get_numbers_dict(ploidy) pbs = snakemake.params['pbs'] pbscomps = snakemake.params['pbscomps'] qualflt = snakemake.params['qualflt'] missingprop = snakemake.params['missingprop'] #Fst/PBS window size windowsizes = snakemake.params['windowsizes'] windowsteps = snakemake.params['windowsteps'] windownames = snakemake.params['windownames'] # Read in list of contrasts comparisons = pd.DataFrame(snakemake.params['DEcontrasts'], columns=['contrast']) comparisons = comparisons.contrast.str.split("_", expand=True) comparisons.columns = ['sus', 'res'] comparisons = [list(row) for i,row in comparisons.iterrows()] for i, chrom in enumerate(chroms): path = f"results/variantAnalysis/vcfs/{dataset}.{chrom}.vcf.gz" vcf, geno, acsubpops, pos, depth, snpeff, subpops, populations = rnaseqpop.readAndFilterVcf(path=path, chrom=chrom, samples=metadata, numbers=numbers, ploidy=ploidy, qualflt=qualflt, missingfltprop=missingprop) #### Fst in windows #### for sus, res in comparisons: name = sus + "_" + res cohortText = f"{sus} v {res}" print(f"Calculating Fst values in sliding windows for {name}\n") for wname, size, step in zip(windownames, windowsizes, windowsteps): FstArray = allel.moving_hudson_fst(acsubpops[sus], acsubpops[res], size=size, step=step) midpoint = allel.moving_statistic(pos, np.median, size=size, step=step) print(FstArray) print(midpoint) cohortNoSpaceText = name + "." + wname rnaseqpop.plotWindowed(statName="Fst", cohortText=cohortText, cohortNoSpaceText=cohortNoSpaceText, values=FstArray, midpoints=midpoint, colour='dodgerblue', prefix="results/variantAnalysis/selection/fst", chrom=chrom, ylim=0.5, save=True) #### Population Branch Statistic (PBS) in windows #### if pbs: for pbscomp in pbscomps: pop1, pop2, outpop = pbscomp.split("_") cohortText = f"(({pop1}, {pop2}), {outpop})" print(f"Calculating PBS values in sliding window for {pbscomp}\n") for wname, size, step in zip(windownames, windowsizes, windowsteps): pbsArray = allel.pbs(acsubpops[pop1], acsubpops[pop2], acsubpops[outpop], window_size=size, window_step=step, normed=True) midpoint = allel.moving_statistic(pos, np.median, size=size, step=step) cohortNoSpaceText = pbscomp + "." + wname rnaseqpop.plotWindowed(statName="PBS", cohortText=cohortText, cohortNoSpaceText=cohortNoSpaceText, values=pbsArray, midpoints=midpoint, colour='dodgerblue', prefix="results/variantAnalysis/selection/pbs", chrom=chrom, ylim=0.5, save=True) |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell shell("samtools index {snakemake.params} {snakemake.input[0]} {snakemake.output[0]}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Joël Simoneau" __copyright__ = "Copyright 2019, Joël Simoneau" __email__ = "simoneaujoel@gmail.com" __license__ = "MIT" from snakemake.shell import shell # Creating log log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Placeholder for optional parameters extra = snakemake.params.get("extra", "") # Allowing for multiple FASTA files fasta = snakemake.input.get("fasta") assert fasta is not None, "input-> a FASTA-file is required" fasta = " ".join(fasta) if isinstance(fasta, list) else fasta shell( "kallisto index " # Tool "{extra} " # Optional parameters "--index={snakemake.output.index} " # Output file "{fasta} " # Input FASTA files "{log}" # Logging ) |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | __author__ = "Joël Simoneau" __copyright__ = "Copyright 2019, Joël Simoneau" __email__ = "simoneaujoel@gmail.com" __license__ = "MIT" from snakemake.shell import shell # Creating log log = snakemake.log_fmt_shell(stdout=True, stderr=True) # Placeholder for optional parameters extra = snakemake.params.get("extra", "") # Allowing for multiple FASTQ files fastq = snakemake.input.get("fastq") assert fastq is not None, "input-> a FASTQ-file is required" fastq = " ".join(fastq) if isinstance(fastq, list) else fastq shell( "kallisto quant " # Tool "{extra} " # Optional parameters "--threads={snakemake.threads} " # Number of threads "--index={snakemake.input.index} " # Input file "--output-dir={snakemake.output} " # Output directory "{fastq} " # Input FASTQ files "{log}" # Logging ) |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Christopher Preusch" __copyright__ = "Copyright 2017, Christopher Preusch" __email__ = "cpreusch[at]ust.hk" __license__ = "MIT" from snakemake.shell import shell shell("samtools flagstat {snakemake.input[0]} > {snakemake.output[0]}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path import re from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=True, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) # Remove file extension(s) (similar to the internal fastqc approach) base = re.sub("\\.gz$", "", base) base = re.sub("\\.bz2$", "", base) base = re.sub("\\.txt$", "", base) base = re.sub("\\.fastq$", "", base) base = re.sub("\\.fq$", "", base) base = re.sub("\\.sam$", "", base) base = re.sub("\\.bam$", "", base) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell( "fastqc {snakemake.params} -t {snakemake.threads} " "--outdir {tempdir:q} {snakemake.input[0]:q}" " {log}" ) # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path:q} {snakemake.output.html:q}") if snakemake.output.zip != zip_path: shell("mv {zip_path:q} {snakemake.output.zip:q}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 | __author__ = "Michael Chambers" __copyright__ = "Copyright 2019, Michael Chambers" __email__ = "greenkidneybean@gmail.com" __license__ = "MIT" from snakemake.shell import shell shell("samtools faidx {snakemake.params} {snakemake.input[0]} > {snakemake.output[0]}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "julianderuiter@gmail.com" __license__ = "MIT" from snakemake.shell import shell n = len(snakemake.input) assert n == 2, "Input must contain 2 (paired-end) elements." extra = snakemake.params.get("extra", "") adapters = snakemake.params.get("adapters", "") log = snakemake.log_fmt_shell(stdout=False, stderr=True) assert ( extra != "" or adapters != "" ), "No options provided to cutadapt. Please use 'params: adapters=' or 'params: extra='." shell( "cutadapt" " {adapters}" " {extra}" " -o {snakemake.output.fastq1}" " -p {snakemake.output.fastq2}" " -j {snakemake.threads}" " {snakemake.input}" " > {snakemake.output.qc} {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "koester@jimmy.harvard.edu" __license__ = "MIT" from snakemake.shell import shell from snakemake_wrapper_utils.bcftools import get_bcftools_opts bcftools_opts = get_bcftools_opts(snakemake) extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "bcftools view {bcftools_opts} " "{extra} " "{snakemake.input[0]} " "-o {snakemake.output} " "{log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __author__ = "Christopher Schröder" __copyright__ = "Copyright 2020, Christopher Schröder" __email__ = "christopher.schroeder@tu-dortmund.de" __license__ = "MIT" import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=True, stderr=True, append=True) with tempfile.TemporaryDirectory() as tmpdir: shell( "gatk --java-options '{java_opts}' ApplyBQSR" " --input {snakemake.input.bam}" " --bqsr-recal-file {snakemake.input.recal_table}" " --reference {snakemake.input.ref}" " {extra}" " --tmp-dir {tmpdir}" " --output {snakemake.output.bam}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | __author__ = "Christopher Schröder" __copyright__ = "Copyright 2020, Christopher Schröder" __email__ = "christopher.schroeder@tu-dortmund.de" __license__ = "MIT" import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=True, stderr=True) known = snakemake.input.get("known", "") if known: if isinstance(known, str): known = [known] known = list(map("--known-sites {}".format, known)) with tempfile.TemporaryDirectory() as tmpdir: shell( "gatk --java-options '{java_opts}' BaseRecalibrator" " --input {snakemake.input.bam}" " --reference {snakemake.input.ref}" " {known}" " {extra}" " --tmp-dir {tmpdir}" " --output {snakemake.output.recal_table}" " {log}" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | __author__ = "Jan Forster" __copyright__ = "Copyright 2019, Jan Forster" __email__ = "jan.forster@uk-essen.de" __license__ = "MIT" import os import tempfile from snakemake.shell import shell from snakemake_wrapper_utils.java import get_java_opts extra = snakemake.params.get("extra", "") java_opts = get_java_opts(snakemake) log = snakemake.log_fmt_shell(stdout=True, stderr=True) with tempfile.TemporaryDirectory() as tmpdir: shell( "gatk --java-options '{java_opts}' SplitNCigarReads" " --reference {snakemake.input.ref}" " --input {snakemake.input.bam}" " {extra}" " --tmp-dir {tmpdir}" " --output {snakemake.output[0]}" " {log}" ) |
Support
- Future updates
Related Workflows





