Behavioral Genetics in Drosophila: RNA-Seq analysis pipeline

public 1yr ago 0 bookmarks

View Workflow

volkanlab_behaviorgenetics — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Software used to analyze data for:

Deanhardt, B., Duan, Q., Du, C., Soeder, C., Morlote, A., Garg, D., Saha, A., Jones, C. D., & Volkan, P. C. (2023). Social experience and pheromone receptor activity reprogram gene expression in sensory neurons. G3 Genes|Genomes|Genetics , 00 (March), 2006–2021. https://doi.org/10.1093/g3journal/jkad072

Social experience and pheromone receptor activity reprogram behavioral switch and neuromodulatory gene expression in sensory neurons Bryson Deanhardt, Qichen Duan, Chengcheng Du, Charles Soeder, Alec Morlote, Deeya Garg, Corbin D. Jones, Pelin Cayirlioglu Volkan bioRxiv 2021.06.18.449021; doi: https://doi.org/10.1101/2021.06.18.449021 latest version: https://www.biorxiv.org/content/10.1101/2021.06.18.449021v3

Changes in splicing and neuromodulatory gene expression programs in sensory neurons with pheromone signaling and social experience Deanhardt Bryson, Duan Qichen, Du Chengcheng, Soeder Charles, Jones Corbin, C. Volkan Pelin bioRxiv 2021.06.18.449021; doi: https://doi.org/10.1101/2021.06.18.449021 version 1: https://www.biorxiv.org/content/10.1101/2021.06.18.449021v1 version 2: https://www.biorxiv.org/content/10.1101/2021.06.18.449021v2

Analysis downstream of this workflow was performed by Quichen Duan.

Dependencies

The following were required to run the full workflow:

module load python/3.6.6 bedtools bedops samtools r/3.6.0 rstudio/1.1.453 bowtie sratoolkit subread

as well, the Snakefile includes rules with hard paths that will need changing:

fastp_clean_sample_pe	:	fastp
write_report	:	pandoc_path

Data

Not all the data in the config.yaml have been published; so full results summary won't run :( Intermediate results can be generated though.

Use

Deanhardt et al. 2021

The results summary from Deanhardt et al. 2021 can be generated: (untested!)

nohup time snakemake --restart-times 6 --latency-wait 60 --jobs 24 -p --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G " --snakefile Snakefile.Deanhardt2021 --configfile config.Deanhardt2021.yaml &

This recursively generates the differential expression analysis from raw data, summarizes/vizualizes it in a PDF, and bundles/timestamps the results summary (".pdf")

The underlying differential expression and differential exon use data from Deanhardt et al. 2021 can be generated:

snakemake diff_expr/grpWtVs47b/grpWtVs47b.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de diff_expr/grpWtVs67d/grpWtVs67d.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de diff_expr/grpWtVsFru_smolFru/grpWtVsFru_smolFru.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de diff_expr/grpWtVsMut/grpWtVsMut.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de 
snakemake diff_exon_use/dex_grpWtVs47b/grpWtVs47b.vs_dm6main.dm6_genes.mapspliceMulti.M.de diff_exon_use/dex_grpWtVs67d/grpWtVs67d.vs_dm6main.dm6_genes.mapspliceMulti.M.de diff_exon_use/dex_grpWtVsFru_smolFru/grpWtVsFru_smolFru.vs_dm6main.dm6_genes.mapspliceMulti.M.de diff_exon_use/dex_grpWtVsMut/grpWtVsMut.vs_dm6main.dm6_genes.mapspliceMulti.M.de

Full Results Summary

The full results summary ("VolkanLab_BehaviorGenetics.05_Apr_2021.pdf") was generated simply by running

snakemake --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G "

Bypassing the Hard Parts

Users may wish to avoid the time and memory consuming steps of read downloading, cleaning, mapping, and counting. The raw counts underlying the results of Deanhardt et al. 2021 are stored in the NCBI GEO ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE179213 ); a shortcut script is included to download this data and prepare it such that it's where the Snakemake workflow anticipates. Currently only the "multi" alignment is available.

The shortcut script is run thus:

bash utils/shortcut.bash

Fruitless replicate #1 was problematic and not used in the final analysis; it is not included in the download by default. To include it, run:

bash utils/shortcut.bash -f

The main time bottleneck is the "rando" alignment, which downsamples but does not entirely remove multimapping reads. Alignement strategy ultimately had little impact on results, so users may wish to omit them by generating the "noRando" PDF (untested!)

nohup time snakemake --restart-times 6 --latency-wait 60 --jobs 24 -p --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G " --snakefile Snakefile.Deanhardt2021 --configfile config.Deanhardt2021.yaml results/VolkanLabBehaviorGenetics.Deanhardt2021.noRando.pdf &

Deanhardt et al. 2023

In order to respond to reviewer comments, the population genetics of the experimental flies were compared to those of published DGRP lines sequenced in Huang et al. (2014). Calling the variants in parallel required some platform-dependent tweaks which will need to be modified; unless these population genetics are of specific interest it is probably best to use the 2021 workflow. However, the 5 May 2023 results summary can be rebuilt with the updated workflow:

snakemake --restart-times 6 --latency-wait 60 --jobs 24 -p --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G " --snakefile Snakefile.Deanhardt2023 &

In particular, scripts/freebayes-parallel requires a hard path to the GNU parallel utility and to the vcfstreamsort utility from vcflib.

Code of Note

├── config.yaml
│ └── (configuration file for the pipeline)
├── dev
│ ├── ...
│ └── (files from earlier stages in project development, including an older FAIRE-seq experiment)
├── misc
│ ├── ...
│ └── (mostly metadata re: sequencing)
├── README.md
├── quichen_duan
│ ├── ...
│ └── (scripts for followup analyses from Quichen Duan)
├── scripts
│ ├── bam_summarizer.mapspliced.py
│ ├── collapsedSignalMerger.py
│ ├── correlate.R
│ ├── deSeqer.R
│ │ └── (differential expression testing script)
│ ├── dexSeqe.R
│ │ └── (differential exon use testing script)
│ ├── edgeHog.py
│ ├── expressionOmete.R
│ ├── faire_results.Rmd
│ ├── fastp_reporter.py
│ ├── geneOntologe.R
│ ├── multimap_spreader.py
│ ├── overlapSignalCollapser.py
│ ├── read_Hollower.py
│ ├── references.bib
│ ├── replicateCorrelations.png
│ └── RNAseq_results.Rmd
│ └── (summarize results)
├── Snakefile
│ └── (rules file for the pipeline)
├── utils
│ └── annotations
│ ├── ...
│ └── (includes lists of genes of interest)
└── VolkanLab_BehaviorGenetics.05_Apr_2021.pdf └── (pipeline walkthrough and results summary)

Code Snippets

args = commandArgs(trailingOnly=TRUE)
# 1: counts file in
# 2: prefix for deseq out
# 3: name of contrast


library("DGCA")
library("dplyr")
library("yaml")
library("readr")
library("tidyr")
library("stringr")
library("magrittr")

xpr_in <- args[1]
nombre <- args[2]
cor_pre <- args[3]




#	build the sample DF
trammel <- read_yaml("config.yaml")
data_sets.df <- plyr::ldply(trammel$data_sets, data.frame)
data_sets.df$name <- as.factor(data_sets.df$name)
data_sets.df$day<- as.factor(data_sets.df$day)
data_sets.df$subgroups<- as.factor(data_sets.df$subgroups)
data_sets.df$rep<- as.factor(data_sets.df$rep)
data_sets.df$housing<- as.factor(data_sets.df$housing)
data_sets.df$genotype<- as.factor(data_sets.df$genotype)
data_sets.df$tissue<- as.factor(data_sets.df$tissue)

#	build the correlation experiment
corr.df <- plyr::ldply(trammel$correlations, data.frame) %>% filter(name == nombre )

# prep the sample metadata
sbgrp <- (corr.df %>% select(filt))[1,] %>% as.character
#coldata <- data_sets.df %>% as.data.frame() %>% ungroup()%>% filter(subgroups == sbgrp)
#coldata <- data_sets.df %>% ungroup() %>% group_by(name) %>%  mutate(genotype = paste0(as.character(genotype), collapse = "," ))  %>% filter(subgroups == sbgrp)
coldata <- data_sets.df%>% group_by(name) %>%  mutate(genotype = paste0(unique(as.character(genotype)), collapse = "." )) %>% ungroup()  %>% filter(subgroups == sbgrp)  %>% as.data.frame() 

rownames(coldata) <- coldata$name
dsgn.df <- coldata %>% select(name, corr.df$vars %>% as.character() )%>% mutate(present =1) %>% spread(key=genotype, value=present, fill=0) 
rownames(dsgn.df) <- dsgn.df$name
dsgn.df %<>% select(-c("name"))
dsgn.mat <- dsgn.df %>% as.matrix()




#xpr_in <- "expression/rotund.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.rpkm"

#xpr_in <- "expression/rotund.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.rpkm"


#load and format the count data
xpr.df <- read_delim(xpr_in, "\t", escape_double = FALSE, trim_ws = TRUE)
xpr.df %<>% as.data.frame()
rownames(xpr.df) <- xpr.df$Geneid
xpr.df %<>% select(rownames(dsgn.df)) 

ddcor_res <- ddcorAll(inputMat = xpr.df, design = dsgn.mat, compare = c("wt", "rn"),adjust = "none", heatmapPlot = FALSE, nPerm = 0)

#Write to TSV
write_delim(ddcor_res, paste(cor_pre, "corr","tsv", sep="."), delim = "\t")

R dplyr tidyr stringr readr magrittr yaml From line 2 of scripts/correlate.R

args = commandArgs(trailingOnly=TRUE)
# 1: counts file in
# 2: prefix for deseq out
# 3: name of contrast


library("DESeq2")
library("dplyr")
library("yaml")
library("readr")
library("tidyr")
library("stringr")

counts_in <- args[1]
dsq_pre <- args[2]
nombre <- args[3]
config <- args[4]

#	build the sample DF
#trammel <- read_yaml("config.yaml")
trammel <- read_yaml(config)
data_sets.df <- plyr::ldply(trammel$data_sets, data.frame)
data_sets.df$name <- as.factor(data_sets.df$name)
data_sets.df$day<- as.factor(data_sets.df$day)
data_sets.df$subgroups<- as.factor(data_sets.df$subgroups)
data_sets.df$rep<- as.factor(data_sets.df$rep)
data_sets.df$housing<- as.factor(data_sets.df$housing)
data_sets.df$genotype<- as.factor(data_sets.df$genotype)
data_sets.df$tissue<- as.factor(data_sets.df$tissue)

#	build the contrast experiment
contrasts.df <- plyr::ldply(trammel$contrasts, data.frame) %>% filter(name == nombre )

# prep the sample metadata
sbgrp <- (contrasts.df %>% select(filt))[1,] %>% as.character
#coldata <- data_sets.df %>% as.data.frame() %>% ungroup()%>% filter(subgroups == sbgrp)
#coldata <- data_sets.df %>% ungroup() %>% group_by(name) %>%  mutate(genotype = paste0(as.character(genotype), collapse = "," ))  %>% filter(subgroups == sbgrp)
coldata <- data_sets.df%>% group_by(name) %>%  mutate(genotype = paste0(unique(as.character(genotype)), collapse = "." )) %>% ungroup()  %>% filter(subgroups == sbgrp) %>% unique() %>% as.data.frame() 

rownames(coldata) <- coldata$name
coldata <- coldata %>% select(contrasts.df$vars %>% as.character() )

#load and format the count data
counts.df <- read_delim(counts_in, "\t", escape_double = FALSE, trim_ws = TRUE)
counts.df.gath <- counts.df %>% select(-c(Chr, Start, End, Strand, Length)) %>% gather(key = "sample", value = "count", -one_of("Geneid"))
counts.df.sprud <- counts.df.gath %>% spread(key = sample, value = count) %>% as.data.frame() 
rownames(counts.df.sprud) <- counts.df.sprud$Geneid
counts.df.sprud <- counts.df.sprud %>%  select(rownames(coldata))

# convert counts, metadata, design into DESeq data object
#counts.dds <- DESeqDataSetFromMatrix(countData = counts.df.sprud, colData = coldata, design = ~ housing) 
desgn <- as.character(contrasts.df$design[1])
counts.dds <- DESeqDataSetFromMatrix(countData = counts.df.sprud, colData = coldata, design = eval(parse(text=desgn)))

#prefilter to remove low-count rows
keep <- rowSums(counts(counts.dds)) >= 10
counts.dds <- counts.dds[keep,]

#relevel 
#counts.dds$condition <- relevel(counts.dds$condition, ref = "untreated")
for (i in seq(nrow(contrasts.df))){
	var <- contrasts.df[i,]$vars %>% as.character()
	ruf <-contrasts.df[i,]$ref %>% as.character()
	counts.dds[[var]] <- relevel(counts.dds[[var]], ref = ruf)
}

#Run DESeq
counts.dds.dsq <- DESeq(counts.dds)

# run the vanilla DE
counts.dds.res <- results(counts.dds.dsq) %>% as.data.frame()
counts.dds.res$geneid <- rownames(counts.dds.res)

# Effect-size shrinkage
coff <- resultsNames(counts.dds.dsq)[2]
counts.dds.res.shrunk <- lfcShrink(counts.dds.dsq, coef=coff, type="apeglm") %>% as.data.frame()
#explore shrinkage types??
counts.dds.res.shrunk$geneid <- rownames(counts.dds.res.shrunk)

# scale by gene length
counts.dds.res<-inner_join(counts.dds.res, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) %>% mutate(expression = baseMean/Length) %>% select(-c(Length))
counts.dds.res.shrunk<-inner_join(counts.dds.res.shrunk, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) %>% mutate(expression = baseMean/Length) %>% select(-c(Length))

#Write to TSV
counts.dds.res.full <- inner_join(counts.dds.res %>%  as.data.frame(), counts.dds.res.shrunk %>%  as.data.frame(), by =c("geneid"="geneid"), suffix=c(".fresh",".shrunk")) %>% select(geneid, everything())
write_delim(counts.dds.res.full, paste(dsq_pre,"de",sep="."), delim = "\t")


###		Now run the itemized DE

#get the factors in the design
desgn.splt <- (desgn %>% str_replace_all("~","")%>% str_replace_all(" ","") %>% strsplit("+", fixed=TRUE))[[1]]

#first, collect the geneids and the basemeans: just run the first factor and its first nonref level
fact <- desgn.splt[1]
ref_lev <- (contrasts.df %>% filter(vars == fact))$ref
all_levs<-levels(counts.dds.dsq[[fact]])
alt_levs<- all_levs[all_levs != ref_lev]

basic.df <- results(counts.dds.dsq, contrast=c(fact, as.character(ref_lev), alt_levs[1])) %>% as.data.frame() %>% select(c("baseMean"))
basic.df$geneid <- rownames(basic.df)
basic.df <- select(basic.df, c("geneid", "baseMean"))

#for each factor in factors....

dds.dsq.byFactor.df <- select(basic.df, c("geneid") )


for (fact in desgn.splt){

	ref_lev <- as.character((contrasts.df %>% filter(vars == fact))$ref)
	all_levs<-levels(counts.dds.dsq[[fact]])
	alt_levs<- all_levs[all_levs != ref_lev]


#running basemean by level:
#https://support.bioconductor.org/p/63567/

	#	# gather baseMeans for each level in factor.levels:
	factor.baseMeanPerLvl <- sapply( levels(counts.dds.dsq[[fact]]), function(lvl) rowMeans( counts(counts.dds.dsq,normalized=TRUE)[,counts.dds.dsq[[fact]] == lvl, drop=F] ) ) %>% as.data.frame()

	names(factor.baseMeanPerLvl ) <- paste0(paste("baseMean.",fact,".", sep=""), names(factor.baseMeanPerLvl))
	factor.baseMeanPerLvl$geneid <- rownames(factor.baseMeanPerLvl)
	factor.baseMeanPerLvl <-inner_join(factor.baseMeanPerLvl, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid"))

	for (levi in all_levs){

		factor.baseMeanPerLvl$x <- factor.baseMeanPerLvl[[paste("baseMean",fact,levi,sep=".")]]
		factor.baseMeanPerLvl[[paste("expression",fact,levi,sep=".")]] <- factor.baseMeanPerLvl$x/factor.baseMeanPerLvl$Length

	}

	factor.baseMeanPerLvl <- factor.baseMeanPerLvl %>% select(-c("x", "Length"))

	dds.dsq.byFactor.df <- full_join(dds.dsq.byFactor.df, factor.baseMeanPerLvl, by = c("geneid"="geneid"))

#	#for each alt_level in factor.levels:
	for (aLev in alt_levs){

		#dds.res_factor_level <- results(counts.dds.dsq, contrast=c(fact, ref_lev, aLev))
		dds.res_factor_level.shrunk <- lfcShrink(counts.dds.dsq, coef=paste(fact, aLev, "vs", ref_lev, sep = "_"), type="apeglm") %>% as.data.frame()

		dds.res_factor_level.shrunk$geneid <- rownames(dds.res_factor_level.shrunk)
		dds.res_factor_level.shrunk<-inner_join(dds.res_factor_level.shrunk, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) %>% mutate(expression = baseMean/Length) %>% select(-c(Length))
		rownames(dds.res_factor_level.shrunk) <- dds.res_factor_level.shrunk$geneid
		dds.res_factor_level.shrunk <- dds.res_factor_level.shrunk %>% select(-c(geneid))
		names(dds.res_factor_level.shrunk ) <- paste0(names(dds.res_factor_level.shrunk),paste(".",fact,".vs_",aLev,".apeglm", sep=""))
		dds.res_factor_level.shrunk$geneid <- rownames(dds.res_factor_level.shrunk)

		dds.dsq.byFactor.df <- full_join(dds.dsq.byFactor.df, dds.res_factor_level.shrunk, by = c("geneid"="geneid"))
#		counts.dds.res_factor_level <- results(counts.dds.dsq, contrast=c(factor1, ref_level, alt_level1)) 
#		counts.dds.res_factor_level$geneid <- rownames(counts.dds.res_factor_level)

	}

}

#Write to TSV
dds.dsq.byFactor.df <- dds.dsq.byFactor.df %>% select(geneid, everything())
write_delim(dds.dsq.byFactor.df, paste(dsq_pre, "itemized.de", sep="."), delim = "\t")

R dplyr tidyr DESeq2 stringr readr yaml From line 2 of scripts/deSeqer.R

args = commandArgs(trailingOnly=TRUE)
# 1: counts file in
# 2: prefix for deseq out
# 3: name of contrast

source("utils/DEXSeq/Subread_to_DEXSeq/load_SubreadOutput.R")

library("dplyr")
library("yaml")
library("readr")
library("tidyr")
library("magrittr")
library("stringr")

counts_in <- args[1] #dxCounts/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M.dxsq.counts
dsq_pre <- args[2] #diff_exon_use/dex_wtHousing/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M
dsqx_gtf <- args[3] #utils/annotations/DEXSeq/dm6_genes.dxsqReady.gtf
nombre <- args[4] #dex_wtHousing
mPhatic <- args[5] #"chr3R_FBgn0004652-"
#mPhatic <- "chr3R_FBgn0004652-"
#use example:
#	Rscript scripts/deSeqer.R {input.fc_in} diff_exon_use/{wildcards.contrast}/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag} {wildcards.contrast} {gene_id}
#	Rscript scripts/deSeqer.R dxCounts/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M.dxsq.counts diff_exon_use/dex_wtHousing/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M utils/annotations/DEXSeq/dm6_genes.dxsqReady.gtf dex_wtHousing "_FBgn0004652-"

#counts_in <- "counts/DEXSeq/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M.dxsq.counts"
#dsqx_gtf <- "utils/annotations/DEXSeq/dm6_genes.dxsqReady.gtf"

#	build the sample DF
trammel <- read_yaml("config.yaml")
data_sets.df <- plyr::ldply(trammel$data_sets, data.frame)
data_sets.df$name <- as.factor(data_sets.df$name)
data_sets.df$day<- as.factor(data_sets.df$day)
data_sets.df$subgroups<- as.factor(data_sets.df$subgroups)
data_sets.df$rep<- as.factor(data_sets.df$rep)
data_sets.df$housing<- as.factor(data_sets.df$housing)
data_sets.df$genotype<- as.factor(data_sets.df$genotype)
data_sets.df$tissue<- as.factor(data_sets.df$tissue)
print("metadata frame built")

#	build the contrast experiment
contrasts.df <- plyr::ldply(trammel$xcontrasts, data.frame) %>% filter(name == nombre )
print("contrast defined")

# prep the sample metadata
sbgrp <- (contrasts.df %>% select(filt))[1,] %>% as.character
fact <- (contrasts.df %>% select(vars))[1,] %>% as.character
coldata.df <- data_sets.df%>% group_by(name) %>%  mutate(genotype = paste0(unique(as.character(genotype)), collapse = "." )) %>% ungroup()  %>% filter(subgroups == sbgrp) %>% as.data.frame() 
row.names(coldata.df) <- coldata.df$name
coldata.df %<>% dplyr::select(fact)
names(coldata.df) <- "condition"
print("experimental metadata selecteted")

coldata.df$condition <- factor(coldata.df$condition, ordered = FALSE)

coldata.df$condition = relevel(coldata.df$condition, ref = as.character(contrasts.df$ref))

#relevel 
#dxd.fc$condition <- factor(dxd.fc$condition, ordered=FALSE)
#eg, https://support.bioconductor.org/p/74520/
#dxd.fc$condition <- relevel(dxd.fc$condition, ref = as.factor(contrasts.df$ref))

# #counts.dds$condition <- relevel(counts.dds$condition, ref = "untreated")
# for (i in seq(nrow(contrasts.df))){
# 	var <- contrasts.df[i,]$vars %>% as.character()
# 	ruf <-contrasts.df[i,]$ref %>% as.character()
# 	counts.dds[[var]] <- relevel(counts.dds[[var]], ref = ruf)
# }

# Run DEXSeq
#desgn <- "~ sample + exon + condition:exon"
desgn <- as.character(contrasts.df$design[1])

dxd.fc = DEXSeqDataSetFromFeatureCounts(counts_in,flattenedfile = dsqx_gtf, sampleData =coldata.df, design= eval(parse(text=desgn)))
print("DEXSeq Dataset Built")

dxd.fc = estimateSizeFactors( dxd.fc )
print("size factors estimated")

dxd.fc = estimateDispersions( dxd.fc )
print("dispersion estimated")

png(filename = paste(dsq_pre,".dispersionPlot.png", sep=""), height=500, width=800)
plotDispEsts( dxd.fc , main = paste("DEXSeq Dispersion Plot: ",nombre," Contrast", sep=""))
dev.off()

dxd.fc = testForDEU( dxd.fc )
print("DEU tested for")

dxd.fc = estimateExonFoldChanges( dxd.fc, fitExpToVar="condition")
print("Exon fold changes estimated")

dxd.fc.results = DEXSeqResults( dxd.fc )
print("Results gathered")

png(filename = paste(dsq_pre,".",gsub("-","",gsub("\\+","",gsub("_","",mPhatic))),".differentialExonUsePlot.png", sep=""), height=500, width=800)
plotDEXSeq( dxd.fc.results, mPhatic, expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2, FDR=0.01 )
dev.off()

png(filename = paste(dsq_pre,".",gsub("-","",gsub("\\+","",gsub("_","",mPhatic))),".differentialExonExpressionPlot.png", sep=""), height=500, width=800)
plotDEXSeq( dxd.fc.results, mPhatic, expression=TRUE, splicing=FALSE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 , FDR=0.01)
dev.off()

dxd.fc.results.df <- dxd.fc.results %>% as_tibble()

#Write to TSV
#dxd.fc.results.df <- dxd.fc.results.df %>% select(geneid, everything())
write_delim(dxd.fc.results.df %>% mutate(transcripts=as.character(transcripts)), paste(dsq_pre, "de", sep="."), delim = "\t")
print("Results written")

print("................... DONE!!!")

R dplyr tidyr stringr readr magrittr yaml From line 2 of scripts/dexSeqe.R

import argparse
import gffutils


parser = argparse.ArgumentParser()
parser.add_argument("-i", "--gtf_in", help="gene model to cronch")
parser.add_argument("-o", "--output_prefix", help="file to write ORFs to")
parser.add_argument("-n", "--gene_name", help="name of gene to cronch ")
#parser.add_argument("-c", "--length_cutoff", help="ignore ORFs with fewer internal (ie, not start or stops) codons", type=int, default=75)
#parser.add_argument("-x", "--extend_across_gaps", help="extend the ORF even if it encounters a gap/N", action="store_true", default=False)
#parser.add_argument("-a", "--include_all_starts", help="report all ORFs when multiple start codons present (else choose longest)", action="store_true", default=False)

args = parser.parse_args()


file_out = args.output_prefix

#path2file="/proj/cdjones_lab/csoeder/VolkanLab_BehaviorGenetics/utils/annotations/fru.gtf"
path2file="/Users/csoeder/Research/VolkanLab_BehaviorGenetics/utils/annotations/fru.gtf"
path2file=args.gtf_in


fn = gffutils.example_filename(path2file)
print(open(fn).read()) 

#db = gffutils.create_db(fn, dbfn='/proj/cdjones_lab/csoeder/VolkanLab_BehaviorGenetics/test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True)
#db = gffutils.create_db(fn, dbfn='/proj/cdjones_lab/csoeder/VolkanLab_BehaviorGenetics/test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, id_spec=['ID', 'Name'] ,disable_infer_genes=True)

path2gtf = "%s.db" % tuple([path2file])
#db = gffutils.create_db(fn, dbfn='/Users/csoeder/Research/VolkanLab_BehaviorGenetics/test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True)
db = gffutils.create_db(fn, dbfn=path2gtf, force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True)

gene_name="FBgn0004652"
gene_name=args.gene_name

gene = db[gene_name]


transcription_edges = []

exon_to_transcript = {}


for transcript in db.children(gene, featuretype='transcript'): 
	transcription_edges.append(transcript.start)
	transcription_edges.append(transcript.stop)
	for exon in db.children(transcript, featuretype='exon'): 
		if exon.start in exon_to_transcript.keys():
			exon_to_transcript[exon.start].append(transcript.id)
		else:
			exon_to_transcript[exon.start] = [transcript.id]
		if exon.stop in exon_to_transcript.keys():
			exon_to_transcript[exon.stop].append(transcript.id)
		else:
			exon_to_transcript[exon.stop] = [transcript.id]

# currently ignoring the *transcript* start/stop sites .... kick all of them out


startStop = []
for transcript in db.children(gene, featuretype='transcript'): 
	startStop.extend([transcript.start, transcript.stop])
startStop = list(set(startStop))

for term in startStop:
	exon_to_transcript.pop(term)



isoid_definitions = {}
isoid_num=0
for iso in exon_to_transcript.values():
	if iso in isoid_definitions.values():
		pass
	else:
		isoid_definitions["isoid_%s"%tuple([isoid_num])] = iso
		isoid_num +=1


isoids = {}
for key in isoid_definitions.keys():
	isoids[key]={"transcripts":isoid_definitions[key],"junctions":[]}
	for junk in exon_to_transcript.keys():
		if exon_to_transcript[junk] == isoid_definitions[key]:
			#print(junk, key, exon_to_transcript[junk])
			isoids[key]["junctions"].append(junk)




phial_name = "%s.isoids.gtf" % tuple([file_out])
phial_out = open(phial_name, "w")


for isoid_name in isoids.keys():
#isoid_name='isoid_0'
	isoid = isoids[isoid_name]
	j_list =isoid['junctions']

	j_list.sort()


	gene_gtf_list = ["edgeHog","gene",min(j_list), max(j_list), 0, gene.strand, ".", "gene_id '%s'; transcript_id '%s';\n" %tuple([isoid_name, isoid_name]) ]
	gene_gtf_str="\t%s"*len(gene_gtf_list) % tuple(gene_gtf_list)
	gene_gtf_str = "%s%s" % tuple([gene.chrom , gene_gtf_str])


	junk_strungs = []
	j_count=0
	for jay in j_list:
		junct_gtf_list = ["edgeHog","exon",jay, jay, 0, gene.strand, ".", "gene_id '%s'; transcript_id '%s'; junction_id j_%s;\n" %tuple([isoid_name, isoid_name, j_count]) ]
		junct_gtf_str="\t%s"*len(junct_gtf_list) % tuple(junct_gtf_list)
		junct_gtf_str = "%s%s" % tuple([gene.chrom , junct_gtf_str])
		junk_strungs.append(junct_gtf_str)
		j_count += 1


	phial_out.write(gene_gtf_str)
	for junk_line in junk_strungs:
		phial_out.write(junk_line)

phial_out.close()


phial_name = "isoids.list"
phial_name = "%s.isoids.list" % tuple([file_out])

phial_out = open(phial_name, "w")

for isoid_name in isoids.keys():
	for trans in isoids[isoid_name]["transcripts"]:
		phial_out.write("'%s'\t%s\n" %tuple([isoid_name, trans]))


phial_out.close()

Python From line 1 of scripts/edgeHog.py

args = commandArgs(trailingOnly=TRUE)
# 1: counts file in
# 2: prefix for deseq out
# 3: name of contrast


library("dplyr")
#library("yaml")
library("readr")
library("tidyr")
library("magrittr")

#library("stringr")

counts_in <- args[1] # counts from featureCount
#counts_in <- "counts/all.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.counts"
aln_meta <- args[2] # metadata for alignments (to get million mapped reads)
#aln_meta <- "summaries/alignments.vs_dm6main.mapspliceMulti.summary"
exp_out <- args[3] # output file prefix to write expression
#exp_out <- "expression.potato"

counts.df <- read_delim(counts_in, "\t", escape_double = FALSE, trim_ws = TRUE)

RPK.df <- counts.df %>% select(-c(Chr, Start, End, Strand)) %>% mutate(Length = Length/1000) # convert bases to kb
#turn counts into per-kb rates
#https://stackoverflow.com/a/48693978
RPK.df[,-(1:2)] %<>% sapply(`/`, RPK.df[,2])
RPK.df.gath <- RPK.df  %>% gather(key = "sample", value = "count", -one_of("Geneid", "Length"))


#collect the million reads mapped (proper pairs, given the current featureCounts settings)
aln_meta.df <- read_delim(aln_meta, "\t", escape_double = FALSE, trim_ws = TRUE,col_names = F)
names(aln_meta.df)<- c("ref_genome","aligner","sample","measure","value")

aln_meta.df %<>% filter(measure == "properly_paired_count") %<>% select(-c("measure", "aligner", "ref_genome"))


RPKM.df.gath <- inner_join(RPK.df.gath, aln_meta.df, by =c("sample"="sample"))

RPKM.df.gath %<>% mutate(MMR = value/1000000, expression = count/MMR )  %>% select(c("Geneid", "sample", "expression"))

RPKM.df.sprud <- RPKM.df.gath %>% spread(key = "sample", value = "expression")

write_delim(RPKM.df.sprud, paste(exp_out,".rpkm", sep=""), delim = "\t", col_names = T)





RPK.df <- counts.df %>% select(-c(Chr, Start, End, Strand)) %>% mutate(Length = Length/1000) # convert bases to kb
# TPM definition https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
#				https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/
RPK.df[,-(1:2)] %<>% sapply(`/`, RPK.df[,2])

RPK.gath.df <- RPK.df %>% select(-c("Length")) %>% gather(key = "sample", value = "rpk", -one_of("Geneid"))
RPK.gath.grupt.sum.df <- RPK.gath.df  %>% group_by(sample) %>% summarise(total=sum(rpk))

TPM.df <- inner_join(RPK.gath.df, RPK.gath.grupt.sum.df, by=c("sample"="sample")) %>% mutate(TPM = rpk/(total/1000000)) %>% select(-c("rpk","total")) %>% spread(key=sample, value=TPM) 

write_delim(TPM.df, paste(exp_out,".tpm", sep=""), delim = "\t", col_names = T)

R dplyr tidyr FeatureCounts readr magrittr From line 2 of scripts/expressionOmete.R

import json
#	https://stackoverflow.com/questions/19483351/converting-json-string-to-dictionary-not-list
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("json_in", help="JSON file to be condensed")
parser.add_argument("flat_out", help="flatfile summary")
parser.add_argument("-t", "--tag", help="line-name for the processed JSON", default=None)
args = parser.parse_args()

def prune_jason(jsn):
	sparse_dict = {}
	sparse_dict = dict([(k,{}) for k in ['prefiltered', 'postfiltered', 'filtration', 'duplication']])

	for i in ['gc_content','q20_rate', 'q30_rate', 'read1_mean_length','total_reads']:
		try:
			sparse_dict['prefiltered'][i] =  jsn['summary']['before_filtering'][i] 
		except KeyError:
			sparse_dict['prefiltered'][i] = "NA"

		try:
			sparse_dict['postfiltered'][i] =  jsn['summary']['after_filtering'][i] 
		except KeyError:
			sparse_dict['postfiltered'][i] = "NA"

	for i in ['adaptor_trimmed_reads']:
		try: 
			sparse_dict['filtration'][i] = jsn['adapter_cutting']['adapter_trimmed_reads']
		except KeyError:
			sparse_dict['filtration'][i] = "NA"

	for i in ['corrected_reads','low_quality_reads','passed_filter_reads']:
		try:
			sparse_dict['filtration'][i] = jsn['filtering_result'][i]
		except KeyError:
			sparse_dict['filtration'][i] = "NA"

	for i in ['rate']:
		sparse_dict['duplication'][i] = jsn['duplication'][i]

	return sparse_dict

def print_pruned(prn):
	lines2write = [ [x, y, prn[x][y]]  for x in prn.keys() for y in prn[x]]
	if args.tag:
		[ ell.insert(0, args.tag) for ell in lines2write ]
	phial_out = open(args.flat_out, 'w')
	for preline in lines2write:
		field_count = len(preline)
		line = ("%s" + "\t%s"*(field_count-1) + "\n") % tuple(preline)
		phial_out.write(line)
	phial_out.close()

jason_file=open(args.json_in)
jason_str = jason_file.read()
jason = json.loads(jason_str)

jason_condensed = prune_jason(jason)
print_pruned(jason_condensed)

Python JSON From line 1 of scripts/fastp_reporter.py

args = commandArgs(trailingOnly=TRUE)

library("readr")
library("biomaRt")
library("org.Dm.eg.db")
library("topGO")
library("dplyr")
library("yaml")
library("stringr")
library("magrittr")
data("geneList")

deSeq_file <- args[1]
nombre <- args[2]
go_out <- args[3]
config <- args[4]

#nombre <- "hausWtVsMut"


sig_thresh <- 0.01

#hausWtVsMut_vs_dm6main_dm6_genes_mapspliceMulti_MpBC_itemized <- read_delim("diff_expr/hausWtVsMut/hausWtVsMut.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de", "\t", escape_double = FALSE, trim_ws = TRUE)
DESeq.itemized<- read_delim(deSeq_file, "\t", escape_double = FALSE, trim_ws = TRUE)

trammel <- read_yaml(config)
#trammel <- read_yaml("config.yaml")
#	build the contrast experiment
contrasts.df <- plyr::ldply(trammel$contrasts, data.frame) %>% filter(name == nombre )
#get the factors in the design
desgn <- as.character(contrasts.df$design[1])
desgn.splt <- (desgn %>% str_replace_all("~","")%>% str_replace_all(" ","") %>% strsplit("+", fixed=TRUE))[[1]]

query_genes.list <- DESeq.itemized$geneid %>% unique() %>% as.list()

#ensembl_dm = useMart("ensembl",dataset="dmelanogaster_gene_ensembl")
marty <- useDataset("dmelanogaster_gene_ensembl",  useMart("ensembl",  host = "useast.ensembl.org") )
G_list <- getBM(attributes= c("flybase_gene_id", "ensembl_gene_id", "external_gene_name", "go_id", "name_1006"), mart= marty, useCache= FALSE) %>% mutate(go_name = name_1006) %>% select(-c("name_1006"))
#G_list <- getBM(attributes= c("flybase_gene_id", "ensembl_gene_id", "external_gene_name", "go_id", "name_1006"), mart= marty) %>% mutate(go_name = name_1006) %>% select(-c("name_1006"))


### ### ### ### ### ### ### ### 
### plz fix this
go_bad <- c("GO:0110109","GO:0120176","GO:0120177","GO:0120170","GO:0062023")
bad_boiz <- G_list %>% filter(go_id %in% go_bad) %>% dplyr::select("flybase_gene_id")
bad_boiz <- rbind(bad_boiz, "FBgn0050046", "FBgn0033710",  "FBgn0050203", "FBgn0026721")
#mask the baddies
G_list.unq <- G_list %>% filter(!flybase_gene_id %in% bad_boiz$flybase_gene_id ) %>% dplyr::select("flybase_gene_id") %>% unique()
DESeq.itemized %<>% filter(!(geneid %in% bad_boiz$flybase_gene_id))
### ### ### ### ### ### ### ### 

#present_gene_list <- G_list.unq$flybase_gene_id %in% DESeq.itemized$geneid %>% as.numeric()
#names(present_gene_list) <- G_list.unq$flybase_gene_id

jeanOnt.mega.df <- as.data.frame(c())

#for each factor in factors....

for (fact in desgn.splt){

	ref_lev <- (contrasts.df %>% filter(vars == fact))$ref
	#all_levs<-levels(counts.dds.dsq[[fact]])

	#subset the DE data to the factor
#	DESeq.itemized.factor <- DESeq.itemized[,!is.na(str_locate(names(DESeq.itemized), paste(fact,".vs_" ,sep=""))[,1])]
	DESeq.itemized.factor <- DESeq.itemized[,!is.na(str_match(names(DESeq.itemized), paste(fact,".vs_(.*?).apeglm" ,sep=""))[,1])]

	#pull out the alt levels from the data
#	alt_levs<- (str_split_fixed(names(DESeq.itemized.factor), paste(fact,".vs_", sep=""), n=2)[,2] %>% str_split_fixed("\\.", n=2))[,1] %>% unique()

	noms <- str_match(names(DESeq.itemized), paste(fact,".vs_(.*?).apeglm" ,sep=""))[,2]
	alt_levs <- unique(noms[!is.na(noms)])
	DESeq.itemized.factor$geneid <- DESeq.itemized$geneid

	#	#for each alt_level in factor.levels:

	for (alt in alt_levs) {
		print(alt)
		#subset the DE data to that level
		DESeq.itemized.level <- DESeq.itemized.factor[,!is.na(str_locate(names(DESeq.itemized.factor), paste(fact,"\\.vs_",alt,"\\." ,sep=""))[,1])]
		DESeq.itemized.level$geneid <- DESeq.itemized.factor$geneid
		#pull out the genes & significance values
		sig_gene_list <- DESeq.itemized.level[paste("padj.",fact,".vs_",alt,".apeglm",sep="")][[1]]# < sig_thresh)[,1] #%>% as.numeric()
		sig_gene_list[is.na(sig_gene_list)] <- 1
		names(sig_gene_list) <- DESeq.itemized.level$geneid

		jeanOnt.df <- as.data.frame(c())

		for (ont in c("MF","BP","CC") ){

			#topDiffGenes uses p cutoff of 0.01
			GOdata <- new("topGOdata", ontology = ont, allGenes = sig_gene_list, geneSel = topDiffGenes, annot = annFUN.org, mapping = "org.Dm.eg.db", ID="Ensembl")
			fishy <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
			kissy <- runTest(GOdata, algorithm = "classic", statistic = "ks")

			goRes.tmp <- GenTable(GOdata, classicFisher =fishy, classicKS = kissy,  topNodes = 50) %>% mutate(ontology = as.factor(ont), classicFisher = as.numeric(classicFisher), classicKS = as.numeric(classicKS)) %>% select(-c(Term), Term) %>% as_tibble() 
			goRes.tmp %<>% left_join(G_list%>% select(c("go_id", "go_name")) %>% unique(), by =c("GO.ID" = "go_id")) %>% select(-c("Term"))
			jeanOnt.df <- rbind(jeanOnt.df, goRes.tmp)
			#store & bind the topGO results. 

		}

		jeanOnt.df$variable <- as.factor(fact)
		jeanOnt.df$alt_level <- as.factor(alt)

		jeanOnt.mega.df <- rbind(jeanOnt.mega.df,jeanOnt.df)

	}

}

write.table(jeanOnt.mega.df, file=args[3], row.names=FALSE, col.names = TRUE, sep = "\t", quote=F)

R dplyr stringr readr org.Dm.eg.db magrittr yaml biomaRt topGO From line 2 of scripts/geneOntologe.R

import pysam
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-i", "--samfile_in", help="file in SAM format to ")
# parser.add_argument("-i", "--idxstat_in", help="samtools idxstat report")
# parser.add_argument("-g", "--split_coverage", help="bedtools genomecov report, with split reads")
# parser.add_argument("-G", "--spanning_coverage", help="bedtools genomecov report, with spanning reads")
# parser.add_argument("-d", "--depthstats_in", help="samtools depth report")
# parser.add_argument("-D", "--dupestats_in", help="samtools markdup stats report")
# parser.add_argument( "-m", "--mapping_multiplicity", help="histogram of mapping multiplicity scraped from the IH:: tags")
# parser.add_argument( "-c", "--mapped_count", help="count of mapped reads")
parser.add_argument("-o", "--samfile_out", help="flatfile summary")
# parser.add_argument("-t", "--tag", help="line-name for the flatfile", default=None)
args = parser.parse_args()

sam_in = args.samfile_in
sammy = pysam.AlignmentFile(sam_in,"rb")  


sam_out = args.samfile_out
samuel = pysam.AlignmentFile(sam_out, "w", template=sammy)


# cigartuples encoding:
# M = 0
# I = 1
# D = 2
# N = 3
# S = 4


for read in sammy.fetch():
	cig = read.cigar
	nu_cig = []
	suffix = []
	if cig[0][0] == 4:
		cig = cig[1:]
	if cig[-1][0] == 4:	#discard the softclips
		cig = cig[:-1]
	front_pad_tally = 0
	while cig[0][0] in [0,1,2]:
		front_pad_tally += cig[0][1]
		cig.pop(0)
	back_pad_tally = 0
	while cig[-1][0] in [0,1,2]:
		back_pad_tally += cig[-1][1]
		cig.pop(-1)

	nu_cig.extend([(3,front_pad_tally-1),(0,1)]) # new cigar starts with 5' end clipped down to the actual splice

	# if cig[0][0] == 0: #take the terminal match blocks
	# 	num_match = cig[0][-1]
	# 	prefix = [(3,num_match-1),(0,1)]
	# 	nu_cig.extend(prefix) #add a bunch of Ns to pad ; ignore 5' and 3' read ends. 
	# 	cig = cig[1:]
	# if cig[-1][0] == 0:
	# 	num_match = cig[-1][1]
	# 	suffix = 
	# 	cig = cig[:-1]

	rill_tally = 0
	mids = []
	for rillo in cig:
		if rillo[0] in [0,1,2]:
			rill_tally += rillo[1] #consolidate exons
		else:
			if rill_tally > 0:
				mids.extend([(0,1),(3,rill_tally-2),(0,1)]) #write hollowed exon
			mids.extend([rillo]) # write intron
			rill_tally=0 # reset tally
	nu_cig.extend(mids)
	nu_cig.extend([(0,1),(3,back_pad_tally-1),]) # end cigar with splice and Ns to 3' end

	read.cigar = nu_cig
	read.seq = "*"
	samuel.write(read)

samuel.close()

Python SAMtools BEDTools pysam From line 2 of scripts/read_Hollower.py

run:
	shell(""" mkdir -p results/figures/; touch results/figures/null.png; for fig in results/figures/*png; do mv $fig $(echo $fig| rev | cut -f 2- -d . | rev ).$(date +%d_%b_%Y).png; done;  rm results/figures/null.*.png; """)
	shell(""" mkdir -p results/figures/supp/ ; touch results/figures/supp/null.png; for fig in results/figures/supp/*png; do mv $fig $(echo $fig| rev | cut -f 2- -d . | rev ).$(date +%d_%b_%Y).png; done; rm results/figures/supp/null.*.png; """)

	shell(""" mkdir -p results/tables/ ; touch results/tables/null.tmp ; for phial in $(ls -p results/tables/ | grep -v /); do pre=$(echo $phial | rev | cut -f 2- -d . | rev ); suff=$(echo $phial | rev | cut -f 1 -d . | rev ); mv results/tables/$phial results/tables/$pre.$(date +%d_%b_%Y).$suff; done ; rm results/tables/null.*.tmp; """)
	shell(""" mkdir -p results/tables/supp/ ; touch results/tables/supp/null.tmp ; for phial in $(ls -p results/tables/supp/ | grep -v /); do pre=$(echo $phial | rev | cut -f 2- -d . | rev ); suff=$(echo $phial | rev | cut -f 1 -d . | rev ); mv results/tables/supp/$phial results/tables/supp/$pre.$(date +%d_%b_%Y).$suff; done ; rm results/tables/supp/null.*.tmp; """)

	shell(""" mv results/VolkanLabBehaviorGenetics.full.pdf results/VolkanLab_BehaviorGenetics.full.$(date +%d_%b_%Y).pdf """)
	shell(""" tar cf results.full.$(date +%d_%b_%Y).tar results/ """)

SnakeMake From line 60 of master/Snakefile

run:
	print("%s%s" % tuple(["FASTQ/", wildcards.path]))
	try:
		sra = sra_by_fqpath["%s%s/" % tuple(["FASTQ/", wildcards.path])]
		shell(""" mkdir -p FASTQ/{wildcards.path}/ """)
		shell("""
			fasterq-dump  --split-3 --outdir FASTQ/{wildcards.path}/ {sra}
		""")
	except KeyError:
		raise KeyError("Sample is listed as empirical but no reads found and no SRA to download!" )

SnakeMake SRA From line 79 of master/Snakefile

run:
	shell("mkdir -p utils/")
	shell(
		""" 
		cd utils;
		wget http://protocols.netlab.uky.edu/~zeng/MapSplice-v2.1.8.zip ;
		unzip MapSplice-v2.1.8.zip;
		2to3 -w MapSplice-v2.1.8/;
		cd MapSplice-v2.1.8/;
		make;
		cd ..;
		rm MapSplice-v2.1.8.zip;
		"""
	)		

SnakeMake From line 97 of master/Snakefile

run:
	fai_path = ref_genome_by_name[wildcards.ref_genome]['fai'],
	shell("mkdir -p utils")
	shell(
		'bedtools makewindows -w {wildcards.window_size} -s {wildcards.slide_rate} -g {fai_path} -i winnum | bedtools sort -i - > {output.windowed}'
	)

SnakeMake BEDTools From line 121 of master/Snakefile

run:
	shell("mkdir -p utils/annotations")
	base_path = annotation_by_name[genelist_by_name[wildcards.subname]["base"]]["bed_path"]
	list_file = genelist_by_name[wildcards.subname]["list_path"]
	shell(""" grep -wFf <( tail -n +2 {list_file} | cut -f 2 ) {base_path} > {output.annot_out} """)

SnakeMake From line 135 of master/Snakefile

shell:
	"""
	mkdir -p summaries/reference_genomes/
	cat {input.fai_in} | awk '{{sum+=$2}} END {{ print "number_contigs\t",NR; print "number_bases\t",sum}}' | sed -e 's/^/{wildcards.ref_gen}\t/g' > {output.report_out};
	"""

SnakeMake From line 150 of master/Snakefile

shell:
	"cat {input.refgen_reports} > {output.refgen_summary}"

SnakeMake From line 165 of master/Snakefile

run:
	shell(""" mkdir -p summaries/annotations/ """)
	shell(""" rm -f {output.report_out} """)
	shell(""" cat {input.annot} | cut -f 1 | grep -v "chr2110000222\|chrmitochondrion\|Un\|rand\|chrrDNA" | sort | uniq -c | tr -s " " | tr " " "\t" | awk '{{print"count\t"$2"\t"$1}}' >> {output.report_out} """)
	shell(""" cat {input.annot} | wc -l | awk '{{print"count\ttotal\t"$0}}' >> {output.report_out} """)
	shell(""" cat {input.annot} | awk '{{print$3-$2;}}' | awk '{{sum+=$1; sumsq+=$1*$1}} END {{ print "size\ttotal\t",sum; print "size\tavg\t",sum/NR; print "size\tstd\t",sqrt(sumsq/NR - (sum/NR)**2)}}'  >> {output.report_out} """)

SnakeMake From line 178 of master/Snakefile

run:
	print([a["name"] for a in config['annotations'] if not a["derived"] ])
	shell(""" rm -f {output.refann_summary} """)
	for anne in [a["name"] for a in config['annotations'] if not a["derived"] ]:
		shell(""" cat summaries/annotations/{anne}.stats | awk '{{print"{anne}\t"$0}}' >> {output.refann_summary}""")

SnakeMake From line 195 of master/Snakefile

run:
	shell(""" python scripts/edgeHog.py --gtf_in $(pwd)/utils/annotations/fru.gtf -o utils/annotations/fru -n FBgn0004652 """)

SnakeMake From line 213 of master/Snakefile

run:
	shell(""" rm -f {output.statsOut} """)
	shell(""" echo "count	total	$( tail -n +2 {input.list_in} | wc -l  )" >> {output.statsOut} """)
	annot_path = annotation_by_name[genelist_by_name[wildcards.listicle]["base"]]["bed_path"]
	annot_name = annotation_by_name[genelist_by_name[wildcards.listicle]["base"]]["name"]
	shell(""" echo "count\tannotated\t$(cat {input.bed_in} | wc -l )">> {output.statsOut} """)
	shell(""" cat {input.bed_in} | awk '{{print$3-$2;}}' | awk '{{sum+=$1; sumsq+=$1*$1}} END {{ print "size\ttotal\t",sum; print "size\tavg\t",sum/NR; print "size\tstd\t",sqrt(sumsq/NR - (sum/NR)**2)}}'  >> {output.statsOut} """)

SnakeMake From line 227 of master/Snakefile

run:
	for listicle in genelist_by_name.keys():
		annot = annotation_by_name[genelist_by_name[listicle]["base"]]["name"]
		shell(""" cat summaries/geneLists/{listicle}.stats | awk '{{print"{listicle}\t{annot}\t"$0}}' >> {output.allStats} """)

SnakeMake From line 245 of master/Snakefile

shell:
	"/nas/longleaf/home/csoeder/modules/fastp/fastp {params.common_params} {params.pe_params} --in1 {input.fileIn[0]} --out1 {output.fileOut[0]} --in2 {input.fileIn[1]} --out2 {output.fileOut[1]}"

SnakeMake fastp From line 268 of master/Snakefile

shell:
	"""
	cp {input.jason} summaries/FASTP/{wildcards.samplename}.json
	python3 scripts/fastp_reporter.py {input.jason} {output.jason_pruned} -t {wildcards.samplename}
	"""

SnakeMake From line 284 of master/Snakefile

shell:
	"cat {input.jasons_in} > {output.summary}"

SnakeMake From line 301 of master/Snakefile

run:
	ref_genome_split = ref_genome_by_name[wildcards.ref_genome]['split'],
	ref_genome_bwt = ref_genome_by_name[wildcards.ref_genome]['bowtie'],
	shell(""" mkdir -p mapped_reads/mapspliceRaw/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 
		python utils/MapSplice-v2.1.8/mapsplice.py -c {ref_genome_split} -x {ref_genome_bwt} -1 {input.reads_in[0]} -2 {input.reads_in[1]} -o mapped_reads/mapspliceRaw/{wildcards.sample} 
		""")
	shell(""" 
		samtools view -Sbh mapped_reads/mapspliceRaw/{wildcards.sample}/alignments.sam | samtools sort - > mapped_reads/mapspliceRaw/{wildcards.sample}/{wildcards.sample}.vs_{wildcards.ref_genome}.mapspliceRaw.sort.bam;
		samtools index {output.raw_bam};
		rm mapped_reads/mapspliceRaw/{wildcards.sample}/alignments.sam;
		""")

SnakeMake SAMtools Bowtie 2 MapSplice From line 318 of master/Snakefile

run:
	shell(""" mkdir -p mapped_reads/mapspliceMulti/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 
		samtools sort -n {input.raw_bam} | samtools fixmate -m - - | samtools sort - | samtools markdup {params.dup_flg} - - | samtools view -bh {params.quality} > {output.multi_bam} ;
		samtools index {output.multi_bam} 
		""")

SnakeMake SAMtools From line 346 of master/Snakefile

run:
	shell(""" mkdir -p mapped_reads/mapspliceUniq/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 
		samtools view {input.multi_bam} | grep -w "IH:i:1" | cat <( samtools view -H {input.multi_bam} ) - | samtools view -Sbh > {output.uniq_bam};
		samtools index {output.uniq_bam}
		""")

SnakeMake SAMtools From line 366 of master/Snakefile

run:
	shell(""" mkdir -p mapped_reads/mapspliceRando/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 

		samtools sort -n {input.multi_bam} | samtools view | grep -v "IH:i:[01]" > {output.rando_bam}.multi.unsampled

		rm -f {output.rando_bam}.multi.sampled
		for idx in $(cut -f 1 {output.rando_bam}.multi.unsampled | sort | uniq); do 
			grep -w $idx {output.rando_bam}.multi.unsampled | sort  --random-sort | head -n 1 >> {output.rando_bam}.multi.sampled
		done;

		cat <( samtools view -h {input.uniq_bam} ) {output.rando_bam}.multi.sampled | samtools view -Sbh | samtools sort - > {output.rando_bam};
		samtools index {output.rando_bam}
		rm {output.rando_bam}.multi.unsampled {output.rando_bam}.multi.sampled
		""")

SnakeMake SAMtools From line 388 of master/Snakefile

run:
	shell(""" mkdir -p mapped_reads/{wildcards.aligner}_SpliceOnly/{wildcards.sample}/ summaries/BAMs/""")
	shell(""" 
		cat <(samtools view -H {input.in_bam}) <(samtools view {input.in_bam}| awk '($6 ~ /N/)' | less) | samtools view -hbS > {output.out_bam}.tmp;
		samtools index {output.out_bam}.tmp;
		python scripts/read_Hollower.py -i {output.out_bam}.tmp -o {output.out_bam}.tmp.sam;
		cat {output.out_bam}.tmp.sam |  sed -e 's/\tN\t/\t*\t/g' | samtools view -hbS > {output.out_bam};
		samtools index {output.out_bam};
		rm {output.out_bam}.tmp {output.out_bam}.tmp.sam;
		""")

SnakeMake SAMtools From line 416 of master/Snakefile

	run:
 		ref_genome_idx=ref_genome_by_name[wildcards.ref_genome]['fai']
 		shell(""" which samtools """)
 		shell("samtools idxstats {input.bam_in} > {input.bam_in}.idxstats")

SnakeMake SAMtools From line 440 of master/Snakefile

shell:
	"""cat {input.bam_reports} | awk '{{print "{wildcards.ref_genome}\t{wildcards.aligner}\t"$0}}'> {output.full_report}"""

SnakeMake From line 475 of master/Snakefile

run:
	fai = ref_genome_by_name[wildcards.ref_genome]['fai']
	shell(""" mkdir -p summaries/coverage/ """)
	shell(""" samtools view -hb {input.bam_in} {params.location} | bedtools genomecov -split -bga -ibam - -g {fai} > {output.bg_out} """)

SnakeMake SAMtools BEDTools From line 491 of master/Snakefile

run:
	shell(""" touch {output} """)

SnakeMake From line 510 of master/Snakefile

run:
	flug_str = wildcards.flags
	flug_lst = [ s for s in flug_str if s != "_"]
	flug_str = " -%s "*len(flug_lst) % tuple(flug_lst)


	sed_suff = ".vs_%s.%s.sort.bam" % tuple([wildcards.ref_genome, wildcards.aligner])		
	sed_cmd =  " sed -e 's/mapped_reads\/[a-zA-Z0-9\/_]*\///g' | sed -e 's/%s//g' " % tuple([sed_suff])

	annot_gtf = annotation_by_name[wildcards.annot]["gtf_path"]
	shell(""" mkdir -p counts summaries/counts/""")
	shell("""
		featureCounts {flug_str} {params.fc_params} -t exon -g gene_id -F GTF -a <(cat {annot_gtf} | awk '{{print"chr"$0}}') -o {output.counted_features}.tmp {input.alignments_in}
		""")
	shell("""
		cat counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp | tail -n +2 | {sed_cmd} > {output.counted_features}
		cat counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp.jcounts | {sed_cmd} > counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.jcounts
		cat counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp.summary | {sed_cmd} > counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.summary
		rm counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp* 
		""")

SnakeMake FeatureCounts From line 530 of master/Snakefile

run:
	shell("""
		head -n 1 {input.sum_in} > {output.count_sum};
		cat {input.sum_in} | grep -w "Assigned\|Unassigned_NoFeatures\|Unassigned_Ambiguity" >> {output.count_sum};
		""")

SnakeMake From line 563 of master/Snakefile

run:
	shell(""" mkdir -p expression/ """)
	shell(""" Rscript  scripts/expressionOmete.R {input.counted_features} {input.aln_rprt} expression/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags} """)

SnakeMake From line 583 of master/Snakefile

run:
	shell(""" mkdir -p utils/ """)
	shell(""" touch {output.expFlg} """)

SnakeMake From line 599 of master/Snakefile

run:
	#cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"]
	shell(""" mkdir -p diff_expr/{wildcards.contrast} """)
	shell(""" Rscript scripts/deSeqer.R {input.fc_in} diff_expr/{wildcards.contrast}/{wildcards.contrast}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag} {wildcards.contrast} config.yaml """)
	#shell(""" mkdir -p meta/DESeq2_methods/ """)
	#shell(""" mv diff_expr/{wildcards.contrast}/{wildcards.contrast}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag}.method meta/DESeq2_methods/ """)

SnakeMake From line 617 of master/Snakefile

run:
	#cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"]
	shell(""" mkdir -p gene_ont/{wildcards.contrast}/ """)
	shell(""" Rscript scripts/geneOntologe.R {input.deSq_itemized} {wildcards.contrast} {output.deSq_go} config.yaml """)

SnakeMake From line 637 of master/Snakefile

run:
	#cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"]
	shell(""" mkdir -p correlations/{wildcards.correlation}/ """)
	shell(""" Rscript scripts/correlate.R {input.xpr_in} {wildcards.correlation} correlations/{wildcards.correlation}/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag} """)

SnakeMake From line 656 of master/Snakefile

run:
	shell(
		"""
		mkdir -p utils/DEXSeq/;
		""")

SnakeMake From line 673 of master/Snakefile

run:
	shell(
		"""	
			mkdir -p utils/annotations/DEXSeq/;
			python utils/DEXSeq/Subread_to_DEXSeq/dexseq_prepare_annotation2.py {params.prep_params} <( cat {input.anne}| awk '{{print"chr"$0}}' ) -f {output.fc_ready} {output.nu_gff}
		""")

SnakeMake From line 702 of master/Snakefile

run:
	flug_str = wildcards.flags
	flug_lst = [ s for s in flug_str if s != "_"]
	flug_str = " -%s "*len(flug_lst) % tuple(flug_lst)

	sed_suff = ".vs_%s.%s.sort.bam" % tuple([wildcards.ref_genome, wildcards.aligner])		
	sed_cmd =  " sed -e 's/mapped_reads\/[a-zA-Z0-9\/_]*\///g' | sed -e 's/%s//g' " % tuple([sed_suff])

	shell(""" mkdir -p dxCounts summaries/dxCounts/""")
	shell("""
		featureCounts {flug_str} {params.fc_params} -t exon -g gene_id -F GTF -a {input.dxq_ann} -o {output.counted_features}.tmp {input.alignments_in}
		""")
	shell("""
		cat {output.counted_features}.tmp | tail -n +2 | {sed_cmd} > {output.counted_features}
		cat {output.counted_features}.tmp.summary | {sed_cmd} > {output.count_stats}
		rm {output.counted_features}.tmp* 
		""")
	##cat {output.counted_features}.tmp.jcounts | {sed_cmd} > counts/DEXSeq/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.anne}.{wildcards.aligner}.{wildcards.flags}.counts.jcounts

SnakeMake FeatureCounts From line 723 of master/Snakefile

run:
	#cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"]
	shell(""" mkdir -p diff_exon_use/{wildcards.contrast}/ """)
	shell("""  Rscript scripts/dexSeqe.R {input.counted_features} diff_exon_use/{wildcards.contrast}/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.anne}.{wildcards.aligner}.{wildcards.flags} {input.dxq_ann} {wildcards.contrast} 'chr3R_FBgn0004652-'  """)

SnakeMake From line 756 of master/Snakefile

run:
	pandoc_path="/nas/longleaf/apps/rstudio/1.0.136/bin/pandoc"
	pwd = subprocess.check_output("pwd",shell=True).decode().rstrip()+"/"
	shell("""mkdir -p results/figures/supp/ results/tables/supp/""")
	shell(""" R -e "setwd('{pwd}');Sys.setenv(RSTUDIO_PANDOC='{pandoc_path}')" -e  "peaDubDee='{pwd}'; rmarkdown::render('scripts/RNAseq_results.Rmd',output_file='{pwd}{output.pdf_out}')"  """)