Behavioral Genetics in Drosophila: RNA-Seq analysis pipeline
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Software used to analyze data for:
Deanhardt, B., Duan, Q., Du, C., Soeder, C., Morlote, A., Garg, D., Saha, A., Jones, C. D., & Volkan, P. C. (2023). Social experience and pheromone receptor activity reprogram gene expression in sensory neurons. G3 Genes|Genomes|Genetics , 00 (March), 2006–2021. https://doi.org/10.1093/g3journal/jkad072
Social experience and pheromone receptor activity reprogram behavioral switch and neuromodulatory gene expression in sensory neurons Bryson Deanhardt, Qichen Duan, Chengcheng Du, Charles Soeder, Alec Morlote, Deeya Garg, Corbin D. Jones, Pelin Cayirlioglu Volkan bioRxiv 2021.06.18.449021; doi: https://doi.org/10.1101/2021.06.18.449021 latest version: https://www.biorxiv.org/content/10.1101/2021.06.18.449021v3
Changes in splicing and neuromodulatory gene expression programs in sensory neurons with pheromone signaling and social experience Deanhardt Bryson, Duan Qichen, Du Chengcheng, Soeder Charles, Jones Corbin, C. Volkan Pelin bioRxiv 2021.06.18.449021; doi: https://doi.org/10.1101/2021.06.18.449021 version 1: https://www.biorxiv.org/content/10.1101/2021.06.18.449021v1 version 2: https://www.biorxiv.org/content/10.1101/2021.06.18.449021v2
Analysis downstream of this workflow was performed by Quichen Duan.
Dependencies
The following were required to run the full workflow:
module load python/3.6.6 bedtools bedops samtools r/3.6.0 rstudio/1.1.453 bowtie sratoolkit subread
as well, the Snakefile includes rules with hard paths that will need changing:
fastp_clean_sample_pe : fastp
write_report : pandoc_path
Data
Not all the data in the config.yaml have been published; so full results summary won't run :( Intermediate results can be generated though.
Use
Deanhardt et al. 2021
The results summary from Deanhardt et al. 2021 can be generated: (untested!)
nohup time snakemake --restart-times 6 --latency-wait 60 --jobs 24 -p --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G " --snakefile Snakefile.Deanhardt2021 --configfile config.Deanhardt2021.yaml &
This recursively generates the differential expression analysis from raw data, summarizes/vizualizes it in a PDF, and bundles/timestamps the results summary (".pdf")
The underlying differential expression and differential exon use data from Deanhardt et al. 2021 can be generated:
snakemake diff_expr/grpWtVs47b/grpWtVs47b.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de diff_expr/grpWtVs67d/grpWtVs67d.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de diff_expr/grpWtVsFru_smolFru/grpWtVsFru_smolFru.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de diff_expr/grpWtVsMut/grpWtVsMut.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de
snakemake diff_exon_use/dex_grpWtVs47b/grpWtVs47b.vs_dm6main.dm6_genes.mapspliceMulti.M.de diff_exon_use/dex_grpWtVs67d/grpWtVs67d.vs_dm6main.dm6_genes.mapspliceMulti.M.de diff_exon_use/dex_grpWtVsFru_smolFru/grpWtVsFru_smolFru.vs_dm6main.dm6_genes.mapspliceMulti.M.de diff_exon_use/dex_grpWtVsMut/grpWtVsMut.vs_dm6main.dm6_genes.mapspliceMulti.M.de
Full Results Summary
The full results summary ("VolkanLab_BehaviorGenetics.05_Apr_2021.pdf") was generated simply by running
snakemake --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G "
Bypassing the Hard Parts
Users may wish to avoid the time and memory consuming steps of read downloading, cleaning, mapping, and counting. The raw counts underlying the results of Deanhardt et al. 2021 are stored in the NCBI GEO ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE179213 ); a shortcut script is included to download this data and prepare it such that it's where the Snakemake workflow anticipates. Currently only the "multi" alignment is available.
The shortcut script is run thus:
bash utils/shortcut.bash
Fruitless replicate #1 was problematic and not used in the final analysis; it is not included in the download by default. To include it, run:
bash utils/shortcut.bash -f
The main time bottleneck is the "rando" alignment, which downsamples but does not entirely remove multimapping reads. Alignement strategy ultimately had little impact on results, so users may wish to omit them by generating the "noRando" PDF (untested!)
nohup time snakemake --restart-times 6 --latency-wait 60 --jobs 24 -p --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G " --snakefile Snakefile.Deanhardt2021 --configfile config.Deanhardt2021.yaml results/VolkanLabBehaviorGenetics.Deanhardt2021.noRando.pdf &
Deanhardt et al. 2023
In order to respond to reviewer comments, the population genetics of the experimental flies were compared to those of published DGRP lines sequenced in Huang et al. (2014). Calling the variants in parallel required some platform-dependent tweaks which will need to be modified; unless these population genetics are of specific interest it is probably best to use the 2021 workflow. However, the 5 May 2023 results summary can be rebuilt with the updated workflow:
snakemake --restart-times 6 --latency-wait 60 --jobs 24 -p --cluster "sbatch --time={params.runtime} -n {params.cores} --mem={params.runmem_gb}G " --snakefile Snakefile.Deanhardt2023 &
In particular, scripts/freebayes-parallel requires a hard path to the GNU parallel utility and to the vcfstreamsort utility from vcflib.
Code of Note
├── config.yaml
│ └── (configuration file for the pipeline)
├── dev
│ ├── ...
│ └── (files from earlier stages in project development, including an older FAIRE-seq experiment)
├── misc
│ ├── ...
│ └── (mostly metadata re: sequencing)
├── README.md
├── quichen_duan
│ ├── ...
│ └── (scripts for followup analyses from Quichen Duan)
├── scripts
│ ├── bam_summarizer.mapspliced.py
│ ├── collapsedSignalMerger.py
│ ├── correlate.R
│ ├── deSeqer.R
│ │ └── (differential expression testing script)
│ ├── dexSeqe.R
│ │ └── (differential exon use testing script)
│ ├── edgeHog.py
│ ├── expressionOmete.R
│ ├── faire_results.Rmd
│ ├── fastp_reporter.py
│ ├── geneOntologe.R
│ ├── multimap_spreader.py
│ ├── overlapSignalCollapser.py
│ ├── read_Hollower.py
│ ├── references.bib
│ ├── replicateCorrelations.png
│ └── RNAseq_results.Rmd
│ └── (summarize results)
├── Snakefile
│ └── (rules file for the pipeline)
├── utils
│ └── annotations
│ ├── ...
│ └── (includes lists of genes of interest)
└── VolkanLab_BehaviorGenetics.05_Apr_2021.pdf └── (pipeline walkthrough and results summary)
Code Snippets
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | args = commandArgs(trailingOnly=TRUE) # 1: counts file in # 2: prefix for deseq out # 3: name of contrast library("DGCA") library("dplyr") library("yaml") library("readr") library("tidyr") library("stringr") library("magrittr") xpr_in <- args[1] nombre <- args[2] cor_pre <- args[3] # build the sample DF trammel <- read_yaml("config.yaml") data_sets.df <- plyr::ldply(trammel$data_sets, data.frame) data_sets.df$name <- as.factor(data_sets.df$name) data_sets.df$day<- as.factor(data_sets.df$day) data_sets.df$subgroups<- as.factor(data_sets.df$subgroups) data_sets.df$rep<- as.factor(data_sets.df$rep) data_sets.df$housing<- as.factor(data_sets.df$housing) data_sets.df$genotype<- as.factor(data_sets.df$genotype) data_sets.df$tissue<- as.factor(data_sets.df$tissue) # build the correlation experiment corr.df <- plyr::ldply(trammel$correlations, data.frame) %>% filter(name == nombre ) # prep the sample metadata sbgrp <- (corr.df %>% select(filt))[1,] %>% as.character #coldata <- data_sets.df %>% as.data.frame() %>% ungroup()%>% filter(subgroups == sbgrp) #coldata <- data_sets.df %>% ungroup() %>% group_by(name) %>% mutate(genotype = paste0(as.character(genotype), collapse = "," )) %>% filter(subgroups == sbgrp) coldata <- data_sets.df%>% group_by(name) %>% mutate(genotype = paste0(unique(as.character(genotype)), collapse = "." )) %>% ungroup() %>% filter(subgroups == sbgrp) %>% as.data.frame() rownames(coldata) <- coldata$name dsgn.df <- coldata %>% select(name, corr.df$vars %>% as.character() )%>% mutate(present =1) %>% spread(key=genotype, value=present, fill=0) rownames(dsgn.df) <- dsgn.df$name dsgn.df %<>% select(-c("name")) dsgn.mat <- dsgn.df %>% as.matrix() #xpr_in <- "expression/rotund.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.rpkm" #xpr_in <- "expression/rotund.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.rpkm" #load and format the count data xpr.df <- read_delim(xpr_in, "\t", escape_double = FALSE, trim_ws = TRUE) xpr.df %<>% as.data.frame() rownames(xpr.df) <- xpr.df$Geneid xpr.df %<>% select(rownames(dsgn.df)) ddcor_res <- ddcorAll(inputMat = xpr.df, design = dsgn.mat, compare = c("wt", "rn"),adjust = "none", heatmapPlot = FALSE, nPerm = 0) #Write to TSV write_delim(ddcor_res, paste(cor_pre, "corr","tsv", sep="."), delim = "\t") |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | args = commandArgs(trailingOnly=TRUE) # 1: counts file in # 2: prefix for deseq out # 3: name of contrast library("DESeq2") library("dplyr") library("yaml") library("readr") library("tidyr") library("stringr") counts_in <- args[1] dsq_pre <- args[2] nombre <- args[3] config <- args[4] # build the sample DF #trammel <- read_yaml("config.yaml") trammel <- read_yaml(config) data_sets.df <- plyr::ldply(trammel$data_sets, data.frame) data_sets.df$name <- as.factor(data_sets.df$name) data_sets.df$day<- as.factor(data_sets.df$day) data_sets.df$subgroups<- as.factor(data_sets.df$subgroups) data_sets.df$rep<- as.factor(data_sets.df$rep) data_sets.df$housing<- as.factor(data_sets.df$housing) data_sets.df$genotype<- as.factor(data_sets.df$genotype) data_sets.df$tissue<- as.factor(data_sets.df$tissue) # build the contrast experiment contrasts.df <- plyr::ldply(trammel$contrasts, data.frame) %>% filter(name == nombre ) # prep the sample metadata sbgrp <- (contrasts.df %>% select(filt))[1,] %>% as.character #coldata <- data_sets.df %>% as.data.frame() %>% ungroup()%>% filter(subgroups == sbgrp) #coldata <- data_sets.df %>% ungroup() %>% group_by(name) %>% mutate(genotype = paste0(as.character(genotype), collapse = "," )) %>% filter(subgroups == sbgrp) coldata <- data_sets.df%>% group_by(name) %>% mutate(genotype = paste0(unique(as.character(genotype)), collapse = "." )) %>% ungroup() %>% filter(subgroups == sbgrp) %>% unique() %>% as.data.frame() rownames(coldata) <- coldata$name coldata <- coldata %>% select(contrasts.df$vars %>% as.character() ) #load and format the count data counts.df <- read_delim(counts_in, "\t", escape_double = FALSE, trim_ws = TRUE) counts.df.gath <- counts.df %>% select(-c(Chr, Start, End, Strand, Length)) %>% gather(key = "sample", value = "count", -one_of("Geneid")) counts.df.sprud <- counts.df.gath %>% spread(key = sample, value = count) %>% as.data.frame() rownames(counts.df.sprud) <- counts.df.sprud$Geneid counts.df.sprud <- counts.df.sprud %>% select(rownames(coldata)) # convert counts, metadata, design into DESeq data object #counts.dds <- DESeqDataSetFromMatrix(countData = counts.df.sprud, colData = coldata, design = ~ housing) desgn <- as.character(contrasts.df$design[1]) counts.dds <- DESeqDataSetFromMatrix(countData = counts.df.sprud, colData = coldata, design = eval(parse(text=desgn))) #prefilter to remove low-count rows keep <- rowSums(counts(counts.dds)) >= 10 counts.dds <- counts.dds[keep,] #relevel #counts.dds$condition <- relevel(counts.dds$condition, ref = "untreated") for (i in seq(nrow(contrasts.df))){ var <- contrasts.df[i,]$vars %>% as.character() ruf <-contrasts.df[i,]$ref %>% as.character() counts.dds[[var]] <- relevel(counts.dds[[var]], ref = ruf) } #Run DESeq counts.dds.dsq <- DESeq(counts.dds) # run the vanilla DE counts.dds.res <- results(counts.dds.dsq) %>% as.data.frame() counts.dds.res$geneid <- rownames(counts.dds.res) # Effect-size shrinkage coff <- resultsNames(counts.dds.dsq)[2] counts.dds.res.shrunk <- lfcShrink(counts.dds.dsq, coef=coff, type="apeglm") %>% as.data.frame() #explore shrinkage types?? counts.dds.res.shrunk$geneid <- rownames(counts.dds.res.shrunk) # scale by gene length counts.dds.res<-inner_join(counts.dds.res, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) %>% mutate(expression = baseMean/Length) %>% select(-c(Length)) counts.dds.res.shrunk<-inner_join(counts.dds.res.shrunk, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) %>% mutate(expression = baseMean/Length) %>% select(-c(Length)) #Write to TSV counts.dds.res.full <- inner_join(counts.dds.res %>% as.data.frame(), counts.dds.res.shrunk %>% as.data.frame(), by =c("geneid"="geneid"), suffix=c(".fresh",".shrunk")) %>% select(geneid, everything()) write_delim(counts.dds.res.full, paste(dsq_pre,"de",sep="."), delim = "\t") ### Now run the itemized DE #get the factors in the design desgn.splt <- (desgn %>% str_replace_all("~","")%>% str_replace_all(" ","") %>% strsplit("+", fixed=TRUE))[[1]] #first, collect the geneids and the basemeans: just run the first factor and its first nonref level fact <- desgn.splt[1] ref_lev <- (contrasts.df %>% filter(vars == fact))$ref all_levs<-levels(counts.dds.dsq[[fact]]) alt_levs<- all_levs[all_levs != ref_lev] basic.df <- results(counts.dds.dsq, contrast=c(fact, as.character(ref_lev), alt_levs[1])) %>% as.data.frame() %>% select(c("baseMean")) basic.df$geneid <- rownames(basic.df) basic.df <- select(basic.df, c("geneid", "baseMean")) #for each factor in factors.... dds.dsq.byFactor.df <- select(basic.df, c("geneid") ) for (fact in desgn.splt){ ref_lev <- as.character((contrasts.df %>% filter(vars == fact))$ref) all_levs<-levels(counts.dds.dsq[[fact]]) alt_levs<- all_levs[all_levs != ref_lev] #running basemean by level: #https://support.bioconductor.org/p/63567/ # # gather baseMeans for each level in factor.levels: factor.baseMeanPerLvl <- sapply( levels(counts.dds.dsq[[fact]]), function(lvl) rowMeans( counts(counts.dds.dsq,normalized=TRUE)[,counts.dds.dsq[[fact]] == lvl, drop=F] ) ) %>% as.data.frame() names(factor.baseMeanPerLvl ) <- paste0(paste("baseMean.",fact,".", sep=""), names(factor.baseMeanPerLvl)) factor.baseMeanPerLvl$geneid <- rownames(factor.baseMeanPerLvl) factor.baseMeanPerLvl <-inner_join(factor.baseMeanPerLvl, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) for (levi in all_levs){ factor.baseMeanPerLvl$x <- factor.baseMeanPerLvl[[paste("baseMean",fact,levi,sep=".")]] factor.baseMeanPerLvl[[paste("expression",fact,levi,sep=".")]] <- factor.baseMeanPerLvl$x/factor.baseMeanPerLvl$Length } factor.baseMeanPerLvl <- factor.baseMeanPerLvl %>% select(-c("x", "Length")) dds.dsq.byFactor.df <- full_join(dds.dsq.byFactor.df, factor.baseMeanPerLvl, by = c("geneid"="geneid")) # #for each alt_level in factor.levels: for (aLev in alt_levs){ #dds.res_factor_level <- results(counts.dds.dsq, contrast=c(fact, ref_lev, aLev)) dds.res_factor_level.shrunk <- lfcShrink(counts.dds.dsq, coef=paste(fact, aLev, "vs", ref_lev, sep = "_"), type="apeglm") %>% as.data.frame() dds.res_factor_level.shrunk$geneid <- rownames(dds.res_factor_level.shrunk) dds.res_factor_level.shrunk<-inner_join(dds.res_factor_level.shrunk, counts.df%>% select(Geneid, Length), by=c("geneid"="Geneid")) %>% mutate(expression = baseMean/Length) %>% select(-c(Length)) rownames(dds.res_factor_level.shrunk) <- dds.res_factor_level.shrunk$geneid dds.res_factor_level.shrunk <- dds.res_factor_level.shrunk %>% select(-c(geneid)) names(dds.res_factor_level.shrunk ) <- paste0(names(dds.res_factor_level.shrunk),paste(".",fact,".vs_",aLev,".apeglm", sep="")) dds.res_factor_level.shrunk$geneid <- rownames(dds.res_factor_level.shrunk) dds.dsq.byFactor.df <- full_join(dds.dsq.byFactor.df, dds.res_factor_level.shrunk, by = c("geneid"="geneid")) # counts.dds.res_factor_level <- results(counts.dds.dsq, contrast=c(factor1, ref_level, alt_level1)) # counts.dds.res_factor_level$geneid <- rownames(counts.dds.res_factor_level) } } #Write to TSV dds.dsq.byFactor.df <- dds.dsq.byFactor.df %>% select(geneid, everything()) write_delim(dds.dsq.byFactor.df, paste(dsq_pre, "itemized.de", sep="."), delim = "\t") |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | args = commandArgs(trailingOnly=TRUE) # 1: counts file in # 2: prefix for deseq out # 3: name of contrast source("utils/DEXSeq/Subread_to_DEXSeq/load_SubreadOutput.R") library("dplyr") library("yaml") library("readr") library("tidyr") library("magrittr") library("stringr") counts_in <- args[1] #dxCounts/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M.dxsq.counts dsq_pre <- args[2] #diff_exon_use/dex_wtHousing/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M dsqx_gtf <- args[3] #utils/annotations/DEXSeq/dm6_genes.dxsqReady.gtf nombre <- args[4] #dex_wtHousing mPhatic <- args[5] #"chr3R_FBgn0004652-" #mPhatic <- "chr3R_FBgn0004652-" #use example: # Rscript scripts/deSeqer.R {input.fc_in} diff_exon_use/{wildcards.contrast}/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag} {wildcards.contrast} {gene_id} # Rscript scripts/deSeqer.R dxCounts/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M.dxsq.counts diff_exon_use/dex_wtHousing/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M utils/annotations/DEXSeq/dm6_genes.dxsqReady.gtf dex_wtHousing "_FBgn0004652-" #counts_in <- "counts/DEXSeq/wtHousing.vs_dm6main.dm6_genes.mapspliceMulti.M.dxsq.counts" #dsqx_gtf <- "utils/annotations/DEXSeq/dm6_genes.dxsqReady.gtf" # build the sample DF trammel <- read_yaml("config.yaml") data_sets.df <- plyr::ldply(trammel$data_sets, data.frame) data_sets.df$name <- as.factor(data_sets.df$name) data_sets.df$day<- as.factor(data_sets.df$day) data_sets.df$subgroups<- as.factor(data_sets.df$subgroups) data_sets.df$rep<- as.factor(data_sets.df$rep) data_sets.df$housing<- as.factor(data_sets.df$housing) data_sets.df$genotype<- as.factor(data_sets.df$genotype) data_sets.df$tissue<- as.factor(data_sets.df$tissue) print("metadata frame built") # build the contrast experiment contrasts.df <- plyr::ldply(trammel$xcontrasts, data.frame) %>% filter(name == nombre ) print("contrast defined") # prep the sample metadata sbgrp <- (contrasts.df %>% select(filt))[1,] %>% as.character fact <- (contrasts.df %>% select(vars))[1,] %>% as.character coldata.df <- data_sets.df%>% group_by(name) %>% mutate(genotype = paste0(unique(as.character(genotype)), collapse = "." )) %>% ungroup() %>% filter(subgroups == sbgrp) %>% as.data.frame() row.names(coldata.df) <- coldata.df$name coldata.df %<>% dplyr::select(fact) names(coldata.df) <- "condition" print("experimental metadata selecteted") coldata.df$condition <- factor(coldata.df$condition, ordered = FALSE) coldata.df$condition = relevel(coldata.df$condition, ref = as.character(contrasts.df$ref)) #relevel #dxd.fc$condition <- factor(dxd.fc$condition, ordered=FALSE) #eg, https://support.bioconductor.org/p/74520/ #dxd.fc$condition <- relevel(dxd.fc$condition, ref = as.factor(contrasts.df$ref)) # #counts.dds$condition <- relevel(counts.dds$condition, ref = "untreated") # for (i in seq(nrow(contrasts.df))){ # var <- contrasts.df[i,]$vars %>% as.character() # ruf <-contrasts.df[i,]$ref %>% as.character() # counts.dds[[var]] <- relevel(counts.dds[[var]], ref = ruf) # } # Run DEXSeq #desgn <- "~ sample + exon + condition:exon" desgn <- as.character(contrasts.df$design[1]) dxd.fc = DEXSeqDataSetFromFeatureCounts(counts_in,flattenedfile = dsqx_gtf, sampleData =coldata.df, design= eval(parse(text=desgn))) print("DEXSeq Dataset Built") dxd.fc = estimateSizeFactors( dxd.fc ) print("size factors estimated") dxd.fc = estimateDispersions( dxd.fc ) print("dispersion estimated") png(filename = paste(dsq_pre,".dispersionPlot.png", sep=""), height=500, width=800) plotDispEsts( dxd.fc , main = paste("DEXSeq Dispersion Plot: ",nombre," Contrast", sep="")) dev.off() dxd.fc = testForDEU( dxd.fc ) print("DEU tested for") dxd.fc = estimateExonFoldChanges( dxd.fc, fitExpToVar="condition") print("Exon fold changes estimated") dxd.fc.results = DEXSeqResults( dxd.fc ) print("Results gathered") png(filename = paste(dsq_pre,".",gsub("-","",gsub("\\+","",gsub("_","",mPhatic))),".differentialExonUsePlot.png", sep=""), height=500, width=800) plotDEXSeq( dxd.fc.results, mPhatic, expression=FALSE, splicing=TRUE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2, FDR=0.01 ) dev.off() png(filename = paste(dsq_pre,".",gsub("-","",gsub("\\+","",gsub("_","",mPhatic))),".differentialExonExpressionPlot.png", sep=""), height=500, width=800) plotDEXSeq( dxd.fc.results, mPhatic, expression=TRUE, splicing=FALSE, legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 , FDR=0.01) dev.off() dxd.fc.results.df <- dxd.fc.results %>% as_tibble() #Write to TSV #dxd.fc.results.df <- dxd.fc.results.df %>% select(geneid, everything()) write_delim(dxd.fc.results.df %>% mutate(transcripts=as.character(transcripts)), paste(dsq_pre, "de", sep="."), delim = "\t") print("Results written") print("................... DONE!!!") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | import argparse import gffutils parser = argparse.ArgumentParser() parser.add_argument("-i", "--gtf_in", help="gene model to cronch") parser.add_argument("-o", "--output_prefix", help="file to write ORFs to") parser.add_argument("-n", "--gene_name", help="name of gene to cronch ") #parser.add_argument("-c", "--length_cutoff", help="ignore ORFs with fewer internal (ie, not start or stops) codons", type=int, default=75) #parser.add_argument("-x", "--extend_across_gaps", help="extend the ORF even if it encounters a gap/N", action="store_true", default=False) #parser.add_argument("-a", "--include_all_starts", help="report all ORFs when multiple start codons present (else choose longest)", action="store_true", default=False) args = parser.parse_args() file_out = args.output_prefix #path2file="/proj/cdjones_lab/csoeder/VolkanLab_BehaviorGenetics/utils/annotations/fru.gtf" path2file="/Users/csoeder/Research/VolkanLab_BehaviorGenetics/utils/annotations/fru.gtf" path2file=args.gtf_in fn = gffutils.example_filename(path2file) print(open(fn).read()) #db = gffutils.create_db(fn, dbfn='/proj/cdjones_lab/csoeder/VolkanLab_BehaviorGenetics/test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True) #db = gffutils.create_db(fn, dbfn='/proj/cdjones_lab/csoeder/VolkanLab_BehaviorGenetics/test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, id_spec=['ID', 'Name'] ,disable_infer_genes=True) path2gtf = "%s.db" % tuple([path2file]) #db = gffutils.create_db(fn, dbfn='/Users/csoeder/Research/VolkanLab_BehaviorGenetics/test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True) db = gffutils.create_db(fn, dbfn=path2gtf, force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True, disable_infer_genes=True) gene_name="FBgn0004652" gene_name=args.gene_name gene = db[gene_name] transcription_edges = [] exon_to_transcript = {} for transcript in db.children(gene, featuretype='transcript'): transcription_edges.append(transcript.start) transcription_edges.append(transcript.stop) for exon in db.children(transcript, featuretype='exon'): if exon.start in exon_to_transcript.keys(): exon_to_transcript[exon.start].append(transcript.id) else: exon_to_transcript[exon.start] = [transcript.id] if exon.stop in exon_to_transcript.keys(): exon_to_transcript[exon.stop].append(transcript.id) else: exon_to_transcript[exon.stop] = [transcript.id] # currently ignoring the *transcript* start/stop sites .... kick all of them out startStop = [] for transcript in db.children(gene, featuretype='transcript'): startStop.extend([transcript.start, transcript.stop]) startStop = list(set(startStop)) for term in startStop: exon_to_transcript.pop(term) isoid_definitions = {} isoid_num=0 for iso in exon_to_transcript.values(): if iso in isoid_definitions.values(): pass else: isoid_definitions["isoid_%s"%tuple([isoid_num])] = iso isoid_num +=1 isoids = {} for key in isoid_definitions.keys(): isoids[key]={"transcripts":isoid_definitions[key],"junctions":[]} for junk in exon_to_transcript.keys(): if exon_to_transcript[junk] == isoid_definitions[key]: #print(junk, key, exon_to_transcript[junk]) isoids[key]["junctions"].append(junk) phial_name = "%s.isoids.gtf" % tuple([file_out]) phial_out = open(phial_name, "w") for isoid_name in isoids.keys(): #isoid_name='isoid_0' isoid = isoids[isoid_name] j_list =isoid['junctions'] j_list.sort() gene_gtf_list = ["edgeHog","gene",min(j_list), max(j_list), 0, gene.strand, ".", "gene_id '%s'; transcript_id '%s';\n" %tuple([isoid_name, isoid_name]) ] gene_gtf_str="\t%s"*len(gene_gtf_list) % tuple(gene_gtf_list) gene_gtf_str = "%s%s" % tuple([gene.chrom , gene_gtf_str]) junk_strungs = [] j_count=0 for jay in j_list: junct_gtf_list = ["edgeHog","exon",jay, jay, 0, gene.strand, ".", "gene_id '%s'; transcript_id '%s'; junction_id j_%s;\n" %tuple([isoid_name, isoid_name, j_count]) ] junct_gtf_str="\t%s"*len(junct_gtf_list) % tuple(junct_gtf_list) junct_gtf_str = "%s%s" % tuple([gene.chrom , junct_gtf_str]) junk_strungs.append(junct_gtf_str) j_count += 1 phial_out.write(gene_gtf_str) for junk_line in junk_strungs: phial_out.write(junk_line) phial_out.close() phial_name = "isoids.list" phial_name = "%s.isoids.list" % tuple([file_out]) phial_out = open(phial_name, "w") for isoid_name in isoids.keys(): for trans in isoids[isoid_name]["transcripts"]: phial_out.write("'%s'\t%s\n" %tuple([isoid_name, trans])) phial_out.close() |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | args = commandArgs(trailingOnly=TRUE) # 1: counts file in # 2: prefix for deseq out # 3: name of contrast library("dplyr") #library("yaml") library("readr") library("tidyr") library("magrittr") #library("stringr") counts_in <- args[1] # counts from featureCount #counts_in <- "counts/all.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.counts" aln_meta <- args[2] # metadata for alignments (to get million mapped reads) #aln_meta <- "summaries/alignments.vs_dm6main.mapspliceMulti.summary" exp_out <- args[3] # output file prefix to write expression #exp_out <- "expression.potato" counts.df <- read_delim(counts_in, "\t", escape_double = FALSE, trim_ws = TRUE) RPK.df <- counts.df %>% select(-c(Chr, Start, End, Strand)) %>% mutate(Length = Length/1000) # convert bases to kb #turn counts into per-kb rates #https://stackoverflow.com/a/48693978 RPK.df[,-(1:2)] %<>% sapply(`/`, RPK.df[,2]) RPK.df.gath <- RPK.df %>% gather(key = "sample", value = "count", -one_of("Geneid", "Length")) #collect the million reads mapped (proper pairs, given the current featureCounts settings) aln_meta.df <- read_delim(aln_meta, "\t", escape_double = FALSE, trim_ws = TRUE,col_names = F) names(aln_meta.df)<- c("ref_genome","aligner","sample","measure","value") aln_meta.df %<>% filter(measure == "properly_paired_count") %<>% select(-c("measure", "aligner", "ref_genome")) RPKM.df.gath <- inner_join(RPK.df.gath, aln_meta.df, by =c("sample"="sample")) RPKM.df.gath %<>% mutate(MMR = value/1000000, expression = count/MMR ) %>% select(c("Geneid", "sample", "expression")) RPKM.df.sprud <- RPKM.df.gath %>% spread(key = "sample", value = "expression") write_delim(RPKM.df.sprud, paste(exp_out,".rpkm", sep=""), delim = "\t", col_names = T) RPK.df <- counts.df %>% select(-c(Chr, Start, End, Strand)) %>% mutate(Length = Length/1000) # convert bases to kb # TPM definition https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ # https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ RPK.df[,-(1:2)] %<>% sapply(`/`, RPK.df[,2]) RPK.gath.df <- RPK.df %>% select(-c("Length")) %>% gather(key = "sample", value = "rpk", -one_of("Geneid")) RPK.gath.grupt.sum.df <- RPK.gath.df %>% group_by(sample) %>% summarise(total=sum(rpk)) TPM.df <- inner_join(RPK.gath.df, RPK.gath.grupt.sum.df, by=c("sample"="sample")) %>% mutate(TPM = rpk/(total/1000000)) %>% select(-c("rpk","total")) %>% spread(key=sample, value=TPM) write_delim(TPM.df, paste(exp_out,".tpm", sep=""), delim = "\t", col_names = T) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | import json # https://stackoverflow.com/questions/19483351/converting-json-string-to-dictionary-not-list import argparse parser = argparse.ArgumentParser() parser.add_argument("json_in", help="JSON file to be condensed") parser.add_argument("flat_out", help="flatfile summary") parser.add_argument("-t", "--tag", help="line-name for the processed JSON", default=None) args = parser.parse_args() def prune_jason(jsn): sparse_dict = {} sparse_dict = dict([(k,{}) for k in ['prefiltered', 'postfiltered', 'filtration', 'duplication']]) for i in ['gc_content','q20_rate', 'q30_rate', 'read1_mean_length','total_reads']: try: sparse_dict['prefiltered'][i] = jsn['summary']['before_filtering'][i] except KeyError: sparse_dict['prefiltered'][i] = "NA" try: sparse_dict['postfiltered'][i] = jsn['summary']['after_filtering'][i] except KeyError: sparse_dict['postfiltered'][i] = "NA" for i in ['adaptor_trimmed_reads']: try: sparse_dict['filtration'][i] = jsn['adapter_cutting']['adapter_trimmed_reads'] except KeyError: sparse_dict['filtration'][i] = "NA" for i in ['corrected_reads','low_quality_reads','passed_filter_reads']: try: sparse_dict['filtration'][i] = jsn['filtering_result'][i] except KeyError: sparse_dict['filtration'][i] = "NA" for i in ['rate']: sparse_dict['duplication'][i] = jsn['duplication'][i] return sparse_dict def print_pruned(prn): lines2write = [ [x, y, prn[x][y]] for x in prn.keys() for y in prn[x]] if args.tag: [ ell.insert(0, args.tag) for ell in lines2write ] phial_out = open(args.flat_out, 'w') for preline in lines2write: field_count = len(preline) line = ("%s" + "\t%s"*(field_count-1) + "\n") % tuple(preline) phial_out.write(line) phial_out.close() jason_file=open(args.json_in) jason_str = jason_file.read() jason = json.loads(jason_str) jason_condensed = prune_jason(jason) print_pruned(jason_condensed) |
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | args = commandArgs(trailingOnly=TRUE) library("readr") library("biomaRt") library("org.Dm.eg.db") library("topGO") library("dplyr") library("yaml") library("stringr") library("magrittr") data("geneList") deSeq_file <- args[1] nombre <- args[2] go_out <- args[3] config <- args[4] #nombre <- "hausWtVsMut" sig_thresh <- 0.01 #hausWtVsMut_vs_dm6main_dm6_genes_mapspliceMulti_MpBC_itemized <- read_delim("diff_expr/hausWtVsMut/hausWtVsMut.vs_dm6main.dm6_genes.mapspliceMulti.MpBC.itemized.de", "\t", escape_double = FALSE, trim_ws = TRUE) DESeq.itemized<- read_delim(deSeq_file, "\t", escape_double = FALSE, trim_ws = TRUE) trammel <- read_yaml(config) #trammel <- read_yaml("config.yaml") # build the contrast experiment contrasts.df <- plyr::ldply(trammel$contrasts, data.frame) %>% filter(name == nombre ) #get the factors in the design desgn <- as.character(contrasts.df$design[1]) desgn.splt <- (desgn %>% str_replace_all("~","")%>% str_replace_all(" ","") %>% strsplit("+", fixed=TRUE))[[1]] query_genes.list <- DESeq.itemized$geneid %>% unique() %>% as.list() #ensembl_dm = useMart("ensembl",dataset="dmelanogaster_gene_ensembl") marty <- useDataset("dmelanogaster_gene_ensembl", useMart("ensembl", host = "useast.ensembl.org") ) G_list <- getBM(attributes= c("flybase_gene_id", "ensembl_gene_id", "external_gene_name", "go_id", "name_1006"), mart= marty, useCache= FALSE) %>% mutate(go_name = name_1006) %>% select(-c("name_1006")) #G_list <- getBM(attributes= c("flybase_gene_id", "ensembl_gene_id", "external_gene_name", "go_id", "name_1006"), mart= marty) %>% mutate(go_name = name_1006) %>% select(-c("name_1006")) ### ### ### ### ### ### ### ### ### plz fix this go_bad <- c("GO:0110109","GO:0120176","GO:0120177","GO:0120170","GO:0062023") bad_boiz <- G_list %>% filter(go_id %in% go_bad) %>% dplyr::select("flybase_gene_id") bad_boiz <- rbind(bad_boiz, "FBgn0050046", "FBgn0033710", "FBgn0050203", "FBgn0026721") #mask the baddies G_list.unq <- G_list %>% filter(!flybase_gene_id %in% bad_boiz$flybase_gene_id ) %>% dplyr::select("flybase_gene_id") %>% unique() DESeq.itemized %<>% filter(!(geneid %in% bad_boiz$flybase_gene_id)) ### ### ### ### ### ### ### ### #present_gene_list <- G_list.unq$flybase_gene_id %in% DESeq.itemized$geneid %>% as.numeric() #names(present_gene_list) <- G_list.unq$flybase_gene_id jeanOnt.mega.df <- as.data.frame(c()) #for each factor in factors.... for (fact in desgn.splt){ ref_lev <- (contrasts.df %>% filter(vars == fact))$ref #all_levs<-levels(counts.dds.dsq[[fact]]) #subset the DE data to the factor # DESeq.itemized.factor <- DESeq.itemized[,!is.na(str_locate(names(DESeq.itemized), paste(fact,".vs_" ,sep=""))[,1])] DESeq.itemized.factor <- DESeq.itemized[,!is.na(str_match(names(DESeq.itemized), paste(fact,".vs_(.*?).apeglm" ,sep=""))[,1])] #pull out the alt levels from the data # alt_levs<- (str_split_fixed(names(DESeq.itemized.factor), paste(fact,".vs_", sep=""), n=2)[,2] %>% str_split_fixed("\\.", n=2))[,1] %>% unique() noms <- str_match(names(DESeq.itemized), paste(fact,".vs_(.*?).apeglm" ,sep=""))[,2] alt_levs <- unique(noms[!is.na(noms)]) DESeq.itemized.factor$geneid <- DESeq.itemized$geneid # #for each alt_level in factor.levels: for (alt in alt_levs) { print(alt) #subset the DE data to that level DESeq.itemized.level <- DESeq.itemized.factor[,!is.na(str_locate(names(DESeq.itemized.factor), paste(fact,"\\.vs_",alt,"\\." ,sep=""))[,1])] DESeq.itemized.level$geneid <- DESeq.itemized.factor$geneid #pull out the genes & significance values sig_gene_list <- DESeq.itemized.level[paste("padj.",fact,".vs_",alt,".apeglm",sep="")][[1]]# < sig_thresh)[,1] #%>% as.numeric() sig_gene_list[is.na(sig_gene_list)] <- 1 names(sig_gene_list) <- DESeq.itemized.level$geneid jeanOnt.df <- as.data.frame(c()) for (ont in c("MF","BP","CC") ){ #topDiffGenes uses p cutoff of 0.01 GOdata <- new("topGOdata", ontology = ont, allGenes = sig_gene_list, geneSel = topDiffGenes, annot = annFUN.org, mapping = "org.Dm.eg.db", ID="Ensembl") fishy <- runTest(GOdata, algorithm = "classic", statistic = "fisher") kissy <- runTest(GOdata, algorithm = "classic", statistic = "ks") goRes.tmp <- GenTable(GOdata, classicFisher =fishy, classicKS = kissy, topNodes = 50) %>% mutate(ontology = as.factor(ont), classicFisher = as.numeric(classicFisher), classicKS = as.numeric(classicKS)) %>% select(-c(Term), Term) %>% as_tibble() goRes.tmp %<>% left_join(G_list%>% select(c("go_id", "go_name")) %>% unique(), by =c("GO.ID" = "go_id")) %>% select(-c("Term")) jeanOnt.df <- rbind(jeanOnt.df, goRes.tmp) #store & bind the topGO results. } jeanOnt.df$variable <- as.factor(fact) jeanOnt.df$alt_level <- as.factor(alt) jeanOnt.mega.df <- rbind(jeanOnt.mega.df,jeanOnt.df) } } write.table(jeanOnt.mega.df, file=args[3], row.names=FALSE, col.names = TRUE, sep = "\t", quote=F) |
2
of
scripts/geneOntologe.R
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | import pysam import argparse parser = argparse.ArgumentParser() parser.add_argument("-i", "--samfile_in", help="file in SAM format to ") # parser.add_argument("-i", "--idxstat_in", help="samtools idxstat report") # parser.add_argument("-g", "--split_coverage", help="bedtools genomecov report, with split reads") # parser.add_argument("-G", "--spanning_coverage", help="bedtools genomecov report, with spanning reads") # parser.add_argument("-d", "--depthstats_in", help="samtools depth report") # parser.add_argument("-D", "--dupestats_in", help="samtools markdup stats report") # parser.add_argument( "-m", "--mapping_multiplicity", help="histogram of mapping multiplicity scraped from the IH:: tags") # parser.add_argument( "-c", "--mapped_count", help="count of mapped reads") parser.add_argument("-o", "--samfile_out", help="flatfile summary") # parser.add_argument("-t", "--tag", help="line-name for the flatfile", default=None) args = parser.parse_args() sam_in = args.samfile_in sammy = pysam.AlignmentFile(sam_in,"rb") sam_out = args.samfile_out samuel = pysam.AlignmentFile(sam_out, "w", template=sammy) # cigartuples encoding: # M = 0 # I = 1 # D = 2 # N = 3 # S = 4 for read in sammy.fetch(): cig = read.cigar nu_cig = [] suffix = [] if cig[0][0] == 4: cig = cig[1:] if cig[-1][0] == 4: #discard the softclips cig = cig[:-1] front_pad_tally = 0 while cig[0][0] in [0,1,2]: front_pad_tally += cig[0][1] cig.pop(0) back_pad_tally = 0 while cig[-1][0] in [0,1,2]: back_pad_tally += cig[-1][1] cig.pop(-1) nu_cig.extend([(3,front_pad_tally-1),(0,1)]) # new cigar starts with 5' end clipped down to the actual splice # if cig[0][0] == 0: #take the terminal match blocks # num_match = cig[0][-1] # prefix = [(3,num_match-1),(0,1)] # nu_cig.extend(prefix) #add a bunch of Ns to pad ; ignore 5' and 3' read ends. # cig = cig[1:] # if cig[-1][0] == 0: # num_match = cig[-1][1] # suffix = # cig = cig[:-1] rill_tally = 0 mids = [] for rillo in cig: if rillo[0] in [0,1,2]: rill_tally += rillo[1] #consolidate exons else: if rill_tally > 0: mids.extend([(0,1),(3,rill_tally-2),(0,1)]) #write hollowed exon mids.extend([rillo]) # write intron rill_tally=0 # reset tally nu_cig.extend(mids) nu_cig.extend([(0,1),(3,back_pad_tally-1),]) # end cigar with splice and Ns to 3' end read.cigar = nu_cig read.seq = "*" samuel.write(read) samuel.close() |
60 61 62 63 64 65 66 67 68 | run: shell(""" mkdir -p results/figures/; touch results/figures/null.png; for fig in results/figures/*png; do mv $fig $(echo $fig| rev | cut -f 2- -d . | rev ).$(date +%d_%b_%Y).png; done; rm results/figures/null.*.png; """) shell(""" mkdir -p results/figures/supp/ ; touch results/figures/supp/null.png; for fig in results/figures/supp/*png; do mv $fig $(echo $fig| rev | cut -f 2- -d . | rev ).$(date +%d_%b_%Y).png; done; rm results/figures/supp/null.*.png; """) shell(""" mkdir -p results/tables/ ; touch results/tables/null.tmp ; for phial in $(ls -p results/tables/ | grep -v /); do pre=$(echo $phial | rev | cut -f 2- -d . | rev ); suff=$(echo $phial | rev | cut -f 1 -d . | rev ); mv results/tables/$phial results/tables/$pre.$(date +%d_%b_%Y).$suff; done ; rm results/tables/null.*.tmp; """) shell(""" mkdir -p results/tables/supp/ ; touch results/tables/supp/null.tmp ; for phial in $(ls -p results/tables/supp/ | grep -v /); do pre=$(echo $phial | rev | cut -f 2- -d . | rev ); suff=$(echo $phial | rev | cut -f 1 -d . | rev ); mv results/tables/supp/$phial results/tables/supp/$pre.$(date +%d_%b_%Y).$suff; done ; rm results/tables/supp/null.*.tmp; """) shell(""" mv results/VolkanLabBehaviorGenetics.full.pdf results/VolkanLab_BehaviorGenetics.full.$(date +%d_%b_%Y).pdf """) shell(""" tar cf results.full.$(date +%d_%b_%Y).tar results/ """) |
79 80 81 82 83 84 85 86 87 88 | run: print("%s%s" % tuple(["FASTQ/", wildcards.path])) try: sra = sra_by_fqpath["%s%s/" % tuple(["FASTQ/", wildcards.path])] shell(""" mkdir -p FASTQ/{wildcards.path}/ """) shell(""" fasterq-dump --split-3 --outdir FASTQ/{wildcards.path}/ {sra} """) except KeyError: raise KeyError("Sample is listed as empirical but no reads found and no SRA to download!" ) |
97 98 99 100 101 102 103 104 105 106 107 108 109 110 | run: shell("mkdir -p utils/") shell( """ cd utils; wget http://protocols.netlab.uky.edu/~zeng/MapSplice-v2.1.8.zip ; unzip MapSplice-v2.1.8.zip; 2to3 -w MapSplice-v2.1.8/; cd MapSplice-v2.1.8/; make; cd ..; rm MapSplice-v2.1.8.zip; """ ) |
121 122 123 124 125 126 | run: fai_path = ref_genome_by_name[wildcards.ref_genome]['fai'], shell("mkdir -p utils") shell( 'bedtools makewindows -w {wildcards.window_size} -s {wildcards.slide_rate} -g {fai_path} -i winnum | bedtools sort -i - > {output.windowed}' ) |
135 136 137 138 139 | run: shell("mkdir -p utils/annotations") base_path = annotation_by_name[genelist_by_name[wildcards.subname]["base"]]["bed_path"] list_file = genelist_by_name[wildcards.subname]["list_path"] shell(""" grep -wFf <( tail -n +2 {list_file} | cut -f 2 ) {base_path} > {output.annot_out} """) |
150 151 152 153 154 | shell: """ mkdir -p summaries/reference_genomes/ cat {input.fai_in} | awk '{{sum+=$2}} END {{ print "number_contigs\t",NR; print "number_bases\t",sum}}' | sed -e 's/^/{wildcards.ref_gen}\t/g' > {output.report_out}; """ |
165 166 | shell: "cat {input.refgen_reports} > {output.refgen_summary}" |
178 179 180 181 182 183 | run: shell(""" mkdir -p summaries/annotations/ """) shell(""" rm -f {output.report_out} """) shell(""" cat {input.annot} | cut -f 1 | grep -v "chr2110000222\|chrmitochondrion\|Un\|rand\|chrrDNA" | sort | uniq -c | tr -s " " | tr " " "\t" | awk '{{print"count\t"$2"\t"$1}}' >> {output.report_out} """) shell(""" cat {input.annot} | wc -l | awk '{{print"count\ttotal\t"$0}}' >> {output.report_out} """) shell(""" cat {input.annot} | awk '{{print$3-$2;}}' | awk '{{sum+=$1; sumsq+=$1*$1}} END {{ print "size\ttotal\t",sum; print "size\tavg\t",sum/NR; print "size\tstd\t",sqrt(sumsq/NR - (sum/NR)**2)}}' >> {output.report_out} """) |
195 196 197 198 199 | run: print([a["name"] for a in config['annotations'] if not a["derived"] ]) shell(""" rm -f {output.refann_summary} """) for anne in [a["name"] for a in config['annotations'] if not a["derived"] ]: shell(""" cat summaries/annotations/{anne}.stats | awk '{{print"{anne}\t"$0}}' >> {output.refann_summary}""") |
213 214 | run: shell(""" python scripts/edgeHog.py --gtf_in $(pwd)/utils/annotations/fru.gtf -o utils/annotations/fru -n FBgn0004652 """) |
227 228 229 230 231 232 233 | run: shell(""" rm -f {output.statsOut} """) shell(""" echo "count total $( tail -n +2 {input.list_in} | wc -l )" >> {output.statsOut} """) annot_path = annotation_by_name[genelist_by_name[wildcards.listicle]["base"]]["bed_path"] annot_name = annotation_by_name[genelist_by_name[wildcards.listicle]["base"]]["name"] shell(""" echo "count\tannotated\t$(cat {input.bed_in} | wc -l )">> {output.statsOut} """) shell(""" cat {input.bed_in} | awk '{{print$3-$2;}}' | awk '{{sum+=$1; sumsq+=$1*$1}} END {{ print "size\ttotal\t",sum; print "size\tavg\t",sum/NR; print "size\tstd\t",sqrt(sumsq/NR - (sum/NR)**2)}}' >> {output.statsOut} """) |
245 246 247 248 | run: for listicle in genelist_by_name.keys(): annot = annotation_by_name[genelist_by_name[listicle]["base"]]["name"] shell(""" cat summaries/geneLists/{listicle}.stats | awk '{{print"{listicle}\t{annot}\t"$0}}' >> {output.allStats} """) |
268 269 | shell: "/nas/longleaf/home/csoeder/modules/fastp/fastp {params.common_params} {params.pe_params} --in1 {input.fileIn[0]} --out1 {output.fileOut[0]} --in2 {input.fileIn[1]} --out2 {output.fileOut[1]}" |
284 285 286 287 288 | shell: """ cp {input.jason} summaries/FASTP/{wildcards.samplename}.json python3 scripts/fastp_reporter.py {input.jason} {output.jason_pruned} -t {wildcards.samplename} """ |
301 302 | shell: "cat {input.jasons_in} > {output.summary}" |
318 319 320 321 322 323 324 325 326 327 328 329 | run: ref_genome_split = ref_genome_by_name[wildcards.ref_genome]['split'], ref_genome_bwt = ref_genome_by_name[wildcards.ref_genome]['bowtie'], shell(""" mkdir -p mapped_reads/mapspliceRaw/{wildcards.sample}/ summaries/BAMs/""") shell(""" python utils/MapSplice-v2.1.8/mapsplice.py -c {ref_genome_split} -x {ref_genome_bwt} -1 {input.reads_in[0]} -2 {input.reads_in[1]} -o mapped_reads/mapspliceRaw/{wildcards.sample} """) shell(""" samtools view -Sbh mapped_reads/mapspliceRaw/{wildcards.sample}/alignments.sam | samtools sort - > mapped_reads/mapspliceRaw/{wildcards.sample}/{wildcards.sample}.vs_{wildcards.ref_genome}.mapspliceRaw.sort.bam; samtools index {output.raw_bam}; rm mapped_reads/mapspliceRaw/{wildcards.sample}/alignments.sam; """) |
346 347 348 349 350 351 | run: shell(""" mkdir -p mapped_reads/mapspliceMulti/{wildcards.sample}/ summaries/BAMs/""") shell(""" samtools sort -n {input.raw_bam} | samtools fixmate -m - - | samtools sort - | samtools markdup {params.dup_flg} - - | samtools view -bh {params.quality} > {output.multi_bam} ; samtools index {output.multi_bam} """) |
366 367 368 369 370 371 | run: shell(""" mkdir -p mapped_reads/mapspliceUniq/{wildcards.sample}/ summaries/BAMs/""") shell(""" samtools view {input.multi_bam} | grep -w "IH:i:1" | cat <( samtools view -H {input.multi_bam} ) - | samtools view -Sbh > {output.uniq_bam}; samtools index {output.uniq_bam} """) |
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 | run: shell(""" mkdir -p mapped_reads/mapspliceRando/{wildcards.sample}/ summaries/BAMs/""") shell(""" samtools sort -n {input.multi_bam} | samtools view | grep -v "IH:i:[01]" > {output.rando_bam}.multi.unsampled rm -f {output.rando_bam}.multi.sampled for idx in $(cut -f 1 {output.rando_bam}.multi.unsampled | sort | uniq); do grep -w $idx {output.rando_bam}.multi.unsampled | sort --random-sort | head -n 1 >> {output.rando_bam}.multi.sampled done; cat <( samtools view -h {input.uniq_bam} ) {output.rando_bam}.multi.sampled | samtools view -Sbh | samtools sort - > {output.rando_bam}; samtools index {output.rando_bam} rm {output.rando_bam}.multi.unsampled {output.rando_bam}.multi.sampled """) |
416 417 418 419 420 421 422 423 424 425 | run: shell(""" mkdir -p mapped_reads/{wildcards.aligner}_SpliceOnly/{wildcards.sample}/ summaries/BAMs/""") shell(""" cat <(samtools view -H {input.in_bam}) <(samtools view {input.in_bam}| awk '($6 ~ /N/)' | less) | samtools view -hbS > {output.out_bam}.tmp; samtools index {output.out_bam}.tmp; python scripts/read_Hollower.py -i {output.out_bam}.tmp -o {output.out_bam}.tmp.sam; cat {output.out_bam}.tmp.sam | sed -e 's/\tN\t/\t*\t/g' | samtools view -hbS > {output.out_bam}; samtools index {output.out_bam}; rm {output.out_bam}.tmp {output.out_bam}.tmp.sam; """) |
440 441 442 443 | run: ref_genome_idx=ref_genome_by_name[wildcards.ref_genome]['fai'] shell(""" which samtools """) shell("samtools idxstats {input.bam_in} > {input.bam_in}.idxstats") |
475 476 | shell: """cat {input.bam_reports} | awk '{{print "{wildcards.ref_genome}\t{wildcards.aligner}\t"$0}}'> {output.full_report}""" |
491 492 493 494 | run: fai = ref_genome_by_name[wildcards.ref_genome]['fai'] shell(""" mkdir -p summaries/coverage/ """) shell(""" samtools view -hb {input.bam_in} {params.location} | bedtools genomecov -split -bga -ibam - -g {fai} > {output.bg_out} """) |
510 511 | run: shell(""" touch {output} """) |
530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 | run: flug_str = wildcards.flags flug_lst = [ s for s in flug_str if s != "_"] flug_str = " -%s "*len(flug_lst) % tuple(flug_lst) sed_suff = ".vs_%s.%s.sort.bam" % tuple([wildcards.ref_genome, wildcards.aligner]) sed_cmd = " sed -e 's/mapped_reads\/[a-zA-Z0-9\/_]*\///g' | sed -e 's/%s//g' " % tuple([sed_suff]) annot_gtf = annotation_by_name[wildcards.annot]["gtf_path"] shell(""" mkdir -p counts summaries/counts/""") shell(""" featureCounts {flug_str} {params.fc_params} -t exon -g gene_id -F GTF -a <(cat {annot_gtf} | awk '{{print"chr"$0}}') -o {output.counted_features}.tmp {input.alignments_in} """) shell(""" cat counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp | tail -n +2 | {sed_cmd} > {output.counted_features} cat counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp.jcounts | {sed_cmd} > counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.jcounts cat counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp.summary | {sed_cmd} > counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.summary rm counts/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags}.counts.tmp* """) |
563 564 565 566 567 | run: shell(""" head -n 1 {input.sum_in} > {output.count_sum}; cat {input.sum_in} | grep -w "Assigned\|Unassigned_NoFeatures\|Unassigned_Ambiguity" >> {output.count_sum}; """) |
583 584 585 | run: shell(""" mkdir -p expression/ """) shell(""" Rscript scripts/expressionOmete.R {input.counted_features} {input.aln_rprt} expression/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flags} """) |
599 600 601 | run: shell(""" mkdir -p utils/ """) shell(""" touch {output.expFlg} """) |
617 618 619 620 621 622 | run: #cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"] shell(""" mkdir -p diff_expr/{wildcards.contrast} """) shell(""" Rscript scripts/deSeqer.R {input.fc_in} diff_expr/{wildcards.contrast}/{wildcards.contrast}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag} {wildcards.contrast} config.yaml """) #shell(""" mkdir -p meta/DESeq2_methods/ """) #shell(""" mv diff_expr/{wildcards.contrast}/{wildcards.contrast}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag}.method meta/DESeq2_methods/ """) |
637 638 639 640 | run: #cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"] shell(""" mkdir -p gene_ont/{wildcards.contrast}/ """) shell(""" Rscript scripts/geneOntologe.R {input.deSq_itemized} {wildcards.contrast} {output.deSq_go} config.yaml """) |
656 657 658 659 | run: #cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"] shell(""" mkdir -p correlations/{wildcards.correlation}/ """) shell(""" Rscript scripts/correlate.R {input.xpr_in} {wildcards.correlation} correlations/{wildcards.correlation}/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.annot}.{wildcards.aligner}.{wildcards.flag} """) |
673 674 675 676 677 | run: shell( """ mkdir -p utils/DEXSeq/; """) |
702 703 704 705 706 707 | run: shell( """ mkdir -p utils/annotations/DEXSeq/; python utils/DEXSeq/Subread_to_DEXSeq/dexseq_prepare_annotation2.py {params.prep_params} <( cat {input.anne}| awk '{{print"chr"$0}}' ) -f {output.fc_ready} {output.nu_gff} """) |
723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 | run: flug_str = wildcards.flags flug_lst = [ s for s in flug_str if s != "_"] flug_str = " -%s "*len(flug_lst) % tuple(flug_lst) sed_suff = ".vs_%s.%s.sort.bam" % tuple([wildcards.ref_genome, wildcards.aligner]) sed_cmd = " sed -e 's/mapped_reads\/[a-zA-Z0-9\/_]*\///g' | sed -e 's/%s//g' " % tuple([sed_suff]) shell(""" mkdir -p dxCounts summaries/dxCounts/""") shell(""" featureCounts {flug_str} {params.fc_params} -t exon -g gene_id -F GTF -a {input.dxq_ann} -o {output.counted_features}.tmp {input.alignments_in} """) shell(""" cat {output.counted_features}.tmp | tail -n +2 | {sed_cmd} > {output.counted_features} cat {output.counted_features}.tmp.summary | {sed_cmd} > {output.count_stats} rm {output.counted_features}.tmp* """) ##cat {output.counted_features}.tmp.jcounts | {sed_cmd} > counts/DEXSeq/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.anne}.{wildcards.aligner}.{wildcards.flags}.counts.jcounts |
756 757 758 759 | run: #cont_sbgrp = contrasts_by_name[wildcards.contrast]["filt"] shell(""" mkdir -p diff_exon_use/{wildcards.contrast}/ """) shell(""" Rscript scripts/dexSeqe.R {input.counted_features} diff_exon_use/{wildcards.contrast}/{wildcards.group}.vs_{wildcards.ref_genome}.{wildcards.anne}.{wildcards.aligner}.{wildcards.flags} {input.dxq_ann} {wildcards.contrast} 'chr3R_FBgn0004652-' """) |
790 791 792 793 794 | run: pandoc_path="/nas/longleaf/apps/rstudio/1.0.136/bin/pandoc" pwd = subprocess.check_output("pwd",shell=True).decode().rstrip()+"/" shell("""mkdir -p results/figures/supp/ results/tables/supp/""") shell(""" R -e "setwd('{pwd}');Sys.setenv(RSTUDIO_PANDOC='{pandoc_path}')" -e "peaDubDee='{pwd}'; rmarkdown::render('scripts/RNAseq_results.Rmd',output_file='{pwd}{output.pdf_out}')" """) |
Support
- Future updates
Related Workflows





