Global workflow: RNA Sequencing Analysis Pipeline with Miniconda and Snakemake

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Global workflow

alt text

First step : Data cleaning (Optionnal)

Second step : Get references files (reference genome, transcriptome) Then:

Differently Alternatively Spliced Genes (DASG)
Differently Expressed Genes (DEG)
ChromatineStateEnrichment (from DEG list)
Diffential Transcrits Usage (DTU)

Prerequires

If it isn't install, please download Miniconda latest-version here (https://docs.conda.io/en/latest/miniconda.html)

After that, download the RNA_Seq_Pipeline project:

git clone https://github.com/vindarbot/RNA_Seq_Pipeline.git

And go in the repository

cd RNA_Seq_Pipeline

To create the environment to run the analysis:

on macx OS:

conda env create -f envs/macos.yml

conda activate macos

Also for macos, please install gawk:

brew install gawk

on Linux:

conda env create -f envs/linux.yml

conda activate linux

Configure the experimental design of the analysis, and which parts of the analysis to execute

This step is the more important. The "config.yaml" file indicates important parameters to correctly execute the pipeline. For exemple, the extension of fastq files (default: .fastq.gz ), the reads length ..

By default, all the steps are executed:

DEG (Differently Expressed Genes)
DTU (DIffenrially Transcrits Usage)
DASG (Differently Alternatively Spliced Genes)
CSE (Chromatin State Enrichment)

If you don'y want to use one of the step, please put the variable "exec" to "False"

DEG :

exec : False

WARNING

If you want to perform CSE analysis, DEG analysis must also be executed.

If you want to only perform CSE analysis, please download the tool individually : (https://github.com/vindarbot/ChromatineStateEnrichment)

Files and Folders

- Experience/

This folder must include all raw files (.fastq format) for the analysis. The nomenclature to use for the files name is as follows:

paired-ends reads: {nameOfCondition}_{n°ofReplicate}_R1.fastq.gz, {nameOfCondition} _{n°ofReplicate}_R2.fastq.gz
single-ends reads: {nameOfCondition}_{n°ofReplicate}.fastq.gz

For exemple, for paired-ends reads from two conditions with two replicates:

Experience/Col_1_R1.fastq.gz
Experience/Col_2_R1.fastq.gz
Experience/Col_1_R2.fastq.gz
Experience/Col_2_R2.fastq.gz
Experience/Mut_1_R1.fastq.gz
Experience/Mut_1_R2.fastq.gz
Experience/Mut_2_R1.fastq.gz
Experience/Mut_2_R2.fastq.gz

In this case, the file name

! WARNING !

If you don't want to run the trimming step (for exemple if the fastq files are already trimmed, you have to include the fastq files in the Trimming/ folder and add ".trim" extension in the file name as follow:

Trimming/Col_1_R1.trim.fastq.gz
Trimming/Col_1_R2.trim.fastq.gz
etc

- example/

Include small .fastq file in order to try the pipeline. To achieve this, check that the Experience/ folder is empty and move the examples files like this:

mv example/* Experience/

rules

/ Include all rules in .smk files to indicate to Snakemake which files to generate.

scripts/

Additional python and R scripts (for exemple, DEG.R is necessary to run DESeq2)

Snakefile

This file is read by Snakemake to run the pipeline, "Snakefile" is the default name, to launch the pipeline, you have to be in the parent folder (RNA_Seq_Pipeline/ ), and execute the command line:

snakemake

If for some reasons the name of the snakefile is no longer "Snakefile", you have to specify the snakefile name as follow:

snakemake --snakefile Snakefile_Name

Code Snippets

shell: ' STAR --runThreadN {threads} --runMode genomeGenerate --genomeSAindexNbases 4 --sjdbOverhang {params.read_length} \
--genomeDir {input.starref} --genomeFastaFiles {input.genome} --sjdbGTFfile {input.gtf}'

SnakeMake STAR From line 47 of rules/1st_pass.smk

shell:' STAR --runThreadN {threads} --genomeDir {params.direct} \
	--outFileNamePrefix pass1/{wildcards.sample} --readFilesIn {input.r1} {input.r2} \
	--outSAMtype BAM Unsorted --readFilesCommand "gunzip -c"'

SnakeMake STAR From line 69 of rules/1st_pass.smk

shell:' STAR --runThreadN {threads} --genomeDir {params.direct} \
	--outFileNamePrefix pass1/{wildcards.sample} --readFilesIn {input.r} \
	--readFilesCommand "gunzip -c" \
	--outSAMtype BAM Unsorted'

SnakeMake STAR From line 89 of rules/1st_pass.smk

shell:''' 
gawk '$6==1 || ($6==0 && $7>2)' {input} >> {output};

'''

SnakeMake From line 104 of rules/1st_pass.smk

shell: '''
awk 'BEGIN {{OFS="\t"; strChar[0]="."; strChar[1]="+"; strChar[2]="-";}} {{if($5>0){{print $1,$2,$3,strChar[$4]}}}}' \
 {input} > {output};
 mv {input} logs/1st_pass/
'''

SnakeMake From line 118 of rules/1st_pass.smk

shell:' STAR --runThreadN {threads} --runMode genomeGenerate --sjdbOverhang {params.read_length} \
--genomeDir {params.genomedir} --genomeFastaFiles {input.genome} --sjdbFileChrStartEnd {input.SJ}'

SnakeMake STAR From line 46 of rules/2sd_pass.smk

	shell:' STAR --runThreadN {threads} --chimSegmentMin 2 --outFilterMismatchNmax 3\
		--alignIntronMin 20 \
 		--alignIntronMax 1000000 \
 		--genomeDir {params.direct} \
 		--outFileNamePrefix pass2/{wildcards.sample} --readFilesIn {input.r1} {input.r2} \
 		--sjdbOverhang  {params.read_length} --readFilesCommand "gunzip -c" \
 		--outSAMtype BAM SortedByCoordinate; \
 		mv pass2/{wildcards.sample}Aligned.sortedByCoord.out.bam {output};'

SnakeMake STAR From line 70 of rules/2sd_pass.smk

	shell:' STAR --runThreadN {threads} --chimSegmentMin 2 --outFilterMismatchNmax 3\
 		--alignIntronMax 2999999 \
 		--genomeDir {params.direct} \
 		--outFileNamePrefix pass2/{wildcards.sample} --readFilesIn {input.r} \
 		--sjdbOverhang  {params.read_length} --readFilesCommand "gunzip -c" \
 		--outSAMtype BAM SortedByCoordinate; \
 		mv pass2/{wildcards.sample}Aligned.sortedByCoord.out.bam {output};'

SnakeMake STAR From line 96 of rules/2sd_pass.smk

run:
	with open(output.b1,'w') as file1:
		file1.write(",".join(params.file1))

	with open(output.b2,'w') as file2:
		file2.write(",".join(params.file2))

SnakeMake From line 121 of rules/2sd_pass.smk

shell: ' STAR --runThreadN {threads} --runMode genomeGenerate \
--genomeDir {input.starref} \
--genomeFastaFiles {input.genome} \
--sjdbGTFfile {input.gtf} \
--genomeSAindexNbases 11 \
>{log} 2>&1'

SnakeMake STAR From line 28 of rules/classic_mapping.smk

shell: ' STAR --runThreadN {threads} \
	--genomeDir {input.starref} \
	--outFileNamePrefix Mapping/{wildcards.sample} --readFilesIn {input.r1} {input.r2} \
	--readFilesCommand "gunzip -c" \
	--outSAMtype BAM Unsorted; \
	mv Mapping/{wildcards.sample}Aligned.out.bam {output};\
	mv Mapping/{wildcards.sample}*out* logs/Mapping'		

SnakeMake STAR From line 53 of rules/classic_mapping.smk

shell: ' STAR --runThreadN {threads} --genomeDir {input.starref} --readFilesCommand "gunzip -c" \
	--outFileNamePrefix Mapping/{wildcards.sample} --readFilesIn {input.r} --outSAMtype BAM Unsorted; \
	mv Mapping/{wildcards.sample}*.bam {output}; mv Mapping/*out* logs/Mapping '

SnakeMake STAR From line 78 of rules/classic_mapping.smk

shell: ''' samtools sort {input} -o {output} -@ {threads} '''

SnakeMake SAMtools From line 93 of rules/classic_mapping.smk

shell: ''' samtools index {input} > {output} '''

SnakeMake SAMtools From line 104 of rules/classic_mapping.smk

shell: ''' featureCounts -p -s 2 -T {threads} -t exon -g gene_id -a {input.gtf} -o {output} {input.mapping} '''

SnakeMake FeatureCounts From line 25 of rules/counts.smk

shell: ''' featureCounts -T {threads} -t exon -g gene_id -a {input.gtf} -o {output} {input.mapping} '''	

SnakeMake FeatureCounts From line 41 of rules/counts.smk

shell: ''' python3 scripts/RPKM.py featureCounts/counts.txt'''

SnakeMake From line 50 of rules/counts.smk

shell:
	"git clone https://github.com/torkian/subread-1.6.1.git; \
	mv subread-1.6.1/ scripts/"

SnakeMake From line 13 of rules/get_packages.smk

run:
	with open("experimentalDesign.txt","w") as xpDesign:
		xpDesign.write("batch,condition\n")

		for condition,samples in CONDITION_TO_SAMPLES.items():
			for sample in samples:
				xpDesign.write(sample+".sorted.bam,"+condition+"\n")

SnakeMake From line 55 of rules/get_ref_files.smk

shell: ''' 
	wget {params.get_genome}; mv {params.fasta_name} {output.fasta}
	wget {params.get_transcripto}; mv {params.transcripto_name} {output.transcripto}
	wget {params.get_gtf}; mv {params.gtf_name} reference.gff
	awk '{{ sub(/'ChrM'/,"mitochondria"); sub(/'ChrC'/,"chloroplast"); sub(/'Chr'/,"");print}}' reference.gff > reference_clean.gff
	rm reference.gff
	gffread reference_clean.gff -T -o {output.gtf}
	rm reference_clean.gff
	wget {params.get_description}
	mv {params.description_name} {output.description}
	'''

SnakeMake gffread From line 104 of rules/get_ref_files.smk

shell: 
	'git clone https://github.com/Darbinator/ChromatineStateEnrichment.git'

SnakeMake From line 15 of rules/init_CSE.smk

shell:
	'python3 ChromatineStateEnrichment/python/fisher_chrState.py -l {input.liste_deg} -o {params.dir} -p {params.padj}; \
	ln -s {params.dir_html} CSE_results/results.html'

SnakeMake From line 33 of rules/init_CSE.smk

shell: '''python3 scripts/generate_results.py {input.up} {output.up};
	python3 scripts/generate_results.py {input.down} {output.down} '''

SnakeMake From line 48 of rules/init_CSE.smk

	shell: '''
        Rscript scripts/DEG.R --padj={params.padj} \
            --lfc={params.lfc} \
            --xpdesign={input.xpdesign} \
            --description={params.description} \
            --outprefix={params.outprefix} \
            >{log} 2>&1
        '''

SnakeMake From line 33 of rules/run_DESeq2.smk

	shell: '''
        Rscript scripts/DTU.R --cutoff={params.cutoff} \
    		--padj={params.padj} \
    		--gtf={params.gtf} \
   			--cond_controle={params.cond_control} \
   			--cond_traitment={params.cond_traitment} \
    		--outprefix={params.outprefix} \
            >{log} 2>&1 \
        '''

SnakeMake From line 35 of rules/run_RATs.smk

shell:'''
	python2.7 {params.rmats_dir} --b1 {input.b1} --b2 {input.b2} --readLength {params.read_length} -t paired --gtf {input.gtf} --bi {input.starRef} --od {input.dir} --libType {params.strand} >{log} 2>&1''' 

SnakeMake From line 54 of rules/run_rMATS.smk

shell:'''
	python2.7 {params.rmats_dir} --b1 {input.b1} --b2 {input.b2} --readLength {params.read_length} -t single --gtf {input.gtf} --bi {input.starRef} --od {input.dir} --libType {params.strand} >{log} 2>&1''' 

SnakeMake From line 82 of rules/run_rMATS.smk

shell:
	" salmon index -p {threads} -t {input} -i {params.dir} >{log} 2>&1"

SnakeMake Salmon From line 33 of rules/salmon_pseudomapping.smk

shell:
	" salmon quant -l A --index {input.index} -1 {input.r1} -2 {input.r2} -o {params.outdir} --numBootstraps {params.boots} --writeMappings={params.outdir}/out.sam >>{log} 2>&1"

SnakeMake Quant Salmon From line 56 of rules/salmon_pseudomapping.smk

shell:
	" salmon quant -l A --index {input.index} -r {input.r} -o {params.outdir} --numBootstraps {params.boots} --writeMappings={params.outdir}/out.sam >>{log} 2>&1"

SnakeMake Quant Salmon From line 76 of rules/salmon_pseudomapping.smk

shell:
	"python3 scripts/rMATS_filt.py -p {params.padj} -c {params.psi} >{log} 2>&1"

SnakeMake From line 36 of rules/sign_DAS.smk

shell: ' cutadapt -l {params.cut_trim} -m {params.cut_trim} -o {output.r1} -p {output.r2} {input.r1} {input.r2} -j 8 --pair-filter=any >{log} 2>&1'

SnakeMake Cutadapt From line 40 of rules/trimming_AS.smk

shell: ' cutadapt -l {params.cut_trim} -m {params.cut_trim} -o {output} {input} -j 8 >{log} 2>&1'

SnakeMake Cutadapt From line 61 of rules/trimming_AS.smk

shell: ' bbduk.sh in1="{input.r1}" in2="{input.r2}" out1="{output.r1}" out2="{output.r2}" \
	ref="{input.adapters}" minlen=25 ktrim=r k=22 qtrim=rl trimq=20 hdist=1 tpe tbo ziplevel=7 >{log} 2>&1'

SnakeMake BBMap From line 27 of rules/trimming.smk

shell: ' bbduk.sh in="{input.r}" out="{output.r}" \
	ref="{input.adapters}" minlen=25 ktrim=r k=22 qtrim=rl trimq=20 hdist=1 tpe tbo ziplevel=7 >{log} 2>&1 '

SnakeMake BBMap From line 49 of rules/trimming.smk

library(docopt)

"Analyse des gènes différentiellement exprimés (DESeq2)
Usage:
  DEG.R --lfc=<lfc> --padj=<padj> --xpdesign=<xpdesign> --description=<description> --outprefix=<outprefix> 
  Options:
    --lfc=<lfc>                      Log 2 Fold change à appliquer entre les deux contions
    --padj=<padj>                    Seuil de pvalue ajustée à appliquer 
    --xpdesign=<xpdesign>            Design expérimental de l'expérience
    --description=<description>      Fichier d'annoation des gènes
    --outprefix=<outprefix>          Dossier avec les résultats
" -> doc

opts <- docopt(doc)


lfc <- as.numeric(opts[['lfc']])  
padj <- as.numeric(opts[['padj']])
xpdesign <- opts[['xpdesign']]
outprefix <- opts[['outprefix']]
description <- opts[['description']]

library(DESeq2)
library(tidyverse)
library(biomaRt)
library(ggplot2)

# On lit le fichier design expérimental, qui a était généré avec Snakemake
xpdesign = read.csv(xpdesign, row.names=1, sep=",")

# On enlève l'extension du nom des fichiers
rownames(xpdesign) <- gsub(".sorted.bam","",rownames(xpdesign))

RPKM=read.csv("DEG/RPKM.txt",row.names=1,sep="\t")

colnames(RPKM) <- gsub("Mapping.", "", colnames(RPKM))
colnames(RPKM) <- gsub(".sorted.bam","",colnames(RPKM))

RPKM <- RPKM[,1:ncol(RPKM)-1]


RPKM$mean <- rowSums(RPKM)/length(colnames(RPKM))

RPKM$ID <- rownames(RPKM)

RPKM$ExpressionLvl <- ifelse(RPKM$mean <1, "very_low", ifelse(RPKM$mean < 10, "low", ifelse(RPKM$mean < 100, "medium","high")))


# On lit la matrice de comptage des reads par gène généré par featureCounts
featurescounts=read.csv("featureCounts/counts.txt", sep="", head=T, skip=1, row.names = "Geneid")


#Formatage de la matrice de comptage généré par featureCounts, pour que celle-ci
# puisse être lu par DESeq2 

#Ici on retire les 5 premières colonnes inutiles pour ne garder que les colonnes associées aux nombre 
# de reads par compté pour chaque échantillon
featureMatrix <- featurescounts[ ,6:ncol(featurescounts)]

# Formatage du nom des échantillon 
colnames(featureMatrix) <- gsub("Mapping.", "", colnames(featureMatrix))
colnames(featureMatrix) <- gsub(".sorted.bam","",colnames(featureMatrix))


featureMatrix <- as.matrix(featureMatrix)

# On récupère les conditions expérimentales 
conditionTest <- factor(xpdesign$condition)

# Pour chaque gène, on récupère la valeur médiane du nombre de reads entre les 3 réplicats
# Par exemple, à partir de la matrice de featureCounts, pour le gène AT1G01010:

#           Col_1 Col_2 Col_3 hon4_1 hon4_2 hon4_3
# AT1G01010   201   104   944     95    138    173

# La médiane calculé sera :

#             Col   hon4
#AT1G01010    201    138
MedianeParCondition = t(apply(featureMatrix, 1, tapply, 
                              conditionTest, median)) 

# Ici, on récupère la valeur médiane maximal entre les deux conditions pour chaque gène
# Exemple pour AT1GO1O1O1: 201 (car 201>138)

maxMedian=apply(MedianeParCondition, 1, max) 

# Ainsi, pour chaque gène, si cette valeur médiane maximal est inférieur à 10, on supprime le gène de la matrice de comptage
# Ceci permet d'éliminer les gènes avec très peu de reads, ces gènes 
# pouvant être à tord détectés comme différentiellement exprimés
#
featureMatrix = featureMatrix[maxMedian >= 10,]



# On créé un tableau qui associe pour chaque échantillon le nom de la condition associé (colonne conditionTest)
# Ceci permet de specifier le design expérimental à DESeq2
# Exemple:
#        conditionTest
#Col_2            Col
#Col_1            Col
#Col_3            Col
#hon4_3          hon4
#hon4_1          hon4
#hon4_2          hon4

(coldata <- data.frame(row.names=colnames(featureMatrix[,rownames(xpdesign)]), conditionTest))



# dds permet de transformer la matrice que nous venons de formater en un object DESeq
# ATTENTION à ce que les colonnes de featureMAtrix soit dans le même ordre que le nom
# des lignes (et donc des échantillons) du fichier de design expérimental
# pour s'assurer de ceci, on formate la matrice comme ceci: featureMatrix[,rownames(xpdesign)]

dds <- DESeqDataSetFromMatrix(countData=featureMatrix[,rownames(xpdesign)], colData=coldata, design= ~ conditionTest)

# Cette fonction permet de normaliser le nombre de reads compté selon la taille de la librairie
# de séuqneçage pour chaque échantillon, selon le nombre de total de reads compté
# Ceci permet de normaliser le nombre de reads compté par gène entre les échantillons

dds <- estimateSizeFactors(dds)

dds = estimateDispersions( dds)
dds <- DESeq(dds)

# On transorme le nombre de reads comptés par le log de ce nombre, afin de minimiser les grands écarts
# pouvant être observés entre les gènes avec peu de reads comptés, et les gènes avec beaucoup de reads comptés
rld <- rlog(dds, blind = FALSE)

# On génère une ACP sur les 1000 gènes les + exprimés 
jpeg('DEG/pca.jpg')
plotPCA(rld, intgroup="conditionTest",ntop = 1000)
dev.off()
# On récupère les résultats obtenus par DESeq2
res <- results(dds)

jpeg('DEG/MA.jpg')
plotMA(res, ylim=c(-5,5))
dev.off()


## Gènes induits lorsque le gène testé est muté
# Ici on récupère les gènes différentiellement surexprimés dans la condition test,
# ceci selon un seuil de p-value ajustée définie (exemple padj<0.1) et selon un log2FoldChange définit
# (exemple log2FoldCHange de 1, soit un FoldChange de 2)
genes_up = res[ which(res$padj < padj & res$log2FoldChange > lfc), ]


# On fait de même pour les gènes sous-exprimés
genes_down = res[ which(res$padj < padj & res$log2FoldChange < -lfc), ]

genes_up <- genes_up[order(genes_up$padj, decreasing = F),]

genes_down <- genes_down[order(genes_down$padj, decreasing = F),]

genes_down <- as.data.frame(genes_down)
genes_up <- as.data.frame(genes_up)

# On va récupérer les identifiers TAIR pour chaque gène et les ajouter dans une colonne "ID"
# Ceci sera nécessaire plus tard pour récuperer les Gene Symbols 
identifiants_genes_up <- rownames(genes_up)
identifiants_genes_down <- rownames(genes_down)
identifiants <- c(identifiants_genes_down,identifiants_genes_up)

genes_up$ID <- identifiants_genes_up
genes_down$ID <- identifiants_genes_down

## Ensemble des gènes différentiellement exprimés.
genes_signif = rbind(genes_up, genes_down)

## Le fichier gene_description a été récupéré sur TAIR10 et nous permet d'accéder aux annotations
# des gènes, à partir de leur identifiants.

# Ici on récupère le fichier de description des gènes , récupéré sur le site de TAIR
gene_description <- read.delim(description,
                               header=FALSE, quote="")

# On formate le nom des identifiants au sein du fichier
genenames = gsub("[.][1234567890]", "",
                 gene_description[,1])

gene_description[,1]=genenames

## Ici on récupère la description des gènes différentiellement exprimés, en regardant les identifants
# des gènes qui "matchent" entre les deux fichiers
genes_match_rows <- match(rownames(genes_signif), gene_description[,1])



# On fait ceci pour les gènes up
genes_match_up <- match(rownames(genes_up), gene_description[,1])

Code_to_description <- gene_description[genes_match_rows,c(1,3)]

Code_to_description_up <- gene_description[genes_match_up,c(1,3)]

colnames(Code_to_description) <- c("Code","Description")

colnames(Code_to_description_up) <- c("Code","Description")

# Et les gènes downs
genes_match_down <- match(rownames(genes_down), gene_description[,1])

Code_to_description_down <- gene_description[genes_match_down,c(1,3)]

colnames(Code_to_description_down) <- c("Code","Description")




# On ajoute la description des gènes dans la variable gene_up et gene_dpwn
genes_up$Description <- Code_to_description_up$Description

genes_down$Description <- Code_to_description_down$Description


# Ici on utilise la librairie bioMart qui permet de traduire et donc de récupérer les Gene Symbols,
# à partir des identifiants TAIR


ensembl <- useMart(biomart="plants_mart",host="plants.ensembl.org",dataset = "athaliana_eg_gene")


TAIRID <- genes_down$ID

# Ici, on déifinit un tableau symbol qui associe chaque TAIR ID avec le GeneSymbol associé, exemple:

#       tair_symbol tair_locus
#1        AT-AER  AT5G16970
#2                AT4G32100
#3                AT2G43120
#4                AT1G30814
#5         PUB29  AT3G18710
#6         APUM6  AT4G25880
symbol <- getBM(attributes=c("tair_symbol","tair_locus"), mart=ensembl)

# On ajoute dans les variables genes_up et genes_up les genesSymbols associés
# Puis on garde uniquement les colonnes d'interet 
# Enfin, on trie le tableau par padj décroissante:
# Exemple

genes_up <- (merge(x=symbol,y=genes_up,by.x="tair_locus",by.y="ID",all.y=TRUE))
genes_up <- genes_up[c(1,2,4,8,9)]
genes_up <- genes_up[order(genes_up$padj,decreasing = F),]

#     tair_locus tair_symbol log2FoldChange     padj
#25   AT1G36180        ACC2       2.438645 1.252854e-39
#47   AT2G02120      PDF2.1       2.082633 8.393737e-19
#170  AT5G25980        TGG2       1.589085 1.758894e-16
#150  AT4G29190                   1.349187 5.826872e-16

genes_down <- (merge(x=symbol,y=genes_down,by.x="tair_locus",by.y="ID",all.y=TRUE))
genes_down <- genes_down[c(1,2,4,8,9)]
genes_down <- genes_down[order(genes_down$padj,decreasing = F),]

# Pour avoir l'nesemble des résultats et des RPKM

all_results_genes             <- as.data.frame(res)

all_results_genes$ID          <- rownames(all_results_genes)

genes_match                   <-  match(rownames(all_results_genes), gene_description[,1])

Code_to_description           <- gene_description[genes_match,c(1,3)]

all_results_genes$Description <- Code_to_description$Description

all_results_genes             <- (merge(x=symbol,y=all_results_genes,by.x="tair_locus",by.y="ID",all.y=TRUE))

all_results_genes             <- all_results_genes[,c(1,2,4,8)]

all_results_genes             <- (merge(x=RPKM,y=all_results_genes,by.x="ID",by.y="tair_locus",all.y=TRUE))

all_results_genes$DEG         <- ifelse(all_results_genes$padj< padj & all_results_genes$log2FoldChange > lfc,"over",ifelse(all_results_genes$padj < padj & all_results_genes$log2FoldChange < lfc,"under","not_deg"))



# Enfin on génère le fichiers de sortie des gènes up et down

write_tsv(as.data.frame(genes_up),"DEG/genes_up.txt")
write_tsv(as.data.frame(genes_down),"DEG/genes_down.txt")
write_tsv(as.data.frame(all_results_genes),"DEG/all_results_genes.txt")
write_tsv(as.data.frame(identifiants),"DEG/tair_ids.txt")

R ggplot2 tidyverse DESeq2 FeatureCounts PCAtools biomaRt docopt From line 1 of scripts/DEG.R

library(docopt)
#if (!require("devtools")) install.packages("devtools", repos="http://cran.us.r-project.org")
if (!require("rats")) devtools::install_github("bartongroup/rats", ref="master")
library(rats)

library(tidyverse)


"Analyse des gènes différentiellement transcrits (RATs)
Usage:
  DTU.R --cutoff=cutoff --padj=<padj> --gtf=<gtf> --cond_controle=<cond_controle> --cond_traitment=<cond_traitment> --outprefix=<outprefix> 
  Options:
    --cutoff=cutoff                  Log 2 Fold change à appliquer entre les deux contions
    --padj=<padj>                    Seuil de pvalue ajustée à appliquer 
    --gtf=<gtf>                      GTF file
    --cond_controle=<cond_controle>  Condition controle
    --cond_traitment=<cond_traitment> Condition traitement
    --outprefix=<outprefix>          Dossier avec les résultats
" -> doc

opts <- docopt(doc)

cutoff <- opts[['cutoff']]
cutoff <- as.numeric(cutoff)  
padj <- opts[['padj']]
padj <- as.numeric(padj)
outprefix <- opts[['outprefix']]
gtf <- opts[['gtf']]
cond_controle <- opts[['cond_controle']]
cond_traitment <- opts[['cond_traitment']]

ident <- annot2ids(gtf)

# Dossiers parents des quantifications, des echantillons controles et parents
samples_A <- list.files(path='Salmon/quants', pattern=cond_controle, full.names = TRUE)
samples_B <- list.files(path='Salmon/quants', pattern=cond_traitment, full.names = TRUE)

samples_A <- as.vector(samples_A)
samples_B <- as.vector(samples_B)

# 2. Calculate length- & library-normalised abundances. 
#    Scale them to 1M reads for TPM values.
mydata_2 <- fish4rodents(A_paths= samples_A, B_paths= samples_B, 
                         annot=ident, scaleto=100000000)

mydtu_2 <- call_DTU(annot= ident, boot_data_A= mydata_2$boot_data_A, 
                    boot_data_B= mydata_2$boot_data_B, verbose= FALSE,abund_thresh = 5, p_thresh = padj,dprop_thresh = cutoff, threads = 2)


# 4. Plot significance VS effect size:
#plot_overview(mydtu_2)


# 5a. Get all gene and transcript identifiers per category 
# (significant DTU, no DTU, Not Applicable):
myids_2 <- get_dtu_ids(mydtu_2)

#dtu_summary(mydtu_2)
# 5b. Get all gene and transcript identifiers implicated in isoform switching:


DTU_genes <- myids_2$`DTU genes (gene test)`
write_tsv(as.data.frame(DTU_genes),"DTU/DTU.txt")

#dir = setwd("~/Desktop/Data/FLUX-SIMU/AS_genes_detected/RATs/")
write_tsv(mydtu_2$Abundances[[1]], "DTU/abondance_control.txt")
write_tsv(mydtu_2$Abundances[[2]], "DTU/abondance_traitment.txt")

R tidyverse docopt devtools From line 2 of scripts/DTU.R

import sys
import os
import re
import subprocess
import glob
from collections import defaultdict
from decimal import *

file1 = open(sys.argv[1],"r")

file2 = open("DEG/RPKM.txt","r")

file3 = open("CSE_results/genes_to_states.txt",'r')

up = file1.readlines()

RPKM = file2.readlines()

STATES = file3.readlines()

RPKM_to_join = {}

for ligne in RPKM:

	RPKM_to_join[ligne.split()[0]] = ligne.split()[-1]

formate_states = {}

for ligne in STATES[1:]:

	formate_states[ligne.split('\t')[0].rstrip()] = ligne.split('\t')[1].rstrip()

all_states = []

for gene,states in formate_states.items():

	for state in states.split():

		if int(state) not in all_states:

			all_states.append(int(state))

			all_states.sort()

new_file = []

header= ["tair_locus","tair_symbol","log2FoldChange","padj","Description","mean_FPKM"]
for state  in all_states:
	header.append(str(state))

new_file.append(header)

for ligne in up[1:]:

	if ligne.split()[0].rstrip() in RPKM_to_join:

		if ligne.split('\t')[4] == "\n":

			full_RPKM = [ligne.split('\t')[0],ligne.split('\t')[1],ligne.split('\t')[2],ligne.split('\t')[3],"NA"]

			full_RPKM.append(RPKM_to_join[ligne.split('\t')[0].rstrip()])
		else:

			full_RPKM = ligne.rstrip().split('\t')

			full_RPKM.append(RPKM_to_join[ligne.split('\t')[0].rstrip()])

		if ligne.split()[0].rstrip() in formate_states:

			states_gene = [str(x) for x in formate_states[ligne.split()[0].rstrip()].split()]

			for state in header[6:]:
				if state in states_gene:
					full_RPKM.append('V')
				else:
					full_RPKM.append('X')

			# for state in formate_states[ligne.split()[0].rstrip()].split():
			# 	print(state)

		new_file.append(full_RPKM)

write_file = open(sys.argv[2],"w")

for ligne in new_file:
	write_file.write("\t".join(ligne))
	write_file.write("\n")

Python From line 10 of scripts/generate_results.py

import sys
import os
import argparse
import glob
### 
# Script qui prend en entrée un fichier de sortie de rMATS pour un évenèment de splicing donné, et ressort
# les évènement différentiellement exprimés entre les deux conditions.
# 
# On filtre d'abord le fichier pour éliminés les évè!nements donc le nombre moyen de reads associés entre les échantillons
# est inférieur à 5
#
# Esnuite, on garde les évènements dont la valeur de FDR (False Discovery Rate) est inférieur à 0.05, 
# et dont la valeur absolue du IncLevelDifference est supérieur à 0.1
#
# Enfin, on sélectionne uniquement les colonnes intéressantes à garder.

parser = argparse.ArgumentParser(description='filter signficatives splice events')

parser.add_argument("-p","--padj",action='store',
	help="padj cutoff\n")

parser.add_argument("-c","--psi", action='store', 
	help="PSI cutoff")

args = parser.parse_args()

PSI = float(args.psi)
padj = args.padj


columnsToKeep = [0,1,3,4,5,6,7,8,9,10,12,13,14,15,19,22]

EVENTS = glob.glob("DAS/rMATS_output/*MATS.JCEC.txt")

for event in EVENTS:

	rMATs_file = open(event,"r")

	file_name = os.path.basename(event).rstrip(".txt")

	with open("DAS/topSplicingEvents/"+file_name+".top.txt","w") as rMATSs_top_file:

		for ligne in rMATs_file.readlines():

			if ligne.startswith("ID"):

				for i in columnsToKeep:

					rMATSs_top_file.write(ligne.split()[i]+"\t")
				rMATSs_top_file.write("\n")



			else:
				# On compte le nombre total de reads dans les 2 conditions (ligne.split()[12,13] pour accéeder aux nombre de reads des)
				# échantillons de la condition 1, et ligne.split()[14,15] pour accéder aux reads de la condition 2
				nbTotalReads = sum(list(map(int, ligne.split()[12].split(","))))+sum(list(map(int, ligne.split()[14].split(","))))+sum(list(map(int, ligne.split()[13].split(","))))+sum(list(map(int, ligne.split()[15].split(","))))

				# On compte le nombre d'échantillon , par exemple dans la colonne 12, si on a 3 échantillons on aura cette valeur la:
				# 12,14,8
				# Ainsi, en splittant la ligne par (",") et en comptant le nombre d'éléments obtenus, on a accès au nombre d'échantillons
				# On somme pour les 2 colonnes associés aux 2 conditiosns.
				nbSamples = len(ligne.split()[12].split(","))+len(ligne.split()[14].split(","))

				avgNbReads = nbTotalReads / nbSamples


				try:
					if avgNbReads > 10  and abs(float(ligne.split()[22])) > PSI :

						print(abs(float(ligne.split()[22])))
						for i in columnsToKeep:

							rMATSs_top_file.write(ligne.split()[i]+"\t")

						rMATSs_top_file.write("\n")
				except:
					break

	rMATs_file.close()


if len(os.listdir("DAS/topSplicingEvents")) == 5:

	liste_DAS = []

	for file in os.listdir("DAS/topSplicingEvents"):

		with open("DAS/topSplicingEvents/"+file,"r") as file:

			for ligne in file.readlines():

				if ligne.split()[1].startswith("\""):

					liste_DAS.append(ligne.split()[1].strip("\""))

	liste_unique_DAS = list(set(liste_DAS))


	with open("DAS/DAS.txt","w") as DAS:
		for gene in liste_unique_DAS:
			DAS.write(gene+"\n")

Python From line 3 of scripts/rMATS_filt.py

import sys
import os
import re
import subprocess
import glob
from collections import defaultdict
from decimal import *

# le premier argument a donné au script est la matrice de comptage de reads par gène par featureCounts
featureCounts = open(sys.argv[1],"r")

# On récupère chaque ligne du fichier sous forme d'une liste
lignes = featureCounts.readlines()

# On initialise un dictionnaire qui récupère chaque échantillon (le nom des échantillons sont retrouvés au sein de la 2ème ligne du fichier 
# featureCounts, accessibles de cette manière :  lignes[1] , à partir de la 7ème colonne : .rstrip().split()[6:] )

sample_to_total_reads = defaultdict(int)

[sample_to_total_reads[sample] for sample in range(len(lignes[1].rstrip().split()[6:]))]


# Pour calculer le nombre total de reads comptés dans chaque alignement (à partir de la 3ème ligne du fichier : lignes[2:])
# A chaque ligne le compteur "sample" se déplace pour récuperer le nombre de reads
# pour un gène, et implémente la valeur dans le dictionnaire sample_to_total_reads

for ligne in lignes[2:]:

	for sample in range(len(ligne.rstrip().split()[6:])):

		sample_to_total_reads[sample] += int(ligne.rstrip().split()[6:][sample])


# Ouverture du fichier qui va générer la matrice avec les valeurs de RPKM

with open("DEG/RPKM.txt", "w") as RPKM:
	RPKM.write('gene')
	# Header du fichier avec le nom des échantillons
	# On accède au nom des échantillons dans la 2ème ligne du fichier généré par featureCounts (lignes[1])
	RPKM.write("\t")
	for sample in lignes[1].rstrip().split()[6:]:
		RPKM.write(sample+"\t")

	RPKM.write('Mean_FPKM'+"\t")
	RPKM.write("\n")



	# Pour chaque gène (à partir de lignes[2:]), on calcule le RPKM pour chaque échantillon et on ajoute la valeur dans le fichier et dans la bonne colonne
	#
	for ligne in lignes[2:]:
		RPKM.write(ligne.rstrip().split()[0]+"\t")

		tot_RPKM = 0
		for sample in range(len(ligne.rstrip().split()[6:])):

			RPKM_value = int(ligne.rstrip().split()[6:][sample]) / ((int(ligne.rstrip().split()[5]) / 1000) * (int(sample_to_total_reads[sample])) / 1e6 )

			RPKM_value = round(RPKM_value,2)

			tot_RPKM+= round(RPKM_value,2)



			#        RPKM = nombre de reads compté pour un échantillon  /  longueur du gène (6ème colonne du fichier) / 1000    *   nombre total de reads de l'échantillon / 1e6
			RPKM.write(str(RPKM_value)+"\t")

		RPKM.write(str(tot_RPKM/len(ligne.rstrip().split()[6:]))+"\t")
		tot_RPKM = 0

		RPKM.write("\n")