Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
A snakemake workflow that takes fastq files as input and creates a mitochondrial genome assembly. As mentioned the workflow is made in Snakemake, integrating different bio-informatic tools as Porechop ABI, Prowler, SACRA, ... with many other Python scripts to gathering output, generating a HTML report file to visualize results of each step in the pipeline.
The main purpose is to detect chimeric reads in a dataset and split the sequences in non-chimeric sequences. Using the corrected sequences to generate mitochondrial genome assembly.
Table of Contents
WGA - Chimera reads
Multiple displacement amplification (MDA) is Whole Genome Amplification method to rapdily amplify DNA samples. It is very efficient to increase small amounts of DNA with a high genome coverage due to the strand displacement and a low error rate. The process starts by annealing random hexamer primers on the ssDNA template. The synthesis will start on multiple sites on the template DNA. Chain-elongation is mediated by polymerase phi 29. Polymerase phi 29 has a high proofreading and strong displacement activity. When phi 20 polymerase encounters a downstream primer, due to it's strong strand displacement activity, it will cause the downstream strand to be gradually displaced of it's 5'-end. The chain-elongation continous as multiple rounds of hexamer primers and polymeases are added to the newly generated ssDNA strands. The exponential growth of DNA, has branched structure generating clusters of DNA molecules.
In the process of MDA creating a branched structure, the displaced ssDNA strands goes in competition with the newly generated template. When the displaced strand re-attaches to the template, the newly generated strand at 3'-end falls off. What happens is the extended strand at 3' will bind with other secondary structures creating these chimera reads. Chimeras are multiple transcripts of DNA sequences joined toghete, also called split reads.
When the phi 29 polymerase of the newly generated strand attaches to another ssDNA displaced strand. The chain-elongation continous along the new template, creating a ssDNA of 2 or more regions that do not belong togehter. These ssDNA are inverted chimeras. It is also possible that the phi 29 polymerase binds with the original template in a region with a similar base sequence but not identical. It skips the elongation from the displaced 3'-end of ssDNA and to new annealed position of phi 29 polymerase.
Installation
Can start by cloning the repository to your local system either using the SSH but then you need a keypair on your local computer and add the public key to your account.
git clone git@github.com:GuillaumeLeBerreBIT/ChimereNanoporePipeline.git
Can also use the HTTPS link.
git clone https://github.com/GuillaumeLeBerreBIT/ChimereNanoporePipeline.git
Snakemake requires to have a conda installation on your system. The preferred conda distribution is mambaforge since it has the required python commands & mamba which is a very fast installation.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
When the download of miniconda is complete can start the local installation. The installation can be completly up to you where to install it on your system.
bash Miniconda3-latest-Linux-x86_64.sh
Snakemake will make use of the conda environment to install envs provided by yaml files. The configuration of the .condarc file is very important for Snakemake!
conda config --add channels conda-forge
conda config --add channels bioconda
The resulting .condarc file looks like this.
auto_activate_base: false
channels:
- bioconda
- conda-forge
- defaults
WATCH OUT! If channel_priority is set to strict, it will not be able to do the conda installations. Can delete it from the .condarc file.
Snakemake can be instsalled using a conda. Mamba is possible to use as well if installed.
conda create -c conda-forge -c bioconda -n snakemake snakemake=7.25.0
Can check if snakemake is installed. The version should be 7.25 if installed using previous command.
conda activate snakemake
snakemake --help
snakemake --version
Snakemake uses multiple scripts, one of the scripts SACRA.sh uses multiple other scripts. Need to provide the path to the .bashrc file of the scripts folder.
nano ~/.bashrc
Can add the path to the scripts folder using PATH variable. The path is dependant on where the folder was cloned/placed. This is an example when the folder is placed in the "home" directory.
export PATH=$PATH:~/ChimereNanoporePipeline/workflow/scripts/
Need to source or restart terminal to configure the PATH variable.
source ~/.bashrc
Repository structure
To show an overview of how the respoitory is set up. The results of the pipeline will be collected in the
results
folder, containing subfolders dependant on the identifier. The config file used by snakemake is in the
config
, containing all different parameters used in the Snakefile. The
Snakefile
is present in the workflow folder. The
workflow
contain data generated by all different rules from the pipeline. The
workflow
folder has a
scripts
,
envs
folder which are used by different rules from the Snakefile.
├── Cmaenas_minion_data
├── config
├── resources
│ ├── DIAMOND-DB
│ ├── DIAMOND-Genes
│ ├── MITOS2-DB
│ ├── Styles
│ └── software
├── results
│ └── fastq_runid_211
├── tuning-params
│ ├── config
│ ├── input-file
│ └── scripts
└── workflow ├── AssemblyFasta ├── Diamond ├── FlyeResults ├── MITOS2Results ├── PorechopABI ├── ProwlerProcessed ├── SACRAResults ├── envs ├── scripts └── snakefile-templates
Usage
To run the pipeline, go to the
workflow
folder. From there can use the following command (not recommended). When running the pipeline for the first time it will need to install all the programs used, which is done by installing all dependencies through conda from a YAML-file. The conda installations can take a long time to install. NOTE! Activate the snakemake conda environment to run.
snakemake --use-conda
Depending the system using, will have a limited amount of resources. Can limit the amount of threads used, using the
--jobs
command. Using
--jobs 8
will use 8 threads in parallel to perform snakemake, depending on your system can lower or make it higher.
snakemake --use-conda --jobs 8
Depending on the amount of cores can use need to lower it. It is very important to set the right amount of threads/jobs it can use. When the jobs is set differently then given here, NEED to change the amount of threads Porechop ABI rule can use. When there aren't enough threads available to perform the Porechop ABI rule or too many Porechop rules are running in parallell, results in the pipeline crashing.
Other programs when the amount of cores specified are not available will be able to scale down the processes.
Snakefile will use the
config_snakemake.yaml
with all different parameters used in from the pipeline. Before starting Snakemake, will need to specify the input folder by
startfolder:
containing all the fastq files to parse through the pipeline.
startfolder: "../Cmaenas_minion_data"
Using a unique
identifier
this for creating folder/file names. Using different names to save the rsults in different runs. WATCH OUT! Previous used identifiers where same files are still present may result in rewriting files or combine files from different experiments. An identifier can also be used as species identifier, to in the future add files of previously used species/files to get more results in addition to previous results.
identifier: "fastq_runid_211"
Credits
Credits to the authors of the different tools, that have been used for creating this pipeline.
Bio-informatics tool for adapter removal: Porechop ABI
Bio-informatics tool for quality trimming: ProwlerTrimmer
Bio-informatics tool for chimera removal: SACRA Paper
Bio-informatics tool for BLASTX: DIAMOND Paper
Bio-informatics tool for assembly: Flye
Bio-informatics tool for assembly: SPAdes
Bio-informatics tool for gene annotation: MITOS2
Code Snippets
16 17 18 19 | shell: """ Bandage image {input} {output} --names --lengths --depth --fontsize 4 """ |
25 26 27 28 | shell: """ python3 scripts/ConcatFiles.py {params.fileSacra} {output} """ |
30 31 32 33 34 35 36 37 | shell: """ for db in ATP6 ATP8 COX1 COX2 COX3 CYTB NAD1 NAD2 NAD3 NAD4 NAD4L NAD5 NAD6; do diamond blastx -d resources/DIAMOND-DB/"$db" -q {input} -k {params.k} -f {params.f} \ -p 4 -o "Diamond/{params.folder}/{params.folder}Diamond_$db.csv" done """ |
27 28 29 30 | shell: """ python3 scripts/DiamondToAssembly.py -i {params.idper} -l {params.length} -e {params.evalue} Diamond/{params.folder} {params.sacraF} {output} """ |
18 19 20 21 | shell: """ python3 scripts/Filtering_SACRA_sequences.py -b {params.bases} {input} {output} """ |
33 34 35 36 | shell: """ python3 scripts/StatisticalReportGenerator.py {output} {params.poreStat} {params.poreFastq} {params.prowFold} {input.sacraF} {input.BandageAssem} {params.mitosFold} """ |
26 27 28 29 30 | shell: """ unset _JAVA_OPTIONS runmitos.py -i {input} -c {params.gencode} -r {params.MitRefSeq} -o {params.MitosFolder} --refdir {params.RefFolder} """ |
20 21 22 23 | shell: """ porechop_abi -abi --no_split -i {input} -o {output.reads} -t 8 > {output.statistics} 2>&1 """ |
50 51 52 53 54 55 | shell: """ python3 scripts/TrimmerLarge.py -f {input.out_fastq} -i PorechopABI/{params.folder}/ -o ProwlerProcessed/{params.folder}/ \ -m {params.trimmode} -c {params.clip} -g {params.fragments} -q {params.qscore} -w {params.windowsize} -l {params.minlen} \ -d {params.datamax} -r '.fasta' """ |
32 33 34 35 36 | shell: """ scripts/SACRA.sh -i {input} -p {output} -t 4 -c ../config/MyConfig.yaml rm {params.blasttab}* {params.bck} {params.des} {params.prj} {params.sds} {params.ssp} {params.suf} {params.tis} """ |
19 20 21 22 | shell: """ spades.py --only-assembler -s {input} -o {params.SpadesFolder} -t 8 """ |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | import os, argparse, re ##################################################################### # COMMAND LINE INPUT ##################################################################### parser = argparse.ArgumentParser(description='Concatenate files') parser.add_argument('inputFolFil', type=str, help='Provide the folder containing all the fastq files to be concatenated (or file of that folder).') parser.add_argument('outputFile', type=str, help='Give a (path to and) name to call the outputfile.') #parser.add_argument('targetNum', type=int, # help='Give a target File.') args = parser.parse_args() ##################################################################### # FILE HANDLING ##################################################################### # Check wheter the path to the files is given OR a file from that folder if os.path.isdir(args.inputFolFil): path_to_files = args.inputFolFil else: # Splitting the file from the path splitted_path = os.path.split(args.inputFolFil) # Only use the path to move further path_to_files = splitted_path[0] ## Setting a flag to break/continue the loop #flag = 0 ## While the amount of total files is not present keep looping until all files are collected #while flag == 0: # # Changes when the folder of SACRA results has all files # if args.targetNum != len(os.listdir(path_to_files)): # flag = 0 # #print(len(os.listdir(path_to_files))) # else: # # Loop breaks # flag = 1 # Open a file to write everything in to with open(args.outputFile, "w") as file_to_write: # Get every file from the directory for filename in os.listdir(path_to_files): # Extra failsafe to only get the resulting fasta after SACRA. After changing Snakefile lcoations of the resulting files not necassary anymore. Can delete this. if not re.search(".csv",filename) and\ not re.search(".non_chimera.fasta",filename) and\ not re.search(".split.fasta", filename): # Paste the path to file togheter with the filename file_path = os.path.join(path_to_files, filename) # True == File if os.path.isfile(file_path): # Open the file with open(file_path, 'r') as file_to_read: # By reading in the file, can write the output each time to the outputfile. file_lines = file_to_read.read() file_to_write.write(file_lines) |
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | import os, argparse, csv, re from Bio import SeqIO # pip install biopython import matplotlib.pyplot as plt import numpy as np ##################################################################### # COMMAND LINE INPUT ##################################################################### parser = argparse.ArgumentParser(description='From DIAMOND to Assembly') parser.add_argument('inputFolder', type=str, help='Give the (path to and) name of the folder containing the files after DIAMOND.') parser.add_argument('sacraFiltered', type=str, help='Give the Fasta file containing the filtered reads on a certain threshold.') parser.add_argument('outputAssembly', type=str, help='Give the (path to and) name for the fasta file with the output reads after filtering and DIAMOND mathces with DB.') parser.add_argument('-i', '--identity', type=int, default = 70, required = False, help ='Give a number of the percentage identity as the minimum treshold.') parser.add_argument('-l', '--len', type=int, default = 20, required = False, help ='Give a number of the min length of residues want to set as minimum treshold.') # Will later convert it to a float -- > YAML sets it as an string parser.add_argument('-e', '--evalue', type=str, default = 10e-6, required = False, help ='Give the minimum value to filter the sequences from DIAMOND.') args = parser.parse_args() ##################################################################### # FILE HANDLING ##################################################################### # VARIABLES & LISTS (set before loop or will reset on each new file) # Will create a list with all the headers validated by the filter parameters filtered_list = [] # Need a list with all the crab families to match on crab_genus = ["Carcinus", "Charybdis", "Thalamita"] # Testing for the Enoplea class >> Halalaimus enoplea = ["Eucoleus","Longidorus","Trichinella", "Trichuris"] # Testing for the Chromadorea class >> Acantholaimus & Desmoscolex chromadorea = ["Gnathostoma", "Rhigonema", "Strongyloides", "Camallanus", "Meloidogyne"] # Empty dictionary to save the amount of matches per gene gene_matches = {} # Empty dictionary for saving the headers sequence_count = {} # The pipeline starts from the ChimereNanoporePipeline/ # Adding the folder containing the files after Diamond # Do check if the user gave the slash with it at the end as well or not if args.inputFolder[-1] == '/': diamond_folder = args.inputFolder else: diamond_folder = args.inputFolder + "/" # Using the unique identifier as name splitted_folder = diamond_folder.split('/') identifier = splitted_folder[1] # Get each file from the list (13 files) and open each one of them to read its content. for file in os.listdir(diamond_folder): #Split the file name to get the gene name extracted >> Splitting will return a list splitted_gene = file.split("_") # Saving the gene name and then removing the .csv # Taking the last item from the list, since it can vary depending on name given. gene_name = splitted_gene[-1].replace(".csv","") # Want to for each gene a starting value of 0 if gene_name not in gene_matches: gene_matches[gene_name] = 0 else: continue # Link each file again with the full path to open it file_to_open = diamond_folder + file # Can open each csv file with open(file_to_open, "r") as csv_file: csv_reader = csv.reader(csv_file, delimiter='\t') # Each line from the csv file be returned as list # 0. qseqid query or source (gene) sequence id <-- # 1. sseqid subject or target (reference genome) sequence id <-- # 2. pident percentage of identical positions <-- # 3. length alignment length (sequence overlap) <-- # 4. mismatch number of mismatches # 5. gapopen number of gap openings # 6. qstart start of alignment in query # 7. qend end of alignment in query # 8. sstart start of alignment in subject # 9. send end of alignment in subject # 10. evalue expect value <-- # 11. bitscore bit score for row in csv_reader: """ print(f"Sequence ID: {row[0]}\ \nTarget Species_Gene: {row[1]}\ \nPercentage Identical positions: {row[2]}\ \nAlignement length: {row[3]}\ \nE-value: {row[10]}") """ ################# # FILTER ################# # First item of each row is the header name header = row[0] # Need to split the sseqid to match on the Crab genus names splitted_genus = row[1].split("_") genus_name = splitted_genus[0] gene = splitted_genus[2] # The float will function in the same way as the integers and can now compare numbers # Convert str -- > float/int # args.identity == Percentage Identity # args.len == Min length of residues # args.evalue == Max e-value a sequence can have as treshold if float(row[2]) >= args.identity and\ int(row[3]) >= args.len and\ float(row[10]) <= float(args.evalue) and\ genus_name in crab_genus: # Get all the headers that match the set filters # The headers that will be used to create FASTA file for assembly # Add a '>' since the fasta files start with a '>' filtered_list.append('>' + header) # For each matching gene name increase the value with 1 gene_matches[gene] += 1 # Dictionary for all headers, to see if reads are found in multiple genes or not if header not in sequence_count: sequence_count[header] = 1 else: sequence_count[header] += 1 # Need to know as well how many hits were not matched after the BLAST # Rearead the input file from DIAMIND and loop over the headers to check if they are in the dictionary or not. # If they are not in the dictionary set to a value of 0 for seq_record in SeqIO.parse(args.sacraFiltered, "fasta"): if seq_record.id not in sequence_count: sequence_count[seq_record.id] = 0 # Now need to reverse the logic and count the frequencys of dictionary per header. # Then can see how much a sequence has not been matched or has been matched after BLASTING against # the genes from the database. header_freq = {} for val in sequence_count.values(): if val not in header_freq: header_freq[val] = 0 else: header_freq[val] += 1 # Header list with all IDs #print(filtered_list) # Hits per gene #print(gene_matches) # The count of how many times a read seen per hit #print(sequence_count) # Can check now the freq of how much a header has been found in a dictionary #print(header_freq) ##################################################################### # WRITE TO FASTA ##################################################################### # HANDLING THE FILE # Opening a file to write to with open(args.outputAssembly, "w") as file_to_write: # Opening a file to read to with open(args.sacraFiltered, "r") as file_to_read: # This will create a list based on the newlines, containing strings reading_lines = file_to_read.readlines() # Setting a counter counter_2 = 0 counter_3 = 0 counter_4 = 0 # Setting a flag flag = 0 # Iterating over the lines in the list for line in reading_lines: # Check for header lines/ record IDs if re.search("^>", line): # Only want to retain the beginning of the line, this will be the only identical part of the line # Split on spaces splitted_header = line.split(" ") # Retain the ID:START-END == >@9030c343-3ff2-4b66-ab1b-c1e3327c3de0:1-188 iso_header = splitted_header[0] # If the header is in the list set the flag to 1 == to write if iso_header in filtered_list: counter_2 += 1 flag = 1 # If the header is not in the list then set flag to 0 == not write elif iso_header not in filtered_list: flag = 0 # This flag will allow ot print the lines to a file if flag == 1: file_to_write.writelines(line) counter_3 += 1 # When flag is 0 it will skip the lines. elif flag == 0: counter_4 += 1 continue # Closing the files for good practice file_to_read.close() file_to_write.close() # To get information out the parsed reads. """ print(f"File handled for checking matching lines (manually): {counter_2}\ \nLines printed {counter_3}\ \nLines skipped {counter_4}") """ ##################################################################### # VISUALIZATION ##################################################################### # Have a predefined width so the x-axis is more spread out plt.figure(figsize=(10,5)) # PLOT HITS PER GENE bar_colors = ['tab:red','tab:blue','tab:brown','tab:orange','tab:green','tab:purple','tab:cyan','tab:olive','tab:pink','yellow', 'forestgreen','darkred','khaki'] # The x-axis values from the dictionary plotted per tick plt.bar(range(len(gene_matches)), gene_matches.values(), align='center', color=bar_colors) # The x-axis labels from the dictionary plotted plt.xticks(range(len(gene_matches)), gene_matches.keys()) # x-axis label plt.xlabel('Genes') # y-axis label plt.ylabel('No. of hits') # Title for the plot plt.title('DIAMOND BLAST Results') plt.savefig(f"../results/{identifier}/{identifier}Bar-HitsPerGene-DIAMOND&Filtering.png", dpi=200, bbox_inches='tight') # Close the plot plt.clf() # REPORT HOW MANY TIMES HEADER BEEN MATCHED with open(f"../results/{identifier}/{identifier}HeaderCountDIAMOND.txt","w") as tmp_to_write: for key, val in header_freq.items(): tmp_to_write.writelines(f"For the headers found {key} time(s) after DIAMOND: {val} sequences.\n") # HISTOGRAM LEN OF THE READS assembly_seq_len = [] for seq_record in SeqIO.parse(args.outputAssembly, "fasta"): assembly_seq_len.append(len(seq_record)) # Convert the read length lists to numpy arrays for plotting assembly_array = np.array(assembly_seq_len) # Set a predefined picture size to save in format plt.figure(figsize=(7,5)) # Plot a histogram of sequence lengths after DIAMOND & Filtering BLAST matches # Setting amount of bins & range of the graph. plt.hist(assembly_array, bins = 40, range = [min(assembly_array), 1000]) # Setting title, x and y labels. plt.title('Length of sequences after DIAMOND & Filtering') plt.xlabel('Sequence length') plt.ylabel('Frequency') # Determining to show the interval of x-axis ticks. plt.xticks(np.arange(0, 1000, 100)) plt.savefig(f"../results/{identifier}/{identifier}Hist-SequenceLengthAfterDIAMOND&Filtering.png", dpi=200, bbox_inches='tight') # Savefig does not close the plot. plt.clf() |
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | import argparse from Bio import SeqIO # pip install biopython ####################################### # COMMAND LINE INPUT ####################################### parser = argparse.ArgumentParser(description='Filter length sequences') parser.add_argument('inputFile', type=str, help='Give the input fasta file to filter the legth of the reads on. Can parse the path of file with it.') parser.add_argument('outputFile', type=str, help='Give an output fasta file name to write to. Can parse the path of file with it.') parser.add_argument('-b', '--bases', type=int, default = 50, required = False, help ='Give a number of bases want to have as minimum treshold to filter on.') args = parser.parse_args() ############################## BIOPYTHON ############################## # Parse the input file want to filter on input_seq_iterator = SeqIO.parse(args.inputFile, "fasta") # Using a generator expression, more memory efficient then using list comprehension creating list of records. treshold_seq_iterator = (record for record in input_seq_iterator if len(record.seq) >= args.bases) # Writes the wanetd sequences to the outputfile SeqIO.write(treshold_seq_iterator, args.outputFile, "fasta") """ ####################################### # FILE HANDLING ####################################### # GATHERING ALL IDS ABOVE X # OF NT list_wanted_records = [] # Setting a counter #counter_1 = 0 # Readigng in the input file for seq_record in SeqIO.parse(args.inputFile, "fasta"): # Filter on the length of the amount of bases smaller then X, default 50 if len(seq_record) < args.bases: continue # Filter on the length of the amount of bases bigger then X, default 50 elif len(seq_record) >= args.bases: # The records do not have the starting ">", so add it here so the parsing can be done much easier. # Will have == >@9030c343-3ff2-4b66-ab1b-c1e3327c3de0:1-188 list_wanted_records.append(">" + seq_record.id) #counter_1 += 1 # HANDLING THE FILE # Opening a file to write to with open(args.outputFile, "w") as file_to_write: # Opening a file to read to with open(args.inputFile, "r") as file_to_read: # This will create a list based on the newlines, containing strings reading_lines = file_to_read.readlines() # Setting a counter #counter_2 = 0 #counter_3 = 0 #counter_4 = 0 # Setting a flag flag = 0 # Iterating over the lines in the list for line in reading_lines: # Check for header lines/ record IDs if re.search("^>", line): # Only want to retain the beginning of the line, this will be the only identical part of the line # Split on spaces splitted_header = line.split(" ") # Retain the ID:START-END == >@9030c343-3ff2-4b66-ab1b-c1e3327c3de0:1-188 iso_header = splitted_header[0] # If the header is in the list set the flag to 1 == to write if iso_header in list_wanted_records: #counter_2 += 1 flag = 1 # If the header is not in the list then set flag to 0 == not write elif iso_header not in list_wanted_records: flag = 0 # This flag will allow ot print the lines to a file if flag == 1: file_to_write.writelines(line) #counter_3 += 1 # When flag is 0 it will skip the lines. elif flag == 0: #counter_4 += 1 continue # Closing the files for good practice file_to_read.close() file_to_write.close() # To get information out the parsed reads. #print(f"File readed through Biopython: {counter_1}\ # \nFile handled for checking matching lines (manually): {counter_2}\ # \nLines printed {counter_3}\ # \nLines skipped {counter_4}") """ |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | Version="2.0" ################################# function parse_yaml { local prefix=$2 local s='[[:space:]]*' w='[a-zA-Z0–9_]*' fs=$(echo @|tr @ '\034') sed -ne "s|^\($s\)\($w\)$s:$s\"\(.*\)\"$s\$|\1$fs\2$fs\3|p" \ -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" $1 | awk -F$fs '{ indent = length($1)/2; vname[indent] = $2; for (i in vname) {if (i > indent) {delete vname[i]}} if (length($3) > 0) { vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")} printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, $2, $3); } }' } while getopts ":i:p:t:c:" o; do case "${o}" in i) i=${OPTARG} ;; p) p=${OPTARG} ;; t) t=${OPTARG} ;; c) c=${OPTARG} ;; *) usage ;; esac done shift $((OPTIND-1)) if [ -z "${i}" ] || [ -z "${p}" ] || [ -z "${t}" ] || [ -z "${c}" ]; then echo "Usage: $0 [-i <input fasta file>] [-p <prefix>] [-t <max no. of cpu cores>] [-c <config.yml>]" 1>&2; exit 1; fi eval $(parse_yaml $c) echo -e "***** [$0, Version: $Version] start " `date +'%Y/%m/%d %H:%M:%S'` " *****" echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] $0 -i $i -p $p -t $t -c $c\n" ########## STEP1 ########## echo "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 1. alignment: All vs all pairwise alignment of long-read by LAST aligner" makedb_cmd="lastdb8 -P $t -R $alignment_R -u $alignment_u $i $i" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $makedb_cmd" eval $makedb_cmd alignment_cmd="lastal8 -a $alignment_a -A $alignment_A -b $alignment_b -B $alignment_B -S $alignment_S -P $t -f $alignment_f $i $i > $i.blasttab" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $alignment_cmd" eval $alignment_cmd id=`grep -v "#" $i.blasttab | shuf -n 1000 | awk '{m+=$3} END{print m/NR}'` echo "[`date +'%Y/%m/%d %H:%M:%S'`] Average read-vs-read identity (%): $id" echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n" ########################### ########## STEP2 ########## echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 2. pars depth: Detecting the partially aligned reads (PARs)" parsdepth_cmd="SACRA_PARs_depth.pl -i $i.blasttab -al $parsdepth_al -tl $parsdepth_tl -pd $parsdepth_pd -id $parsdepth_id > $i.blasttab.depth" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $parsdepth_cmd" eval $parsdepth_cmd echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n" ########################### ########## STEP3 ########## echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 3. cal pc ratio: Obtaining the PARs/CARs ratio (PC ratio) at the putative chimeric positions" pcratio_cmd="SACRA_multi.pl $t $i.blasttab.depth" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $pcratio_cmd" eval $pcratio_cmd for k in `ls $i.blasttab.depth.split*` do echo "[`date +'%Y/%m/%d %H:%M:%S'`] SACRA_PCratio.pl -i $i.blasttab -pa $k -ad $pcratio_ad -id $pcratio_id > $k.pcratio" SACRA_PCratio.pl -i $i.blasttab -pa $k -ad $pcratio_ad -id $pcratio_id > $k.pcratio & done # wait for all backgroud jobs to finish wait cat="cat $i*.depth.split*.pcratio > $i.blasttab.depth.pcratio" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $cat" eval $cat rm -rf $i*.depth.split* echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n" ########################### ########## STEP4 ########## echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 4. cal mPC ratio: Calculate mPC ratio based on spike-in reference genome" if [ $mpc_sp = true ] && [ -e $mpc_rf ]; then echo "[`date +'%Y/%m/%d %H:%M:%S'`] mPC ratio was calculated based on provided spike-in reference genome." echo "[`date +'%Y/%m/%d %H:%M:%S'`] Spike-in reference genome: $mpc_rf" rfdb_cmd="lastdb8 -P $t -R $mpc_R -u $mpc_u $mpc_rf $mpc_rf" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $rfdb_cmd" eval $rfdb_cmd rfalign_cmd="lastal8 -a $mpc_a -A $mpc_A -b $mpc_b -B $mpc_B -S $mpc_S -P $t -f $mpc_f $mpc_rf $i > $i.spike-in.blasttab" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $rfalign_cmd" eval $rfalign_cmd grep -v "#" $i.spike-in.blasttab | awk -v "id=$mpc_id" '$3>=id' | awk -v "al=$mpc_al" '$4>=al' > $i.spike-in.blasttab.aligned awk '{print $1}' $i.spike-in.blasttab.aligned | sort | uniq > $i.spike-in.blasttab.aligned.id chimera_cmd="SACRA_detect_chimera.pl -i $i.spike-in.blasttab -id $mpc_id -al $mpc_al -lt $mpc_lt > $i.spike-in.blasttab.chimera" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $chimera_cmd" eval $chimera_cmd chimera_pos_cmd="SACRA_detect_chimeric_position.pl $i.spike-in.blasttab.chimera > $i.spike-in.blasttab.chimera.position" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $chimera_pos_cmd" eval $chimera_pos_cmd cal_mpc="SACRA_reference_PC.pl $i.spike-in.blasttab.aligned.id $i.blasttab.depth.pcratio $i.spike-in.blasttab.chimera.position" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $cal_mpc" eval $cal_mpc split_pc=`sort -k4,4nr $i.blasttab.depth.pcratio.mPC | awk '{print $1}' | sed 's/mPC=//g' | head -n 1` echo "[`date +'%Y/%m/%d %H:%M:%S'`] Calculated mPC ratio: $split_pc" echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n" else echo "[`date +'%Y/%m/%d %H:%M:%S'`] Spike-in reference genome is not provided, so the mPC ratio in the config file was used." echo "[`date +'%Y/%m/%d %H:%M:%S'`] Calculated mPC ratio: $split_pc" echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n" fi ########################### ########## STEP5 ########## echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] STEP 5. split: Split chimeras at the chimeric positions" split_cmd="SACRA_split.pl -i $i.blasttab.depth.pcratio -pc $split_pc -dp $split_dp -sl $split_sl | awk '{split(\$0,a,\":\"); split(a[2],b,\"-\"); print a[1],b[1]-1,b[2]}' > $i.blasttab.depth.pcratio.faidx" echo "[`date +'%Y/%m/%d %H:%M:%S'`] $split_cmd" eval $split_cmd seqtk subseq $i $i.blasttab.depth.pcratio.faidx > $i.split.fasta echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] DONE\n" ########################### ########## Output ########## echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] Output final reads" awk '{print $1}' $i.blasttab.depth.pcratio.faidx | uniq > $i.blasttab.depth.pcratio.faidx.id seqkit grep -v -f $i.blasttab.depth.pcratio.faidx.id $i > $i.non_chimera.fasta cat $i.non_chimera.fasta $i.split.fasta > $p echo "[`date +'%Y/%m/%d %H:%M:%S'`] Split reads: $i.split.fasta" echo "[`date +'%Y/%m/%d %H:%M:%S'`] Non-chimeras: $i.non_chimera.fasta" echo -e "[`date +'%Y/%m/%d %H:%M:%S'`] Combined reads: $p\n" ########################### echo "***** [$0, Version: $Version] finished " `date +'%Y/%m/%d %H:%M:%S'` " *****" |
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 | import re, os, argparse #, webbrowser, time from Bio import SeqIO # pip install biopython import numpy as np # pip install numpy import matplotlib.pyplot as plt # pip install matplotlib import pandas as pd ############################# COMMAND LINE INPUT ############################# parser = argparse.ArgumentParser(description='Generate Statistical Report. The report will contain the results in graph, tables, text, figures fo all different input files/folders gathered throughout the Snakemake pipeline.') parser.add_argument('outputFile', type=str, help='The name of the folder that will contain the output files.') parser.add_argument('porechopStat', type=str, help='The folder containing the Porechop Statistic files. THe files are the output from the terminal printed in text files.') parser.add_argument('porechopFastq', type=str, help='The folder containing the fastq files generated by Porechop ABI.') parser.add_argument('prowlerFolder', type=str, help='Give the folder containing the fasta files generated by Prowler Trimmer.') parser.add_argument('sacraFiltered', type=str, help='Give the Fasta file containing the filtered sequences on a certain length.') parser.add_argument('BandagePict', type=str, help='Give the assembly picture created in BANDAGE.') parser.add_argument('MITOS2Folder', type=str, help='Give the folder with the MITOS2 output folders.') args = parser.parse_args() ## PARSING UNIQUE IDENTIFIER # Use the unique identifier per sample uniquelabel = args.outputFile # Will use the unique label, ued to create a folder as identifier. Split on paths ('/') and get the unique identifier name out of the list. splitted_label = uniquelabel.split("/") identifier = splitted_label[2] ############################# HANDLING FILES ####################################### # This section will contain each of the files used for statistical summary; ############################# PORECHOP REPORT ####################################### ## TERMINAL OUTPUT # Empty list == The list used to gather all the lines containing relevant information from the terminal statistics_list = [] # Go over all the files in the directory for file in os.listdir(args.porechopStat): # Combining the path to the file + file name -- > To open the file file_path = f"{args.porechopStat}/{file}" # Opening the file from the current location with open(file_path, "r") as text_file: # Readlines returns a list of lines split on newline character == "\n" splitted_file = text_file.readlines() # Flag == Define when to print/parse or not. flag = 0 for line in splitted_file: # Setting up the different conditions to handle the output # When the re.search matches the pattern of a line it will return == True if not == False if re.search("Trimming adapters from read ends", line): flag = 1 elif re.search("adapters trimmed from their", line): flag = 2 elif re.search("reads were split based on middle adapters", line): flag = 3 # When encoutering a newline flag == 0 == Not add the line to the list elif re.search("^\n", line): flag = 0 # If any of the flags match the set value == strip the line on unwanted characters & save the line in a list if flag == 1 or flag == 2 or flag == 3: # Removing leading or trailing characters stripped_line = line.strip() # Add each stripped line to the list statistics_list.append(stripped_line) # Empty list == Removing special characters from the lines. cleaned_text_list = [] for text in statistics_list: # If any of the patterns match in the item from statistics_list then replace it by nothing == Removing unwanted characters. cleaned_text = re.sub(r'\x1b\[1m\x1b\[4m|\x1b\[0m|\x1b\[31m', '', text) # If the filtered item has nothing == Delete/Skip # Else keep it and add to a list if cleaned_text == '': continue else: cleaned_text_list.append(cleaned_text) ## ADAPTERS # Define an empty set == Will keep unique items only in a list (no duplicates). adapters = set() # Save all the statistical stats of trimmed reads in a list. trimmed_count = [] # Loop over the list that has been fully cleaned/stripped for item in cleaned_text_list: # ADAPTER LINES == ":" if re.search(":", item): adapters.add(item) # REMOVED READS STATS == "/" elif re.search("/", item): trimmed_count.append(item) # Need to extract now Start - Mid - End start = [] middle = [] end = [] # Gather the lines containg start - middle - end in each respective list. for item in trimmed_count: if re.search("start", item): start.append(item) elif re.search("middle", item): middle.append(item) elif re.search("end", item): end.append(item) StartAdapRem = 0 StartTotReads = 0 StartBasePairs = 0 # Split each item == sentence -- > Get the numbers only to generate a summary of all files togheter. for s in start: splitted_start = s.split(" ") # [0] == Nr reads adapters removed & [2] == Total Reads & [10] == BP StartAdapRem += int(splitted_start[0].replace(",","")) StartTotReads += int(splitted_start[2].replace(",","")) StartBasePairs += int(splitted_start[10].replace(",","").replace("(","")) #print(f"{StartAdapRem}-{StartTotReads}-{StartBasePairs}") MidAdapRem = 0 MidTotReads = 0 # Split each item == sentence -- > Get the numbers only to generate a summary of all files togheter. for m in middle: splitted_mid = m.split(" ") # [0] == Nr reads adapters removed & [2] == Total Reads MidAdapRem += int(splitted_mid[0].replace(",","")) MidTotReads += int(splitted_mid[2].replace(",","")) #print(f"{MidAdapRem}-{MidTotReads}") EndAdapRem = 0 EndTotReads = 0 EndBasePairs = 0 # Split each item == sentence -- > Get the numbers only to generate a summary of all files togheter. for e in end: splitted_end = e.split(" ") # [0] == Nr reads adapters removed & [2] == Total Reads & [10] == BP EndAdapRem += int(splitted_end[0].replace(",","")) EndTotReads += int(splitted_end[2].replace(",","")) EndBasePairs += int(splitted_end[10].replace(",","").replace("(","")) #print(f"{EndAdapRem}-{EndTotReads}-{EndBasePairs}") ############################# PROWLER REPORT ############################# # Empty lists to store the read lengths before and after trimming before_trim = [] after_trim = [] # Biopython library can handle fastq & fasta files. # Makes it much more easier to read in files and can filter on what part of each sequence you want == .seq, .id # Much faster then writing it yourself. # Read the input fastq file and append the length of each read to before_trim list. for pore_file in os.listdir(args.porechopFastq): # Concat the filepath == To open the file file_path_pore = f"{args.porechopFastq}/{pore_file}" for seq_record in SeqIO.parse(file_path_pore, "fastq"): # The length of each read before_trim.append(len(seq_record.seq)) # Read the trimmed fasta file and append the length of each read to after_trim list for prow_file in os.listdir(args.prowlerFolder): # Due to other files present (output SACRA), only want the fasta file before SACRA. if not re.search(".csv",prow_file) or\ re.search(".non_chimera.fasta",prow_file) or\ re.search(".split.fasta", prow_file): file_path_prow = f"{args.prowlerFolder}/{prow_file}" for seq_record in SeqIO.parse(file_path_prow, "fasta"): after_trim.append(len(seq_record.seq)) # Convert the read length lists to numpy arrays >> Plots require numpy array == [[...]] before_array = np.array(before_trim) after_array = np.array(after_trim) # Create a figure with two subplots + tight layout fig, axs = plt.subplots(1, 2, tight_layout=True, figsize = (10,5)) # Plot a histogram of read lengths before trimming + setting number of bins + setting the range of x axis axs[0].hist(before_array, bins = 40, range = [0,10000]) axs[0].set_title('Before trimming Reads') axs[0].set_xlabel('Read length') axs[0].set_ylabel('Frequency') # Plot a histogram of read lengths before trimming + setting number of bins + setting the range of x axis axs[1].hist(after_array, bins = 40, range = [0,10000]) axs[1].set_title('After trimming Reads') axs[1].set_xlabel('Read length') axs[1].set_ylabel('Frequency') # Saving the graph picture. plt.savefig(f"../results/{identifier}/{identifier}Before&After-Prowler.png", dpi=200, bbox_inches='tight') # Savefig does not close the plot. >> clf = close plt.clf() ############################# SACRA REPORT ############################# # Defining empty lists beforehand == Amount of records prow_records = [] chim_records = [] nonchim_records = [] # Gather seq len from Non-Cimera and Chimera reads -- > After SACRA sacra_seq_len = [] ### PROWLER SEQUENCES # Loop over the files in the folder. for prow_file in os.listdir(args.prowlerFolder): # Want one specif file == .fasta if not re.search(".csv",prow_file) and\ re.search(".non_chimera.fasta",prow_file) and\ re.search(".split.fasta", prow_file): file_path_prow = f"{args.prowlerFolder}/{prow_file}" # Reading in the Prowler Fasta file for seq_record in SeqIO.parse(file_path_prow, "fasta"): # Append each record to a list prow_records.append(seq_record.id) # Counting the number of IDs == Number of sequences >> total number of sequences present in the fasta file. count_prow = len([record for record in prow_records]) ### CHIMERA SEQUENCES # Loop over the files in the folder. for chim_file in os.listdir(args.prowlerFolder): if re.search(".split.fasta", chim_file): file_path_chim = f"{args.prowlerFolder}/{chim_file}" # Reading in the Fasta file with Chimere sequences for seq_record in SeqIO.parse(file_path_chim, "fasta"): # Append each record/header to a list chim_records.append(seq_record.id) # Gathering the read lengths. >> Want all read lengths (Chimera sequences multiple headers same BUT different Start - End position added) sacra_seq_len.append(len(seq_record.seq)) # Counting the number of IDs == Number of sequences >> total number of sequences present in the fasta file. count_chim = len([record for record in chim_records]) ### UNIQUE CHIMERA SEQUENCE ID # Will try to count and see how many reads had one/multiple chimera reads == UNIQUE HEADERS!! >> .split.fasta # Setting empty list beforehand record_splitted = [] # Iterating over the collected chimera sequence records for chim in chim_records: # Split each item >> HEADER:START-END == Want only the first part to find unique headers chim_splitted = chim.split(":") # Only ID part retained record_splitted.append(chim_splitted[0]) # Set an empty set >> Set == only unique IDs unique_chim = set() # Iterate over the sequence IDs for i in record_splitted: # Add to the set of IDs unique_chim.add(i) # Counting the number of IDs == Number of UNIQUE sequences count_unique_chim = len([record for record in unique_chim]) ### NON CHIMERA SEQUENCES for chim_file in os.listdir(args.prowlerFolder): if re.search(".non_chimera.fasta", chim_file): file_path_non_chim = f"{args.prowlerFolder}/{chim_file}" # Non chimera sequences were not splitted or so thus headers remain unique # Reading in the SACRA fasta file with Non Chimere reads for seq_record in SeqIO.parse(file_path_non_chim, "fasta"): nonchim_records.append(seq_record.id) # Gathering the read lengths. sacra_seq_len.append(len(seq_record.seq)) # Counting the number of IDs == Number of reads count_nonchim = len([record for record in nonchim_records]) # Informative print of the results gathered #print(f"Reads after Prowler: {count_prow}\ # \nHow many chimera sequences: {count_chim}\ # \nHow many from original sequence, are unique: {count_unique_chim}\ # \nHow many non chimera sequences: {count_nonchim}\ # \nSum chimera unique IDs & non chimera IDs: {(total_sacra_seq := count_unique_chim + count_nonchim)}") ### VISUALIZATION ### RAW RESULTS # Creating a pandas dataframe >> Parse to matplotlib sacraDf = pd.DataFrame([["No. sequences", count_unique_chim, count_nonchim]], columns = ["Amount", "Chimera", "Non chimera"]) # Plotting the rows of the No. sequences per bar. # Setting the width of the bars bit smaller. # Adding colormap for visualization. ax = sacraDf.plot(x='Amount', kind='bar', stacked=True, width = 0.2, colormap = "Set3", title='Total amounft of chimera and non-chimera sequences after SACRA') # Iterating over the patches to obtain the width and height + x and y coordinates # Using the x, y coordinates to place the label in the center of corresponding bar for p in ax.patches: width, height = p.get_width(), p.get_height() x, y = p.get_xy() # labelling text based on gathered positions. ax.text(x+width/2, y+height/2, '{:.0f}'.format(height), horizontalalignment='center', verticalalignment='center') # For some reason have to set the ticks to 0 to get the label horizontally. plt.xticks(rotation=0) # Legend location to the upper right # bbox anchor is to change the location, placed it outside the box, bbox_to_anchor(x,y) plt.legend(loc = 'upper right', bbox_to_anchor=(1.4, 0.95)) # Shrink current axis by 20% (x-axis) box = ax.get_position() ax.set_position([box.x0, box.y0, box.width * 0.8, box.height]) # Label y-axis plt.ylabel("No. of sequences") # Saving the picture, using bbox_inches plt.savefig(f"../results/{identifier}/{identifier}SACRA-Stacked-Seq-Amount.png", dpi=200, bbox_inches='tight') # Savefig does not close the plot. plt.clf() ### RELATIVE RESULTS # Calculations Relative Amount oc sequences. total_sacra_seq = count_unique_chim + count_nonchim rel_unique_chim = (count_unique_chim / total_sacra_seq) * 100 rel_nonchim = (count_nonchim / total_sacra_seq) * 100 # Creating a pandas dataframe. >> Parse to matplotlib sacraDf = pd.DataFrame([["No. sequences", rel_unique_chim, rel_nonchim]], columns = ["Amount", "Chimera", "Non chimera"]) #Plotting the rows of the No. sequences per bar. # Setting the width of the bars bit smaller. # Adding colormap for visualization. ax = sacraDf.plot(x='Amount', kind='bar', stacked=True, width = 0.2, colormap = "Set3", title='Total amount of chimera and non-chimera sequences after SACRA') # Iterating over the patches to obtain the width and height + x and y coordinates # Using the x, y coordinates to place it in the center of corresponding bar for p in ax.patches: width, height = p.get_width(), p.get_height() x, y = p.get_xy() # labelling text based on gathered positions. ax.text(x+width/2, y+height/2, '{:.2f} %'.format(height), horizontalalignment='center', verticalalignment='center') # For some reason have to set the ticks to 0 to get the label horizontally. plt.xticks(rotation=0) # Shrink current axis by 20% box = ax.get_position() ax.set_position([box.x0, box.y0, box.width * 0.8, box.height]) # Legend location to the upper right # bbox anchor is to change the location, placed it outside the box, bbox_to_anchor(x,y) plt.legend(loc = 'upper right', bbox_to_anchor=(1.4, 0.95)) # Label y-axis plt.ylabel("No. of sequences (%)") # Saving the created plot as .png, using the dpi to set the size of the figure. bbox_inches makes sure everything is saved in the canvas. plt.savefig(f"../results/{identifier}/{identifier}SACRA-Stacked-Seq-Rel-Amount.png", dpi=200, bbox_inches='tight') # Savefig does not close the plot. plt.clf() ### HISTOGRAM LENGTH READS # Convert the read length lists to numpy arrays for plotting sacra_array = np.array(sacra_seq_len) # Plot a histogram of sequence lengths after SACRA # Setting amount of bins & range of the graph. plt.hist(sacra_array, bins = 40, range = [min(sacra_array), 1000]) # Setting title, x and y labels. plt.title('Sequence lengths after SACRA') plt.xlabel('Sequence length') plt.ylabel('Frequency') # Determining to show the interval of x-axis ticks. plt.xticks(np.arange(0, 1000, 100)) # Saving the figure in .png format. bbox_inches makes sure everything is saved in the canvas. plt.savefig(f"../results/{identifier}/{identifier}SACRA-Hist-Distribution.png", dpi=200, bbox_inches='tight') # Savefig does not close the plot. plt.clf() ####################################### # SACRA FILTERING STEP ####################################### # Lists to store the sequence lengths after the filtering of fasta file on certain length. filtered_sacra = [] # Read the input fastq file and append the length of each sequence to before_trim list for seq_record in SeqIO.parse(args.sacraFiltered, "fasta"): filtered_sacra.append(len(seq_record.seq)) # Convert the read length lists to numpy arrays for plotting before_array = np.array(filtered_sacra) # Plot a histogram with a predefined number of bins & a range set from x1 to x2. plt.hist(before_array, bins=40, range=[0,1000]) # Setting the title plt.title('Filtering on a treshhold of ' + str(min(filtered_sacra)) + ' bases') # X-axis label plt.xlabel('Read length') # Setting y-axis label plt.ylabel('Frequency') # Detrmining to show the interval of x-axis ticks. # np.arange to go from a to b in x amount of steps. plt.xticks(np.arange(0, 1000, 100)) # Saving the figure plt.savefig(f"../results/{identifier}/{identifier}SACRA-Hist-FilteredSeq.png", dpi=200, bbox_inches='tight') # Close the plot plt.clf() ####################################### # GENERATING HTML REPORT ####################################### # Creating the header line html_header = f"<! DOCTYPE html>\n<html lang='en'>\n<head>\n<meta charset='UTF-8'>\n<title>Statistical Report</title>\n<link rel='stylesheet' href='../../resources/Styles/styles.css'>\n</head>\n<body>\n<div class='main'>\n<h1 id='statisticalReport'>Statistical Report - Workflow</h1>\n<h2 id='porechopABI'>Porechop ABI</h2>\n<h3>Trimming adapters from read ends</h3>\n" # Creating the table containg everything # The last 3 are summary lines from the adapters trimmed so can parse everything until then #unorder_list = [] #unorder_list.append("<ul>\n") #for adapter in adapters: # # unorder_list.append(f"\t<li>{adapter}</li>\n") #unorder_list.append("</ul>\n") # Header adapter loc adap_loc_header = "<h3>Adapters Removed</h3>\n" # Adding the last line to the file html_end = "</body>\n</html>" # WRITING TO THE FILE # The location of the outputfile statisticalFile = args.outputFile #Opening the outputfile to write to with open(statisticalFile, "w") as html_file: # PORECHOP ABI html_file.writelines(html_header) # Adapters html_file.write("<table>\n\t<tr>\n\t\t<th>Adapter</th>\n\t\t<th>Sequence</th>\n\t</tr>\n") # Loop over the list containing all the adapeters. for adapter in adapters: splitted_adapter = adapter.split(":") html_file.writelines(f"\t<tr>\n\t\t<td>{splitted_adapter[0]}</td>\n\t\t<td>{splitted_adapter[1]}</td>\n\t</tr>\n") html_file.writelines("</table>\n") html_file.writelines(adap_loc_header) #Removed adapters html_file.writelines(f"\t<div class='centerdiv'>{StartAdapRem} / {StartTotReads} reads had adapters trimmed from their start ({StartBasePairs} bp removed)</div>") html_file.writelines(f"\t<div class='centerdiv'>{MidAdapRem} / {MidTotReads} reads had adapters trimmed from their middle </div>") html_file.writelines(f"\t<div class='centerdiv'>{EndAdapRem} / {EndTotReads} reads had adapters trimmed from their end ({EndBasePairs} bp removed)</div>") # PROWLER html_file.writelines("<h2 id='prowler'>Prowler Trimming</h2>\n") # Have to set the locatio from where the html file will be, so set picture in same folder html_file.writelines(f"\t<img src='{identifier}Before&After-Prowler.png' height='800px'>\n") # SACRA html_file.writelines("<h2 id='sacra'>Split Amplified Chimeric Read Algorithm (SACRA)</h2>\n") # Have to set the locatio from where the html file will be, so set picture in same folder html_file.writelines(f"\t<img src='{identifier}SACRA-Stacked-Seq-Amount.png' height='800px'>\n") html_file.writelines(f"\t<img src='{identifier}SACRA-Stacked-Seq-Rel-Amount.png' height='800px'>\n") html_file.writelines(f"\t<img src='{identifier}SACRA-Hist-Distribution.png' height='800px'>\n") # The statistical ouput after filtering on certain amount of bases html_file.writelines("<h2 id ='sacrafilt'>SACRA Filtered Reads</h2>\n") html_file.writelines(f"\t<img src='{identifier}SACRA-Hist-FilteredSeq.png' height='800px'>\n") ######################### DIAMOND ######################### html_file.writelines("<h2 id='diamond'>DIAMOND</h2>\n") # Read the file containing the amount a hit has been found in the genes with open(f"../results/{identifier}/{identifier}HeaderCountDIAMOND.txt", "r") as text_reader: lines_read = text_reader.readlines() # Write the output in an unordered list html_file.writelines("<ul>\n") for line in lines_read: # Write each newline in a list item, strip the line for \n. html_file.writelines(f"\t<li>{line.strip()}</li>\n") html_file.writelines("</ul>\n") # Writing the pictures to the html file. html_file.writelines(f"\t<img src='{identifier}Bar-HitsPerGene-DIAMOND&Filtering.png' height='800px'>\n") html_file.writelines(f"\t<img src='{identifier}Hist-SequenceLengthAfterDIAMOND&Filtering.png' height='800px'>\n") #Add the end of html to the file. html_file.writelines(html_end) ######################### FLYE - BANDAGE ######################### html_file.writelines("<h2 id='bandage'>SPAdes - BANDAGE</h2>\n") # Split the path of the picture to head and tail part == Tail is Picture splitted_Bandage = os.path.split(args.BandagePict) # Can call the last item in the list which is the Picture (OR [1]) html_file.writelines(f"\t<img src='{splitted_Bandage[-1]}' height='800px'>\n") ######################### MITOS2 ANNOTATION REPORT ########################### html_file.writelines("<h2 id='mitos'>MITOS2</h2>\n") # Filter the list to open files in numerical order. list_folders = os.listdir(args.MITOS2Folder) list_folders.sort() # Empty contig list to create titles for navbar contig_headers = [] # The IDS to link the headers to ids_headers = [] # Need to specify the full path to find the specific folder for folder in list_folders: #print(f"{folder}") # Have to give the full path with it to find the folder for file in os.listdir(f"{args.MITOS2Folder}/{folder}/"): ## GFF if re.search("result.gff",file): with open(f"{args.MITOS2Folder}/{folder}/{file}", "r") as file_to_read: # Create a list of lines from the file file_lines = file_to_read.readlines() # Have to check wheter list is empty or not if len(file_lines) == 0: # If empty it will return to the test continue # The title by getting the contig name splitted_item = file_lines[0].split("\t") contig_raw = splitted_item[0] # Edit word contig_list = contig_raw.split("_") # Format a title and save it into a variable formatted = contig_list[0].capitalize() + " " + contig_list[1] # Add the variable to the list contig_headers.append(formatted) # Create string for the ids to link to ids_headers.append(contig_list[1]) # Write title == Contig html_file.write(f"<h3 id='{contig_list[1]}'>{contig_list[0].capitalize()} {contig_list[1]}</h3>\n") # Write the table header to the file html_file.write("<table>\n\t<tr>\n\t\t<th>Source</th>\n\t\t<th>Feature</th>\n\t\t<th>Start position</th>\n\t\t<th>End position</th>\n\t\t<th>Score</th>\n\t\t<th>Strand</th>\n\t\t<th>Frame</th>\n\t\t<th>Attributes</th>\n\t</tr>\n") # Iterate over the lines from the list for line in file_lines: # The columns are tab seperated o split on splitted_gff = line.split("\t") # 1) seqname - name of the chromosome or scaffold # 2) source - name of the data source # 3) feature - feature type # 4) start - Start position # 5) end - End position # 6) score - A floating point value. # 7) strand - defined as + (forward) or - (reverse). # 8) frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.. # 9) attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature. #print(splitted_gff) # Parse everything to a table in the file. html_file.write(f"\t<tr>\n\t\t<td>{splitted_gff[1]}</td>\n\t\t<td>{splitted_gff[2]}</td>\n\t\t<td>{splitted_gff[3]}</td>\n\t\t<td>{splitted_gff[4]}</td>\n\t\t<td>{splitted_gff[5]}</td>\n\t\t<td>{splitted_gff[6]}</td>\n\t\t<td>{splitted_gff[7]}</td>\n\t\t<td>{splitted_gff[8].strip()}</td>\n\t</tr>\n") html_file.write("</table>\n") file_to_read.close() ## FASTA # Only open the file name that matches -- > Will be always the same if re.search("result.fas", file): # Open the file to read from with open(f"{args.MITOS2Folder}/{folder}/{file}", "r") as file_to_read: # Get the lines of a file split on newline in a list. file_lines = file_to_read.readlines() #Set a counter for each file to handle count = 0 for line in file_lines: count += 1 # To write the first div if re.search("^>", line) and count == 1: html_file.write(f"\t<div class='contigpadd'>\n\t\t{line}\n\t<pre>\n") # To write the divs except first and last elif re.search("^>", line) and count != 1: html_file.write(f"\t</pre>\n\t</div>\n\t<div class='contigpadd'>\n\t\t{line}\n\t<pre>\n") # Write the sequence lines elif re.search("^[A,G,C,T,U]", line): # Tab in pre statement will directly visualized. html_file.write(f"\t{line}") # To write the last div block if count == len(file_lines): html_file.write(f"\t</pre>\n\t</div>\n") # Close file file_to_read.close() html_file.write(f"</div>\n") # Write a div to a file that will contain the part of the sidebar. # Write every line connected to a title to the sidebar on the side. html_file.write(f"<div class='sidenav'>\n") # Only style the first differently to come more out as a header. html_file.write(f"<a class='header' href='#statisticalReport'>Statistical Report</a>\n") html_file.write(f"<a href='#porechopABI'>Porechop ABI</a>\n") html_file.write(f"<a href='#prowler'>Prowler Trimming</a>\n") html_file.write(f"<a href='#sacra'>SACRA</a>\n") html_file.write(f"<a href='#sacrafilt'>SACRA Filtered</a>\n") html_file.write(f"<a href='#diamond'>DIAMOND</a>\n") html_file.write("<a href='#bandage'>SPAdes - BANDAGE</a>\n") html_file.write(f"<a href='#mitos'>MITOS2</a>\n") # Loop over 2 lists one that contains the links to the titles & one with the visible title in the navbar for ids, item in zip(ids_headers, contig_headers): html_file.write(f"<a class='subtitle'href='#{ids}'>{item}</a>\n") # Close the div html_file.write(f"</div>'>\n") #Close the file html_file.close() # Close the file that has been written to html_file.close() |
23
of
scripts/StatisticalReportGenerator.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | import Fastq as fq import time as tm import sys, os from optparse import OptionParser # -6 removes the ".fastq" def generateOutputFilename(filename, outFolder, trimSpecs, Qcutoff, windowSize, \ minLen, maxDataMB, suffix): filebase = os.path.basename(filename) outputFileName = filebase[0:-6] + suffix outputFileName += trimSpecs + str(Qcutoff) outputFileName += "W" outputFileName += str(windowSize) outputFileName += "L" + str(minLen) outputFileName += "R" + str(maxDataMB) outputFileName = outFolder + '/'+ outputFileName return outputFileName def writeFile(filename, outFolder, seqs, trimSpecs, Qcutoff, windowSize, minLen, \ maxDataMB, suffix, outMode=".fastq"): # add code suffixes to output file name for mode(D__/S__), window size (W__), # minimum Length of read (L__), and sequences selected from raw file (R[__-__) #generates the output filename from sourcefilename and specs outputFileName = generateOutputFilename(filename, outFolder, trimSpecs, Qcutoff, \ windowSize, minLen, maxDataMB, \ suffix) if outMode == ".fastq": fq.writeFastq(seqs,outputFileName) elif outMode == ".fasta": fq.writeFasta(seqs,outputFileName) else: print("error. output file type not supported") # clips ends according to clipEnds seeting def clipFunction(clipEnds,seqs): if clipEnds == "LT": fq.removeAllLeadingAndTrailingNs(seqs) elif clipEnds == "L": fq.removeAllLeadingNs(seqs) elif clipEnds == "T": fq.removeAllTrailingNs(seqs) # Trims contig list based on min len and desired #fragments def fragFunction(seqs, minLen, numFrags, fragMode): if fragMode == "F": seqs = fq.seqsToFragments(seqs, minLen, numFrags) return seqs def analyzeFastqFile(fileLines,minLen): seqQty = int(len(fileLines)/4) seqs = [None]*seqQty for i in range(seqQty): # print("i:",i) seqs[i] = fq.Fastq(fileLines[i*4:i*4+4],minLen) return seqs def readBuffer(f, bufferSize): fileLines = f.readlines(bufferSize) extraLines = (4-len(fileLines)%4)%4 for j in range(extraLines): fileLines.append(f.readline()) # print(len(fileLines)/4) return fileLines if __name__ == "__main__": #use homePC=1 when running in spyder, use 0 when running on HPC runMode = 0 if runMode == 0: parser = OptionParser() parser.add_option("-f", "--file", dest="filename", help="the name of the file you want to trim, wihtout the folderpath", metavar="FILE") parser.add_option("-i", "--infolder", dest="inFolder", help="the folderpath where your file to be trimmed is located (default = cwd)", metavar="DIRECTORY", default="") parser.add_option("-o", "--outfolder", dest="outFolder", help="the folderpath where your want to save the trimmed file (default = cwd)", metavar="DIRECTORY", default="") parser.add_option("-w", "--windowsize", dest="windowSize", help="change the size of the trimming window (default= 100bp)", metavar="INT", default=100) parser.add_option("-l", "--minlen", dest="minLen", help="change the minimum acceptable numer of bases in a read (default=100)", metavar="INT", default=100) parser.add_option("-m", "--trimmode", dest="mode", help="select trimming algorithm: S for static or D for dynamic (default=)", metavar="[S/D]", default="S") parser.add_option("-q", "--qscore", dest="Qcutoff", help="select the phred quality score trimming threshold (default=7)", metavar="INT", default=7) parser.add_option("-d", "--datamax", dest="maxDataMB", help="select a maximum data subsample in MB (default=0, entire file))", metavar="INT", default=0) parser.add_option("-r", "--outformat", dest="outMode", help="select output format of trimmed file (fastq or fasta) (default=.fastq)", metavar="[.fasta/.fastq]", default=".fastq") parser.add_option("-c", "--clip", dest="clipping", help="select L to clip leading Ns, T to trim trialing Ns and LT to trim both (default=LT)", metavar="[L/T/LT]", default="LT") parser.add_option("-g", "--fragments", dest="fragments", help="select fragmentation mode (default=U0)", metavar="[U0/F1/F2...]", default="U0") (options, args) = parser.parse_args() #the source file that will be trimmed filename = options.filename #folder in which the source file is located inFolder = options.inFolder #folder where the trimmed file will be saved outFolder = options.outFolder #size of the trim window windowSize = int(options.windowSize) #minimum length of each fastq sequence minLen = int(options.minLen) #trim mode (D=dynamic, S=static) mode = options.mode #Phred Quality score threshold Qcutoff = int(options.Qcutoff) #number of sequences to analyze (0=all) maxDataMB = int(options.maxDataMB) #u\output format of the trimmed file (fq=fastq, fa=fasta) outMode = options.outMode trimSpecs = options.clipping + "-" + options.fragments + "-" + options.mode trimSpecsList = [options.clipping, options.fragments,options.mode] elif runMode == 1: #the source file that will be trimmed filename = "brahSplits00-00.fastq" #folder in which the source file is located inFolder = "" #folder where the trimmed file will be saved outFolder = "" #size of the trim window windowSize = 1000 #minimum length of each fastq sequence minLen = 1000 #trim mode (D=dynamic, S=static) mode = "S" #Phred Quality score threshold Qcutoff = 0 #number of sequences to analyze (0=all) maxDataMB = 100 #u\output format of the trimmed file (fq=fastq, fa=fasta) outMode = ".fastq" #three trim settings: #[0]: leading / trailing trim flags L=leading, T=triailing #[1]: number of fragments to keep in fragmented mode (0=all) #[2]: fragmentation mode: N=no fragmentation, F=fragmented trimSpecs = "LT-U0-S" trimSpecsList = trimSpecs.split("-") elif runMode == 2: filename = sys.argv[1] inFolder = sys.argv[2] outFolder = sys.argv[3] windowSize = int(sys.argv[4]) minLen = int(sys.argv[5]) # mode = sys.argv[6][-1] Qcutoff = int(sys.argv[7]) maxDataMB = int(sys.argv[8]) outMode = sys.argv[9] # trimStats = sys.argv[10].split(".") trimSpecs = sys.argv[6] trimSpecsList = trimSpecs.split("-") #variables denoting the size of each buffer element: n*Mb = n megabytes n = 1 Mb = 1024**2 bufferSize = n*Mb #trim mode (D=dynamic, S=static) mode = trimSpecsList[2] #clip ends contains the leading/trailing trim flags clipEnds = trimSpecsList[0] #fragMode is the fragmentation flag fragMode = trimSpecsList[1][0] #numFrags contains the number of frags to keep in fragmented mode numFrags = int(trimSpecsList[1][1:]) #filepath combines the source file name witht he source folder to #produce the source filepath filepath = inFolder + os.path.basename(filename) # performs a trim of the sequence list ofr the given trim specs # also removes leading and trailing Ns and breaks # fragmented sequences into contigs print("start:") #generates a new blank file with the appropriate filename suffix1 = "Trim" outputFileName = generateOutputFilename(filename, outFolder, trimSpecs, \ Qcutoff, windowSize, minLen, \ maxDataMB, suffix1) if not os.path.exists("ProwlerProcessed"): os.makedirs("ProwlerProcessed") f = open(outputFileName+outMode,"w+") f.close() f = open(filepath, "r") startTime = tm.time() # reads n Mb of data, then adds between 0-3 lines to ensure the number of # lines is a multiple of 4 (4 lines per fastq element) fileLines = readBuffer(f, bufferSize) counter=0 time0=tm.time() timeCurrent=0 maxCount = round(maxDataMB,0) #analyzes buffer seqs in lots until the End of File or max count is reached while (len(fileLines) > 0) and (maxCount == 0 or counter <= maxCount): print(counter) counter+=1 # print("count: ", counter) timePrevious = timeCurrent timeCurrent = tm.time()-time0 timeStep = timeCurrent-timePrevious timeAvg = (timeCurrent+timeStep)/(counter) # print("time current: ", round(timeCurrent,2), "s") # print("time step : ", round(timeStep,2), "s") # print("time average: ", round(timeAvg,2), "s") #turns line strings into fastq elements and collates them into a list seqs1 = analyzeFastqFile(fileLines, minLen) #performs rolling trim on all buffer seqs and returns coverage trimCoverage = fq.rollingTrimAllSeqs(seqs1,Qcutoff,windowSize,mode, minLen) # clips ends according to clipEnds setting clipFunction(clipEnds, seqs1) # Trims contig list based on min len and desired #fragments seqs1 = fragFunction(seqs1, minLen, numFrags, fragMode) #writes current trimmed buffer seqs to file writeFile(filename, outFolder, seqs1, trimSpecs, Qcutoff, windowSize, minLen, \ maxDataMB, suffix1, outMode) # reads n Mb of data, then adds between 0-3 lines to ensure the number of # lines is a multiple of 4 (4 lines per fastq element) fileLines = readBuffer(f, bufferSize) f.close() #calculates the time taken to analyze all seqs endTime = tm.time() trimTime = endTime - startTime if not os.path.exists("ProwlerProcessed"): os.makedirs("ProwlerProcessed") filebase = os.path.basename(filename) #the name of the csv file that will contain the trim time and coverage stats fileOutName = outFolder + '/' + filebase[0:-6] + "Stats"+str(trimSpecs)+str(Qcutoff) \ +"W"+str(windowSize)+"L"+str(minLen)+"R"+str(maxDataMB)+".csv" # the run specifications plus the trim time and coverage stats for the run trimOut = (suffix1 + "\t"+ trimSpecs +"\t"+str(Qcutoff)+"\t"+str(windowSize) \ +"\t"+str(minLen)+"\t"+str(maxDataMB)+"\t"+str(trimCoverage) \ +"\t"+str(trimTime)+"\n") #outputs time statistics to csv fo = open(fileOutName, "w+") fo.write(trimOut) fo.close() |
Support
- Future updates
Related Workflows





