SMARTDADA2: High-Resolution Sequencing Quality Control Workflow
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
For more information here is the SMARTDADA2 website: https://smartdada2.readthedocs.io/en/latest/#
Repo Directory
Directory Name | Description |
---|---|
.github | Contains github actions tests |
notebooks | Contains series of steps and explanation of our method |
smartdada2 | Contains main scripts that will allow user to better understand the quality of their reads |
Installations and Dependencies
To install this program:
-
clone repository
-
create environment
-
activate environment
-
install smartdada2 module into python environment
For example in the terminal type:
git clone https://github.com/acolorado1/SMARTDADA2.git
conda env create -f smartdada2_env.yaml
conda activate {environemtnname i.e., smartdada2}
pip install -e .
Workflow
To Run Program
This program can be run using snakemake. In the Snakefile you must put the file path of the FASTQ file of your choosing and adjust parameters as needed. Parameters include:
-
PAIRED (bool): Are you using paired or signle end reads?
-
SUBSAMPLE (int): Number of reads you want to use when creating visualizations.
-
AVG_Q_SCORE (default = 30.0): Determines when obvious trimming will stop.
-
OBV_TRIMMING_MAX (default = 0.1): Determines when obvious trimming will throw a warning that the read might be too short. Default warns if trimming is over 10% of the read on either end.
-
MAX_TRIMMING (default = 0.2): Calculates max index trimming on either end when finding average expected error for position. Default will only calculate up to 20% trim/truncating on each end.
Once that is done write in the terminal:
snakemake -c 1
Note:
-c
specifies the max number of cores you want to use.
Input
This script takes in one FASTQ formatted file. For example:
@M07186:25:000000000-DHMKP:1:1101:15331:1350 1:N:0:1
TNCGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTCGCTTGTGAAAGCCCGGGGCTTAACTCCGGGTCTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAAC
+
>#>>>>>AA1>AEEEGEEGGFDFEGGEGHGHHHHGGGGGGGHHHFHHGGGGGFFFEFHGHBEGGCEE>EGGHHBBF1GGHGGGGGGHGHFHDHGGCBCGCGFH0EC/CGCGCGCGHEFCGHHGC=<GGDDDGHGG?-.EGHACCHFGFFFBCF.BB9?@;-AEBFFFGG?@-FFF9/;/BEBFFAFFF@@@FF--@@=?EFF<FFFFFFEF?-;FFFBF//FBBAB-FEBAE==@;B9BFFFFBA-->@@-
@M07186:25:000000000-DHMKP:1:1101:15094:1354 1:N:0:1
TNCGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA
+
A#>>>ABBBAABFGGGFGGGGGFDGGGGHFHHGHHHFFGGHGHFEEGECGGGGGGFGGEGFHHHHHGG@EGFGHFHHHHHGGGGGGHHFGHHHHGHHGHFHHHHHHHEE@GHHHHGGGHGHFGHHHHGHFHHHHGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFACDFAFFDAFFFFFFFFFFFFFFFFFFDDDDFFFFBFDDCFFFFFFFFFFFFFFFF
@M07186:25:000000000-DHMKP:1:1101:14910:1354 1:N:0:1
TNCGATTAACCCAAACTAATAGGCCTACGGCGTAAAGCGTGTTCAAGATACTTTTACACTAAAGTTAAAACTTAACTAAGCCGTAAAAAGCTACAGTTATCATAAAATAAACCACGAAAGTGACTTTATAATAATCTGACTACACGATAGCTAAGACCCAAACTGGGATTAGAAACCCCTGTAGTCCGGCTGGCTGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAATAGACGTGCTAGGTAT
+
A#>>>AABFFFB2AAEGGGGGGGHCGFHFG?EAEEAFGGGFEGH5AGGHFHHHHHHFFHHHGGHHHHHHGHHHHBDGGFHHHGGGGGGEDHHHHGHHGHHHHHFGEFGGFFFFFEFGGEF?FDEFHHHGHFFHGGHHEFFHHGHHDGEGGHGFHHGHHHHHHHBFGCGHHHHFDGGHFGFGCDGHHHDCGGCCABEGGHHH0CGHGGG/FGGG@FFGGGGFEFFB0CFBB<-;--.////.....//.9//
Output
An interactive output containing resulting images and two TSV files are output from these scripts in a directory called forward/output and reverse/output . In addition, static formats of the images will also be present in the subdirectory called plots .
The interactive output will appear as an html file that will resemble the following:
https://github.com/acolorado1/SMARTDADA2/assets/68305443/9f44536e-c4d5-466d-85b6-fca4a6c92a02
The two TSVs that will be output and that will be used to create the figures look like the following:
TrimInfo:
LeftIndex RightIndex ReadLength AvgEEPerPosition RightTrim LeftTrim
0 201 201 0.00025149061996092216 234 0
0 202 202 0.0002525037876296975 234 0
0 203 203 0.0002531623874528494 234 0
0 204 204 0.00025375036528248743 234 0
SumEEInfo:
NoTrimming ObviousTrimming
1.7478821923032126 1.545527496734793
0.686050226996564 0.6824597001220604
1.6827705639478587 0.9463769578124168
Contact
Angela Sofia Burkhart Colorado - angelasofia.burkhartcolorado@cuanschutz.edu
Erik Serrano - erik.serrano@cuanschutz.edu
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | import argparse as arg from smartdada2.reader import reader def getPairedFiles(main_fastq: str, output_f: str, output_r: str): """If paired end sequencing was performed get two files containing reverse and forward sequences. Args: main_fastq (str): one FASTq formatted file containing all reads output_f (str): output file path for forward sequences output_r (str): output file path for reverse sequences Returns: files: two fastq files will be written containing forward and reverse reads separately. """ # make sure input is of right type if not isinstance(main_fastq, str): raise TypeError("must be a file path of type string") # read in data fqe = reader.FastqReader(main_fastq) # make sure data is correct type if not isinstance(fqe, reader.FastqReader): raise TypeError("must be of FastqEntry type") # get forward reads in list format forward = [] for reads in fqe.get_forward_reads(): forward.extend((reads.header, reads.seq, "+", reads.scores)) # write forward file with open(output_f, "+w") as f: for line in forward: line = str(line) + "\n" f.write(line) # get reverse reads in list format reverse = [] for reads in fqe.get_reverse_reads(): reverse.extend((reads.header, reads.seq, "+", reads.scores)) # write reverse file with open(output_r, "+w") as f: for line in reverse: line = str(line) + "\n" f.write(line) return None # created parser to run through terminal def main(): parser = arg.ArgumentParser( description="This script will separate forward and reverse " + "reads for either paired or single end sequencing" ) parser.add_argument( "--fq", "-fastq", type=str, required=True, help="file path to fastq files", ) parser.add_argument( "--of_f", "-output_file_forward", type=str, required=True, help="output file path for forward sequences", ) parser.add_argument( "--of_r", "-output_file_reverse", type=str, required=True, help="output file path for reverse sequences", ) args = parser.parse_args() getPairedFiles(args.fq, args.of_f, args.of_r) exit() if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | import argparse as arg from smartdada2.reader import reader def subsample(input_fp: str, subsample: int, output_fp: str): """Takes one FASTQ file and outputs subsets of the reads. Args: input_fp (str): FASTQ file path subsample (int): number of reads in subset output_fp (str): FASTQ file subset output path Returns: A fastq formatted file containing a subset of the original reads. """ # make sure input is of right type if not isinstance(input_fp, str): raise TypeError("must be a file path of type string") if not isinstance(subsample, int): raise TypeError("subsampling must be of integer type") if not isinstance(output_fp, str): raise TypeError("must be a file path of type string") # read in data fqe = reader.FastqReader(input_fp) # make sure data is correct type if not isinstance(fqe, reader.FastqReader): raise TypeError("must be of FastqEntry type") # perform reservoir sampling samples = fqe.reservoir_sampling(n_samples=subsample, seed=5) # create list of samples sample_list = [] for sample in samples: sample_list.extend((sample.header, sample.seq, "+", sample.scores)) # write into a fastq formatted file with open(output_fp, "+w") as f: for line in sample_list: line = str(line) + "\n" f.write(line) return None # created parser to run through terminal def main(): parser = arg.ArgumentParser( description="This script will take a fastq formatted file" + "and output a subset of it" ) parser.add_argument( "--fq", "-fastq", type=str, required=True, help="file path to fastq files", ) parser.add_argument( "--n", "-n_sample", type=int, required=True, help="Number of samples in subset" ) parser.add_argument( "--of", "-output_file", type=str, required=True, help="output file path" ) args = parser.parse_args() subsample(args.fq, args.n, args.of) exit() if __name__ == "__main__": main() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | import argparse as arg import warnings import GetMaxEE as GME import GetTrimParameters as GTP from smartdada2.reader import reader def main(): parser = arg.ArgumentParser( description="This script will output a dataframe containing" + "trimming information read length, average expected error by" + "position and the number of reads below and over a threshold" ) parser.add_argument( "--fq", "-fastq", type=str, required=True, help="file path to fastq files", ) parser.add_argument( "--t_of", "-trim_output_file", type=str, required=False, help="name of the output file containing trim information", default="parameter_info.tsv", ) parser.add_argument( "--th", "-threshold", type=int, required=False, help="Phred score threshold for obvious trimming", default=30, ) parser.add_argument( "--o_mtp", "-obvs_max_trim_percentage", type=float, required=False, help="max percent of the length of the original trim to be" + "trimmed on either end of the read in obvious trimming", default=0.1, ) parser.add_argument( "--a_mtp", "-additional_max_trim_percentage", type=float, required=False, help="when performing further trimming the max percentage of" + "the read to be trimmed at either end", default=0.2, ) parser.add_argument( "--EE_of", "-EE_output_file", type=str, required=False, help="name of output file containing sum of EE information", default="SumEEInfo.tsv", ) args = parser.parse_args() # ensure o_mtp is less than a_mtp if args.o_mtp > args.a_mtp: warnings.warn( "max percentage of trimming done in obvious trimming is larger" + "than the max in following steps which is not recommended" ) # read in data fqe = reader.FastqReader(args.fq) # get average scores per position avg_scores_df = fqe.get_average_score() avg_scores_list = avg_scores_df["AverageQualityScore"].values.tolist() # perform obvious trimming left, right = GTP.trim_ends_less_than_threshold( avg_scores_list, args.th, args.o_mtp ) # get average EE by position for different read sizes EE_by_size_df = GTP.read_size_by_avg_EE(fqe, left, right, args.a_mtp) # get dataframe containing the sum of the expected error per sequence sumEE = GME.read_size_by_maxEE(fqe, left, right) # convert final dataframes to TSV EE_by_size_df.to_csv(args.t_of, sep="\t", index=False) sumEE.to_csv(args.EE_of, sep="\t", index=False) exit() if __name__ == "__main__": main() |
33 34 | shell: "python smartdada2/GetPairedendFiles.py --fq {input.fq} --of_f {output.forward} --of_r {output.rev}" |
43 44 | shell: "python smartdada2/GetSubsamples.py --fq {input} --n {params.n} --of {output}" |
56 57 | shell: "python smartdada2/main.py --fq {input} --t_of {output.TrimInfo} --th {params.threshold} --o_mtp {params.o_mtp} --a_mtp {params.a_mtp} --EE_of {output.SumEEInfo}" |
72 73 | shell: "Rscript " + "smartdada2/R_scripts/vizualize.R --trim_file {input.TrimInfo} --sumEE_file {input.sumEEInfo} --output_file {input.output_file}" |
82 83 | shell: "Rscript " + "smartdada2/R_scripts/wrapper_RunRender.R --trim_file {input.TrimInfo} --sumEE_file {input.sumEEInfo} --output_file {input.output_file}" |
Support
- Future updates
Related Workflows





