SMARTDADA2: High-Resolution Sequencing Quality Control Workflow

public 1yr ago Version: v1.0 0 bookmarks

View Workflow

dada2parameterexploration — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

SMARTDADA2

For more information here is the SMARTDADA2 website: https://smartdada2.readthedocs.io/en/latest/#

Repo Directory

Directory Name	Description
.github	Contains github actions tests
notebooks	Contains series of steps and explanation of our method
smartdada2	Contains main scripts that will allow user to better understand the quality of their reads

Installations and Dependencies

To install this program:

clone repository
create environment
activate environment
install smartdada2 module into python environment

For example in the terminal type:

git clone https://github.com/acolorado1/SMARTDADA2.git
conda env create -f smartdada2_env.yaml
conda activate {environemtnname i.e., smartdada2}
pip install -e .

Workflow

To Run Program

This program can be run using snakemake. In the Snakefile you must put the file path of the FASTQ file of your choosing and adjust parameters as needed. Parameters include:

PAIRED (bool): Are you using paired or signle end reads?
SUBSAMPLE (int): Number of reads you want to use when creating visualizations.
AVG_Q_SCORE (default = 30.0): Determines when obvious trimming will stop.
OBV_TRIMMING_MAX (default = 0.1): Determines when obvious trimming will throw a warning that the read might be too short. Default warns if trimming is over 10% of the read on either end.
MAX_TRIMMING (default = 0.2): Calculates max index trimming on either end when finding average expected error for position. Default will only calculate up to 20% trim/truncating on each end.

Screen Shot 2023-05-23 at 1 58 10 PM Once that is done write in the terminal:

snakemake -c 1

Note: -c specifies the max number of cores you want to use.

Input

This script takes in one FASTQ formatted file. For example:

@M07186:25:000000000-DHMKP:1:1101:15331:1350 1:N:0:1
TNCGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTCGCTTGTGAAAGCCCGGGGCTTAACTCCGGGTCTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAAC
+
>#>>>>>AA1>AEEEGEEGGFDFEGGEGHGHHHHGGGGGGGHHHFHHGGGGGFFFEFHGHBEGGCEE>EGGHHBBF1GGHGGGGGGHGHFHDHGGCBCGCGFH0EC/CGCGCGCGHEFCGHHGC=<GGDDDGHGG?-.EGHACCHFGFFFBCF.BB9?@;-AEBFFFGG?@-FFF9/;/BEBFFAFFF@@@FF--@@=?EFF<FFFFFFEF?-;FFFBF//FBBAB-FEBAE==@;B9BFFFFBA-->@@-
@M07186:25:000000000-DHMKP:1:1101:15094:1354 1:N:0:1
TNCGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA
+
A#>>>ABBBAABFGGGFGGGGGFDGGGGHFHHGHHHFFGGHGHFEEGECGGGGGGFGGEGFHHHHHGG@EGFGHFHHHHHGGGGGGHHFGHHHHGHHGHFHHHHHHHEE@GHHHHGGGHGHFGHHHHGHFHHHHGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFACDFAFFDAFFFFFFFFFFFFFFFFFFDDDDFFFFBFDDCFFFFFFFFFFFFFFFF
@M07186:25:000000000-DHMKP:1:1101:14910:1354 1:N:0:1
TNCGATTAACCCAAACTAATAGGCCTACGGCGTAAAGCGTGTTCAAGATACTTTTACACTAAAGTTAAAACTTAACTAAGCCGTAAAAAGCTACAGTTATCATAAAATAAACCACGAAAGTGACTTTATAATAATCTGACTACACGATAGCTAAGACCCAAACTGGGATTAGAAACCCCTGTAGTCCGGCTGGCTGACTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAATAGACGTGCTAGGTAT
+
A#>>>AABFFFB2AAEGGGGGGGHCGFHFG?EAEEAFGGGFEGH5AGGHFHHHHHHFFHHHGGHHHHHHGHHHHBDGGFHHHGGGGGGEDHHHHGHHGHHHHHFGEFGGFFFFFEFGGEF?FDEFHHHGHFFHGGHHEFFHHGHHDGEGGHGFHHGHHHHHHHBFGCGHHHHFDGGHFGFGCDGHHHDCGGCCABEGGHHH0CGHGGG/FGGG@FFGGGGFEFFB0CFBB<-;--.////.....//.9//

Output

An interactive output containing resulting images and two TSV files are output from these scripts in a directory called forward/output and reverse/output . In addition, static formats of the images will also be present in the subdirectory called plots .

The interactive output will appear as an html file that will resemble the following:

https://github.com/acolorado1/SMARTDADA2/assets/68305443/9f44536e-c4d5-466d-85b6-fca4a6c92a02

The two TSVs that will be output and that will be used to create the figures look like the following:

TrimInfo:

LeftIndex	RightIndex	ReadLength	AvgEEPerPosition RightTrim	LeftTrim
0 201 201 0.00025149061996092216	234 0
0 202 202 0.0002525037876296975 234 0
0 203 203 0.0002531623874528494 234 0
0 204 204 0.00025375036528248743	234 0

SumEEInfo:

NoTrimming ObviousTrimming
1.7478821923032126 1.545527496734793
0.686050226996564 0.6824597001220604
1.6827705639478587 0.9463769578124168

Contact

Angela Sofia Burkhart Colorado - angelasofia.burkhartcolorado@cuanschutz.edu

Erik Serrano - erik.serrano@cuanschutz.edu

Code Snippets

import argparse as arg

from smartdada2.reader import reader


def getPairedFiles(main_fastq: str, output_f: str, output_r: str):
    """If paired end sequencing was performed get
    two files containing reverse and forward sequences.

    Args:
        main_fastq (str): one FASTq formatted file containing all
        reads
        output_f (str): output file path for forward sequences
        output_r (str): output file path for reverse sequences

    Returns:
        files: two fastq files will be written containing
        forward and reverse reads separately.
    """
    # make sure input is of right type
    if not isinstance(main_fastq, str):
        raise TypeError("must be a file path of type string")

    # read in data
    fqe = reader.FastqReader(main_fastq)

    # make sure data is correct type
    if not isinstance(fqe, reader.FastqReader):
        raise TypeError("must be of FastqEntry type")

    # get forward reads in list format
    forward = []
    for reads in fqe.get_forward_reads():
        forward.extend((reads.header, reads.seq, "+", reads.scores))

    # write forward file
    with open(output_f, "+w") as f:
        for line in forward:
            line = str(line) + "\n"
            f.write(line)

    # get reverse reads in list format
    reverse = []
    for reads in fqe.get_reverse_reads():
        reverse.extend((reads.header, reads.seq, "+", reads.scores))

    # write reverse file
    with open(output_r, "+w") as f:
        for line in reverse:
            line = str(line) + "\n"
            f.write(line)

    return None


# created parser to run through terminal
def main():
    parser = arg.ArgumentParser(
        description="This script will separate forward and reverse "
        + "reads for either paired or single end sequencing"
    )

    parser.add_argument(
        "--fq",
        "-fastq",
        type=str,
        required=True,
        help="file path to fastq files",
    )

    parser.add_argument(
        "--of_f",
        "-output_file_forward",
        type=str,
        required=True,
        help="output file path for forward sequences",
    )

    parser.add_argument(
        "--of_r",
        "-output_file_reverse",
        type=str,
        required=True,
        help="output file path for reverse sequences",
    )

    args = parser.parse_args()

    getPairedFiles(args.fq, args.of_f, args.of_r)

    exit()


if __name__ == "__main__":
    main()

Python From line 1 of smartdada2/GetPairedendFiles.py

import argparse as arg

from smartdada2.reader import reader


def subsample(input_fp: str, subsample: int, output_fp: str):
    """Takes one FASTQ file and outputs subsets of the reads.

    Args:
        input_fp (str): FASTQ file path
        subsample (int): number of reads in subset
        output_fp (str): FASTQ file subset output path

    Returns:
        A fastq formatted file containing a subset of the original
        reads.
    """
    # make sure input is of right type
    if not isinstance(input_fp, str):
        raise TypeError("must be a file path of type string")
    if not isinstance(subsample, int):
        raise TypeError("subsampling must be of integer type")
    if not isinstance(output_fp, str):
        raise TypeError("must be a file path of type string")

    # read in data
    fqe = reader.FastqReader(input_fp)

    # make sure data is correct type
    if not isinstance(fqe, reader.FastqReader):
        raise TypeError("must be of FastqEntry type")

    # perform reservoir sampling
    samples = fqe.reservoir_sampling(n_samples=subsample, seed=5)

    # create list of samples
    sample_list = []
    for sample in samples:
        sample_list.extend((sample.header, sample.seq, "+", sample.scores))

    # write into a fastq formatted file
    with open(output_fp, "+w") as f:
        for line in sample_list:
            line = str(line) + "\n"
            f.write(line)

    return None


# created parser to run through terminal
def main():
    parser = arg.ArgumentParser(
        description="This script will take a fastq formatted file"
        + "and output a subset of it"
    )

    parser.add_argument(
        "--fq",
        "-fastq",
        type=str,
        required=True,
        help="file path to fastq files",
    )

    parser.add_argument(
        "--n", "-n_sample", type=int, required=True, help="Number of samples in subset"
    )

    parser.add_argument(
        "--of", "-output_file", type=str, required=True, help="output file path"
    )

    args = parser.parse_args()

    subsample(args.fq, args.n, args.of)

    exit()


if __name__ == "__main__":
    main()

Python From line 1 of smartdada2/GetSubsamples.py

import argparse as arg
import warnings

import GetMaxEE as GME
import GetTrimParameters as GTP

from smartdada2.reader import reader


def main():
    parser = arg.ArgumentParser(
        description="This script will output a dataframe containing"
        + "trimming information read length, average expected error by"
        + "position and the number of reads below and over a threshold"
    )

    parser.add_argument(
        "--fq",
        "-fastq",
        type=str,
        required=True,
        help="file path to fastq files",
    )
    parser.add_argument(
        "--t_of",
        "-trim_output_file",
        type=str,
        required=False,
        help="name of the output file containing trim information",
        default="parameter_info.tsv",
    )
    parser.add_argument(
        "--th",
        "-threshold",
        type=int,
        required=False,
        help="Phred score threshold for obvious trimming",
        default=30,
    )
    parser.add_argument(
        "--o_mtp",
        "-obvs_max_trim_percentage",
        type=float,
        required=False,
        help="max percent of the length of the original trim to be"
        + "trimmed on either end of the read in obvious trimming",
        default=0.1,
    )
    parser.add_argument(
        "--a_mtp",
        "-additional_max_trim_percentage",
        type=float,
        required=False,
        help="when performing further trimming the max percentage of"
        + "the read to be trimmed at either end",
        default=0.2,
    )
    parser.add_argument(
        "--EE_of",
        "-EE_output_file",
        type=str,
        required=False,
        help="name of output file containing sum of EE information",
        default="SumEEInfo.tsv",
    )

    args = parser.parse_args()

    # ensure o_mtp is less than a_mtp
    if args.o_mtp > args.a_mtp:
        warnings.warn(
            "max percentage of trimming done in obvious trimming is larger"
            + "than the max in following steps which is not recommended"
        )

    # read in data
    fqe = reader.FastqReader(args.fq)

    # get average scores per position
    avg_scores_df = fqe.get_average_score()
    avg_scores_list = avg_scores_df["AverageQualityScore"].values.tolist()

    # perform obvious trimming
    left, right = GTP.trim_ends_less_than_threshold(
        avg_scores_list, args.th, args.o_mtp
    )

    # get average EE by position for different read sizes
    EE_by_size_df = GTP.read_size_by_avg_EE(fqe, left, right, args.a_mtp)

    # get dataframe containing the sum of the expected error per sequence
    sumEE = GME.read_size_by_maxEE(fqe, left, right)

    # convert final dataframes to TSV
    EE_by_size_df.to_csv(args.t_of, sep="\t", index=False)
    sumEE.to_csv(args.EE_of, sep="\t", index=False)

    exit()


if __name__ == "__main__":
    main()

Python From line 1 of smartdada2/main.py

shell:
    "python smartdada2/GetPairedendFiles.py --fq {input.fq} --of_f {output.forward} --of_r {output.rev}"

SnakeMake From line 33 of master/Snakefile

shell:
    "python smartdada2/GetSubsamples.py --fq {input} --n {params.n} --of {output}"

SnakeMake From line 43 of master/Snakefile

shell:
    "python smartdada2/main.py --fq {input} --t_of {output.TrimInfo} --th {params.threshold} --o_mtp {params.o_mtp} --a_mtp {params.a_mtp} --EE_of {output.SumEEInfo}"

SnakeMake From line 56 of master/Snakefile

shell:
    "Rscript " + "smartdada2/R_scripts/vizualize.R --trim_file {input.TrimInfo} --sumEE_file {input.sumEEInfo} --output_file {input.output_file}"

SnakeMake From line 72 of master/Snakefile

shell:
    "Rscript " + "smartdada2/R_scripts/wrapper_RunRender.R --trim_file {input.TrimInfo} --sumEE_file {input.sumEEInfo} --output_file {input.output_file}"