A metatranscriptomic pipeline optimized for the study of microeukaryotes.
This
README
file is a brief introduction to
eukrhythmic
. For more detailed information, please visit the
readthedocs
.
Metatranscriptomic Pipelining
Documentation for
eukrhythmic
a metatranscriptomics analysis pipeline written as a Snakemake workflow that enables users to run
-
Trimming
-
Quality statistics
-
Assembly (using several available assemblers)
-
Assembly analysis / quality comparisons
Table of Contents
Initializing the pipeline
Initialize the pipeline by setting up a
conda
environment, such that all the requested packages are loaded.
conda env create metatrans --file environment.yaml
Pipeline schematic
File naming
Your file names should not subset one another. So if one file is called "supercool_sample", another should not be called "supercool_sample_2". However, if the two were called "supercool_sample_1" and "supercool_sample_2", this would be fine, because neither name is entirely found within the other.
The
metaT_sample
file
In the
config.yaml
file, there is a listing for a file called
metaT_sample
in the configuration file. This is essentially the data input source as far as what sample names you are expecting to include in your analysis, as well as any other information about the samples that you would like to be used. This is essential if you would like to apply groupings and co-assemble several samples together, and in general it is essential for the pipeline to work as intended.
Depending on the application, some columns of this file must be added or may not be necessary. For example, for repeated samples in the same location, the latitude and longitude may not be necessary, because geographic variation in the metatranscriptomic assembly will not be evaluated. Any data that are not included in the default steps in the provided script can be excluded.
As detailed in
scripts/make_sampleinfo.ipynb
, the minimum required to run the general (default) pipeline without any comparative analysis between samples are the "SampleName", "SampleID" and the "FastqFileNames". SampleName and SampleID can be identical if no such distinction exists in your sample. However, strictly, "SampleName" is a descriptive name for your samples, whereas "SampleID" is how the samples will be named throughout the pipeline, thus is is preferable to minimize special characters and length in the "SampleID" column, whereas "SampleName" may be more verbose.
Autogeneration of full
metaT_sample
file
Using the script
scripts/autogenerate_metaT_sample.py
, you can autogenerate a working sample file for whatever files are present in the directory specified in your configuration file as
inputDIR
. This file should be run from the base
eukrhythmic
directory. At minimum, it requires a name for the output
metaT_sample
file as a parameter, which will be saved in the
input
directory, and then should be specified in the configuration file as the
metaT_sample
file. (TODO: in the future, autopopulate the config file with this entry and allow users to run the pipeline without even running this separately, with a default name for the
metaT_sample
file, as long as the
inputDIR
is specified).
So within the base
eukrhythmic
directory, the following command may be run:
python scripts/autogenerate_metaT_sample.py testsampledata.txt
Optionally, additional parameters may be provided. The second optional parameter is a file extension, which defaults to "fastq.gz". The third and fourth optional parameters are labels for forward and reverse reads (defaults to "_1" and "_2", respectively), and the fifth optional parameter is an additional file suffix used to split the filename (e.g. 001; defaults to the file extension; specifically important for single-end reads).
Autogeneration of "FastqFileNames" column with "SampleID" column
In /scripts/, there is a Python script called
make_sample_file.py
that will generate the
fastq
file names column, given your data input folder and an existing
metaT_sample
file that contains sample IDs. This is a good option if you only wish to run a subset of the files in your input data folder, and want to make sure the
fastq
file names are properly formatted.
If you use this script, all of the files with the fastq extension listed in your
INPUTDIR
that have a match to entries in your SampleID column will be included. Optionally, the script accepts up to two input arguments that specify how forward/reverse reads are labeled. By default, "_R", "_1", and "_2" are searched for.
Notes about manually creating
metaT_sample
If you specify "FastqFileNames" manually,
ensure that the files are named uniquely and that the entire unique choice of name is specified in this column
. Filenames that match to more than two
fastq
files in your input directory will raise an exception.
In the event that you want more control over how your samples are named, use
scripts/make_sampleinfo.ipynb
. The AssemblyGroup column may be omitted if you do not mind if your samples are assigned assembly groups according to numbers 1-
n
, where
n
is your number of samples, but in this case
separategroups
must be set to 0 in your
config.yaml
file. (
In the future, this may be updated to be done automatically if the column is absent, as well
).
Once you have generated the
metaT_sample
file containing the information about your samples, and have populated
config.yaml
with the relevant directories, including, importantly,
outputDIR
, which will be the location of your results, you are ready to run the pipeline.
Configuration file entries
Below is a listing of each supported entry in the configuration file (
config.yaml
in the base directory) and how to specify each flag when using the pipeline.
Flag in file | Meaning & how to specify |
---|---|
metaT_sample
|
The name of the sample file containing sample ids to be used as unique identifiers in the pipeline, descriptive sample names, and input FASTA file names. |
inputDIR
|
The file directory where the input data is found. Currently, should be specified with "/" separators, but no trailing "/". Should begin with "/" only if you are going to the root of your file system (not a relative path). |
checkqual
|
Boolean flag for whether to run quality checking with
salmon
,
QUAST
,
BUSCO
, etc. on assemblies. If 1, these quality checks are performed.
|
spikefile
|
A path to a FASTA file containing the sequence of any spiking that might affect reads. This will depend on experimental setup. If the file is not valid (e.g., if this flag is set to 0), nothing is done. |
runbbmap
|
A boolean flag to specify whether to use a spike file to drop spiked reads, according to what was done in your experiment. If 1, the spikefile is used; otherwise, this filtering is either not performed or is not used downstream in the pipeline (depending on whether a spike file exists). |
kmers
|
A list of k -mer sizes to use, where applicable, in assembly. These should all be integer values (default: 20, 50, 110). The median k -mer value in this list will be used when just 1 k -mer value is required. |
assemblers
|
The assemblers to be used to assemble the metatranscriptomes (which will later be merged). All of the specified assemblers in this list should have matching Snakemake rules in the
modules
folder of the main pipeline directory (named identically), as well as "clean" rules (explained below).
|
jobname
|
A descriptive name to be used to name jobs on your high-performance computing system, such that you can track the progress of your workflow. |
adapter
|
Path to a FASTA file containing the adapter used during sequencing. Defaults to a static adapter file in the
static
directory.
|
separategroups
|
A boolean flag. If 1, specified assembly groups in the
metaT_sample
file are used to co-assemble raw files. Otherwise, each raw file is assembled separately regardless of what is specified in the "AssemblyGroup" column of the input file.
|
outputDIR
|
The path to a directory where all program output will be stored. |
assembledDIR
|
The directory to move assembled files to, relative to the output directory. Defaults to "assembled"; not necessary to specify. |
renamedDIR
|
The directory to move "renamed" files to (which are files with the name of the assembler added to each FASTA header), relative to the output directory. Defaults to "assembled"; not necessary to specify. |
scratch
|
The location to move unnecessary intermediate files to after computation. |
How to use the pipeline
There are three main options for using the pipeline: you can run the pipeline directly
using command-line arguments
, you can set up a simple batch script with those command line options,
and run it via
SLURM
, or you can treat the pipeline as a normal
Snakemake
workflow, and edit its components
manually
by changing the configuration file.
Running the pipeline from the command line
To run the pipeline from the command line, you likely still want to make sure that the sample file is generated the way you intend. To explore what goes into this file, and the way it will be autogenerated when executing
eukrhythmic
from the command line, please read the section on
file naming
below.
Other than that, all you need to do to execute
euk
rhythmic
from the command line is to alias the executable while in the
eukrhythmic
base directory after cloning from
GitHub
, like so:
alias eukrhythmic='./bin/eukrhythmic.sh'
And then invoke:
eukrhythmic <arguments>
To run
eukrhythmic
on the provided sample data, simply invoke
eukrhythmic
on a clean install of the program without arguments. You can also run the sample data by using the argument
--use-sample
, which will copy the relevant configuration entries. The exception is if you are running on an HPC system that uses the
SLURM
scheduler, invoke
eukrhythmic
with one argument:
-l
or
--slurm
to leverage your computing resources. In that case, you will also want to invoke
sbatch eukrhythmic --slurm
, to avoid running any steps on your current remote machine. A list of additional arguments required commonly by users is given below:
Flag |
config.yaml
entry
|
Meaning |
---|---|---|
-n / --job-name | jobname | A descriptive/unique name to be applied to SLURM jobs. |
-s / --sample-file-name | metaT_sample | The listing of SampleIDs, SampleNames, FastqFiles, and assembly groups for each sample. |
-o / --out-dir | outputDIR | The output directory to write to. |
-i / --in-dir | inputDIR | The location of the sample FASTQ files (crucial unless running sample data). |
-q / --check-quality | checkqual | Takes no arguments. If specified, quality assessment is run on assemblies. |
-b / --run-bbmap | runbbmap; spikefile | If your data contains spiked reads, you can list this flag and specify a FASTA file containing the spike. |
-l / --slurm | - |
If provided, run
snakemake
on SLURM.
|
-g / --generate-file | - | If provided, create the sample file automatically from the files present in the input directory. |
--use-sample | - | If provided, override all other parameters and run the sample file. |
-c / --scratch-dir | scratch | Takes argument of the directory to be used as scratch space. |
Running the pipeline directly with Snakemake
You can use the command-line arguments described above if you run the pipeline from
submit/eukrhythmic.sh
. Submitting this, customized with whatever arguments you prefer, to a scheduler (from the home directory) will run the pipeline in full.
To use the pipeline, the most important thing to do is to specify the paths to your particular input and output directories. Personalizing this will allow the pipeline to pull the relevant files when computing the results of individual rules, so this step is crucial. You can edit the configuration file randomly, or use the command-line options specified
above
. Specifically, you would change this line in
eukrhythmic.sh
:
eukrhythmic --use-sample --slurm
And add an
in-dir
flag like so:
eukrhythmic --use-sample --in-dir <name-of-your-input> --out-dir <name-of-your-output>
All input
fastq
files must be contained in the same directory, the
inputDIR
location, specified by the
--in-dir
flag. Only these metatranscriptomic data will be included in the analysis. These files do
not
, however, need to be located inside the
eukrhythmic
directory (and it is recommended that they are located elsewhere).
The next thing that needs to be done is to produce the sample file, containing all of the relevant information to run the pipeline. You can create this
yourself
, using only a list of
Sample IDs
, or completely
automatically
, which can be done all in one with the
eukrhythmic
bash script, by specifying:
eukrhythmic --use-sample --in-dir <name-of-your-input> --out-dir <name-of-your-output> -g
using the
-g
or
--generate-file
option, which runs the
included script
.
If you are using the
SLURM
scheduler, you can run the pipeline by simply adding/keeping the
--slurm
flag in the
eukrhythmic
command, after configuring options
for your particular system
, or by setting the
rewritecluster
configuration flag to 1, and specifying the options for all jobs in the
required
section of the
cluster.yaml
file.
Running the pipeline manually
To use the pipeline as a
Snakemake
workflow manually, the most important thing to do is to populate
config.yaml
with the paths to your particular input and output directories, and to generate the sample file (either
manually
or using a
semi
or
completely
automatic procedure. Personalizing this will allow the pipeline to pull the relevant files when computing the results of individual rules, so this step is crucial. You should also edit any other important parts of the configuration file, as described
here
.
Once the pieces are in place, and you have either activated an environment using
environment.yaml
or otherwise installed
snakemake
, you can run the pipeline using:
sbatch submit/snake_submit.sh
If you are using the
SLURM
scheduler, you can run the pipeline by simply executing the
submit/snake_submit.sh
file in the
eukrhythmic
directory, after configuring options
for your particular system
, or by setting the
rewritecluster
configuration flag to 1, and specifying the options for all jobs in the
required
section of the
cluster.yaml
file. If you are not using a scheduler, or are logged into a computer with sufficient computational resources (e.g., a
SLURM
job run in interactive mode), you can execute
Snakemake
directly.
Running the pipeline with
SLURM
In order to run the pipeline with
SLURM
or any other similar scheduling platform on an HPC system, the file
cluster.yaml
in the base directory needs to be populated. Specifications for individual rules can be added or removed as needed; the default configuration is what must absolutely be specified for the pipeline to run properly. Make sure that you include the following:
-
Your account name
-
Any flags that you typically use when running commands on the system, in the
__default__
->slurm
->command
string -
The partition of your system that you plan to use, as
queue
. By default, this might becompute
ornormal
.
If you set your account name at the top of the
cluster.yaml
file, as well as setting the default partition just once, and you do not change the parameter
rewritecluster
to 0 in
config.yaml
, you can use the command line interface or the provided submission file to circumvent filling out the rest of
cluster.yaml
. You can also do this by invoking
python scripts/importworkspace.py
once before running the pipeline, if you already have a valid
config.yaml
. If you do this, you won't need to change these values for the specifications for all of the individual rules, unless you have specific computational needs or usage requirements, in which case you should set
rewritecluster
to 0. If defaults are not specified at the beginning of the
cluster.yaml
file for the user, maximum memory usage, maximum number of cores, and maximum number of threads,
euk
rhythmic
will not execute successfully and an error will be thrown.
Setting CPUs and memory requirements
As a general rule for memory-intensive assemblers, the memory available to the process should be about ten times the number of cores/CPUs that you have available to you on the machine. For example, if using a machine with 30 cores available and 300 GB of available memory, you may want to configure your jobs to use 15 cores and 150 GB of memory, to allow two jobs to run concurrently on one node, and optimize memory relative to number of cores.
Adding new assemblers
In order to add a new assembler to the pipeline, three things need to be included:
-
A rule for the assembler, including a command that can be run using software in the specified
conda
environment, called<assembler>
. -
A rule to "clean" the assembly output, named
<assembler>_clean
, including moving the completed assemblies to the shared assembly folder, specified asassembledDIR
, which is a subdirectory of theoutputDIR
, also specified in the configuration file. Intermediate files should also be moved to a scratch directory or deleted, based on user preferences and space requirements. Any other files needed by other tools or desired by the user should be moved to a subdirectory of the output directory. If they are specified as output files,snakemake
will generate them automatically. Otherwise, the user will need to manually create directories that do not already exist (specifying them as output files is more extensible). -
A list entry for the assembler in the configuration file that matches the name of
<assembler>
in eachsnakemake
rule.
Intermediate files and cleanup
A "scratch" directory, specified in the configuration file, is used to store intermediate files after execution of rules such as assembly, which produce many files which are not needed downstream in the pipeline. To override this behavior, specify the output directory and the scratch directory to be the same location.
After the pipeline has been run, simply enter
snakemake hardclean --cores 1
To safely remove the scratch directory, if you don't need the intermediate files generated in individual pipeline steps.
Code Snippets
22 23 24 25 26 27 28 29 30 31 32 | shell: """ #busco --list-datasets #-l eukaryota_odb10 #cd ./scripts/busco #python3 setup.py install --user #cd ../.. #python ./scripts/busco/src/busco/run_BUSCO.py -i {input.outputassemblies} --auto-lineage-euk -o {output} -c 8 -m transcriptome --long busco -f -i {input.outputassemblies} -l ./input/eukaryota_odb10 -o {params.assembly}_busco -c 8 -m transcriptome --long mv run_{params.assembly}_busco {output} """ |
30 31 32 33 | shell: ''' cd-hit-est -d 0 -i {input.infiles} -o {output} -T {params.threads} -M {params.maxmemory} -G o -c {params.identityparam} -aS {params.mincoverageshorter} -aL {params.mincoveragelong} 2> {log.err} 1> {log.out} ''' |
51 52 53 54 | shell: ''' cd-hit-est -d 0 -i {input.infiles} -o {output} -T {params.threads} -M {params.maxmemory} -G o -c {params.identityparam} -aS {params.mincoverageshorter} -aL {params.mincoveragelong} 2> {log.err} 1> {log.out} ''' |
72 73 74 75 | shell: ''' cd-hit-est -d 0 -i {input.infiles} -o {output} -T {params.threads} -M {params.maxmemory} -G o -c {params.identityparam} -aS {params.mincoverageshorter} -aL {params.mincoveragelong} 2> {log.err} 1> {log.out} ''' |
15 16 17 18 19 20 21 | shell: ''' if [ -d {params.scratch} ] then rm -r {params.scratch} fi ''' |
35 36 37 38 | shell: ''' sed 's/>/>{params.assemblername}_{params.assembly}_/' {input.assemblyfiles} > {output} ''' |
49 50 51 52 53 54 | shell: ''' if [ ! -f {output} ]; then cat {params.assemblyfiles} > {output} fi ''' |
64 65 66 67 | shell: ''' cat {params.assemblyfiles} > {output} ''' |
38 39 40 41 | shell: ''' quast {params.outputassemblies} -o {output} --threads 8 --labels {params.assemblers} 2> {log.err} 1> {log.out} ''' |
52 53 54 55 56 57 58 59 60 61 62 63 | run: quastfiles = os.listdir(params.quastdir) if len(quastfiles) > 0: outputassemblers = pd.DataFrame() for r in quastfiles: if os.path.isfile(os.path.join(params.quastdir, r, "report.tsv")): currentreport = pd.read_csv(os.path.join(params.quastdir, r, "report.tsv"), index_col = 0, sep = "\t") for c in range(0,len(currentreport.columns)): currentreport = currentreport.rename(columns={currentreport.columns[c]: (currentreport.columns[c] + "_" + str(r))}) outputassemblers = pd.concat([outputassemblers,currentreport], axis = 1) outputassemblers.to_csv(path_or_buf = params.outputfile, sep = "\t") |
79 80 81 82 | shell: ''' quast {params.outputassemblies} -o {output} --threads 8 --labels {params.assembly} 2> {log.err} 1> {log.out} ''' |
93 94 95 96 97 98 99 100 101 102 103 104 | run: quastfiles = os.listdir(os.path.join(OUTPUTDIR, "quast")) if len(quastfiles) > 0: outputassemblers = pd.DataFrame() for r in quastfiles: if os.path.isfile(os.path.join(params.quastdir, r, "report.tsv")): currentreport = pd.read_csv(os.path.join(params.quastdir, r, "report.tsv"), index_col = 0, sep = "\t") for c in range(0,len(currentreport.columns)): currentreport = currentreport.rename(columns={currentreport.columns[c]: (currentreport.columns[c] + "_" + str(r))}) outputassemblers = pd.concat([outputassemblers,currentreport], axis = 1) outputassemblers.to_csv(path_or_buf = params.outputfile, sep = "\t") |
50 51 52 53 54 | shell: """ salmon index -t {input.fastafile} -i {params.indexname} -k {params.kval} 2> {log.err} 1> {log.out} salmon quant -i {params.indexname} -l {params.libtype} -1 {input.left} -2 {input.right} --validateMappings -o {params.outdir} 2>> {log.err} 1>> {log.out} """ |
72 73 74 75 76 | shell: """ salmon index -t {input.fastafile} -i {params.indexname} -k {params.kval} 2> {log.err} 1> {log.out} salmon quant -i {params.indexname} -l {params.libtype} -1 {input.left} -2 {input.right} --validateMappings -o {params.outdir} 2>> {log.err} 1>> {log.out} """ |
96 97 98 99 100 | shell: """ salmon index -t {input.assemblyfile} -i {params.indexname} -k {params.kval} 2> {log.err} 1> {log.out} salmon quant -i {params.indexname} -l {params.libtype} -1 {input.left} -2 {input.right} --validateMappings -o {params.outdir} 2>> {log.err} 1>> {log.out} """ |
120 121 122 123 124 | shell: """ salmon index -t {input.assemblyfile} -i {params.indexname} -k {params.kval} 2> {log.err} 1> {log.out} salmon quant -i {params.indexname} -l {params.libtype} -1 {input.left} -2 {input.right} --validateMappings -o {params.outdir} 2>> {log.err} 1>> {log.out} """ |
22 23 24 25 | shell: """ for i in {input.input}; do awk -F, -v OFS=, 'NR==1{{split(FILENAME,a,"/quant.sf");$2= a[1] ""}}1' ${{i}} | awk '{{gsub(".*salmon_quant_assembly_","",$5)}}1' > {output.out}; done """ |
34 35 36 37 38 39 40 41 42 43 44 | shell: """ awk ' {{samples[$1] = samples[$1] OFS $NF}} END {{ print "Name", samples["Name"] delete samples["Name"] for (name in samples) print name, samples[name] }} ' {input.input} > {output.out} """ |
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | run: final = pd.DataFrame() spike=pd.read_csv(input.copiestable, sep=' ') counts=pd.read_csv(input.countfile, sep=' ') for i in params.IDs: df = (spike.loc[spike['ID'] == (i)]) #Spike calculation reference: https://www.protocols.io/view/internal-genomic-dna-standard-for-quantitative-met-jxdcpi6?step=12 calc = (counts[i] * ((float((df['bbmap'])) * (params.spikecopies / (float((df['bbmap-total']))))) / (float((df['bbmap']))))) / float((df['VolumeFiltered'])) colNames = calc.name rowNames = counts.iloc[:,0] calc.index = rowNames calc.column_name = colNames final = final.append(calc) final_transpose = final.T print(final_transpose) final_transpose.to_csv(output.out, sep = ' ', index=True) |
31 32 33 34 35 36 37 38 39 | shell: """ unset PERL5LIB mkdir -p {params.wd_path} cp {input.merged} {params.merged}.fasta (cd {params.wd_path} && TransDecoder.LongOrfs -t {params.filename}.fasta -m {params.size}) 2> {log.err} 1> {log.out} (cd {params.wd_path} && TransDecoder.Predict -t {params.filename}.fasta -O --no_refine_starts) 2>> {log.err} 1>> {log.out} rm {params.merged}.fasta """ |
58 59 60 61 62 | shell: """ unset PERL5LIB bedtools getfasta -fi {input.merged} -bed {params.merged}.fasta.transdecoder.bed -fo {output.cds} """ |
81 82 83 84 85 86 87 88 89 | shell: """ mv {input.pep} {output.pep} mv {input.cds} {output.cds} mv {input.gff} {output.gff} mv {input.bed} {output.bed} rm -rf {params.merged}.fasta.transdecoder_dir* rm -rf pipeliner.*.cmds """ |
109 110 111 112 113 114 115 116 117 | shell: """ unset PERL5LIB mkdir -p {params.wd_path} cp {input.merged} {params.merged}.fasta (cd {params.wd_path} && TransDecoder.LongOrfs -t {params.filename}.fasta -m {params.size}) 2> {log.err} 1> {log.out} (cd {params.wd_path} && TransDecoder.Predict -t {params.filename}.fasta -O --no_refine_starts) 2>> {log.err} 1>> {log.out} rm {params.merged}.fasta """ |
131 132 133 134 135 | shell: """ unset PERL5LIB bedtools getfasta -fi {input.merged} -bed {params.merged}.fasta.transdecoder.bed -fo {output.cds} """ |
154 155 156 157 158 159 160 161 162 | shell: """ mv {input.pep} {output.pep} mv {input.cds} {output.cds} mv {input.gff} {output.gff} mv {input.bed} {output.bed} rm -rf {params.merged}.fasta.transdecoder_dir* rm -rf pipeliner.*.cmds """ |
182 183 184 185 186 187 188 189 190 | shell: """ unset PERL5LIB mkdir -p {params.wd_path} cp {input.clustered} {params.merged}.fasta (cd {params.wd_path} && TransDecoder.LongOrfs -t {params.filename}.fasta -m {params.size}) 2> {log.err} 1> {log.out} (cd {params.wd_path} && TransDecoder.Predict -t {params.filename}.fasta -O --no_refine_starts) 2>> {log.err} 1>> {log.out} rm {params.merged}.fasta """ |
204 205 206 207 208 | shell: """ unset PERL5LIB bedtools getfasta -fi {input.merged} -bed {params.merged}.fasta.transdecoder.bed -fo {output.cds} """ |
Support
- Future updates
Related Workflows





