A single-cell RNAseq pipeline for 10X genomics data

public 1yr ago Version: 2.3.2 0 bookmarks

Introduction

nf-core/scrnaseq is a bioinformatics best-practice analysis pipeline for processing 10x Genomics single-cell RNA-seq data.

This is a community effort in building a pipeline capable to support:

Alevin-Fry + AlevinQC
STARSolo
Kallisto + BUStools
Cellranger
UniverSC

Documentation

The nf-core/scrnaseq pipeline comes with documentation about the pipeline usage , parameters and output .

scrnaseq workflow

Usage

Note If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv :

sample,fastq_1,fastq_2,expected_cells
pbmc8k,pbmc8k_S1_L007_R1_001.fastq.gz,pbmc8k_S1_L007_R2_001.fastq.gz,"10000"
pbmc8k,pbmc8k_S1_L008_R1_001.fastq.gz,pbmc8k_S1_L008_R2_001.fastq.gz,"10000"

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).

Now, you can run the pipeline using:

nextflow run nf-core/scrnaseq \
 -profile <docker/singularity/.../institute> \
 --input samplesheet.csv \
 --genome_fasta GRCm38.p6.genome.chr19.fa \
 --gtf gencode.vM19.annotation.chr19.gtf \
 --protocol 10XV2 \
 --aligner <alevin/kallisto/star/cellranger/universc> \
 --outdir <OUTDIR>

Warning: Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters ; see docs .

For more details and further functionality, please refer to the usage documentation and the parameter documentation .

Decision Tree for users

The nf-core/scrnaseq pipeline features several paths to analyze your single cell data. Future additions will also be done soon, e.g. the addition of multi-ome analysis types. To aid users in analyzing their data, we have added a decision tree to help people decide on what type of analysis they want to run and how to choose appropriate parameters for that.

graph TD
 A[sc RNA] -->|alevin-fry| B(h5ad/seurat/mtx matrices)
 A[sc RNA] -->|CellRanger| B(h5ad/seurat/mtx matrices)
 A[sc RNA] -->|kbpython| B(h5ad/seurat/mtx matrices)
 A[sc RNA] -->|STARsolo| B(h5ad/seurat/mtx matrices)
 A[sc RNA] -->|Universc| B(h5ad/seurat/mtx matrices)

Options for the respective alignment method can be found here to choose between methods.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation .

Credits

nf-core/scrnaseq was originally written by Bailey PJ, Botvinnik O, Marques de Almeida F, Gabernet G, Peltzer A, Sturm G.

We thank the following people for their extensive assistance in the development of this pipeline:

@heylf
@KevinMenden
@FloWuenne
@rob-p

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines .

For further information or help, don't hesitate to get in touch on the Slack #scrnaseq channel (you can join with this invite ).

Citations

If you use nf-core/scrnaseq for your analysis, please cite it using the following doi: 10.5281/zenodo.3568187

The basic benchmarks that were used as motivation for incorporating the three available modular workflows can be found in this publication .

We offer all three paths for the processing of scRNAseq data so it remains up to the user to decide which pipeline workflow is chosen for a particular analysis question.

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Code Snippets

"""
#!/usr/bin/env Rscript
require(alevinQC)
alevinFryQCReport(
    mapDir = "${alevin_results}/af_map",
    quantDir = "${alevin_results}/af_quant",
    permitDir= "${alevin_results}/af_quant",
    sampleId = "${prefix}",
    outputFile = "alevin_report_${meta.id}.html",
    outputFormat = "html_document",
    outputDir = "./",
    forceOverwrite = TRUE
)

yaml::write_yaml(
    list(
        '${task.process}'=list(
            'alevinqc' = paste(packageVersion('alevinQC'), collapse='.')
        )
    ),
    "versions.yml"
)
"""

NextFlow alevinQC From line 23 of local/alevinqc.nf

"""
concat_h5ad.py \\
    --input $samplesheet \\
    --out combined_matrix.h5ad \\
    --suffix "_matrix.h5ad"
"""

NextFlow From line 20 of local/concat_h5ad.nf

"""
touch combined_matrix.h5ad
"""

NextFlow From line 28 of local/concat_h5ad.nf

"""
$unzip
cat $name | t2g.py --use_version > transcripts_to_genes.txt
"""

NextFlow From line 30 of local/gene_map.nf

"""
gffread -F $gtf -w "${genome_fasta}.transcriptome.fa" -g $genome_fasta

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    gffread: \$(gffread --version 2>&1)
END_VERSIONS
"""

NextFlow gffread From line 22 of local/gffread_transcriptome.nf

"""
filter_gtf_for_genes_in_genome.py \\
    --gtf $gtf \\
    --fasta $fasta \\
    -o ${fasta.baseName}_genes.gtf
cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version | sed 's/Python //g')
END_VERSIONS
"""

NextFlow From line 22 of local/gtf_gene_filter.nf

"""
# convert file types
mtx_to_h5ad.py \\
    --aligner ${params.aligner} \\
    --input filtered_feature_bc_matrix.h5 \\
    --sample ${meta.id} \\
    --out ${meta.id}/${meta.id}_matrix.h5ad
"""

NextFlow From line 45 of local/mtx_to_h5ad.nf

"""
# convert file types
for input_type in spliced unspliced ; do
    mtx_to_h5ad.py \\
        --aligner ${params.aligner} \\
        --sample ${meta.id} \\
        --input *count/counts_unfiltered/\${input_type}.mtx \\
        --barcode *count/counts_unfiltered/\${input_type}.barcodes.txt \\
        --feature *count/counts_unfiltered/\${input_type}.genes.txt \\
        --txp2gene ${txp2gene} \\
        --star_index ${star_index} \\
        --out ${meta.id}/${meta.id}_\${input_type}_matrix.h5ad ;
done
"""

NextFlow From line 55 of local/mtx_to_h5ad.nf

"""
# convert file types
mtx_to_h5ad.py \\
    --task_process ${task.process} \\
    --aligner ${params.aligner} \\
    --sample ${meta.id} \\
    --input $mtx_matrix \\
    --barcode $barcodes_tsv \\
    --feature $features_tsv \\
    --txp2gene ${txp2gene} \\
    --star_index ${star_index} \\
    --out ${meta.id}/${meta.id}_matrix.h5ad
"""

NextFlow From line 71 of local/mtx_to_h5ad.nf

"""
mkdir ${meta.id}
touch ${meta.id}/${meta.id}_matrix.h5ad
touch versions.yml
"""

NextFlow From line 86 of local/mtx_to_h5ad.nf

"""
mkdir ${meta.id}
"""

NextFlow From line 39 of local/mtx_to_seurat.nf

"""
# convert file types
for input_type in spliced unspliced ; do
    mtx_to_seurat.R \\
        *count/counts_unfiltered/\${input_type}.mtx \\
        *count/counts_unfiltered/\${input_type}.barcodes.txt \\
        *count/counts_unfiltered/\${input_type}.genes.txt \\
        ${meta.id}/${meta.id}_\${input_type}_matrix.rds \\
        ${aligner}
done
"""

NextFlow From line 44 of local/mtx_to_seurat.nf

"""
mtx_to_seurat.R \\
    $matrix \\
    $barcodes \\
    $features \\
    ${meta.id}/${meta.id}_matrix.rds \\
    ${aligner}
"""

NextFlow From line 57 of local/mtx_to_seurat.nf

"""
mkdir ${meta.id}
touch ${meta.id}/${meta.id}_matrix.rds
touch versions.yml
"""

NextFlow From line 67 of local/mtx_to_seurat.nf

"""
check_samplesheet.py \\
    $samplesheet \\
    samplesheet.valid.csv

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    python: \$(python --version | sed 's/Python //g')
END_VERSIONS
"""

NextFlow From line 21 of local/samplesheet_check.nf

"""
# export required var
export ALEVIN_FRY_HOME=.

# prep simpleaf
simpleaf set-paths

# run simpleaf index
simpleaf \\
    index \\
    --threads $task.cpus \\
    --fasta $genome_fasta \\
    $seq_inputs \\
    $args \\
    -o salmon

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    simpleaf: \$(simpleaf -V | tr -d '\\n' | cut -d ' ' -f 2)
    salmon: \$(salmon --version | sed -e "s/salmon //g")
END_VERSIONS
"""

NextFlow Salmon simpleaf From line 27 of local/simpleaf_index.nf

"""
# export required var
export ALEVIN_FRY_HOME=.

# prep simpleaf
simpleaf set-paths

# run simpleaf quant
gzip -dcf $whitelist > whitelist.uncompressed.txt
simpleaf quant \\
    -1 ${forward.join( "," )} \\
    -2 ${reverse.join( "," )} \\
    -i ${index} \\
    -o ${prefix}_alevin_results \\
    -m $txp2gene \\
    -t $task.cpus \\
    -c $protocol \\
    $expect_cells \\
    $unfiltered_command \\
    $args

$save_whitelist
[[ ! -f ${prefix}_alevin_results/af_quant/all_freq.bin ]] && cp ${prefix}_alevin_results/af_quant/permit_freq.bin ${prefix}_alevin_results/af_quant/all_freq.bin

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    simpleaf: \$(simpleaf -V | tr -d '\\n' | cut -d ' ' -f 2)
    salmon: \$(salmon --version | sed -e "s/salmon //g")
END_VERSIONS
"""

NextFlow Quant Salmon simpleaf From line 50 of local/simpleaf_quant.nf

"""
STAR \\
    --genomeDir $index \\
    --readFilesIn ${reverse.join( "," )} ${forward.join( "," )} \\
    --runThreadN $task.cpus \\
    --outFileNamePrefix $prefix. \\
    --soloCBwhitelist <(gzip -cdf $whitelist) \\
    --soloType $protocol \\
    --soloFeatures $star_feature \\
    $other_10x_parameters \\
    $out_sam_type \\
    $ignore_gtf \\
    $seq_center \\
    $cell_filter \\
    $args \\

$mv_unsorted_bam

if [ -f ${prefix}.Unmapped.out.mate1 ]; then
    mv ${prefix}.Unmapped.out.mate1 ${prefix}.unmapped_1.fastq
    gzip ${prefix}.unmapped_1.fastq
fi
if [ -f ${prefix}.Unmapped.out.mate2 ]; then
    mv ${prefix}.Unmapped.out.mate2 ${prefix}.unmapped_2.fastq
    gzip ${prefix}.unmapped_2.fastq
fi

if [ -d ${prefix}.Solo.out ]; then
    # Backslashes still need to be escaped (https://github.com/nextflow-io/nextflow/issues/67)
    find ${prefix}.Solo.out \\( -name "*.tsv" -o -name "*.mtx" \\) -exec gzip {} \\;
fi

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    star: \$(STAR --version | sed -e "s/STAR_//g")
END_VERSIONS
"""

NextFlow STAR From line 54 of local/star_align.nf

"""
mkdir -p "${prefix}/outs/"
touch ${prefix}/outs/fake_file.txt

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    cellranger: \$(echo \$( cellranger --version 2>&1) | sed 's/^.*[^0-9]\\([0-9]*\\.[0-9]*\\.[0-9]*\\).*\$/\\1/' )
END_VERSIONS
"""

NextFlow From line 33 of count/main.nf

"""
cellranger \\
    mkgtf \\
    $gtf \\
    ${prefix}.gtf \\
    $args

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    cellranger: \$(echo \$( cellranger --version 2>&1) | sed 's/^.*[^0-9]\\([0-9]*\\.[0-9]*\\.[0-9]*\\).*\$/\\1/' )
END_VERSIONS
"""

NextFlow From line 24 of mkgtf/main.nf

"""
cellranger \\
    mkref \\
    --genome=$reference_name \\
    --fasta=$fasta \\
    --genes=$gtf \\
    $args

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    cellranger: \$(echo \$( cellranger --version 2>&1) | sed 's/^.*[^0-9]\\([0-9]*\\.[0-9]*\\.[0-9]*\\).*\$/\\1/' )
END_VERSIONS
"""

NextFlow From line 25 of mkref/main.nf

"""
printf "%s %s\\n" $rename_to | while read old_name new_name; do
    [ -f "\${new_name}" ] || ln -s \$old_name \$new_name
done

fastqc \\
    $args \\
    --threads $task.cpus \\
    $renamed_files

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" )
END_VERSIONS
"""

NextFlow FastQC From line 28 of fastqc/main.nf

"""
touch ${prefix}.html
touch ${prefix}.zip

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" )
END_VERSIONS
"""

NextFlow FastQC From line 46 of fastqc/main.nf

"""
gffread \\
    $gff \\
    $args \\
    -o ${prefix}.gtf
cat <<-END_VERSIONS > versions.yml
"${task.process}":
    gffread: \$(gffread --version 2>&1)
END_VERSIONS
"""

NextFlow gffread From line 23 of gffread/main.nf

"""
# Not calling gunzip itself because it creates files
# with the original group ownership rather than the
# default one for that user / the work directory
gzip \\
    -cd \\
    $args \\
    $archive \\
    > $gunzip

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    gunzip: \$(echo \$(gunzip --version 2>&1) | sed 's/^.*(gzip) //; s/ Copyright.*\$//')
END_VERSIONS
"""

NextFlow From line 23 of gunzip/main.nf

"""
touch $gunzip
cat <<-END_VERSIONS > versions.yml
"${task.process}":
    gunzip: \$(echo \$(gunzip --version 2>&1) | sed 's/^.*(gzip) //; s/ Copyright.*\$//')
END_VERSIONS
"""

NextFlow From line 41 of gunzip/main.nf

"""
kb \\
    count \\
    -t $task.cpus \\
    -i $index \\
    -g $t2g \\
    $cdna \\
    $introns \\
    -x $technology \\
    $args \\
    -o ${prefix}.count \\
    ${reads.join( " " )} \\
    -m ${memory}G

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    kallistobustools: \$(echo \$(kb --version 2>&1) | sed 's/^.*kb_python //;s/positional arguments.*\$//')
END_VERSIONS
"""

NextFlow kallisto kallistobustools From line 32 of count/main.nf

"""
kb \\
    ref \\
    -i kb_ref_out.idx \\
    -g t2g.txt \\
    -f1 cdna.fa \\
    --workflow $workflow_mode \\
    $fasta \\
    $gtf

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    kallistobustools: \$(echo \$(kb --version 2>&1) | sed 's/^.*kb_python //;s/positional arguments.*\$//')
END_VERSIONS
"""

NextFlow kallisto kallistobustools From line 30 of ref/main.nf

"""
kb \\
    ref \\
    -i kb_ref_out.idx \\
    -g t2g.txt \\
    -f1 cdna.fa \\
    -f2 intron.fa \\
    -c1 cdna_t2c.txt \\
    -c2 intron_t2c.txt \\
    --workflow $workflow_mode \\
    $fasta \\
    $gtf

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    kallistobustools: \$(echo \$(kb --version 2>&1) | sed 's/^.*kb_python //;s/positional arguments.*\$//')
END_VERSIONS
"""

NextFlow kallisto kallistobustools From line 46 of ref/main.nf

"""
multiqc \\
    --force \\
    $args \\
    $config \\
    $extra_config \\
    .

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" )
END_VERSIONS
"""

NextFlow MultiQC From line 28 of multiqc/main.nf

"""
touch multiqc_data
touch multiqc_plots
touch multiqc_report.html

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    multiqc: \$( multiqc --version | sed -e "s/multiqc, version //g" )
END_VERSIONS
"""

NextFlow MultiQC From line 43 of multiqc/main.nf

"""
mkdir star
STAR \\
    --runMode genomeGenerate \\
    --genomeDir star/ \\
    --genomeFastaFiles $fasta \\
    --sjdbGTFfile $gtf \\
    --runThreadN $task.cpus \\
    $memory \\
    $args

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    star: \$(STAR --version | sed -e "s/STAR_//g")
    samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
    gawk: \$(echo \$(gawk --version 2>&1) | sed 's/^.*GNU Awk //; s/, .*\$//')
END_VERSIONS
"""

NextFlow STAR From line 26 of genomegenerate/main.nf

"""
samtools faidx $fasta
NUM_BASES=`gawk '{sum = sum + \$2}END{if ((log(sum)/log(2))/2 - 1 > 14) {printf "%.0f", 14} else {printf "%.0f", (log(sum)/log(2))/2 - 1}}' ${fasta}.fai`

mkdir star
STAR \\
    --runMode genomeGenerate \\
    --genomeDir star/ \\
    --genomeFastaFiles $fasta \\
    --sjdbGTFfile $gtf \\
    --runThreadN $task.cpus \\
    --genomeSAindexNbases \$NUM_BASES \\
    $memory \\
    $args

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    star: \$(STAR --version | sed -e "s/STAR_//g")
    samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
    gawk: \$(echo \$(gawk --version 2>&1) | sed 's/^.*GNU Awk //; s/, .*\$//')
END_VERSIONS
"""

NextFlow SAMtools STAR From line 45 of genomegenerate/main.nf

"""
mkdir star
touch star/Genome
touch star/Log.out
touch star/SA
touch star/SAindex
touch star/chrLength.txt
touch star/chrName.txt
touch star/chrNameLength.txt
touch star/chrStart.txt
touch star/exonGeTrInfo.tab
touch star/exonInfo.tab
touch star/geneInfo.tab
touch star/genomeParameters.txt
touch star/sjdbInfo.txt
touch star/sjdbList.fromGTF.out.tab
touch star/sjdbList.out.tab
touch star/transcriptInfo.tab

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    star: \$(STAR --version | sed -e "s/STAR_//g")
    samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
    gawk: \$(echo \$(gawk --version 2>&1) | sed 's/^.*GNU Awk //; s/, .*\$//')
END_VERSIONS
"""

NextFlow From line 70 of genomegenerate/main.nf

"""
universc \\
    --id 'sample-${meta.id}' \\
    ${input_reads} \\
    --technology '${meta.technology}' \\
    --chemistry '${meta.chemistry}' \\
    --reference ${reference_name} \\
    --description ${sample_arg} \\
    --jobmode "local" \\
    --localcores ${task.cpus} \\
    --localmem ${task.memory.toGiga()} \\
    --per-cell-data \\
    $args 1> _log 2> _err

# save log files
echo !! > sample-${meta.id}/outs/_invocation
cp _log sample-${meta.id}/outs/_log
cp _err sample-${meta.id}/outs/_err

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    cellranger:  \$(echo \$(cellranger count --version 2>&1 | head -n 2 | tail -n 1 | sed 's/^.* //g' | sed 's/(//g' | sed 's/)//g' ))
    universc:  \$(echo \$(bash /universc/launch_universc.sh --version | grep version | grep universc  | sed 's/^.* //g' ))
END_VERSIONS
"""

NextFlow UniverSC From line 38 of universc/main.nf

"""
mkdir -p "sample-${meta.id}/outs/"
touch sample-${meta.id}/outs/fake_file.txt

cat <<-END_VERSIONS > versions.yml
"${task.process}":
    cellranger:  \$(echo \$(cellranger count --version 2>&1 | head -n 2 | tail -n 1 | sed 's/^.* //g' | sed 's/(//g' | sed 's/)//g' ))
    universc:  \$(echo \$(bash /universc/launch_universc.sh --version | grep version | grep universc | sed 's/^.* //g' ))
END_VERSIONS
"""