ENA SARS-CoV2 sequence analysis workflow
This is the official repository of the SARS-CoV-2 variant surveillance pipeline developed by Danish Technical University (DTU), Eotvos Lorand University (ELTE), EMBL-EBI, Erasmus Medical Center (EMC) under the Versatile Emerging infectious disease Observatory (VEO) project. The project consists of 20 European partners. It is funded by the European Commission.
The pipeline has been integrated on EMBL-EBI infrastructure to automatically process raw SARS-CoV-2 read data, presenting in the COVID-19 Data Portal: https://www.covid19dataportal.org/sequences?db=sra-analysis-covid19&size=15&crossReferencesOption=all#search-content.
Architecture
The pipeline supports sequence reads from both Illumina and Nanopore platforms. It is designed to be highly portable for both Google Cloud Platform and High Performance Computing cluster with IBM Spectrum LSF. We have performed secondary and tertiary analysis on millions of public samples. The pipeline shows good performance for large scale production.
The pipeline takes SRA from the public FTP from ENA. It submits analysis objects back to ENA on the fly. The intermediate results and logs are stored in the cloud storage buckets or high performance local POSIX file system. The metadata is stored in Google BigQuery for metadata and status tracking and analysis. The runtime is created with Docker / Singularity containers and NextFlow.
Process to run the pipelines
The pipeline requires the Nextflow Tower for the application level monitoring. A free test account can be created for evaluation purposes at https://tower.nf/.
Preparation
-
Store
export TOWER_ACCESS_TOKEN='...'
in$HOME/.bash_profile
. Restart the current session or source the updated$HOME/.bash_profile
. -
Run
git clone https://github.com/enasequence/covid-sequence-analysis-workflow
. -
Create
./covid-sequence-analysis-workflow/data/projects_accounts.csv
with submission_account_id and submission_passwor, for example:
project_id,center_name,meta_key,submission_account_id,submission_password,ftp_password PRJEB45555,"European Bioinformatics Institute",public,,,
Running pipelines
-
Run
./covid-sequence-analysis-workflow/init.sra_index.sh
to initialize or reinitialize the metadata in BigQuery. -
Run
./covid-sequence-analysis-workflow/./start.lsf.jobs.sh
with proper parameters to start the batch jobs on LSF or./covid-sequence-analysis-workflow/./start.gls.jobs.sh
with proper parameters to start the batch jobs on GCP.
Error handling
If a job is killed or died, run the following to update the metadata to avoid reprocessing samples completed successfully.
-
Run
./covid-sequence-analysis-workflow/update.receipt.sh <batch_id>
to collect the submission receipts and to update submission metadata. The script can be run at anytime. It needs to be run if a batch job is killed instead of completed for any reason. -
Run
./covid-sequence-analysis-workflow/set.archived.sh
to update stats for analyses submitted. The script can be run at anytime. It needs to be run at least once before ending a snapshot to make sure that the stats are up-to-date.
To reprocess the samples failed, delete the record in
sra_processing
.
Code Snippets
28 29 30 | """ fastqc -t ${task.cpus} -q ${reads[0]} ${reads[1]} """ |
46 47 48 49 50 51 | """ trimmomatic PE ${reads} ${run_id}_trim_1.fq \ ${run_id}_trim_1_un.fq ${run_id}_trim_2.fq ${run_id}_trim_2_un.fq \ -summary ${run_id}_trim_summary -threads ${task.cpus} \ SLIDINGWINDOW:5:30 MINLEN:50 """ |
69 70 71 | """ fastqc -t ${task.cpus} -q ${trimmed_reads} """ |
93 94 95 96 97 98 99 | """ bowtie2 --very-sensitive-local -p ${task.cpus} \ -x $index_base --met-file ${run_id}_bowtie_human_summary \ -1 ${trimmed_reads[0]} -2 ${trimmed_reads[2]} \ -U ${trimmed_reads[1]},${trimmed_reads[3]} | \ samtools view -Sb -f 4 > ${run_id}_nohuman.bam """ |
116 117 118 119 | """ samtools bam2fq -1 ${run_id}_nohuman_1.fq -2 ${run_id}_nohuman_2.fq \ -s ${run_id}_nohuman_s.fq ${bam} > ${run_id}_nohuman_3.fq """ |
142 143 144 145 146 147 148 | """ bowtie2 -p ${task.cpus} --no-mixed --no-discordant \ --met-file ${run_id}_bowtie_nohuman_summary -x $index_base \ -1 ${fastq[0]} -2 ${fastq[1]} | samtools view -bST ${sars2_fasta} | \ samtools sort | samtools view -h -F 4 -b > ${run_id}.bam samtools index ${run_id}.bam """ |
165 166 167 168 | """ picard MarkDuplicates I=${bam} O=${run_id}_dep.bam REMOVE_DUPLICATES=true \ M=${run_id}_marked_dup_metrics.txt """ |
186 187 188 189 | """ samtools mpileup -A -Q 30 -d 1000000 -f ${sars2_fasta} ${bam} > \ ${run_id}.pileup """ |
207 208 209 | """ cat ${pileup} | awk '{print \$2,","\$3,","\$4}' > ${run_id}.coverage """ |
230 231 232 233 234 235 236 237 | """ samtools index ${bam} lofreq call-parallel --pp-threads ${task.cpus} -f ${sars2_fasta} \ -o ${run_id}.vcf ${bam} bgzip ${run_id}.vcf tabix ${run_id}.vcf.gz bcftools stats ${run_id}.vcf.gz > ${run_id}.stat """ |
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | """ bcftools filter -i "DP>50" ${vcf} -o ${run_id}.cfiltered.vcf bgzip ${run_id}.cfiltered.vcf tabix ${run_id}.cfiltered.vcf.gz bcftools filter -i "AF>0.5" ${run_id}.cfiltered.vcf.gz > \ ${run_id}.cfiltered_freq.vcf bgzip -c ${run_id}.cfiltered_freq.vcf > ${run_id}.cfiltered_freq.vcf.gz bcftools index ${run_id}.cfiltered_freq.vcf.gz bcftools consensus -f ${sars2_fasta} ${run_id}.cfiltered_freq.vcf.gz > \ ${run_id}.cons.fa sed -i "1s/.*/>${run_id}/" ${run_id}.cons.fa rm ${run_id}.cfiltered.vcf.gz rm ${run_id}.cfiltered.vcf.gz.tbi rm ${run_id}.cfiltered_freq.vcf rm ${run_id}.cfiltered_freq.vcf.gz.csi rm ${run_id}.cfiltered_freq.vcf.gz """ |
290 291 292 293 294 295 296 | """ bcftools filter -i "DP>50" ${vcf} -o ${run_id}.filtered.vcf bgzip ${run_id}.filtered.vcf tabix ${run_id}.filtered.vcf.gz bcftools filter -i "AF>0.1" ${run_id}.filtered.vcf.gz > \ ${run_id}.filtered_freq.vcf """ |
314 315 316 317 318 319 | """ cat ${vcf} | sed "s/^NC_045512.2/NC_045512/" > \ ${run_id}.newchr.filtered_freq.vcf java -Xmx4g -jar /data/tools/snpEff/snpEff.jar -v -s ${run_id}.snpEff_summary.html sars.cov.2 \ ${run_id}.newchr.filtered_freq.vcf > ${run_id}.annot.n.filtered_freq.vcf """ |
Support
- Future updates
Related Workflows





