Prelimiary CpG Methylation Data

public 1yr ago 0 bookmarks

View Workflow

preliminary-cpg-data-analysis — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Introduction

This repository contains the scripts for processing and analyzing some preliminary data for my work on plant aging, first by following a workflow to extract CpG methylation data from PacBio HiFi subreads, and then by analyzing this data with the methylKit R package.

Conda Environment Creation

Environment for PacBio CpG workflow without snakemake

conda env create -n my.bio3 pbccs pbjasmine pbmm2
#or using the my.bio3.yml file in this repo
conda env create -n my.bio3 -f my.bio3.yml

Environment for PacBio CpG workflow with snakemake

conda env create -n snakebio3 pbccs pbjasmine pbmm2 snakemake
#or using the snakebio3.yml file in this repo
conda env create -n snakebio3 -f snakebio2.yml

Environment for analysis with methylKit

#works best if you use the rmethyl.yml file in this repo rather than listing packages on command line
conda env create -n rmethyl -f rmethyl.yml

PacBio subreads to CpG data workflow

This workflow takes PacBio subreads as input and produces a .bed file and a .bw file that give the probability of methylation of each CpG site across the genome. The .bw file can be viewed in IGV or another genome browser. The .bed file will be used for further analysis in this project. To run a sample through this workflow, the scripts in this repo must be run in the following order with commandline input:

ccs2.sh SUBREADSFILE SAMPLENAME
jasmine.sh HIFIFILE SAMPLENAME
align.sh 5MCFILE SAMPLENAME
cpgtools.sh ALIGNEDFILE SAMPLENAME

The merge.sh script was used in between steps 2 and 3 for some samples that had both subreads and already existing 5mc.hifi.bam files.

The output .bed file will have the following tab-delimited columns of data:

reference name
start coordinate
end coordinate
modification probability
haplotype
coverage
estimated modified site count (extrapolated from model modification probability)
estimated unmodified site count (extrapolated from model modification probability)
discretized modification probability (calculated from estimated mod/unmod site counts)

methylKit workflow

Because methylKit is designed to be used with methylation data from bisulfite sequencing, it takes input data in a couple of formats specific to that method of data generation. One of these input formats is a tab-delimited file with the following columns of data:

unique.id
chromosome name
site position
strand
coverage
percent methylated
percent unmethylated

As a result, the .bed file(s) from the previous workflow needs to be modified to fit this format. The methylformat.R and methylformat.sh scripts perform this reformatting. The usage is as follows:

methylformat.sh SAMPLENAME

This script will activate the rmethyl conda environment and run the R script with the sample name you indicate on the command line. The output will be a .tsv file with the correct columns needed for use with methylKit. Strand is indicated as 'F' for every entry, which is consistent with the output format of the PacBio Jasmine program.

Code Snippets

shell: """
  pbmm2 align /home/asillers/Genome/farr1.mmi {input} {output} --sort
"""

SnakeMake pbmm2 From line 15 of main/Snakefile

shell: """
  pb-CpG-tools-v2.3.1-x86_64-unknown-linux-gnu/bin/aligned_bam_to_cpg_scores --bam {input} --output-prefix {wildcards.sample}.cpg --model pb-CpG-tools-v2.3.1-x86_64-unknown-linux-gnu/models/pileup_calling_model.v1.tflite --threads 12
"""