Snakemake workflow for the project Exploring the Impact of Parity and its Interaction with History of Preterm Delivery on Gestational Duration

public 1yr ago 0 bookmarks

View Workflow

parity_gestational_duration_interaction — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

The script in this folder present the Snakmake workflow for the project Exploring the Impact of Parity and its Interaction with History of Preterm Delivery on Gestational Duration.

##Contact

Karin Ytterberg (@ykarin97)
karin.ytterberg@gu.se

Code Snippets

library(dplyr)
library(data.table)


#### Loading data ####
dat = fread(snakemake@input[[1]]) # Swedish Medical Birth Register
p_id = fread(snakemake@input[[2]]) # Multi-Generation Register
edu = fread(snakemake@input[[3]]) # Education Register



#### Creating variables ####

## PTB < 37 weeks
dat = dat %>% mutate(PTB = ifelse(GRDBS< (37*7),1,0))


## Sex
# male as ref
# 1=Pojke (boy); 2=Flicka (girl)
dat = dat %>% mutate(KON = ifelse(KON == 1, "M","F"))
dat$KON = factor(dat$KON, level = c("M","F"))


## Maternal age categorized:
# < 20
# 20-29
# 30-39
# >= 40

# Maternal age categorization
dat = dat %>% mutate(MALDER_c = ifelse(MALDER<20,1,0),
                     MALDER_c = ifelse(MALDER>=20 & MALDER<=29,0,MALDER_c),
                     MALDER_c = ifelse(MALDER>=30 & MALDER<=39,2,MALDER_c),
                     MALDER_c = ifelse(MALDER>=40,3,MALDER_c))
dat$MALDER_c = as.factor(dat$MALDER_c)


## Finding father to each child in  the multi-generation register and the fathers age
f_id = p_id %>% select(LopnrBarn, LopnrFar, FoddArBioFar) # selecting columns of interest
colnames(f_id) = c("lpnr_BARN","lpnr_far", "ar_far")
dat = left_join(dat,f_id,by="lpnr_BARN") # adding fathers info from parents.csv to mfr data
rm(f_id)
dat = dat %>% mutate(FALDER = AR-ar_far) # calculating how old the father was when their child was born


## Nationality  
# As maternal citizenship and the mothers birth country
dat = dat %>% mutate(swe_citizenship =as.numeric(MNAT %in% c('SVERIGE'))) %>% mutate(mor_birth_country_NORDIC = as.numeric(MFODLAND %in% c('SVERIGE','NORGE','FINLAND','ISLAND','DANMARK')))


## First child (parity 0) born preterm
dat = dat %>% group_by(lpnr_mor) %>% arrange(parity_clean) %>%
  mutate(PTB_first_born = any(row_number() == 1 & PTB ==1)*1) %>% 
  mutate(PTB_first_born = ifelse(dplyr::first(parity_clean)!=1,NA,PTB_first_born)) #parity_clean == 1 is parity 0


## Previous preterm delivery
dat = dat %>% group_by(lpnr_mor) %>% arrange(parity_clean )%>% mutate(prev_PTD = ifelse(dplyr::lag(PTB)==1,1,0))
dat = dat %>% group_by(lpnr_mor) %>% arrange(parity_clean) %>%  mutate(diff_p = parity_clean - dplyr::lag(parity_clean)) %>% mutate(prev_PTD = ifelse(diff_p == 1,prev_PTD,NA))


## Mother was born preterm herself
barn = dat %>% pull(lpnr_BARN)
mor_also_barn_in_mfr =  dat[dat$lpnr_mor %in% barn,]
mor_also_barn_in_mfr = mor_also_barn_in_mfr %>% select(lpnr_mor)
mor_also_barn_in_mfr = unique(mor_also_barn_in_mfr)
mor_as_barn = inner_join(dat,mor_also_barn_in_mfr, by = c("lpnr_BARN" ="lpnr_mor")) # The pregnancies in which the mothers where born

mor_as_barn = mor_as_barn %>% mutate(mother_herself_PTB = ifelse(PTB == 1, 1,0)) %>%
  select(lpnr_BARN,mother_herself_PTB) #lpnr_BARN here are barn that also are mothers in mfr

dat = full_join(dat, mor_as_barn, by = c("lpnr_mor" ="lpnr_BARN"))
dat = select(dat, -lpnr_mor.y)


## Diabetes
dat = dat %>% mutate(diab1 = ifelse(DIABETES != 1 | is.na(DIABETES), 0,1))  #diabetes according to mfr variable Diabetes

test = dat[grepl("O24|E10|E11|E12|E13|E14|648A|250A|250B|250C|250D|250E|250F|250G|250H|250X|25000| 25001| 25002| 25003| 25004| 25005| 25006| 25007| 25008| 2500",paste(dat$MDIAG1,dat$MDIAG2,dat$MDIAG3,dat$MDIAG4,dat$MDIAG5,dat$MDIAG6,dat$MDIAG7,dat$MDIAG8,dat$MDIAG9,dat$MDIAG10,dat$MDIAG11,dat$MDIAG12)),] #ICD codes (ICD-10-SE,ICD9-SE,ICD-8) related to diabetes, extracted from maternal icd diagnosis in mfr
mor_with_diabetes = test %>% pull(sq) # rows of mothers that have diabetes according to icd codes

dat = dat %>% mutate(diab2 = ifelse(sq %in% mor_with_diabetes,1,0))

dat = dat %>% mutate(diab = ifelse(diab1==1 | diab2 ==1,1,0)) # mother will have diabetes based on icd codes and the mfr variable Diabetes
dat = select(dat, -diab1,-diab2) 


## BMI
print("s1")
dat1 = dat %>% mutate(BMI = MVIKT / (MLANGD/100)^2) # BMI
#source("/home/karin/Parity_Project1/scripts/functions/1_cleaning_modules.R")
source(snakemake@params[[1]]) #fun_mBmiQC modified to not remove "bad" BMI, just set them as NA.
print("s2")
year_matrix = NULL
dat2 = fun_mBmiQC(as.data.frame(dat1)) # setting bad BMI to NA
print("s3")
dat = dat2
rm(dat1,dat2)


## Smoking 
dat = dat %>% mutate(smoking = ifelse((ROK1 ==1 | is.na(ROK1)) & (ROK0 == 1 | is.na(ROK0)) ,0,1),
                     smoking = ifelse(ROK2 == 1 |is.na(ROK2),smoking,2)) # 0 = Not smoking, 1 =  Smoking 3 months prior to the current pregnancy or/and Smoking at admission to maternal health, 2 = Smoking in pregnancy week 30-32  


## Preeclampsia
test = dat[grepl("O14|O11|O15|642E|642F|642H|63703 |63704 | 63709| 63710|6612",paste(dat$MDIAG1,dat$MDIAG2,dat$MDIAG3,dat$MDIAG4,dat$MDIAG5,dat$MDIAG6,dat$MDIAG7,dat$MDIAG8,dat$MDIAG9,dat$MDIAG10,dat$MDIAG11,dat$MDIAG12)),] #ICD codes (ICD-10-SE,ICD9-SE,ICD-8) related to preeclampsia, extracted from maternal icd diagnosis in mfr
mor_with_preeclampsia = test %>% pull(sq) # rows of mothers that have preeclampsia according to icd codes

dat = dat %>% mutate(preeclamspia = ifelse(sq %in% mor_with_preeclampsia,1,0))


## Education 
# find the maximum edu + filtering
edu = edu %>% group_by(LopNr) %>% filter(n()==1) # can not tell which of the rows are the ture one when ID for the same person exist in several rows, are removed
edu = as.data.frame(edu)
edu_grades = edu[grep("SUN2000", names(edu))] # education based on SUN2000
edu_grades[, "max"] <- apply(edu_grades, 1, max, na.rm=TRUE) # Finding highest education for each person
edu = cbind(edu, edu_grades[,"max"])
names(edu)[names(edu) == 'edu_grades[, "max"]'] = "max_grade"

# Remove reused LopNr based on AterPnr
edu_rm = edu[grep("Ater", names(edu))]
edu_rm = edu_rm %>% mutate(remove = ifelse(rowSums(edu_rm == 1,na.rm = TRUE) > 0, F, T))
edu = edu[edu_rm$remove,]

# Remove reused LopNr based on SenPnr
edu_rm = edu[grep("Sen", names(edu))]
#nr = ncol(edu_rm)
edu_rm = edu_rm %>% mutate(remove = ifelse(rowSums(edu_rm == 0,na.rm = TRUE) >0 , F, T))
edu = edu[edu_rm$remove,]
#nrow(edu) == 5828310

#Join with mfr
edu_max = edu[grep("LopNr|max_grade", names(edu))]
d_mor = left_join(dat, edu_max, by = c("lpnr_mor" = "LopNr") )
names(d_mor)[names(d_mor) == 'max_grade'] = "max_grade_mor"

d_mor_far = left_join(d_mor, edu_max, by = c("lpnr_far" = "LopNr") )
names(d_mor_far)[names(d_mor_far) == 'max_grade'] = "max_grade_far"

d_mor_far_child = left_join(d_mor_far, edu_max, by = c("lpnr_BARN" = "LopNr") )
names(d_mor_far_child)[names(d_mor_far_child) == 'max_grade'] = "max_grade_child"

dat = d_mor_far_child
rm(edu_grades,edu_max,d_mor,d_mor_far,d_mor_far_child)

# Max_grade in categories
dat = dat %>% mutate(max_grade_mor_c  = ifelse(max_grade_mor==2 | max_grade_mor==1,1,0),                   # 9 years or less
                     max_grade_mor_c = ifelse(max_grade_mor==3 | max_grade_mor ==4,2,max_grade_mor_c),     # Gymnasial utbilding (additional 2-3 years)      
                     max_grade_mor_c = ifelse(max_grade_mor >=5,3,max_grade_mor_c)) # 0 is nas             # Eftergymnasial utbildning (shorter than 3 years, 3 years or longer, postgraduate education)

dat = dat %>% mutate(max_grade_far_c  = ifelse(max_grade_far==2 | max_grade_far==1,1,0),
                     max_grade_far_c = ifelse(max_grade_far==3 | max_grade_far ==4,2,max_grade_far_c),
                     max_grade_far_c = ifelse(max_grade_far >=5,3,max_grade_far_c)) # 0 is nas


## Parity, grouping after parity 4
dat = dat %>% mutate(Parity_logreg = ifelse(as.numeric(parity_clean)<5,parity_clean,4))


#### Saving ####
fwrite(dat, snakemake@output[[1]], sep=",")

R dplyr data.table preeclampsia From line 1 of scripts/create_variables.R

source(snakemake@params[[1]]) # "1_cleaning_modules.R"
source(snakemake@params[[2]]) # "2_renumber_parity_to_parityF.R"




#### Cleaning the data by Removing maternal with no ID, Removing mother-kids duplications, Remove Stillbirths, Remove rows with no GestAge and Only keeping singeltons. Also cleaning parity ####

## Loading data
library(data.table)
mfr_raw = fread(snakemake@input[[1]]) # load data
print("Number of pregnancies in the data")
print(nrow(mfr_raw)) #Number of pregnancies in the data


###  Removing maternal with no ID -> Removing mother-kids duplications -> Remove Stillbirths -> Remove rows with no GestAge -> Only keeping singeltons -> cleaning parity
year_matrix = NULL  # For the function fun_momID
cleaning = function(dat){

  dat = fun_momID(dat)  # Removing mothers without IDs
  colnames(dat)[grep("lpnr_barn",colnames(dat))]  = "lpnr_BARN"  # Change the name the column with child id 
  print("Removed mothers without IDs, pregnancies left:")
  print(row(dat))

  dat = fun_momkidID(dat)  # Removing mother-child duplications 
  print("Removed mother-child duplications, pregnancies left:")
  print(nrow(dat))

  dat = fun_deadborn(dat)  # Removing stillbirths
  print("Removed stillbirths, pregnancies left:")
  print(nrow(dat))

  dat = fun_GAmiss(dat)  # Removing pregnancies with no gestational age 
  print("Removed pregnancies with no gestational age, pregnancies left:")
  print(nrow(dat))

  dat = fun_multipregs(dat)   # Only keeping singeltons 
  print("Only keeping singeltons, pregnancies left:")
  print(nrow(dat)) 

  nnn = recalculateParity(dat, thr_d_low = 31, thr_d_upp = 200)  # Function resulting in the parity (number of pregnancies) for each woman (cleaning parity)
  mfr = cbind(dat, data.frame(parity_clean = nnn))  # Adding the parity to the "input data" that has been filtered

  dat = mfr %>% filter(BORDF2==1)
  mor = dat %>% group_by(lpnr_mor) %>% filter(parity_clean==1) %>% filter(n()>1) %>% pull(lpnr_mor)
  dat = dat[!(dat$lpnr_mor %in% mor),]
  print("Only keeping singeltons, pregnancies left:")
  print(nrow(dat))

  print("Done: Removed maternal with no ID -> Removed mother-kids duplications -> Removed Stillbirths -> Removed rows with no GestAge -> Only keeping singeltons -> cleaning parity")

dat
}

mfr_new = cleaning(mfr_raw)  # Cleaning data with the function cleaning
print("Number of pregnancies in the data after the filering function:")
print(nrow(mfr_new))




####
## Filtering GD?
# No using GRDBS. 
# GRDBS range between 154 to 321 days. The number of GD > 310 days is 2774, and more common in the earlier years of the register. Maybe there are some reason that the GD was bigger than 310 days, for now they are included.
####




#### Cleaning for outliers based on deviating child weight and gestational duration correlation with a Generalised Additive Models for Location Scale and Shape ####
library(dplyr)
library(gamlss)
b=subset(mfr_new,select=c("BVIKTBS","GRDBS")) # extracting weigth and gestational duration from data
b=na.omit(b) # removing incompleacte vases of weigth and gestational duration

newy=b$BVIKTBS
newx=b$GRDBS

mBCTcs<- gamlss(BVIKTBS~cs(GRDBS),family=BCT, data=b) # A generalised additive model utilizing a Box-Cox-t distribution and cubic spline (cs) smoothing term
saveRDS(mBCTcs, file = "~/Parity_Project1/mBCTcs.rds")
h=centiles.pred(mBCTcs, xname="GRDBS",xvalues=newx, type="s",dev = c( -5, -4.5, -4,-3.5, -3, -2, -1, 0, 1, 2, 3,3.5, 4,4.5, 5)) # looks good!

mfr_new$sq = seq(nrow(mfr_new)) # to restore the original order in the end
lower<-h[,3]
upper<-h[,15]
dat2<- arrange(mfr_new,GRDBS) %>% filter(!is.na(BVIKTBS), !is.na(GRDBS)) %>% mutate(rows_to_remove = ifelse(((BVIKTBS<lower)|
                 (BVIKTBS>upper)),T,F)) # to remove = T, removing any values above 4.5 or below -4.5 standard deviations (upper and lower)
dat3 <- filter(dat2,rows_to_remove==F)
#ggplot(dat3, aes(x=GRDBS,y=BVIKTBS)) + geom_point() # control removal

dat = dat3[order(dat3$sq),] # to restore order
rm(mfr_new,mBCTcs,dat2,dat3,b)

print("Pregnancies left after cleaning for outliers based on deviating child weight and gestational duration correlation:")
print(nrow(dat))




#### IVF ####
## Filter IVF cases and making a variable telling if this mother had a problem becomming pregnant = unwilling subfertility ##
# Variable
dat = dat %>% mutate(unwilling_subfertility = ifelse(is.na(OFRIBARN) | OFRIBARN == 0,0, OFRIBARN)) # 0 = no unwilling subfertility, variable >= 1 --> years in unwilling subfertility

# Filtering
# Variables below regards action on infertility --> pregancies with a 1 (=yes) are removed
# - OFRIIATG
# - OFRIABEF
# - OFRISTIM
# - OFRIKIRU
# - OFRIICSI
# - OFRIANN
dat = dat %>% filter(is.na(OFRIIATG), is.na(OFRIABEF), is.na(OFRISTIM), is.na(OFRIKIRU), is.na(OFRIICSI), is.na(OFRIANN))

print("Pregnancies left after removing pregnancies with action on infertility:")
print(nrow(dat))



#### Saving ####
fwrite(dat, snakemake@output[[1]], sep=",")

R dplyr data.table gamlss From line 5 of scripts/filtering_script_10feb2022.R

script:
    "scripts/filtering_script_10feb2022.R"

SnakeMake From line 49 of master/Snakefile

script:
    "scripts/create_variables.R"

SnakeMake From line 65 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/descriptive_table.R"

SnakeMake From line 85 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Parity_affects_gd.R"

SnakeMake From line 105 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/variance.R"

SnakeMake From line 126 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/clinical_history.R"

SnakeMake From line 146 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/family_history_maternal_sisters_given_birth_preterm.R"

SnakeMake From line 168 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Supplementary/clinical_history_only_spontaneous.R"

SnakeMake From line 189 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Supplementary/famliy_history_no_parity_interaction_plot.R"

SnakeMake From line 210 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Supplementary/mother_herself_ptd.R"

SnakeMake From line 230 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Supplementary/year_of_delivery_interaction.R"

SnakeMake From line 248 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Supplementary/parity_risk_of_ptd_maternal_age_grouping_mixed_model.R"

SnakeMake From line 268 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Supplementary/PTB_by_parity_BMI_unwilling_subfertilitu_smoking_diab_preeclamspia_descriptive_table.R"

SnakeMake From line 286 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Supplementary/simulation_group_effect_bias.R"

SnakeMake From line 305 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Supplementary/ultrasound.R"

SnakeMake From line 326 of master/Snakefile

script:
    "/home/karin/Parity_gestational_duration_interaction/scripts/plots/Supplementary/Parity_affects_gd.R"

SnakeMake From line 347 of master/Snakefile

ShowHide 16 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/PerinatalLab/Parity_gestational_duration_interaction

Name: parity_gestational_duration_interaction

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

Snakemake data.table dplyr gamlss preeclampsia

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free