Apis-wings-EU: A workflow for morphometric identification of honey bees from Europe
We present an R script that describes the workflow for analysing honey bee (Apis mellifera) wing shape. It is based on a large dataset of wing images and landmark coordinates available at Zenodo: https://doi.org/10.5281/zenodo.7244070 . The dataset can be used as a reference for the identification of unknown samples. As unknown samples, we used data from Nawrocka et al. (2018), available at Zenodo: https://doi.org/10.5281/zenodo.7567336 . Among others, the script can be used to identify the geographic origin of unknown samples and therefore assist in the monitoring and conservation of honey bee biodiversity in Europe.
Code Snippets
9 10 11 12 13 | knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE ) |
19 20 21 22 23 24 25 26 27 28 29 30 31 32 | # calculations library(geomorph) # GPA library(shapes) # OPA library(Morpho) # CVA library(phangorn) # UPGMA library(geosphere) # distm # plotting and visualization library(ggplot2) # plots ggplot2::theme_set(theme_light()) library(ggforce) # geom_ellipse library(mgcViz) # GAM visualization library(viridis) # plot scale library(rnaturalearth) # maps |
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | wings <- read.csv("https://zenodo.org/record/7244070/files/EU-raw-coordinates.csv", header = TRUE) # extract sample classifier wings <- tidyr::separate( data = wings, col = file, sep = c(7), # sample name is in the first 7 characters of file name into = c("sample", NA), remove = FALSE ) # extract country classifier wings <- tidyr::separate( data = wings, col = sample, sep = c(2), # country code is in the first 2 characters of sample name into = c("country", NA), remove = FALSE ) head(wings, 2) |
84 85 86 87 88 89 90 91 92 93 94 95 96 97 | geo.data <- read.csv("https://zenodo.org/record/7244070/files/EU-geo-data.csv", header = TRUE) sample.geo.data <- aggregate(cbind(geo.data$latitude, geo.data$longitude), by = list(geo.data$sample), FUN = mean) sample.geo.data <- reshape::rename(sample.geo.data, c(Group.1 = "sample", V1 = "latitude", V2 = "longitude")) # extract country classifier sample.geo.data <- tidyr::separate( data = sample.geo.data, col = sample, sep = c(2), # country code is in the first 2 characters of sample name into = c("country", NA), remove = FALSE ) head(sample.geo.data, 2) |
103 104 105 106 107 108 109 110 111 112 113 114 115 116 | world <- ne_countries(scale = "medium", returnclass = "sf") # jitter the coordinates to show all locations jitter.x <- jitter(sample.geo.data$longitude, amount = 0.2) jitter.y <- jitter(sample.geo.data$latitude, amount = 0.2) ggplot(data = world) + geom_sf() + geom_point(data = sample.geo.data, aes(x = jitter.x, y = jitter.y, colour = country, shape = country), size = 1) + scale_color_manual(name ="Country", values = rainbow(13)) + scale_shape_manual(name ="Country", values = c(0:2,15:25)) + coord_sf(xlim = c(-11, 33), ylim = c(34, 56)) + xlab("Longitude") + ylab("Latitude") |
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | p <- 19 # number of landmarks k <- 2 # number of dimensions, in this case 2 for coordinates (x, y) # create coordinates names used by IdentiFly xyNames <- c("x1", "y1") for (i in 2:p) { xyNames <- c(xyNames, paste0("x", i)) xyNames <- c(xyNames, paste0("y", i)) } xyNames # create coordinates names used by geomorph xy.Names <- c("1.X", "1.Y") for (i in 2:p) { xy.Names <- c(xy.Names, paste(i, "X", sep = ".")) xy.Names <- c(xy.Names, paste(i, "Y", sep = ".")) } xy.Names # The number of principal components used is 2*p-4 = 34, which is equal to the degrees of freedom # create principal components names used by prcomp pcNames <- paste0("PC", 1:(2*p-4)) pcNames # create principal components names used by geomorph compNames <- paste0("Comp", 1:(2*p-4)) compNames geoNames <- c("latitude", "longitude") |
162 163 164 165 166 167 168 169 170 171 | # Convert 2D array into a 3D array wings.coords <- arrayspecs(wings[xyNames], p, k) dimnames(wings.coords)[[3]] <- wings$file # Align the coordinates using Generalized Procrustes Analysis GPA <- gpagen(wings.coords, print.progress = FALSE) # Convert 3D array into a 2D array - opposite to arrayspecs wings.aligned <- two.d.array(GPA$coords) head(wings.aligned, 2) |
177 178 179 180 181 182 183 184 185 186 187 188 | sample.aligned <- aggregate(wings.aligned, by = list(wings$sample), FUN = mean) names(sample.aligned)[names(sample.aligned) == "Group.1"] <- "sample" # rename column # extract country code as classifier sample.aligned <- tidyr::separate( data = sample.aligned, col = sample, sep = c(2), into = c("country", NA), remove = FALSE ) head(sample.aligned, 2) |
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | # Convert 2D array into a 3D array sample.3D <- arrayspecs(sample.aligned[xy.Names], p, k) dimnames(sample.3D)[[3]] <- sample.aligned$sample sample.pca <- gm.prcomp(sample.3D) sample.pca.scores <- as.data.frame(sample.pca$x[ , compNames]) sample.pca.scores <- cbind(sample.geo.data, sample.pca.scores) # create plot labels variance.tab <- summary(sample.pca)$PC.summary variance <- variance.tab["Proportion of Variance", "Comp1"] variance <- round(100 * variance, 1) label.x <- paste0("PC1 (", variance, "%)") variance <- variance.tab["Proportion of Variance", "Comp2"] variance <- round(100 * variance, 1) label.y <- paste0("PC2 (", variance, "%)") ggplot(sample.pca.scores, aes(x = Comp1, y = Comp2, shape = sample.aligned$country, color = sample.aligned$country)) + geom_point() + scale_shape_manual(name ="Country", values = c(0:2,15:25)) + scale_color_manual(name ="Country", values = rainbow(13)) + stat_ellipse() + xlab(label.x) + ylab(label.y) |
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 | # define which landmarks are connected by lines in wireframe graph link.x <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 14, 15, 16, 17) link.y <- c(2, 3, 4, 5, 6, 19, 6, 10, 12, 8, 8, 14, 19, 9, 10, 15, 11, 12, 16, 13, 18, 15, 16, 17, 18) links.apis <- cbind(link.x, link.y) # mark minimum blue and maximum red GP1 <- gridPar(pt.bg = "blue", link.col = "blue", pt.size = 1, tar.pt.bg ="red", tar.link.col ="red") # wing shape change along PC1 plotRefToTarget(M1 = sample.pca$shapes$shapes.comp1$min, M2 = sample.pca$shapes$shapes.comp1$max, gridPars=GP1, method = "points", links = links.apis) # wing shape change along PC2 plotRefToTarget(M1 = sample.pca$shapes$shapes.comp2$min, M2 = sample.pca$shapes$shapes.comp2$max, gridPars=GP1, method = "points", links = links.apis) # countries difer significantly in wing shape country.MANOVA <- manova(as.matrix(sample.pca.scores[ , compNames]) ~ country, data=sample.pca.scores) summary(country.MANOVA) |
248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 | # PC1 ggplot(data = world) + geom_sf() + geom_point(data = sample.pca.scores, aes(x = jitter.x, y = jitter.y, colour = Comp1), size = 1) + coord_sf(xlim = c(-11, 32), ylim = c(34, 57), expand = FALSE) + scale_color_viridis(name = "PC1") + xlab("Longitude") + ylab("Latitude") GAM <- mgcv::gam(Comp1 ~ s(longitude, latitude), data = sample.pca.scores) anova(GAM) viz.GAM <- getViz(GAM) plot(viz.GAM, 1) + l_fitRaster() + l_fitContour(color ="grey30") + l_points() + labs(title = NULL) + scale_fill_continuous(type = "viridis", name = "PC1", na.value = "transparent") + scale_x_continuous(limits = c(-11, 32), expand = c(0, 0)) + scale_y_continuous(limits = c(34, 57), expand = c(0, 0)) + geom_path(data = map_data("world"), aes(x = long, y = lat, group = group), color ="black", inherit.aes = FALSE) # PC2 ggplot(data = world) + geom_sf() + geom_point(data = sample.pca.scores, aes(x = jitter.x, y = jitter.y, colour = Comp2), size = 1) + coord_sf(xlim = c(-11, 32), ylim = c(34, 57), expand = FALSE) + scale_color_viridis(name = "PC2") + xlab("Longitude") + ylab("Latitude") GAM <- mgcv::gam(Comp2 ~ s(longitude, latitude), data = sample.pca.scores) anova(GAM) viz.GAM <- getViz(GAM) plot(viz.GAM, 1) + l_fitRaster() + scale_fill_continuous(na.value = "transparent") + l_fitContour(color ="grey30") + l_points() + labs(title = NULL) + scale_fill_continuous(type = "viridis", name = "PC2", na.value = "transparent") + scale_x_continuous(limits = c(-11, 32), expand = c(0, 0)) + scale_y_continuous(limits = c(34, 57), expand = c(0, 0)) + geom_path(data = map_data("world"), aes(x = long, y = lat, group = group), color ="black", inherit.aes = FALSE) |
301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | # Convert 2D array into a 3D array sample.3D <- arrayspecs(sample.aligned[xy.Names], p, k) dimnames(sample.3D)[[3]] <- sample.aligned$sample # use equal prior probability for all groups n.country <- length(unique(sample.aligned$country)) # number of groups sample.cva <- CVA(sample.3D, sample.aligned$country, rounds = 10000, cv = TRUE, prior = rep(1/n.country, n.country)) sample.cva.scores <- as.data.frame(sample.cva$CVscores) # remove unwanted spaces in variable names otherwise use `CV 1` colnames(sample.cva.scores) <- gsub(" ", "", colnames(sample.cva.scores)) sample.cva.scores <- cbind(sample.geo.data, sample.cva.scores) ggplot(sample.cva.scores, aes(x = CV1, y = CV2, shape = country, color = country)) + geom_point() + scale_shape_manual(name ="Country", values = c(0:2,15:25)) + scale_color_manual(name ="Country", values = rainbow(13)) + stat_ellipse() |
327 328 | CVA.class <- typprobClass(sample.cva$CVscores, groups = as.factor(sample.pca.scores$country), outlier = 0) print(CVA.class) |
338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 | sample.class <- data.frame(sample = sample.geo.data$sample, country = sample.geo.data$country, latitude = sample.geo.data$latitude, longitude = sample.geo.data$longitude, classified.as = CVA.class$groupaffinCV) # add column indicating incorrect classification sample.class$error <- ifelse(sample.class$country == sample.class$classified.as, "OK", "error") # only the incorrectly classified samples sample.error <- sample.class[sample.class$error == "error", c("sample", "country", "latitude", "longitude","classified.as")] # incorrect countries classification map ggplot(data = world) + geom_sf() + geom_point(data = sample.error, aes(x = longitude, y = latitude, colour = classified.as, shape = classified.as), size = 2) + scale_color_manual(name ="Classified as", values = rainbow(13)) + scale_shape_manual(name ="Classified as", values = c(0:2, 15:25)) + coord_sf(xlim = c(-11, 33), ylim = c(34, 56), expand = FALSE) + labs(x = "Longitude", y = "Latitude") |
367 368 369 370 371 372 373 374 | # Mahalanobis Distnces between countries dist <- sample.cva$Dist MD.dist <- as.matrix(dist$GroupdistMaha) MD.dist # Significance of differences betwen countries MD.prob <- as.matrix(dist$probsMaha) MD.prob |
381 382 383 | # UPGMA tree tree.upgma <- upgma(MD.dist) plot(tree.upgma, label.offset = 0.1, cex = 0.5) |
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 | # calculate geographical distance between countries country.geo.data <- aggregate(cbind(sample.geo.data$latitude, sample.geo.data$longitude), by = list(sample.geo.data$country), FUN = mean) country.geo.data <- reshape::rename(country.geo.data, c(Group.1 = "country", V1 = "latitude", V2 = "longitude")) geo.dist <- distm(country.geo.data[c("longitude","latitude")], fun = distGeo) row.names(geo.dist) <- country.geo.data$country colnames(geo.dist) <- country.geo.data$country geo.dist <- geo.dist / 1000 # convert meters to kilometers geo.dist vegan::mantel(geo.dist, MD.dist, permutations = 9999, method = "spearman") # convert distance table to get distance in one column MD.distance <- as.dist(MD.dist) MD.distance <- as.numeric(MD.distance) MD.distance <- data.frame(t(combn(rownames(MD.dist),2)), MD.distance) geo.distance <- as.dist(geo.dist) geo.distance <- as.numeric(geo.distance) geo.distance <- data.frame(t(combn(rownames(geo.dist),2)), geo.distance) geo.distance$MD.distance <- MD.distance$MD.distance geo.distance$dist.name <- paste0(geo.distance$X1, "-", geo.distance$X2) outliers <- subset(geo.distance, dist.name %in% c("AT-HU","AT-HR", "AT-SI")) ggplot(geo.distance, aes(x = geo.distance, y = MD.distance)) + geom_point() + coord_trans(x = "log10") + geom_text( data = outliers, aes(x = geo.distance , y = MD.distance, label = dist.name), hjust =-0.2) + xlab("Geographical distance (km)") + ylab("Mahalanobis distance between wing shape") |
431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 | # add column with regions consisting of well represented countries or pairs of neighboring countries sample.aligned$region <- ifelse(sample.aligned$country == "ES" | sample.class$country == "PT" , "ES-PT", sample.aligned$country) sample.aligned$region <- ifelse(sample.aligned$country == "RO" | sample.class$country == "MD" , "RO-MD", sample.aligned$region) sample.aligned$region <- ifelse(sample.aligned$country == "HR" | sample.class$country == "SI" , "HR-SI", sample.aligned$region) # exclude countries with sample size smaller than 25 sample.region <- sample.aligned[sample.aligned$country != "AT" & sample.aligned$country != "HU" & sample.aligned$country != "ME" & sample.aligned$country != "RS", ] sample.3D <- arrayspecs(sample.region[xy.Names], p, k) dimnames(sample.3D)[[3]] <- sample.region$sample # use equal prior probability for all groups n.region <- length(unique(sample.region$region)) # number of groups region.cva <- CVA(sample.3D, sample.region$region, rounds = 10000, cv = TRUE, prior = rep(1/n.region, n.region)) region.cva.scores <- as.data.frame(region.cva$CVscores) # remove unwanted spaces in variable names otherwise use `CV 1` colnames(region.cva.scores) <- gsub(" ", "", colnames(region.cva.scores)) ggplot(region.cva.scores, aes(x = CV1, y = CV2, shape = sample.region$region, color = sample.region$region)) + geom_point() + scale_shape_manual(name ="Region", values = c(0:2,19,5:6)) + scale_color_manual(name ="Region", values = rainbow(6)) + stat_ellipse() CVA.class.region <- typprobClass(region.cva$CVscores, groups = as.factor(sample.region$region), outlier = 0) print(CVA.class.region) |
475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 | wings.lin <- read.csv("https://zenodo.org/record/7567336/files/Nawrocka_et_al2018.csv", header = TRUE) # extract sample classifier wings.lin <- tidyr::separate( data = wings.lin, col = file, sep = c(10), # sample name is in the first 10 characters of file name into = c("sample", NA), remove = FALSE ) # extract lineage classifier wings.lin <- tidyr::separate( data = wings.lin, col = file, sep = c(1), # lineage code is in the first character of file name into = c("lineage", NA), remove = FALSE ) head(wings.lin, 2) |
503 504 505 506 507 508 509 510 | # Convert 2D array into a 3D array wings.lin.3D <- arrayspecs(wings.lin[xyNames], p, k) dimnames(wings.lin.3D)[[3]] <- wings.lin$file # Align the coordinates using Generalized Procrustes Analysis GPA.lin <- gpagen(wings.lin.3D, print.progress = FALSE) # Convert 3D array into a 2D array - opposite to arrayspecs wings.lin.aligned <- two.d.array(GPA.lin$coords) head(wings.lin.aligned, 2) |
516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 | sample.lin.aligned <- aggregate(wings.lin.aligned, by = list(wings.lin$sample), FUN = mean) names(sample.lin.aligned)[names(sample.lin.aligned) == "Group.1"] <- "sample" # rename column # extract lineage code as classifier sample.lin.aligned <- tidyr::separate( data = sample.lin.aligned, col = sample, sep = c(1), into = c("lineage", NA), remove = FALSE ) geo.data.lin <- read.csv("https://zenodo.org/record/7567336/files/Nawrocka_et_al2018-geo-data.csv", header = TRUE) sample.lin.aligned$subspecies <- geo.data.lin$subspecies sample.lin.aligned$country <- geo.data.lin$country head(sample.lin.aligned, 2) |
537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 | # Convert 2D array into a 3D array for easier analysis of unknown samples sample.lin.3D <- arrayspecs(sample.lin.aligned[xy.Names], p, k) dimnames(sample.lin.3D)[[3]] <- sample.lin.aligned$sample # use equal prior probability for all groups n.lin <- length(unique(sample.lin.aligned$lineage)) # number of groups sample.lin.cva <- CVA(sample.lin.3D, sample.lin.aligned$lineage, rounds = 10000, cv = TRUE, prior = rep(1/n.lin, n.lin)) sample.lin.cva.scores <- as.data.frame(sample.lin.cva$CVscores) # remove unwanted spaces in variable names otherwise use `CV 1` colnames(sample.lin.cva.scores) <- gsub(" ", "", colnames(sample.lin.cva.scores)) sample.lin.cva.scores <- cbind(sample.lin.aligned$lineage, sample.lin.cva.scores) names(sample.lin.cva.scores)[1] <- "lineage" # rename column rownames(sample.lin.cva.scores) <- sample.lin.aligned$sample ggplot(sample.lin.cva.scores, aes(x = CV1, y = CV2, color = lineage)) + geom_point() + coord_fixed() + stat_ellipse() |
563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 | # Convert 2D array into a 3D array sample.only <- sample.aligned[xy.Names] unknown.wings <- arrayspecs(sample.only, p, k) dimnames(unknown.wings)[[3]] <- sample.only$sample # calculate covarience for each lineage covariances <- lapply(unique(sample.lin.cva.scores$lineage), function(x) cov(sample.lin.cva.scores[sample.lin.cva.scores$lineage==x,-1])) means <- aggregate(sample.lin.cva.scores[,names(sample.lin.cva.scores) != "lineage"], list(sample.lin.cva.scores$lineage), FUN=mean) rownames(means) <- means$Group.1 means <- means[,names(means) != "Group.1"] # Projection of the samples to canonical variate space from Nawrocka et al. 2018 groups <- rownames(means) result.list <- vector(mode = "list", length = nrow(sample.aligned)) CV.list <- vector(mode = "list", length = nrow(sample.aligned)) for (r in 1:nrow(sample.aligned)) { # Align unknown consensus with consensus from reference samples unknown.OPA <- procOPA(GPA.lin$consensus, unknown.wings[,,r]) unknown.aligned <- unknown.OPA$Bhat CV.row <- predict(sample.lin.cva, unknown.aligned) # create empty list for results result <- numeric(length(groups)) for (i in 1:length(groups)) { MD <- mahalanobis(CV.row, unlist(means[i, ]), as.matrix(covariances[[i]])) result[i] <- MD } result.list[[r]] <- result CV.list[[r]] <- CV.row } CV.tab <- do.call(rbind, CV.list) # remove unwanted spaces in variable names otherwise use `CV 1` colnames(CV.tab) <- gsub(" ", "", colnames(CV.tab)) CV.tab <- as.data.frame(CV.tab) CV.tab <- cbind(CV.tab, sample.aligned$country) colnames(CV.tab) <- gsub("sample.aligned\\$", "", colnames(CV.tab)) # Black ellipses A, C, M and O indicate 95% confidence regions of reference samples from Nawrocka et al. 2018 ggplot(CV.tab, aes(x = CV1, y = CV2, shape = country, color = country))+ geom_point() + scale_shape_manual(name ="Country", values = c(0:2,15:25)) + scale_color_manual(name ="Country", values = rainbow(13)) + stat_ellipse() + stat_ellipse(data = subset(sample.lin.cva.scores, lineage == "A"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + stat_ellipse(data = subset(sample.lin.cva.scores, lineage == "C"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + stat_ellipse(data = subset(sample.lin.cva.scores, lineage == "M"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + stat_ellipse(data = subset(sample.lin.cva.scores, lineage == "O"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + geom_label(means, mapping = aes(x = CV1, y = CV2, label = rownames(means)), inherit.aes = FALSE) MD.tab <- do.call(rbind, result.list) MD.tab <- sqrt(MD.tab) # column names for lineages colnames(MD.tab) <- groups MD.tab <- as.data.frame(MD.tab) lineage <- colnames(MD.tab)[apply(MD.tab, 1, which.min)] # final column names colnames(MD.tab) <- paste0("MD.",groups) MD.tab <- cbind(MD.tab, lineage) rownames(MD.tab) <- sample.aligned$sample # frequencies of the lineages table(MD.tab$lineage) # frequencies of the lineage in countries table(sample.aligned$country, MD.tab$lineage) sample.lineages <- MD.tab sample.lineages <- cbind(sample.lineages, sample.geo.data$longitude, sample.geo.data$latitude) colnames(sample.lineages) <- gsub("sample.geo.data\\$", "", colnames(sample.lineages)) # Classification of the samples to lineages according to <a href="#Nawrocka et al. 2018">Nawrocka et al. 2018</a> ggplot(data = world) + geom_sf() + geom_point(data = sample.lineages, aes(x = jitter.x, y = jitter.y, colour = lineage), size = 1) + scale_color_manual(name ="Lineage", values = rainbow(4)) + coord_sf(xlim = c(-11, 32), ylim = c(34, 57), expand = FALSE) + xlab("Longitude") + ylab("Latitude") # Lineage A # Mahalanobis Distance to lineage A (with jitter to show multiple samples from one location) ggplot(data = world) + geom_sf() + geom_point(data = sample.lineages, aes(x = jitter.x, y = jitter.y, colour = MD.A), size = 1) + coord_sf(xlim = c(-11, 32), ylim = c(34, 57), expand = FALSE) + scale_color_viridis(name = "Distance to\nlineage A", direction = -1) + xlab("Longitude") + ylab("Latitude") # spatial interpolation of the above data using GAM GAM <- mgcv::gam(MD.A ~ s(longitude, latitude), data = sample.lineages) viz.GAM <- getViz(GAM) GAM.A <- plot(viz.GAM, 1) + l_fitRaster() + l_fitContour(color ="grey30") + l_points() + labs(title = NULL) + scale_fill_continuous(type = "viridis", direction = -1, name = "Distance to\nlineage A", na.value = "transparent") + scale_x_continuous(limits = c(-11, 32), expand = c(0, 0)) + scale_y_continuous(limits = c(34, 57), expand = c(0, 0)) + geom_path(data = map_data("world"), aes(x = long, y = lat, group = group), color ="black", inherit.aes = FALSE) GAM.A # Lineage C # Mahalanobis Distance to lineage C (with jitter to show multiple samples from one location) ggplot(data = world) + geom_sf() + geom_point(data = sample.lineages, aes(x = jitter.x, y = jitter.y, colour = MD.C), size = 1) + coord_sf(xlim = c(-11, 32), ylim = c(34, 57), expand = FALSE) + scale_color_viridis(name = "Distance to\nlineage C", direction = -1) + xlab("Longitude") + ylab("Latitude") # spatial interpolation of the above data using GAM GAM <- mgcv::gam(MD.C ~ s(longitude, latitude), data = sample.lineages) viz.GAM <- getViz(GAM) GAM.C <- plot(viz.GAM, 1) + l_fitRaster() + l_fitContour(color ="grey30") + l_points() + labs(title = NULL) + scale_fill_continuous(type = "viridis", direction = -1, name = "Distance to\nlineage C", na.value = "transparent") + scale_x_continuous(limits = c(-11, 32), expand = c(0, 0)) + scale_y_continuous(limits = c(34, 57), expand = c(0, 0)) + geom_path(data = map_data("world"), aes(x = long, y = lat, group = group), color ="black", inherit.aes = FALSE) GAM.C # Lineage M # Mahalanobis Distance to lineage M (with jitter to show multiple samples from one location) ggplot(data = world) + geom_sf() + geom_point(data = sample.lineages, aes(x = jitter.x, y = jitter.y, colour = MD.M), size = 1) + coord_sf(xlim = c(-11, 32), ylim = c(34, 57), expand = FALSE) + scale_color_viridis(name = "Distance to\nlineage M", direction = -1) + xlab("Longitude") + ylab("Latitude") # spatial interpolation of the above data using GAM GAM <- mgcv::gam(MD.M ~ s(longitude, latitude), data = sample.lineages) viz.GAM <- getViz(GAM) GAM.M <- plot(viz.GAM, 1) + l_fitRaster() + l_fitContour(color ="grey30") + l_points() + labs(title = NULL) + scale_fill_continuous(type = "viridis", direction = -1, name = "Distance to\nlineage M", na.value = "transparent") + scale_x_continuous(limits = c(-11, 32), expand = c(0, 0)) + scale_y_continuous(limits = c(34, 57), expand = c(0, 0)) + geom_path(data = map_data("world"), aes(x = long, y = lat, group = group), color ="black", inherit.aes = FALSE) GAM.M # Lineage O # Mahalanobis Distance to lineage O (with jitter to show multiple samples from one location) ggplot(data = world) + geom_sf() + geom_point(data = sample.lineages, aes(x = jitter.x, y = jitter.y, colour = MD.O), size = 1) + coord_sf(xlim = c(-11, 32), ylim = c(34, 57), expand = FALSE) + scale_color_viridis(name = "Distance to\nlineage O", direction = -1) + xlab("Longitude") + ylab("Latitude") # spatial interpolation of the above data using GAM GAM <- mgcv::gam(MD.O ~ s(longitude, latitude), data = sample.lineages) viz.GAM <- getViz(GAM) GAM.O <- plot(viz.GAM, 1) + l_fitRaster() + l_fitContour(color ="grey30") + l_points() + labs(title = NULL) + scale_fill_continuous(type = "viridis", direction = -1, name = "Distance to\nlineage O", na.value = "transparent") + scale_x_continuous(limits = c(-11, 32), expand = c(0, 0)) + scale_y_continuous(limits = c(34, 57), expand = c(0, 0)) + geom_path(data = map_data("world"), aes(x = long, y = lat, group = group), color ="black", inherit.aes = FALSE) GAM.O |
758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 | # use samples from lineages M and C as unknown sample.unknown <- subset(sample.lin.aligned, lineage == "M" | lineage == "C") # convert 2D array into a 3D array unknown.lin <- arrayspecs(sample.unknown[xy.Names], p, k) dimnames(unknown.lin)[[3]] <- sample.unknown$sample # calculate CVA scores for unknown samples CV.list <- vector(mode = "list", length = nrow(sample.unknown)) for (r in 1:nrow(sample.unknown)) { # Align unknown consensus with consensus from reference samples unknown.OPA <- procOPA(GPA$consensus, unknown.lin[,,r]) unknown.aligned <- unknown.OPA$Bhat CV.row <- predict(region.cva, unknown.aligned) CV.list[[r]] <- CV.row } CV.tab <- do.call(rbind, CV.list) # remove unwanted spaces in variable names otherwise use `CV 1` colnames(CV.tab) <- gsub(" ", "", colnames(CV.tab)) CV.tab <- as.data.frame(CV.tab) outlier.threshold <- 0.001 typprobs <- typprobClass(CV.tab, region.cva$CVscores, groups = as.factor(sample.region$region), outlier = outlier.threshold, sep = FALSE) print(typprobs) # probability of classification for all groups P.tab <- typprobs$probs rownames(P.tab) <- sample.unknown$sample # for each sample find maximum probability P.max <- apply(P.tab, 1, FUN = max) # for each sample find region with larges probability region.max <- colnames(P.tab)[apply(P.tab, 1, which.max)] # samples with probability below 0.001 are considered as outlines P.tab.max <- data.frame(country = sample.unknown$country, P.max, region.max) head(P.tab.max, 2) # frequencies of the countries in regions without detection of outliers table(P.tab.max$country, P.tab.max$region.max) P.tab.max$region.none <- ifelse(P.max < outlier.threshold, "none", region.max) # frequencies of the countries in regions with detection of outliers table(P.tab.max$country, P.tab.max$region.none) # regions labels region.cva.scores$region <- sample.region$region region.cva.means <- aggregate(region.cva.scores[,names(region.cva.scores) != "region"], list(region.cva.scores$region), FUN = mean) rownames(region.cva.means) <- region.cva.means$Group.1 region.cva.means <- region.cva.means[,names(region.cva.means) != "Group.1"] ggplot(CV.tab, aes(x = CV1, y = CV2, shape = sample.unknown$subspecies, color = sample.unknown$subspecies)) + geom_point() + scale_shape_manual(name ="subspecies", values = c(0:5)) + scale_color_manual(name ="subspecies", values = rainbow(6)) + stat_ellipse() + stat_ellipse(data = subset(region.cva.scores, region == "ES-PT"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + stat_ellipse(data = subset(region.cva.scores, region == "GR"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + stat_ellipse(data = subset(region.cva.scores, region == "HR-SI"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + stat_ellipse(data = subset(region.cva.scores, region == "PL"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + stat_ellipse(data = subset(region.cva.scores, region == "RO-MD"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + stat_ellipse(data = subset(region.cva.scores, region == "TR"), mapping = aes(x = CV1, y = CV2), inherit.aes = FALSE) + geom_label(region.cva.means, mapping = aes(x = CV1, y = CV2, label = rownames(region.cva.means)), inherit.aes = FALSE) |
851 | sessionInfo() |
Support
Do you know this workflow well? If so, you can
request seller status , and start supporting this workflow.
Created: 1yr ago
Updated: 1yr ago
Maitainers:
public
URL:
https://workflowhub.eu/workflows/422
Name:
apis-wings-eu-a-workflow-for-morphometric-identifi
Version:
Version 1
Downloaded:
0
Copyright:
Public Domain
License:
None
Keywords:
Refs:
- Future updates
Related Workflows

ENCODE pipeline for histone marks developed for the psychENCODE project
psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project.
The o...

Near-real time tracking of SARS-CoV-2 in Connecticut
Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

ATLAS - Three commands to start analyzing your metagenome data
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...
raw sequence reads
Genome assembly
Annotation track
checkm2
gunc
prodigal
snakemake-wrapper-utils
MEGAHIT
Atlas
BBMap
Biopython
BioRuby
Bwa-mem2
cd-hit
CheckM
DAS
Diamond
eggNOG-mapper v2
MetaBAT 2
Minimap2
MMseqs
MultiQC
Pandas
Picard
pyfastx
SAMtools
SemiBin
Snakemake
SPAdes
SqueezeMeta
TADpole
VAMB
CONCOCT
ete3
gtdbtk
h5py
networkx
numpy
plotly
psutil
utils
metagenomics

RNA-seq workflow using STAR and DESeq2
This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

This Snakemake pipeline implements the GATK best-practices workflow
This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...