Last updated: 2022-02-03
Checks: 7 0
Knit directory: beer_manuscript/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20210907)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version c566d60. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: analysis/partials/.DS_Store
Ignored: data_processed/.DS_Store
Ignored: data_processed/simulation_2beads_mle/.DS_Store
Ignored: data_processed/simulation_2beads_mom/.DS_Store
Ignored: data_processed/simulation_2beads_truth/.DS_Store
Ignored: data_processed/simulation_4beads_edgeR/.DS_Store
Ignored: data_processed/simulation_curves.rda
Ignored: data_raw/.DS_Store
Ignored: figures/.DS_Store
Unstaged changes:
Deleted: analysis/figure/hiv_ec.Rmd/figure_hiv_protein-1.png
Deleted: analysis/figure/hiv_ec.Rmd/figure_hiv_protein_A-1.png
Deleted: analysis/figure/hiv_ec.Rmd/figure_hiv_protein_noB-1.png
Deleted: analysis/figure/hiv_ec.Rmd/figure_hiv_replicates-1.png
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/hiv_ec.Rmd
) and HTML (docs/hiv_ec.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | c566d60 | Athena Chen | 2022-02-03 | updated Figure S13 |
Rmd | 7f65c20 | Athena Chen | 2022-01-30 | added new figure |
html | 7f65c20 | Athena Chen | 2022-01-30 | added new figure |
html | bbf7cba | Athena Chen | 2022-01-23 | rebuilt webpage |
html | 3b06e48 | Athena Chen | 2022-01-23 | Build site. |
html | d8b023f | Athena Chen | 2022-01-18 | Build site. |
html | 75f2a6d | Athena Chen | 2022-01-18 | Build site. |
Rmd | 9d7b43a | Athena Chen | 2022-01-18 | Updated repo to use most up-to-date figures and code |
html | 9245fe2 | Athena Chen | 2021-09-15 | Build site. |
Rmd | 917bca7 | Athena Chen | 2021-09-15 | wflow_publish(list.files("analysis", pattern = "Rmd", full.names = TRUE)) |
html | 54af830 | Athena Chen | 2021-09-15 | Build site. |
Rmd | 26ee6f3 | Athena Chen | 2021-09-15 | wflow_publish(list.files("analysis", pattern = ".Rmd", full.names = TRUE)) |
html | e30784b | Athena Chen | 2021-09-14 | Build site. |
html | 4984369 | Athena Chen | 2021-09-14 | Build site. |
html | 34893c7 | Athena Chen | 2021-09-14 | Build site. |
Rmd | 667d1af | Athena Chen | 2021-09-14 | Added simulation output |
html | 667d1af | Athena Chen | 2021-09-14 | Added simulation output |
The HIV EC data consists of read counts for 3395 HIV peptidesfor six beads-only samples and ten serum samples run on the same plate. Of these ten serum samples, eight were from individuals infected with HIV subtype B. The remaining two serum samples are identical samples from an individual infected with HIV subtype A.
#' Code to load required packages to reproduce the results and figures in
#' the manuscript.
required_packages <- c('plyr', 'tidyverse', 'here', 'ggpubr', 'gridExtra',
'latex2exp', 'kableExtra', 'RColorBrewer', 'BiocManager')
for (pkg in required_packages) {
if (!(pkg %in% rownames(installed.packages()))) {
install.packages(pkg)
}
library(pkg, character.only = TRUE)
}
bioc_packages <- c("beer")
for(pkg in bioc_packages){
if(!(pkg %in% rownames(installed.packages()))) {
BiocManager::install(pkg)
}
library(pkg, character.only = TRUE)
}
rm(list = c("required_packages", "bioc_packages", "pkg"))
#' Define global variables for plotting
hot_cold_cols <- c("navy", "blue", "deepskyblue", "cyan", "lightcyan",
"yellow", "orange", "red")
#' helper_functions.R
#'
#' functions used to help process and analyze output from BEER simulations
#' integrate_vector()
#'
#' Function to perform trapezoidal approximation given two vectors.
#'
#' @param x numeric vector of x values
#' @param y numeric vector with the same length as x
#'
#' @return numeric value of the trapezoidal approximation
integrate_vector <- function(x, y){
if (length(y) != length(x)) {
stop("The length of the vectors must be equal.")
}
n_points <- length(y)
sorted_ind <- sort(x, index.return = TRUE)$ix
sorted_x <- x[sorted_ind]
sorted_y <- y[sorted_ind]
# trapezoidal rule
sum(0.5 * (sorted_y[-1] + sorted_y[-n_points]) *
(sorted_x[-1] - sorted_x[-n_points]))
}
#' get_roc()
#'
#' Function to calculate the ppv, sens, and spec for various
#' cutoffs between 0 and 1.
#'
#' @param data data frame with columns `prop_enriched` for the threshold and
#' `Z` for the true enrichment status, and optional column `extra_info`
#' @param min_cutoff minimum cutoff, default to 0
#' @param max_cutoff maximum cutoff, default ot 1 - 1e-6
#' @param extra_info boolean indicating whether extra information should be used
#' in the classification at each cutoff. If `TRUE`, then the column `extra_info`
#' must be present in the data frame.
#'
#' @return data fram with columns `cutoff`, `ppv`, `sens`, and `spec`.
get_roc <- function(data, min_cutoff = 0, max_cutoff = 1 - 1e-6,
extra_info = FALSE){
cutoffs <- seq(min_cutoff, max_cutoff, length.out = 1000)
# Calculate ppv, sens, spec for each cutoff
ppv <- sapply(cutoffs, function(x) {
if(extra_info){
predict <- (data$prop_enriched >= x & data$extra_info)
} else {
predict <- (data$prop_enriched >= x)
}
sum((data$Z == 1) & predict)/sum(predict)
})
spec <- sapply(cutoffs, function(x){
if(extra_info){
predict <- (data$prop_enriched >= x & data$extra_info)
} else {
predict <- (data$prop_enriched >= x)
}
sum((data$Z == 0) & !predict)/sum(data$Z == 0)
})
sens <- sapply(cutoffs, function(x){
if(extra_info){
predict <- (data$prop_enriched >= x & data$extra_info)
} else {
predict <- (data$prop_enriched >= x)
}
sum((data$Z == 1) & predict)/sum(data$Z == 1)
})
# Get AUC for ROC
npoints <- length(sens)
area_roc <- integrate_vector(1-spec, sens)
return(cutoffs = data.frame(cutoff = cutoffs,
ppv = ppv,
sens = sens,
spec = spec,
area_roc = area_roc))
}
#' get_legend()
#'
#' Function given [here](https://stackoverflow.com/questions/12539348/ggplot-separate-legend-and-plot)
#' to extract the legend for plotting purposes.
#'
#' @param plot ggplot
#' @return ggplot legend
get_legend <- function(myggplot){
tmp <- ggplot_gtable(ggplot_build(myggplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
#' mysqrt_trans()
#'
#' Function given [here](https://stackoverflow.com/questions/47944992/ggplot2-removes-zero-when-using-scale-x-sqrt)
#' to add zero to the plot after sqrt transforming the x-axis
#'
#' @param plot ggplot
#' @return ggplot legend
mysqrt_trans <- function() {
scales::trans_new("mysqrt",
transform = base::sqrt,
inverse = function(x) ifelse(x<0, 0, x^2),
domain = c(0, Inf))
}
#' penriched_fit()
#'
#' Function that returns a data frame with the point estimate, and 95\%
#' confidence intervals.
#'
#' @param model logistic regression model
#' @param covariates data frame of covariates
#' @return data frame with columns for the covariate, point estimate, lower CI
#' and upper CI.
penriched_fit <- function(model, covariates){
# Predict based on the model
prediction <- predict(model, covariates, type = "link", se.fit = TRUE)
pred_lower <- prediction$fit - 1.96*prediction$se.fit
pred_upper <- prediction$fit + 1.96*prediction$se.fit
# Transform logit to probabilities
ppred <- 1/(1 + exp(-prediction$fit))
plower <- 1/(1 + exp(-pred_lower))
pupper <- 1/(1 + exp(-pred_upper))
# Return covariates with prediction + 95 CI added
bind_cols(covariates,
data.frame(predict_p = ppred,
lower_ci = plower,
upper_ci = pupper))
}
hiv <- readRDS(here::here("data_raw", "hiv.rds"))
sampleInfo(hiv)
DataFrame with 16 rows and 3 columns
group subtype n
<character> <character> <numeric>
BEADS 1 beads NA 61472
BEADS 2 beads NA 60633
BEADS 3 beads NA 56353
BEADS 4 beads NA 58873
BEADS 5 beads NA 55549
... ... ... ...
HIV EC 6 sample B 315839
HIV EC 7 sample B 569059
HIV EC 8 sample B 261074
HIV EC 9 sample A 131095
replicate of HIV EC 9 sample A 164912
We can run BEER and edgeR using the following code. To estimate the false positive rate, we run each beads-only sample against the five other beads-only samples.
## Run edgeR with beadsRR
hiv_out <- edgeR(hiv, assay.names = c("edgeR_logfc", "edgeR_prob"),
parallel = "multisession", beadsRR = TRUE)
## Run beer with beadsRR
beer_assays <- c(phi = NULL, phi_Z = "beer_logfc", Z = "beer_prob",
c = "sampleInfo", pi = "sampleInfo")
hiv_out <- brew(hiv_out, assay.names = beer_assays, beadsRR = TRUE,
parallel = "multisession")
For plotting convenience, we convert the PhIPData object to a tidy dataframe.
hiv_tidy <- as(hiv_out, "DataFrame") %>%
as_tibble() %>%
group_by(sample) %>%
mutate(is_se = ifelse(sample != "beads" & is.na(beer_prob), TRUE, FALSE),
sample = factor(sample, levels = colnames(hiv)),
beer_hits = ifelse(beer_prob > 0.5 | is_se, 1, 0),
edgeR_bh = p.adjust(10^(-edgeR_prob), method = "BH"),
edgeR_hits = ifelse(edgeR_bh < 0.05, 1, 0)) %>%
ungroup()
Bland-Altman (MA) plots for the proportion of enriched peptides by protein, for eight elite controller samples. Points represent individual proteins, point colors indicate protein virus types, point diameters indicate the number of peptides tiling the respective proteins. All subjects shown here were infected with subtype B (red).
# Color palette
hiv_subtype <- unique(hiv_tidy$taxon_species)
grey_palette <- palette(gray(seq(0.1, 0.8, len = (length(hiv_subtype) - 1))))
# Make `grey_palette` is of the correct length. Not sure why this has to be run twice
grey_palette <- if(length(grey_palette) < (length(hiv_subtype) - 1)){
palette(gray(seq(0.1, 0.8, length.out = (length(hiv_subtype) - 1))))
} else grey_palette
num_B <- grep("HIV type 1 group M subtype B", hiv_subtype)
subtype_order <- c(hiv_subtype[num_B], hiv_subtype[-num_B])
hiv_tidy %>%
filter(group != "beads" & !grepl("9", sample)) %>%
select(sample, peptide, UniProt_acc, taxon_species,
beer_hits, edgeR_hits) %>%
group_by(sample, UniProt_acc, taxon_species) %>%
dplyr::summarize(prot_prop_Bayes = mean(beer_hits),
prot_prop_edgeR = mean(edgeR_hits),
num_peptides = n(), .groups = "drop") %>%
mutate(taxon_species = factor(taxon_species, subtype_order)) %>%
arrange(desc(taxon_species), desc(num_peptides)) %>%
ggplot(aes(x = 0.5*(prot_prop_Bayes + prot_prop_edgeR),
y = prot_prop_Bayes - prot_prop_edgeR,
color = taxon_species,
size = num_peptides)) +
facet_wrap(sample ~., ncol = 4) +
geom_hline(aes(yintercept = 0), size = 0.5, color = "grey50") +
geom_vline(aes(xintercept = 0), size = 0.5, color = "grey50") +
geom_point(alpha = 0.8) +
labs(x = TeX("$\\frac{1}{2}$ (BEER + edgeR)"),
y = "BEER - edgeR",
color = "HIV strain",
size = "# peptides") +
scale_x_continuous(trans = "mysqrt",
limits = c(0, 1),
breaks = seq(0, 1, by = 0.2)) +
scale_y_continuous(breaks = seq(-0.25, 1, by = 0.25),
limits = c(-0.25, 1)) +
scale_color_manual(values = c( "firebrick2", grey_palette),
breaks = subtype_order) +
theme_bw() +
theme(legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.75, "lines"),
legend.position = "bottom") +
guides(color = guide_legend(ncol = 2, order = 1),
size = guide_legend(ncol = 1, order = 2))
Version | Author | Date |
---|---|---|
54af830 | Athena Chen | 2021-09-15 |
Proportion of enriched peptides by protein without HIV subtype B. Each point represents a protein. The color of the point indicates which virus the protein belongs to, and the size of the point corresponds to the number of peptides tiling the protein.
# Figure S6: hiv_protein_noB.png ----------
hiv_tidy %>%
filter(group != "beads" & !grepl("9", sample) &
!grepl("HIV type 1 group M subtype B", taxon_species)) %>%
select(sample, peptide, UniProt_acc, taxon_species,
beer_hits, edgeR_hits) %>%
group_by(sample, UniProt_acc, taxon_species) %>%
summarize(prot_prop_Bayes = mean(beer_hits),
prot_prop_edgeR = mean(edgeR_hits),
num_peptides = n(), .groups = "drop") %>%
mutate(taxon_species = factor(taxon_species, subtype_order)) %>%
arrange(desc(taxon_species), desc(num_peptides)) %>%
ggplot(aes(x = 0.5*(prot_prop_Bayes + prot_prop_edgeR),
y = prot_prop_Bayes - prot_prop_edgeR,
color = taxon_species,
size = num_peptides)) +
facet_wrap(sample ~., ncol = 4) +
geom_hline(aes(yintercept = 0), size = 0.5, color = "grey50") +
geom_vline(aes(xintercept = 0), size = 0.5, color = "grey50") +
geom_point(alpha = 0.8) +
labs(x = TeX("$\\frac{1}{2}$ (BEER + edgeR)"),
y = "BEER - edgeR",
color = "HIV strain",
size = "# peptides") +
scale_x_continuous(trans = "mysqrt",
limits = c(0, 1),
breaks = seq(0, 1, by = 0.2)) +
scale_y_continuous(breaks = seq(-0.25, 1, by = 0.25),
limits = c(-0.25, 1)) +
scale_color_manual(values = c( "firebrick2", grey_palette),
breaks = subtype_order) +
theme_bw() +
theme(legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.75, "lines"),
legend.position = "bottom") +
guides(color = guide_legend(ncol = 2, order = 1),
size = guide_legend(ncol = 1, order = 2))
Version | Author | Date |
---|---|---|
54af830 | Athena Chen | 2021-09-15 |
Proportion of enriched peptides by protein across technical replicates. This individual was infected with HIV subtype A. Each point represents a protein. The color of the point indicates which virus the protein belongs to, and the size of the point corresponds to the number of peptides tiling the protein.
# Color palette
num_A <- grep("HIV type 1 group M subtype A", hiv_subtype)
subtype_order <- c(hiv_subtype[c(num_B, num_A)], hiv_subtype[-c(num_B, num_A)])
hiv_tidy %>%
filter(grepl("9", sample)) %>%
select(sample, peptide, UniProt_acc, taxon_species,
beer_hits, edgeR_hits) %>%
group_by(sample, UniProt_acc, taxon_species) %>%
summarize(prot_prop_Bayes = mean(beer_hits),
prot_prop_edgeR = mean(edgeR_hits),
num_peptides = n(), .groups = "drop") %>%
mutate(taxon_species = factor(taxon_species, subtype_order)) %>%
arrange(desc(taxon_species), desc(num_peptides)) %>%
ggplot(aes(x = 0.5*(prot_prop_Bayes + prot_prop_edgeR),
y = prot_prop_Bayes - prot_prop_edgeR,
color = taxon_species,
size = num_peptides)) +
facet_wrap(sample ~., ncol = 4) +
geom_hline(aes(yintercept = 0), size = 0.5, color = "grey50") +
geom_vline(aes(xintercept = 0), size = 0.5, color = "grey50") +
geom_point(alpha = 0.8) +
labs(x = TeX("$\\frac{1}{2}$ (BEER + edgeR)"),
y = "BEER - edgeR",
color = "HIV strain",
size = "# peptides") +
scale_x_continuous(trans = "mysqrt",
limits = c(0, 1),
breaks = seq(0, 1, by = 0.2)) +
scale_y_continuous(breaks = seq(-0.25, 1, by = 0.25),
limits = c(-0.25, 1)) +
scale_color_manual(values = c("firebrick2", "blue", grey_palette[-num_A]),
breaks = subtype_order) +
theme_bw() +
theme(legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.key.size = unit(0.75, "lines"),
legend.position = "bottom") +
guides(color = guide_legend(ncol = 2, order = 1),
size = guide_legend(ncol = 1, order = 2))
Version | Author | Date |
---|---|---|
54af830 | Athena Chen | 2021-09-15 |
Left: proportion of reads pulled for 3,395 HIV peptides for two technical replicates. Right: concordance of HIV technical replicates, shown as proportion of peptides among the top ranked peptides in both replicates. For BEER, peptides are ranked by decreasing posterior probability of enrichment. For edgeR, peptides are ranked by increasing p-values. For both methods, ties of posterior probabilities and p-value (e.g., 0 and 1) were broken by the estimated fold-change. The top eight peptides from BEER are all highly enriched and treated exchangeably as no fold-change estimates are returned.
# Plot for proportion of reads
prop_reads <- hiv_tidy %>%
mutate(prop_reads = counts/n) %>%
filter(sample %in% c("HIV EC 9", "replicate of HIV EC 9")) %>%
select(sample, peptide, prop_reads) %>%
pivot_wider(names_from = "sample",
values_from = "prop_reads") %>%
dplyr::rename(sample_1 = `HIV EC 9`, sample_2 = `replicate of HIV EC 9`) %>%
ggplot(aes(x = log10(sample_1), y = log10(sample_2))) +
geom_abline(aes(intercept = 0, slope = 1)) +
geom_point() +
labs(title = "Proportion of reads pulled",
x = "sample, log10(proportion)",
y = "replicate, log10(proportion)") +
scale_x_continuous(breaks = seq(-5, -1, by = 1),
minor_breaks = seq(-5, -1, by = 0.5),
labels = TeX(paste0("$10^{", -5:-1, "}$"))) +
scale_y_continuous(breaks = seq(-5, -1, by = 1),
minor_breaks = seq(-5, -1, by = 0.5),
labels = TeX(paste0("$10^{", -5:-1, "}$"))) +
theme_bw() +
theme(aspect.ratio = 1,
title = element_text(size = 10))
# CAT curve for BEER posterior probabilities
# For a fair comparison, set super-enriched peptides in the replicate sample
# to have a rank of 1 as all of these peptides are identified as enriched
# by BEER and edgeR.
se_peps <- hiv_tidy %>%
filter(is_se & grepl("9", sample)) %>%
group_by(peptide) %>%
filter(row_number() == 1) %>%
ungroup() %>%
mutate(rank = 1:n())
# Ties are broken by the posterior marginal estimates of phi
beer_ranks <- hiv_tidy %>%
filter(sample %in% c("HIV EC 9", "replicate of HIV EC 9")) %>%
group_by(sample) %>%
arrange(desc(beer_prob), desc(beer_logfc), peptide, .by_group = TRUE) %>%
mutate(rank_se = nrow(se_peps) + cumsum(!peptide %in% se_peps$peptide)) %>%
left_join(se_peps %>% select(peptide, rank), by = c("peptide")) %>%
mutate(rank = ifelse(peptide %in% se_peps$peptide, rank, rank_se)) %>%
ungroup() %>%
select(sample, peptide, rank) %>%
pivot_wider(names_from = "sample",
values_from = "peptide") %>%
dplyr::rename(sample_1 = `HIV EC 9`, sample_2 = `replicate of HIV EC 9`) %>%
arrange(rank)
beer_conc <- sapply(1:nrow(beer_ranks), function(rank){
length(intersect(beer_ranks$sample_1[beer_ranks$rank <= rank],
beer_ranks$sample_2[beer_ranks$rank <= rank]))/rank
})
# CAT curve for edgeR p-values
edgeR_ranks <- hiv_tidy %>%
filter(sample %in% c("HIV EC 9", "replicate of HIV EC 9")) %>%
group_by(sample) %>%
arrange(edgeR_bh, desc(edgeR_logfc), .by_group = TRUE) %>%
mutate(rank_se = 8 + cumsum(!peptide %in% se_peps$peptide)) %>%
left_join(se_peps %>% select(peptide, rank), by = c("peptide")) %>%
mutate(rank = ifelse(peptide %in% se_peps$peptide, rank, rank_se)) %>%
ungroup() %>%
select(sample, peptide, rank) %>%
pivot_wider(names_from = "sample",
values_from = "peptide") %>%
dplyr::rename(sample_1 = `HIV EC 9`, sample_2 = `replicate of HIV EC 9`) %>%
arrange(rank)
edgeR_conc <- sapply(1:nrow(edgeR_ranks), function(rank){
length(intersect(edgeR_ranks$sample_1[edgeR_ranks$rank <= rank],
edgeR_ranks$sample_2[edgeR_ranks$rank <= rank]))/rank
})
cat_plot <- data.frame(rank = 1:length(beer_conc),
BEER = beer_conc,
edgeR = edgeR_conc) %>%
pivot_longer(cols = c("BEER", "edgeR"),
names_to = "approach",
values_to = "concordance") %>%
ggplot(aes(x = rank, y = concordance, color = approach)) +
geom_line() +
labs(title = "Concordance at the top",
y = "concordance",
x = "rank") +
coord_cartesian(ylim = c(0, 1), xlim = c(0, 200)) +
scale_x_continuous(breaks = seq(0, 200, by = 25)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.2)) +
scale_color_manual(values = c("red", "black")) +
theme_bw() +
theme(aspect.ratio = 1,
title = element_text(size = 10),
legend.background = element_rect(color = "black", size = 0.3),
legend.position = c(0.8, 0.2))
ggarrange(prop_reads, cat_plot, align = "hv",
nrow = 1, ncol = 2)
Version | Author | Date |
---|---|---|
54af830 | Athena Chen | 2021-09-15 |
HIV replicate posterior probabilities by rank. For each of the technical replicates, peptides are sorted in decreasing order by posterior probability and -log10(edgeR p-values). edgeR -log10(p-values) are truncated at 6.
post_prob <- hiv_tidy %>%
filter(sample %in% c("HIV EC 9", "replicate of HIV EC 9")) %>%
select(sample, peptide, beer_prob) %>%
group_by(sample) %>%
arrange(desc(beer_prob), .by_group = TRUE) %>%
mutate(rank = 1:n()) %>%
ggplot(aes(x = rank, y = beer_prob, group = sample, color = sample)) +
geom_point(size = 1) +
geom_line() +
coord_cartesian(xlim = c(0, 400), ylim = c(0, 1)) +
labs(title = "BEER",
x = "rank",
y = "posterior probabilities",
color = "sample") +
scale_color_manual(values = c("red", "black")) +
theme_bw() +
theme(aspect.ratio = 1,
title = element_text(size = 10),
legend.background = element_rect(color = "black", size = 0.3),
legend.position = "none")
p_values <- hiv_tidy %>%
filter(sample %in% c("HIV EC 9", "replicate of HIV EC 9")) %>%
select(sample, peptide, edgeR_prob) %>%
group_by(sample) %>%
arrange(desc(edgeR_prob), .by_group = TRUE) %>%
mutate(rank = 1:n()) %>%
ggplot(aes(x = rank, y = edgeR_prob, group = sample, color = sample)) +
geom_point(size = 1) +
geom_line() +
labs(title = "edgeR",
x = "rank",
y = "-log10(p-values)",
color = "sample") +
coord_cartesian(xlim = c(0, 400), ylim = c(0, 6)) +
scale_color_manual(values = c("red", "black")) +
theme_bw() +
theme(aspect.ratio = 1,
title = element_text(size = 10),
legend.background = element_rect(color = "black", size = 0.3),
legend.position = c(0.725, 0.825))
Posterior predictive 95% credible intervals for HIV EC 1 (horizontal lines) and medians (black circles) from the posterior predictive distribution for 67 of 3,394 HIV peptides, compared to the observed read counts (blue points). The peptides were chosen by ordering the observed read counts and selecting every 50th peptide.
hiv <- readRDS(here("data_processed", "hiv_results.rds"))
hiv_samples <- readRDS(here("data_processed", "hiv_samples.rds"))
thetas <- as.matrix(hiv_samples)
thetas <- thetas[, grepl("theta", colnames(thetas))]
sample_num <- which(hiv$group != "beads")[1]
n <- librarySize(hiv)[sample_num]
# Sample from the posterior predictive distribution
postpred_samples <- apply(thetas, 2, function(chain){
as.vector(sapply(chain, function(x){rbinom(10, n, x)}))
})
# Get 95% credible intervals
ci <- map_dfr(1:nrow(hiv), function(row_number){
row <- postpred_samples[, row_number]
obs_count <- counts(hiv)[row_number, sample_num]
# Calculate p-value, get empirical counts distribution
emp_dist <- table(row)/length(row)
dens_threshold <- emp_dist[names(emp_dist) == obs_count]
p_value <- sum(emp_dist[emp_dist <= dens_threshold])
output <- c(obs_count, quantile(row, c(0.025, 0.5, 0.975)), p_value)
names(output) <- c("counts","low", "med", "upper", "p_value")
output
})
ci %>%
arrange(counts) %>%
mutate(peptide = 1:n()) %>%
filter(peptide %% 50 == 0) %>%
ggplot(aes(y = peptide)) +
geom_errorbarh(aes(xmin = low, xmax = upper, color = "95% interval")) +
geom_point(aes(x = counts, color = "observed count")) +
geom_point(aes(x = med, color = "median"), fill = NA, shape = 21) +
coord_cartesian(xlim = c(0, 160), ylim = c(1, 3300)) +
scale_x_continuous(breaks = seq(0, 160, by = 20)) +
labs(x = "read counts", y = "", color = "") +
scale_color_manual(breaks = c("95% interval", "median", "observed count"),
values = c("black", "black", "blue")) +
guides(colour = guide_legend(
override.aes = list(linetype = c("solid", "blank", "blank"),
shape = c(NA, 21, 16)))) +
theme_bw() +
theme(aspect.ratio = 1,
legend.title = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
legend.background = element_rect(color = "black"),
legend.position = c(0.75, 0.17))
Version | Author | Date |
---|---|---|
7f65c20 | Athena Chen | 2022-01-30 |
Concordance of enrichment calls between two technical replicates of an HIV subtype A infected individual for BEER and edgeR. A total of 204 peptides from subtype A and 3,191 peptides from other subtypes were present on the platform. n: number of peptides; p: proportion of peptides.
hiv_tidy %>%
filter(sample %in% c("HIV EC 9", "replicate of HIV EC 9")) %>%
mutate(subtype_a = ifelse(taxon_species == "HIV type 1 group M subtype A", 1, 0)) %>%
select(subtype_a, sample, peptide, beer_hits, edgeR_hits) %>%
pivot_longer(cols = contains("hits"),
names_to = "method",
names_pattern = "([a-zA-Z]*)_hits",
values_to = "hits") %>%
pivot_wider(names_from = sample, values_from = hits) %>%
dplyr::rename(sample_1 = `HIV EC 9`,
sample_2 = `replicate of HIV EC 9`) %>%
mutate(concordance = case_when(sample_1 == 1 & sample_2 == 1 ~ "both enriched",
sample_1 == 0 & sample_2 == 0 ~ "both not enriched",
TRUE ~ "discordant")) %>%
group_by(subtype_a, method, concordance) %>%
summarize(n = n(), .groups = "drop_last") %>%
mutate(total_peps = sum(n),
proportion = n/total_peps) %>%
pivot_longer(cols = c("n", "proportion"),
names_to = "param_name",
values_to = "value") %>%
unite("param", method, param_name) %>%
pivot_wider(names_from = "param", values_from = "value") %>%
arrange(desc(subtype_a)) %>%
ungroup() %>%
mutate(across(contains("proportion"), ~ round(.x, 3))) %>%
select(-subtype_a, -total_peps) %>%
kbl(col.names = c("concordance", rep(c("n", "p"), times = 2))) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
pack_rows(index = c("Subtype A" = 3, "Other Subtypes" = 3)) %>%
add_header_above(c(" " = 1, "BEER" = 2, "edgeR" = 2))
concordance | n | p | n | p |
---|---|---|---|---|
Subtype A | ||||
both enriched | 16 | 0.078 | 14 | 0.069 |
both not enriched | 186 | 0.912 | 188 | 0.922 |
discordant | 2 | 0.010 | 2 | 0.010 |
Other Subtypes | ||||
both enriched | 118 | 0.037 | 94 | 0.029 |
both not enriched | 3039 | 0.952 | 3085 | 0.967 |
discordant | 34 | 0.011 | 12 | 0.004 |
sessionInfo()
R Under development (unstable) (2022-01-12 r81477)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur/Monterey 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] beer_0.99.2 rjags_4-12
[3] coda_0.19-4 PhIPData_1.3.1
[5] SummarizedExperiment_1.25.3 Biobase_2.55.0
[7] GenomicRanges_1.47.6 GenomeInfoDb_1.31.4
[9] IRanges_2.29.1 S4Vectors_0.33.10
[11] BiocGenerics_0.41.2 MatrixGenerics_1.7.0
[13] matrixStats_0.61.0 BiocManager_1.30.16
[15] RColorBrewer_1.1-2 kableExtra_1.3.4
[17] latex2exp_0.5.0 gridExtra_2.3
[19] ggpubr_0.4.0 here_1.0.1
[21] forcats_0.5.1 stringr_1.4.0
[23] dplyr_1.0.7 purrr_0.3.4
[25] readr_2.1.1 tidyr_1.1.4
[27] tibble_3.1.6 ggplot2_3.3.5
[29] tidyverse_1.3.1 plyr_1.8.6
[31] workflowr_1.7.0
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 ggsignif_0.6.3 ellipsis_0.3.2
[4] rprojroot_2.0.2 XVector_0.35.0 fs_1.5.2
[7] rstudioapi_0.13 farver_2.1.0 fansi_1.0.0
[10] lubridate_1.8.0 xml2_1.3.3 knitr_1.37
[13] jsonlite_1.7.2 broom_0.7.11 dbplyr_2.1.1
[16] compiler_4.2.0 httr_1.4.2 backports_1.4.1
[19] assertthat_0.2.1 Matrix_1.4-0 fastmap_1.1.0
[22] limma_3.51.3 cli_3.1.0 later_1.3.0
[25] htmltools_0.5.2 tools_4.2.0 gtable_0.3.0
[28] glue_1.6.0 GenomeInfoDbData_1.2.7 Rcpp_1.0.8
[31] carData_3.0-5 cellranger_1.1.0 jquerylib_0.1.4
[34] vctrs_0.3.8 svglite_2.0.0 progressr_0.10.0
[37] xfun_0.29 ps_1.6.0 rvest_1.0.2
[40] lifecycle_1.0.1 rstatix_0.7.0 edgeR_3.37.0
[43] getPass_0.2-2 zlibbioc_1.41.0 scales_1.1.1
[46] hms_1.1.1 promises_1.2.0.1 parallel_4.2.0
[49] yaml_2.2.1 stringi_1.7.6 highr_0.9
[52] BiocParallel_1.29.12 rlang_0.4.12 pkgconfig_2.0.3
[55] systemfonts_1.0.3 bitops_1.0-7 evaluate_0.14
[58] lattice_0.20-45 labeling_0.4.2 cowplot_1.1.1
[61] processx_3.5.2 tidyselect_1.1.1 magrittr_2.0.1
[64] R6_2.5.1 generics_0.1.1 DelayedArray_0.21.2
[67] DBI_1.1.2 pillar_1.6.4 haven_2.4.3
[70] whisker_0.4 withr_2.4.3 abind_1.4-5
[73] RCurl_1.98-1.5 modelr_0.1.8 crayon_1.4.2
[76] car_3.0-12 utf8_1.2.2 tzdb_0.2.0
[79] rmarkdown_2.11 locfit_1.5-9.4 grid_4.2.0
[82] readxl_1.3.1 callr_3.7.0 git2r_0.29.0
[85] reprex_2.0.1 digest_0.6.29 webshot_0.5.2
[88] httpuv_1.6.5 munsell_0.5.0 viridisLite_0.4.0